OpenStack: Recover Galera Cluster
OpenStack MySQL (MariaDB Galera Cluster) Recovery
Problem: Your MySQL Secondary Database will not start because of disk space, InnoDB problems, etc.
This hit me when the Keystone token cleanup got fouled up and I ended up with 900K expired token records. At that point the database was hosed and would not recover because transaction logs were greater than 1GB (default max replication size).
Helpful links on MySQL recovery:
- http://blackbird.si/mysql-corrupted-innodb-tables-recovery-step-by-step-guide/
- http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
- https://www.percona.com/blog/2015/10/26/how-big-can-your-galera-transactions-be/ – the notes on
binlog_row_image=minimal
do not apply as we are running MariaDB 5.5″> - https://www.percona.com/forums/questions-discussions/percona-xtradb-cluster/33907-got-error-5-during-commit-wsrep_max_ws_size-limit
- http://severalnines.com/blog/9-tips-going-production-galera-cluster-mysql – Great guide on going Production with Galera cluster”>
Solved this by doing the following:
- Stop the database on both primary (120) and secondary (220) lvosmysql database instances.
- On the primary, start the database manually:
sudo su - mysql /usr/bin/mysqld_safe --basedir=/usr
Wait for the database to come up cleanly (review
/var/log/mariadb/mariadb.log
and do a test connection to verify). - On the secondary, because InnoDB was corrupted I had to add the following to
/etc/my.cnf
:# settings to recover in emergency innodb_force_recovery=5 innodb_purge_threads=0 port=8881
NB: The
port
changes to keep the database from being hammered during recovery.
Then run the database manually as with the master:sudo su - mysql /usr/bin/mysqld_safe --basedir=/usr
This process works because the secondary will first detect that it needs the entire
/var/lib/mysql/ibdata1
file; this is a Good Thing because it (in effect) forces the secondary to rebuild itself from the master. You can verify this by checking forrsync
in the process list (I usedlsof
for this:[root@lvosmysql220 mariadb]# lsof | grep rsync wsrep_sst 6711 mysql 255r REG 253,0 8771 45942 /usr/bin/wsrep_sst_rsync rsync 6738 mysql cwd DIR 253,0 4096 814419 /var/lib/mysql [...] rsync 6754 mysql 11r REG 253,0 18733858816 1258906 /var/lib/mysql/ibdata1 [...]
Once the
ibdata1
file is transferred, the database promptly halts because usinginnodb_force_recovery=5
places the database in read-only recovery mode. Which – since the entire database has been rescanned from the master – is no longer necessary. So comment out the emergency settings in/etc/my.cnf
and manually restart the database on the secondary. - At this point, both primary and secondary database hosts should be synchronized and replication should report OK. The next step is to get the databases synchronized and committed with each other; in my case this meant a painful session of deleting 1000 Keystone token records at a time (to prevent the transaction log / replication processes from being overloaded). That took several hours.
In your case, you will need to troubleshoot why your secondary database host failed to start and correct as needed. - Once the databases are finally at a good point (in my case, when all 900K worth of expired Keystone token records were deleted and committed to primary / secondary), you can stop the database on each server (remember: they are running from a manual prompt):
mysqladmin -u[user] -p[password] -h[host] shutdown
You run the above as
root
and you *wait for a clean shutdown* on each node. I recommend a full reboot of each server and careful verification that MySQL (MariaDB) starts up correctly after the reboot completes.
That is all.
Leave a Reply