In this post I will talk about how to measure replication lag and check/fix replication inconsistencies with this tools:
pt-heartbeat
pt-table-checksum
pt-table-sync
I am using environment from previous blog post.
Master-Master replication with MariaDB 10.0.16 database on Debian 7.
Install Percona Toolkit on both nodes:
$ sudo wget percona.com/get/percona-toolkit.deb $ sudo apt-get install libterm-readkey-perl $ sudo dpkg -i percona-toolkit.deb
I will create percona database where I will store tables needed for various checks. Also I will create percona user which will be used with Percona tools.
MASTER1
MariaDB [(none)]> create database percona; MariaDB [(none)]> grant all privileges on *.* to 'percona'@'master1.localdomain' identified by 'percona'; MariaDB [(none)]> grant all privileges on *.* to 'percona'@'localhost' identified by 'percona'; MariaDB [(none)]> flush privileges;
MASTER2
MariaDB [(none)]> grant all privileges on *.* to 'percona'@'master2.localdomain' identified by 'percona'; MariaDB [(none)]> grant all privileges on *.* to 'percona'@'localhost' identified by 'percona'; MariaDB [(none)]> flush privileges;
MONITOR REPLICATION LAG
So, I have replication running and I want to be sure that everything is working fine.
Typical method to monitor replication lag would be to run SLAVE STATUS and look at Seconds_Behind_Master. But Seconds_Behind_Master is not always accurate.
Percona Toolkit has a tool to monitor replication delay called pt-heartbeat.
We must create heartbeat table on the master manually or using --create-table option and heartbeat table must contain one heartbeat row. This table will be updated in interval we specify by pt-heartbeat. Slave will actively check table and calculate time delay.
Create heartbeat table and start daemonized process to update percona.heartbeat table.
MASTER1
$ pt-heartbeat -upercona -ppercona -D percona --update master1 --daemonize --create-table
MASTER2
Start pt-heartbeat.
$ pt-heartbeat -upercona -ppercona --update --database percona
MASTER1
Monitor replication slave lag.
$ pt-heartbeat -upercona -ppercona -D percona --monitor -h master2 0.00s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ] 0.01s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ] 0.00s [ 0.00s, 0.00s, 0.00s ]
CHECK REPLICATION INCONSISTENCIES
If we want to check replication integrity we can use pt-table-checksum tool.
Run tool on master server. It will automatically detect slave servers and connect to them to do some safety checks. After that it runs checksums on the tables of the master database and reports results in the checksum table. This results are then compared with the results on the slave whether the data differs.
You can inspect that table anytime - in this example percona.checksums table.
If there are no different rows in the tables between master and slave database DIFF will show 0.
$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master2 TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 02-02T20:58:15 0 0 5 1 0 1.134 testdb.users
MASTER2
MariaDB [testdb]> create table address (id int auto_increment primary key, city varchar(30)); Query OK, 0 rows affected (0.06 sec) MariaDB [testdb]> insert into address (city) values ('New York'); Query OK, 1 row affected (0.07 sec) MariaDB [testdb]> insert into address (city) values ('LA'); Query OK, 1 row affected (0.06 sec) MariaDB [testdb]> insert into address (city) values ('Zagreb'); Query OK, 1 row affected (0.13 sec)
MASTER1
$ pt-table-checksum -upercona -ppercona --replicate percona.checksums --databases testdb -h master2 TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 02-02T20:59:16 0 0 3 1 0 1.032 testdb.address 02-02T20:59:17 0 0 5 1 0 1.120 testdb.users
$ pt-table-checksum -upercona -ppercona --replicate=percona.checksums --replicate-check-only --databases=testdb master1
Nothing received in output which means that testdb database is in sync with slave.
Insert some test data:
MariaDB [testdb]> create table animals (id int not null auto_increment, -> name char(30) not null, -> primary key(id)); Query OK, 0 rows affected (0.04 sec) MariaDB [testdb]> insert into animals (name) values ('dog'),('cat'),('whale'); Query OK, 3 rows affected (0.00 sec) Records: 3 Duplicates: 0 Warnings: 0 MariaDB [testdb]> create table countries (id int not null auto_increment, -> name varchar(30), -> primary key(id)); Query OK, 0 rows affected (0.09 sec) MariaDB [testdb]> insert into countries(name) values ('Croatia'),('England'),('USA'),('Island'); Query OK, 4 rows affected (0.00 sec) Records: 4 Duplicates: 0 Warnings: 0 MariaDB [testdb]> select * from animals; +----+-------+ | id | name | +----+-------+ | 1 | dog | | 2 | cat | | 3 | whale | +----+-------+ 3 rows in set (0.00 sec) MariaDB [testdb]> select * from countries; +----+---------+ | id | name | +----+---------+ | 1 | Croatia | | 2 | England | | 3 | USA | | 4 | Island | +----+---------+ 4 rows in set (0.00 sec)
Check if database is in sync:
$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master1 TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 02-02T21:03:49 0 0 3 1 0 0.177 testdb.address 02-02T21:03:49 0 0 3 1 0 0.045 testdb.animals 02-02T21:03:49 0 0 4 1 0 0.049 testdb.countries 02-02T21:03:49 0 0 5 1 0 0.037 testdb.users
RESYNC REPLICA FROM THE MASTER
Lets make database on MASTER2 out-of-sync and create some differences between databases.
MASTER2
Instead of stopping replication process, I will temporarily disable binary logging on MASTER2 server.
MariaDB [testdb]> SET SQL_LOG_BIN=0; Query OK, 0 rows affected (0.00 sec)
Make same data modifications.
MariaDB [testdb]> insert into animals (name) values ('Ostrich'),('Penguin'); Query OK, 2 rows affected (0.04 sec) Records: 2 Duplicates: 0 Warnings: 0 MariaDB [testdb]> delete from countries where id=2; Query OK, 1 row affected (0.01 sec) MariaDB [testdb]> create table colors (name varchar(30)); Query OK, 0 rows affected (0.10 sec) MariaDB [testdb]> insert into colors(name) values ('Red'),('Blue'); Query OK, 2 rows affected (0.02 sec) Records: 2 Duplicates: 0 Warnings: 0
Enable binary logging again.
MariaDB [testdb]> SET SQL_LOG_BIN=1; Query OK, 0 rows affected (0.00 sec)
MASTER1
MariaDB [testdb]> select * from animals; +----+-------+ | id | name | +----+-------+ | 1 | dog | | 2 | cat | | 3 | whale | +----+-------+ 3 rows in set (0.00 sec) MariaDB [testdb]> select * from countries; +----+---------+ | id | name | +----+---------+ | 1 | Croatia | | 2 | England | | 3 | USA | | 4 | Island | +----+---------+ 4 rows in set (0.00 sec) MariaDB [testdb]> show tables; +------------------+ | Tables_in_testdb | +------------------+ | address | | animals | | countries | | users | +------------------+ 4 rows in set (0.00 sec)
MASTER2
MariaDB [testdb]> select * from animals; +----+---------+ | id | name | +----+---------+ | 1 | dog | | 2 | cat | | 3 | whale | | 4 | Ostrich | | 5 | Penguin | +----+---------+ 5 rows in set (0.00 sec) MariaDB [testdb]> select * from countries; +----+---------+ | id | name | +----+---------+ | 1 | Croatia | | 3 | USA | | 4 | Island | +----+---------+ 3 rows in set (0.00 sec) MariaDB [testdb]> show tables; +------------------+ | Tables_in_testdb | +------------------+ | address | | animals | | colors | | countries | | users | +------------------+ 5 rows in set (0.00 sec)
Notice that there are some inconsistencies between databases and there isn’t any built-in tool that will notify us about that. Replication is working fine, even though replica has different data than master.
With pt-table-checksum we will check data differences between databases.
MASTER1
$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master1 TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 02-02T21:11:23 0 0 3 1 0 0.106 testdb.address 02-02T21:11:23 0 1 3 1 0 0.053 testdb.animals 02-02T21:11:24 0 1 4 1 0 0.046 testdb.countries 02-02T21:11:24 0 0 5 1 0 0.042 testdb.users $ pt-table-checksum -upercona -ppercona --replicate=percona.checksums --replicate-check-only --databases=testdb master1 Differences on master2 TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY testdb.animals 1 2 1 testdb.countries 1 -1 1
Notice how tool reported differences in DIFFS column.
Synchronizing data between servers in master-master configuration is not trivial task. You have to think about which process is changing data where and be very careful to avoid data corruption.
In master-master configuration data changes are replicated between nodes and statements executed on "slave" node are replicated to the master.
Maybe the best approach would be to stop replication, restore replica from backup or reclone whole server and start replication again. You can also dump only affected data with mysqldump and reload it.
As this is my testing environment I will try to resolve differences using pt-table-sync tool from Percona toolkit.
First I will use tool with --print option which will only display me queries that will resolve differences. I will inspect those queries before executing them on the slave server.
These queries could be executed manually also.
$ pt-table-sync -upercona -ppercona --sync-to-master --databases testdb --transaction --lock=1 --verbose master2 --print # Syncing h=master2,p=...,u=percona # DELETE REPLACE INSERT UPDATE ALGORITHM START END EXIT DATABASE.TABLE # 0 0 0 0 Chunk 22:13:17 22:13:17 0 testdb.address DELETE FROM `testdb`.`animals` WHERE `id`='4' LIMIT 1 /*percona-toolkit src_db:testdb src_tbl:animals src_dsn:P=3306,h=master1,p=...,u=percona dst_db:testdb dst_tbl:animals dst_dsn:h=master2,p=...,u=percona lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:7723 user:msutic host:master1*/; DELETE FROM `testdb`.`animals` WHERE `id`='5' LIMIT 1 /*percona-toolkit src_db:testdb src_tbl:animals src_dsn:P=3306,h=master1,p=...,u=percona dst_db:testdb dst_tbl:animals dst_dsn:h=master2,p=...,u=percona lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:7723 user:msutic host:master1*/; # 2 0 0 0 Chunk 22:13:17 22:13:17 2 testdb.animals REPLACE INTO `testdb`.`countries`(`id`, `name`) VALUES ('2', 'England') /*percona-toolkit src_db:testdb src_tbl:countries src_dsn:P=3306,h=master1,p=...,u=percona dst_db:testdb dst_tbl:countries dst_dsn:h=master2,p=...,u=percona lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:7723 user:msutic host:master1*/; # 0 1 0 0 Chunk 22:13:17 22:13:17 2 testdb.countries # 0 0 0 0 Chunk 22:13:17 22:13:17 0 testdb.users
Set --execute option to execute those queries.
With --sync-to-master option we will treat MASTER2 server as a slave.
$ pt-table-sync -upercona -ppercona --sync-to-master --databases testdb --transaction --lock=1 --verbose master2 --execute # Syncing h=master2,p=...,u=percona # DELETE REPLACE INSERT UPDATE ALGORITHM START END EXIT DATABASE.TABLE # 0 0 0 0 Chunk 22:19:51 22:19:51 0 testdb.address # 2 0 0 0 Chunk 22:19:51 22:19:51 2 testdb.animals # 0 1 0 0 Chunk 22:19:51 22:19:51 2 testdb.countries # 0 0 0 0 Chunk 22:19:51 22:19:51 0 testdb.users
Output shows that differences are successfully resolved with two DELETE and one REPLACE operation on specified tables.
Let’s run another check to verify if differences still exist.
$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master1 TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 02-02T22:21:30 0 0 3 1 0 0.549 testdb.address 02-02T22:21:30 0 0 3 1 0 0.048 testdb.animals 02-02T22:21:30 0 0 4 1 0 0.043 testdb.countries 02-02T22:21:30 0 0 5 1 0 0.049 testdb.users
DIFFS columns shows only 0 which means that tables are in sync.
What if I run checksums on MASTER2 server.
MASTER2
$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master2 TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 02-02T22:24:16 0 0 3 1 0 0.072 testdb.address 02-02T22:24:16 0 0 3 1 0 0.048 testdb.animals 02-02T22:24:16 Skipping table testdb.colors because it has problems on these replicas: Table testdb.colors does not exist on replica master1 This can break replication. If you understand the risks, specify --no-check-slave-tables to disable this check. 02-02T22:24:16 Error checksumming table testdb.colors: DBD::mysql::db selectrow_hashref failed: Table 'testdb.colors' doesn't exist [for Statement "EXPLAIN SELECT * FROM `testdb`.`colors` WHERE 1=1"] at /usr/bin/pt-table-checksum line 6595. 02-02T22:24:16 1 0 0 0 0 0.003 testdb.colors 02-02T22:24:16 0 0 4 1 0 0.044 testdb.countries 02-02T22:24:16 0 0 5 1 0 0.043 testdb.users
Output shows error because table testdb.colors exists on MASTER2 but not in MASTER1.
I know that MASTER1 has "correct" data so I will just drop testdb.colors table on MASTER2 node.
MariaDB [testdb]> drop table if exists testdb.colors; Query OK, 0 rows affected (0.05 sec)
Run check again:
$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master2 TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 02-02T22:26:43 0 0 3 1 0 0.322 testdb.address 02-02T22:26:43 0 0 3 1 0 0.056 testdb.animals 02-02T22:26:43 0 0 4 1 0 0.050 testdb.countries 02-02T22:26:43 0 0 5 1 0 0.045 testdb.users
Now we have synced databases.
If we use --quiet option tool will print out row per table only if there are some differences. This is nice way to run tool from a cron job and send mail only if there is non-zero exit status.
$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master1 --quiet (no rows)
REFERENCES
http://www.percona.com/doc/percona-toolkit/2.2/pt-table-sync.html
http://www.percona.com/doc/percona-toolkit/2.2/pt-table-checksum.html
http://www.percona.com/software/percona-toolkit
0 Comments:
Post a Comment