Thursday, February 5, 2015

MariaDB - Measure Replicaton Lag and Check / Fix Replication Inconsistencies using Percona tools

Percona Toolkit is collection of command-line tools to perform many MySQL tasks like creating backups, finding duplicate indexes, managing replication, etc.

In this post I will talk about how to measure replication lag and check/fix replication inconsistencies with this tools:
pt-heartbeat
pt-table-checksum
pt-table-sync


I am using environment from previous blog post.
Master-Master replication with MariaDB 10.0.16 database on Debian 7.


Install Percona Toolkit on both nodes:

$ sudo wget percona.com/get/percona-toolkit.deb

$ sudo apt-get install libterm-readkey-perl
$ sudo dpkg -i percona-toolkit.deb  


I will create percona database where I will store tables needed for various checks. Also I will create percona user which will be used with Percona tools.



MASTER1

MariaDB [(none)]> create database percona;

MariaDB [(none)]> grant all privileges on *.* to 'percona'@'master1.localdomain' identified by 'percona';

MariaDB [(none)]> grant all privileges on *.* to 'percona'@'localhost' identified by 'percona';

MariaDB [(none)]> flush privileges;


MASTER2

MariaDB [(none)]> grant all privileges on *.* to 'percona'@'master2.localdomain' identified by 'percona';

MariaDB [(none)]> grant all privileges on *.* to 'percona'@'localhost' identified by 'percona';

MariaDB [(none)]> flush privileges;




MONITOR REPLICATION LAG


So, I have replication running and I want to be sure that everything is working fine.
Typical method to monitor replication lag would be to run SLAVE STATUS and look at Seconds_Behind_Master. But Seconds_Behind_Master is not always accurate.

Percona Toolkit has a tool to monitor replication delay called pt-heartbeat.

We must create heartbeat table on the master manually or using --create-table option and heartbeat table must contain one heartbeat row. This table will be updated in interval we specify by pt-heartbeat. Slave will actively check table and calculate time delay.


Create heartbeat table and start daemonized process to update percona.heartbeat table.


MASTER1

$ pt-heartbeat -upercona -ppercona -D percona --update master1 --daemonize --create-table


MASTER2
Start pt-heartbeat.

$ pt-heartbeat -upercona -ppercona --update --database percona


MASTER1
Monitor replication slave lag.

$ pt-heartbeat -upercona -ppercona -D percona --monitor -h master2
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.01s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]





CHECK REPLICATION INCONSISTENCIES


If we want to check replication integrity we can use pt-table-checksum tool.

Run tool on master server. It will automatically detect slave servers and connect to them to do some safety checks. After that it runs checksums on the tables of the master database and reports results in the checksum table. This results are then compared with the results on the slave whether the data differs.
You can inspect that table anytime - in this example percona.checksums table.

If there are no different rows in the tables between master and slave database DIFF will show 0.


$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master2
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T20:58:15      0      0        5       1       0   1.134 testdb.users


MASTER2

MariaDB [testdb]> create table address (id int auto_increment primary key, city varchar(30));
Query OK, 0 rows affected (0.06 sec)

MariaDB [testdb]> insert into address (city) values ('New York');
Query OK, 1 row affected (0.07 sec)

MariaDB [testdb]> insert into address (city) values ('LA');
Query OK, 1 row affected (0.06 sec)

MariaDB [testdb]> insert into address (city) values ('Zagreb');
Query OK, 1 row affected (0.13 sec)


MASTER1

$ pt-table-checksum -upercona -ppercona --replicate percona.checksums --databases testdb -h master2
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T20:59:16      0      0        3       1       0   1.032 testdb.address
02-02T20:59:17      0      0        5       1       0   1.120 testdb.users


$ pt-table-checksum -upercona -ppercona --replicate=percona.checksums --replicate-check-only --databases=testdb master1


Nothing received in output which means that testdb database is in sync with slave.


Insert some test data:

MariaDB [testdb]> create table animals (id int not null auto_increment,
    -> name char(30) not null,
    -> primary key(id));
Query OK, 0 rows affected (0.04 sec)

MariaDB [testdb]> insert into animals (name) values ('dog'),('cat'),('whale');
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0


MariaDB [testdb]> create table countries (id int not null auto_increment,
    -> name varchar(30),
    -> primary key(id));
Query OK, 0 rows affected (0.09 sec)

MariaDB [testdb]> insert into countries(name) values ('Croatia'),('England'),('USA'),('Island');
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0

MariaDB [testdb]> select * from animals;
+----+-------+
| id | name  |
+----+-------+
|  1 | dog   |
|  2 | cat   |
|  3 | whale |
+----+-------+
3 rows in set (0.00 sec)

MariaDB [testdb]>  select * from countries;
+----+---------+
| id | name    |
+----+---------+
|  1 | Croatia |
|  2 | England |
|  3 | USA     |
|  4 | Island  |
+----+---------+
4 rows in set (0.00 sec)


Check if database is in sync:

$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master1
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T21:03:49      0      0        3       1       0   0.177 testdb.address
02-02T21:03:49      0      0        3       1       0   0.045 testdb.animals
02-02T21:03:49      0      0        4       1       0   0.049 testdb.countries
02-02T21:03:49      0      0        5       1       0   0.037 testdb.users





RESYNC REPLICA FROM THE MASTER


Lets make database on MASTER2 out-of-sync and create some differences between databases.


MASTER2

Instead of stopping replication process, I will temporarily disable binary logging on MASTER2 server.

MariaDB [testdb]> SET SQL_LOG_BIN=0;
Query OK, 0 rows affected (0.00 sec)


Make same data modifications.

MariaDB [testdb]> insert into animals (name) values ('Ostrich'),('Penguin');
Query OK, 2 rows affected (0.04 sec)
Records: 2  Duplicates: 0  Warnings: 0

MariaDB [testdb]> delete from countries where id=2;
Query OK, 1 row affected (0.01 sec)


MariaDB [testdb]> create table colors (name varchar(30));
Query OK, 0 rows affected (0.10 sec)

MariaDB [testdb]> insert into colors(name) values ('Red'),('Blue');
Query OK, 2 rows affected (0.02 sec)
Records: 2  Duplicates: 0  Warnings: 0


Enable binary logging again.

MariaDB [testdb]> SET SQL_LOG_BIN=1;
Query OK, 0 rows affected (0.00 sec)



MASTER1

MariaDB [testdb]> select * from animals;
+----+-------+
| id | name  |
+----+-------+
|  1 | dog   |
|  2 | cat   |
|  3 | whale |
+----+-------+
3 rows in set (0.00 sec)

MariaDB [testdb]> select * from countries;
+----+---------+
| id | name    |
+----+---------+
|  1 | Croatia |
|  2 | England |
|  3 | USA     |
|  4 | Island  |
+----+---------+
4 rows in set (0.00 sec)

MariaDB [testdb]> show tables;
+------------------+
| Tables_in_testdb |
+------------------+
| address          |
| animals          |
| countries        |
| users            |
+------------------+
4 rows in set (0.00 sec)


MASTER2

MariaDB [testdb]> select * from animals;
+----+---------+
| id | name    |
+----+---------+
|  1 | dog     |
|  2 | cat     |
|  3 | whale   |
|  4 | Ostrich |
|  5 | Penguin |
+----+---------+
5 rows in set (0.00 sec)

MariaDB [testdb]> select * from countries;
+----+---------+
| id | name    |
+----+---------+
|  1 | Croatia |
|  3 | USA     |
|  4 | Island  |
+----+---------+
3 rows in set (0.00 sec)

MariaDB [testdb]> show tables;
+------------------+
| Tables_in_testdb |
+------------------+
| address          |
| animals          |
| colors           |
| countries        |
| users            |
+------------------+
5 rows in set (0.00 sec)


Notice that there are some inconsistencies between databases and there isn’t any built-in tool that will notify us about that. Replication is working fine, even though replica has different data than master.

With pt-table-checksum we will check data differences between databases.


MASTER1

$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master1
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T21:11:23      0      0        3       1       0   0.106 testdb.address
02-02T21:11:23      0      1        3       1       0   0.053 testdb.animals
02-02T21:11:24      0      1        4       1       0   0.046 testdb.countries
02-02T21:11:24      0      0        5       1       0   0.042 testdb.users


$ pt-table-checksum -upercona -ppercona --replicate=percona.checksums --replicate-check-only --databases=testdb master1
Differences on master2
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
testdb.animals 1 2 1
testdb.countries 1 -1 1

Notice how tool reported differences in DIFFS column.


Synchronizing data between servers in master-master configuration is not trivial task. You have to think about which process is changing data where and be very careful to avoid data corruption.

In master-master configuration data changes are replicated between nodes and statements executed on "slave" node are replicated to the master.

Maybe the best approach would be to stop replication, restore replica from backup or reclone whole server and start replication again. You can also dump only affected data with mysqldump and reload it.


As this is my testing environment I will try to resolve differences using pt-table-sync tool from Percona toolkit.


First I will use tool with --print option which will only display me queries that will resolve differences. I will inspect those queries before executing them on the slave server.
These queries could be executed manually also.

$ pt-table-sync -upercona -ppercona --sync-to-master --databases testdb --transaction --lock=1 --verbose master2 --print

# Syncing h=master2,p=...,u=percona
# DELETE REPLACE INSERT UPDATE ALGORITHM START    END      EXIT DATABASE.TABLE
#      0       0      0      0 Chunk     22:13:17 22:13:17 0    testdb.address
DELETE FROM `testdb`.`animals` WHERE `id`='4' LIMIT 1 /*percona-toolkit src_db:testdb src_tbl:animals src_dsn:P=3306,h=master1,p=...,u=percona dst_db:testdb dst_tbl:animals dst_dsn:h=master2,p=...,u=percona lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:7723 user:msutic host:master1*/;
DELETE FROM `testdb`.`animals` WHERE `id`='5' LIMIT 1 /*percona-toolkit src_db:testdb src_tbl:animals src_dsn:P=3306,h=master1,p=...,u=percona dst_db:testdb dst_tbl:animals dst_dsn:h=master2,p=...,u=percona lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:7723 user:msutic host:master1*/;
#      2       0      0      0 Chunk     22:13:17 22:13:17 2    testdb.animals
REPLACE INTO `testdb`.`countries`(`id`, `name`) VALUES ('2', 'England') /*percona-toolkit src_db:testdb src_tbl:countries src_dsn:P=3306,h=master1,p=...,u=percona dst_db:testdb dst_tbl:countries dst_dsn:h=master2,p=...,u=percona lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:7723 user:msutic host:master1*/;
#      0       1      0      0 Chunk     22:13:17 22:13:17 2    testdb.countries
#      0       0      0      0 Chunk     22:13:17 22:13:17 0    testdb.users


Set --execute option to execute those queries.
With --sync-to-master option we will treat MASTER2 server as a slave.


$ pt-table-sync -upercona -ppercona --sync-to-master --databases testdb --transaction --lock=1 --verbose master2 --execute

# Syncing h=master2,p=...,u=percona
# DELETE REPLACE INSERT UPDATE ALGORITHM START    END      EXIT DATABASE.TABLE
#      0       0      0      0 Chunk     22:19:51 22:19:51 0    testdb.address
#      2       0      0      0 Chunk     22:19:51 22:19:51 2    testdb.animals
#      0       1      0      0 Chunk     22:19:51 22:19:51 2    testdb.countries
#      0       0      0      0 Chunk     22:19:51 22:19:51 0    testdb.users


Output shows that differences are successfully resolved with two DELETE and one REPLACE operation on specified tables.

Let’s run another check to verify if differences still exist.

$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master1

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T22:21:30      0      0        3       1       0   0.549 testdb.address
02-02T22:21:30      0      0        3       1       0   0.048 testdb.animals
02-02T22:21:30      0      0        4       1       0   0.043 testdb.countries
02-02T22:21:30      0      0        5       1       0   0.049 testdb.users

DIFFS columns shows only 0 which means that tables are in sync.




What if I run checksums on MASTER2 server.


MASTER2

$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master2

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T22:24:16      0      0        3       1       0   0.072 testdb.address
02-02T22:24:16      0      0        3       1       0   0.048 testdb.animals
02-02T22:24:16 Skipping table testdb.colors because it has problems on these replicas:
Table testdb.colors does not exist on replica master1
This can break replication.  If you understand the risks, specify --no-check-slave-tables to disable this check.
02-02T22:24:16 Error checksumming table testdb.colors: DBD::mysql::db selectrow_hashref failed: Table 'testdb.colors' doesn't exist [for Statement "EXPLAIN SELECT * FROM `testdb`.`colors` WHERE 1=1"] at /usr/bin/pt-table-checksum line 6595.

02-02T22:24:16      1      0        0       0       0   0.003 testdb.colors
02-02T22:24:16      0      0        4       1       0   0.044 testdb.countries
02-02T22:24:16      0      0        5       1       0   0.043 testdb.users


Output shows error because table testdb.colors exists on MASTER2 but not in MASTER1.

I know that MASTER1 has "correct" data so I will just drop testdb.colors table on MASTER2 node.

MariaDB [testdb]> drop table if exists testdb.colors;
Query OK, 0 rows affected (0.05 sec)


Run check again:

$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master2
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T22:26:43      0      0        3       1       0   0.322 testdb.address
02-02T22:26:43      0      0        3       1       0   0.056 testdb.animals
02-02T22:26:43      0      0        4       1       0   0.050 testdb.countries
02-02T22:26:43      0      0        5       1       0   0.045 testdb.users


Now we have synced databases.



If we use --quiet option tool will print out row per table only if there are some differences. This is nice way to run tool from a cron job and send mail only if there is non-zero exit status.

$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master1 --quiet
(no rows)




REFERENCES
http://www.percona.com/doc/percona-toolkit/2.2/pt-table-sync.html
http://www.percona.com/doc/percona-toolkit/2.2/pt-table-checksum.html
http://www.percona.com/software/percona-toolkit


0 Comments:

Post a Comment