Tuesday, November 16, 2021

PostgreSQL HA - Patroni, ETCD, HAProxy, Keepalived - Test failure scenarios

I will test few failover/maintenance scenarios and show results in this blog post.

Just to mention, this is not proper production test. Before considering this setup for the production it would be great to put cluster under proper load, simulate slow IO response time, memory crashes, etc. and check cluster behavior.

In this tests I am only checking start/stop resources in various scenarios.

Standby tests

No. Test Scenario Downtime Observation
1. Kill PostgreSQL process 10 secs 1 RO node - No problems for writer process.
- Patroni started on the boot and brought PostgreSQL instance automatically
2. Stop the PostgreSQL process 27 secs 1 RO node - No problems for read write process.
- Patroni started on the boot and brought PostgreSQL instance automatically
3. Reboot the server 27 secs 1 RO node - No problems for the write process.
- Patroni started on the boot and brought PostgreSQL instance automatically.
4. Stop the Patroni process 25 secs 1 RO node - No problem for the write process.
- Stopping Patroni stopped PostgreSQL process and excluded 192.168.56.53 node from the cluster.
- After starting Patroni brought PosgreSQL instance and joined to the cluster automatically.
Master tests

No. Test Scenario Downtime Observation
1. Kill PostgreSQL process 10 secs RW - After killing PostgreSQL process Patroni brought service back to the running state.
- No distruption for the read only requests.
2. Stop the PostgreSQL process 7 secs RW - Patroni brought PostgreSQL to the running state. Election was not triggered.
3. Reboot the server 17 secs RW - Faiover happened and one of the slave servers was elected as the new master.
- On the old master server, Petroni brought PostgreSQL and performed pg_rewind to create replica.
4. Stop Patroni process 10 ses RW - Patroni stopped PostgreSQL instance and new master node was elected.
- After starting Patroni, old master server was rewound using pg_rewind and new replica joined to the cluster.
Network isolation tests

No. Test Scenario Downtime Observation
1. Network isolate master server from the configuration 31 secs RW - New master was elected.
- Bringing back communication to old master server did not bring old master server as replica automatically.
- Restarting Patroni it brought PostgreSQL instance on 192.168.56.53 as replica.
2. Network isolate slave server from the configuration 0 secs RW - Isolated standby server was excluded from the cluster configuration.
- Bringing back communication to the standby server node rejoined to the cluster automatically.


Pinging cluster on the read write interface (port 5000):

while true; do echo "select inet_server_addr(),now()::timestamp" | psql -Upostgres -h192.168.56.100 -p5000 -t; sleep 1; done
Pinging cluster on the read only interface (port 5001):

while true; do echo "select inet_server_addr(),now()::timestamp" | psql -Upostgres -h192.168.56.100 -p5001 -t; sleep 1; done

Standby Tests

1. Kill PostgreSQL process

Read Write

192.168.56.52 | 2021-11-11 19:54:45.462582
192.168.56.52 | 2021-11-11 19:54:46.483056
192.168.56.52 | 2021-11-11 19:54:47.502918
192.168.56.52 | 2021-11-11 19:54:48.522746
192.168.56.52 | 2021-11-11 19:54:49.544109
192.168.56.52 | 2021-11-11 19:54:50.564185
192.168.56.52 | 2021-11-11 19:54:51.585437
192.168.56.52 | 2021-11-11 19:54:52.607154
192.168.56.52 | 2021-11-11 19:54:53.628248
192.168.56.52 | 2021-11-11 19:54:54.649941
192.168.56.52 | 2021-11-11 19:54:55.671482
No problems for writer process.

Read Only

192.168.56.53 | 2021-11-11 19:54:45.505932 <<-- KILL POSTGRES PROCESS ON 192.168.56.53
192.168.56.51 | 2021-11-11 19:54:46.562742

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

 192.168.56.51 | 2021-11-11 19:54:50.598035

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

 192.168.56.51 | 2021-11-11 19:54:54.633922
 192.168.56.51 | 2021-11-11 19:54:55.655928
 192.168.56.51 | 2021-11-11 19:54:56.679855
 192.168.56.51 | 2021-11-11 19:54:57.70306
 192.168.56.51 | 2021-11-11 19:54:58.725866
 192.168.56.51 | 2021-11-11 19:54:59.749008
 192.168.56.51 | 2021-11-11 19:55:00.770238
 192.168.56.51 | 2021-11-11 19:55:01.791585
 192.168.56.53 | 2021-11-11 19:55:02.779865 <<-- PATRONI BROUGHT POSTGRESQL PROCESS
 192.168.56.51 | 2021-11-11 19:55:03.835348
 192.168.56.53 | 2021-11-11 19:55:04.825825
 192.168.56.51 | 2021-11-11 19:55:05.890109
After 10 secs Patroni brought PostgreSQL automatically.

2. Stop the PostgreSQL process

Read Write

192.168.56.52 | 2021-11-11 20:05:18.785093
192.168.56.52 | 2021-11-11 20:05:19.806449
192.168.56.52 | 2021-11-11 20:05:20.82694
192.168.56.52 | 2021-11-11 20:05:21.847219
192.168.56.52 | 2021-11-11 20:05:22.868177
192.168.56.52 | 2021-11-11 20:05:23.888856
192.168.56.52 | 2021-11-11 20:05:24.90578
No problems for read write process.

Read Only

 
192.168.56.53 | 2021-11-11 20:05:18.990093 <<-- STOP POSTGRESQL PROCESS
 192.168.56.51 | 2021-11-11 20:05:20.04388

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

 192.168.56.51 | 2021-11-11 20:05:24.08155
 192.168.56.53 | 2021-11-11 20:05:25.073322 <<-- PATRONI BROUGHT POSTGRESQL PROCESS
 
Patroni brough PostgreSQL instance in 6 seconds.

3. Reboot the server

Read Write

192.168.56.52 | 2021-11-11 20:10:13.171874
192.168.56.52 | 2021-11-11 20:10:14.193623
192.168.56.52 | 2021-11-11 20:10:15.217776
192.168.56.52 | 2021-11-11 20:10:16.239323
192.168.56.52 | 2021-11-11 20:10:17.257308
192.168.56.52 | 2021-11-11 20:10:18.27552
192.168.56.52 | 2021-11-11 20:10:19.292373
192.168.56.52 | 2021-11-11 20:10:20.310198
192.168.56.52 | 2021-11-11 20:10:21.32735
192.168.56.52 | 2021-11-11 20:10:22.343773
192.168.56.52 | 2021-11-11 20:10:23.361844
192.168.56.52 | 2021-11-11 20:10:24.38691
192.168.56.52 | 2021-11-11 20:10:25.407598
192.168.56.52 | 2021-11-11 20:10:26.429343
192.168.56.52 | 2021-11-11 20:10:27.450577
192.168.56.52 | 2021-11-11 20:10:28.471854
192.168.56.52 | 2021-11-11 20:10:29.492637
192.168.56.52 | 2021-11-11 20:10:30.512336
192.168.56.52 | 2021-11-11 20:10:31.533257
192.168.56.52 | 2021-11-11 20:10:32.554038
192.168.56.52 | 2021-11-11 20:10:33.574338
192.168.56.52 | 2021-11-11 20:10:34.596119
192.168.56.52 | 2021-11-11 20:10:35.615495
192.168.56.52 | 2021-11-11 20:10:36.637819
192.168.56.52 | 2021-11-11 20:10:37.659621
192.168.56.52 | 2021-11-11 20:10:38.682478
192.168.56.52 | 2021-11-11 20:10:39.703187
192.168.56.52 | 2021-11-11 20:10:40.727444
No problems for the write process.

Read Only

192.168.56.53 | 2021-11-11 20:10:12.314665 <<-- REBOOT THE 192.168.56.53 SERVER
 192.168.56.51 | 2021-11-11 20:10:13.304627

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
 192.168.56.51 | 2021-11-11 20:10:24.340825

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
 192.168.56.51 | 2021-11-11 20:10:29.42999
 192.168.56.51 | 2021-11-11 20:10:30.44846
 192.168.56.51 | 2021-11-11 20:10:31.470978
 192.168.56.51 | 2021-11-11 20:10:32.49244
 192.168.56.51 | 2021-11-11 20:10:33.515443
 192.168.56.51 | 2021-11-11 20:10:34.53563
 192.168.56.51 | 2021-11-11 20:10:35.553104
 192.168.56.51 | 2021-11-11 20:10:36.572375
 192.168.56.51 | 2021-11-11 20:10:37.595694
 192.168.56.51 | 2021-11-11 20:10:38.620022
 192.168.56.53 | 2021-11-11 20:10:39.644502 <<-- PATRONI STARTED ON THE BOOT AND STARTET POSTGRESQL PROCESS
 
Patroni started on the boot and brought PostgreSQL instance automatically in 27 secs.

4. Stop the Patroni process

Read Write

192.168.56.52 | 2021-11-11 20:25:01.931924
192.168.56.52 | 2021-11-11 20:25:02.954774
192.168.56.52 | 2021-11-11 20:25:03.975514
192.168.56.52 | 2021-11-11 20:25:04.99868
192.168.56.52 | 2021-11-11 20:25:06.021456
192.168.56.52 | 2021-11-11 20:25:07.048917
192.168.56.52 | 2021-11-11 20:25:08.071156
192.168.56.52 | 2021-11-11 20:25:09.093902
192.168.56.52 | 2021-11-11 20:25:10.117138
192.168.56.52 | 2021-11-11 20:25:11.138296
192.168.56.52 | 2021-11-11 20:25:12.159975
192.168.56.52 | 2021-11-11 20:25:13.186149
192.168.56.52 | 2021-11-11 20:25:14.20717
192.168.56.52 | 2021-11-11 20:25:15.2286
No problem for the write process.

Read Only

Stopping Patroni stopped PostgreSQL process and excluded 192.168.56.53 node from the cluster.

+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 9 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 9 | |
+-----------+---------------+---------+---------+----+-----------+
192.168.56.51 | 2021-11-11 20:24:52.731887

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
192.168.56.51 | 2021-11-11 20:24:56.772703

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
192.168.56.51 | 2021-11-11 20:25:00.811616

192.168.56.51 | 2021-11-11 20:25:01.837298
192.168.56.51 | 2021-11-11 20:25:02.860275
192.168.56.51 | 2021-11-11 20:25:03.8829
192.168.56.51 | 2021-11-11 20:25:04.906505
192.168.56.51 | 2021-11-11 20:25:05.932158
Start Patroni.

After starting Patroni brought PosgreSQL instance and joined to the cluster automatically.

+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 9 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 9 | |
| psql13n53 | 192.168.56.53 | Replica | running | 9 | 0 |
+-----------+---------------+---------+---------+----+-----------+
192.168.56.51 | 2021-11-11 20:28:18.473041
192.168.56.51 | 2021-11-11 20:28:19.495974
192.168.56.51 | 2021-11-11 20:28:20.518773
192.168.56.51 | 2021-11-11 20:28:21.541587
192.168.56.51 | 2021-11-11 20:28:22.563967
192.168.56.51 | 2021-11-11 20:28:23.586971
192.168.56.51 | 2021-11-11 20:28:24.608738
192.168.56.53 | 2021-11-11 20:28:25.63165
It took 7 seconds to route traffic on stanby node after starting Patroni process.


Master Tests

1. Kill PostgreSQL process

Read Write

 192.168.56.52 | 2021-11-11 20:40:55.246602
 192.168.56.52 | 2021-11-11 20:40:56.270163 <<-- KILL POSTGRESQL PROCESS.

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

 192.168.56.52 | 2021-11-11 20:41:05.318191 <<-- PATRONI BROUGHT POSTGRESQL SERVICE BACK
 192.168.56.52 | 2021-11-11 20:41:06.341719
After killing PostgreSQL process Patroni brought service back to the running state. We had 10 secs downtime for the writer process.

Read Only

192.168.56.51 | 2021-11-11 20:40:56.774198
192.168.56.53 | 2021-11-11 20:40:57.797533
192.168.56.51 | 2021-11-11 20:40:58.821054
192.168.56.53 | 2021-11-11 20:40:59.843738
192.168.56.51 | 2021-11-11 20:41:00.86877
192.168.56.53 | 2021-11-11 20:41:01.889666
192.168.56.51 | 2021-11-11 20:41:02.912988
192.168.56.53 | 2021-11-11 20:41:03.933952
192.168.56.51 | 2021-11-11 20:41:05.045196
192.168.56.53 | 2021-11-11 20:41:06.078416
No distruption for the read only requests.

2. Stop the PostgreSQL process

Read Write

  
192.168.56.52 | 2021-11-11 20:52:01.251009

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.

192.168.56.52 | 2021-11-11 20:52:08.301596
Patroni brought PostgreSQL to the running state. Election was not triggered. There was 7 secs downtime for the writer process.

Read Only

  
192.168.56.53 | 2021-11-11 20:52:01.53767
192.168.56.51 | 2021-11-11 20:52:02.561452
192.168.56.53 | 2021-11-11 20:52:03.583391
192.168.56.51 | 2021-11-11 20:52:04.609092
192.168.56.53 | 2021-11-11 20:52:05.631433
192.168.56.51 | 2021-11-11 20:52:06.656341
192.168.56.53 | 2021-11-11 20:52:07.677131
192.168.56.51 | 2021-11-11 20:52:08.701682
192.168.56.53 | 2021-11-11 20:52:09.730157
3. Reboot the server

Read Write

192.168.56.52 | 2021-11-11 20:59:31.49515

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

192.168.56.51 | 2021-11-11 20:59:48.650256 <<-- SERVER 192.168.56.51 ELECTED AS THE NEW MASTER:
192.168.56.51 | 2021-11-11 20:59:49.669785
192.168.56.51 | 2021-11-11 20:59:50.687517
Faiover happened and one of the slave servers was elected as the new master. We had 17 seconds downtime for the writer process. On the old master server, Petroni brought PostgreSQL and performed pg_rewind to create replica.

  
+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 14 | |
| psql13n52 | 192.168.56.52 | Replica | running | 14 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 14 | 0 |
+-----------+---------------+---------+---------+----+-----------+
Read Only

192.168.56.51 | 2021-11-11 20:59:29.053858
192.168.56.53 | 2021-11-11 20:59:30.07594
192.168.56.51 | 2021-11-11 20:59:31.105134
192.168.56.53 | 2021-11-11 20:59:32.123691
192.168.56.51 | 2021-11-11 20:59:33.152343
192.168.56.53 | 2021-11-11 20:59:34.170016
192.168.56.51 | 2021-11-11 20:59:35.199209
192.168.56.53 | 2021-11-11 20:59:36.21726
192.168.56.51 | 2021-11-11 20:59:37.238567
192.168.56.53 | 2021-11-11 20:59:38.251579
192.168.56.51 | 2021-11-11 20:59:39.273968
192.168.56.53 | 2021-11-11 20:59:40.288168
192.168.56.51 | 2021-11-11 20:59:41.308803
192.168.56.53 | 2021-11-11 20:59:42.32304
192.168.56.53 | 2021-11-11 20:59:43.339712
192.168.56.53 | 2021-11-11 20:59:44.357711
192.168.56.53 | 2021-11-11 20:59:45.375188
192.168.56.53 | 2021-11-11 20:59:46.395121
192.168.56.53 | 2021-11-11 20:59:47.411711
192.168.56.53 | 2021-11-11 20:59:48.428075
192.168.56.53 | 2021-11-11 20:59:49.445494
192.168.56.53 | 2021-11-11 20:59:50.462092
There was no disrupions for the read only requests.

3. Stop Patroni process

Read Write

192.168.56.51 | 2021-11-11 21:08:25.526132

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.

192.168.56.53 | 2021-11-11 21:08:35.60005
192.168.56.53 | 2021-11-11 21:08:36.62634
192.168.56.53 | 2021-11-11 21:08:37.651523
  
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n52 | 192.168.56.52 | Replica | running | 15 | 0 |
| psql13n53 | 192.168.56.53 | Leader | running | 15 | |
+-----------+---------------+---------+---------+----+-----------+
Patroni stopped PostgreSQL instance and new master node was elected. We had 10 secs of downtime for the writer process.

After starting Patroni, old master server was rewound using pg_rewind and new replica joined to the cluster.

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 15 | 0 |
| psql13n52 | 192.168.56.52 | Replica | running | 15 | 0 |
| psql13n53 | 192.168.56.53 | Leader | running | 15 | |
+-----------+---------------+---------+---------+----+-----------+
Read Only

192.168.56.53 | 2021-11-11 21:08:25.3516
192.168.56.52 | 2021-11-11 21:08:26.374974
192.168.56.53 | 2021-11-11 21:08:27.397898
192.168.56.52 | 2021-11-11 21:08:28.432293
192.168.56.53 | 2021-11-11 21:08:29.455458
192.168.56.52 | 2021-11-11 21:08:30.479256
192.168.56.53 | 2021-11-11 21:08:31.500499
192.168.56.52 | 2021-11-11 21:08:32.525148
192.168.56.53 | 2021-11-11 21:08:33.54793
192.168.56.52 | 2021-11-11 21:08:34.571675
192.168.56.52 | 2021-11-11 21:08:35.610965
192.168.56.52 | 2021-11-11 21:08:36.639712
There was no disruptions for the read only process.


Network Isolation Tests

1. Network isolate master server from the configuration

Read Write

192.168.56.53 | 2021-11-11 22:15:37.376374 <<-- COMMUNICATION BLOCKED

192.168.56.52 | 2021-11-11 22:15:08.169172
192.168.56.52 | 2021-11-11 22:15:09.190167
192.168.56.52 | 2021-11-11 22:15:10.211688
192.168.56.52 | 2021-11-11 22:15:11.232966
192.168.56.52 | 2021-11-11 22:15:12.254794
192.168.56.52 | 2021-11-11 22:15:13.276149
192.168.56.52 | 2021-11-11 22:15:14.29847
192.168.56.52 | 2021-11-11 22:15:15.319335
192.168.56.52 | 2021-11-11 22:15:16.343936
Communication was blocked on the master (read/write) node. New master was elected. We had 31 secs downtime for the writer application.

  
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 16 | |
+-----------+---------------+---------+---------+----+-----------+
Bringing back communication to old master server did not bring old master server as replica automatically. Restarting Patroni it brought PostgreSQL instance on 192.168.56.53 as replica.

+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 16 | |
| psql13n53 | 192.168.56.53 | Replica | running | 16 | 0 |
+-----------+---------------+---------+---------+----+-----------+
Read Only

192.168.56.51 | 2021-11-11 22:15:11.676438
192.168.56.51 | 2021-11-11 22:15:12.699285
192.168.56.51 | 2021-11-11 22:15:13.722465
192.168.56.51 | 2021-11-11 22:15:14.74705
192.168.56.51 | 2021-11-11 22:15:15.77105
192.168.56.51 | 2021-11-11 22:15:16.794407
192.168.56.51 | 2021-11-11 22:15:17.816547
192.168.56.51 | 2021-11-11 22:15:18.838761

192.168.56.53 | 2021-11-11 22:25:57.360616
192.168.56.51 | 2021-11-11 22:25:58.390982
192.168.56.53 | 2021-11-11 22:25:59.42245
192.168.56.51 | 2021-11-11 22:26:00.450804
192.168.56.53 | 2021-11-11 22:26:01.480687
192.168.56.51 | 2021-11-11 22:26:02.510569
192.168.56.53 | 2021-11-11 22:26:03.540663
192.168.56.51 | 2021-11-11 22:26:04.574112
192.168.56.53 | 2021-11-11 22:26:05.606363
192.168.56.51 | 2021-11-11 22:26:06.635608
Afer adding old master server to cluster configuration as replica it started to accept read only requests.

2. Network isolate slave server from the configuration

Read Write

192.168.56.52 | 2021-11-11 22:28:22.539789
192.168.56.52 | 2021-11-11 22:28:23.559629
192.168.56.52 | 2021-11-11 22:28:24.580749
192.168.56.52 | 2021-11-11 22:28:25.925264
192.168.56.52 | 2021-11-11 22:28:26.946179
192.168.56.52 | 2021-11-11 22:28:27.969459
192.168.56.52 | 2021-11-11 22:28:28.991379
192.168.56.52 | 2021-11-11 22:28:30.013173
192.168.56.52 | 2021-11-11 22:28:31.032617
192.168.56.52 | 2021-11-11 22:28:32.053455
192.168.56.52 | 2021-11-11 22:28:33.074863
192.168.56.52 | 2021-11-11 22:28:34.096192
192.168.56.52 | 2021-11-11 22:28:35.116744
There was no problem for the writer applications.

Read Only

192.168.56.51 | 2021-11-11 22:28:03.186052
192.168.56.53 | 2021-11-11 22:28:04.208455
192.168.56.51 | 2021-11-11 22:28:05.23119
192.168.56.51 | 2021-11-11 22:28:16.665654

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.

192.168.56.51 | 2021-11-11 22:28:30.700107
192.168.56.51 | 2021-11-11 22:28:31.721761
192.168.56.51 | 2021-11-11 22:28:32.744021
192.168.56.51 | 2021-11-11 22:28:33.766453
192.168.56.51 | 2021-11-11 22:28:34.789146
192.168.56.51 | 2021-11-11 22:28:35.811602
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 16 | |
+-----------+---------------+---------+---------+----+-----------+
Isolated standby server was excluded from the cluster configuration.

Bringing back communication to the standby server node rejoined to the cluster automatically.

+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 16 | |
| psql13n53 | 192.168.56.53 | Replica | running | 16 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Switchover

Manually trigger switch of the primary node to one of the replicas and bring the old primary as a new replica into the cluster.

$ patronictl -c /opt/app/patroni/etc/postgresql.yml list

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 17 | |
| psql13n52 | 192.168.56.52 | Replica | running | 17 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 17 | 0 |
+-----------+---------------+---------+---------+----+-----------+
$ patronictl -c /opt/app/patroni/etc/postgresql.yml switchover
Master [psql13n51]:
Candidate ['psql13n52', 'psql13n53'] []: psql13n52
When should the switchover take place (e.g. 2021-11-15T22:08 ) [now]:
Current cluster topology
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 17 | |
| psql13n52 | 192.168.56.52 | Replica | running | 17 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 17 | 0 |
+-----------+---------------+---------+---------+----+-----------+
Are you sure you want to switchover cluster postgres, demoting current master psql13n51? [y/N]: y
2021-11-15 21:10:28.05685 Successfully switched over to "psql13n52"
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | stopped | | unknown |
| psql13n52 | 192.168.56.52 | Leader | running | 17 | |
| psql13n53 | 192.168.56.53 | Replica | running | 17 | 0 |
+-----------+---------------+---------+---------+----+-----------+
192.168.56.51 | 2021-11-15 21:10:21.190417
192.168.56.51 | 2021-11-15 21:10:22.223856
192.168.56.51 | 2021-11-15 21:10:23.259458
192.168.56.51 | 2021-11-15 21:10:24.293523
192.168.56.51 | 2021-11-15 21:10:25.329155

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
192.168.56.51 | 2021-11-15 21:10:30.379076
192.168.56.51 | 2021-11-15 21:10:31.40607
192.168.56.52 | 2021-11-15 21:10:32.417283
192.168.56.51 | 2021-11-15 21:10:33.450491
192.168.56.52 | 2021-11-15 21:10:34.468676
192.168.56.52 | 2021-11-15 21:10:35.494665
192.168.56.52 | 2021-11-15 21:10:36.517738
192.168.56.52 | 2021-11-15 21:10:37.541415
192.168.56.52 | 2021-11-15 21:10:38.567083
Node 192.168.56.52 bacame new primary node and 192.168.56.51 joined as the new replica to the cluster. Downtime for the read write node was 7 secs.

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 18 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 18 | |
| psql13n53 | 192.168.56.53 | Replica | running | 18 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Failover

Although failover can also be triggered manually it is mostly executed automatically, when leader node is unavailable for unplanned reason. We have noticed automatic failovers in previous tests.

For a test I will trigger failover manually.

patronictl -c /opt/app/patroni/etc/postgresql.yml failover
Candidate ['psql13n51', 'psql13n53'] []: psql13n51
Current cluster topology
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 18 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 18 | |
| psql13n53 | 192.168.56.53 | Replica | running | 18 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Are you sure you want to failover cluster postgres, demoting current master psql13n52? [y/N]: y
2021-11-15 21:25:04.85489 Successfully failed over to "psql13n51"
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 18 | |
| psql13n52 | 192.168.56.52 | Replica | stopped | | unknown |
| psql13n53 | 192.168.56.53 | Replica | running | 18 | 0 |
+-----------+---------------+---------+---------+----+-----------+
Node 192.168.56.51 became new master and node 192.168.56.52 joined to the cluster as a replica. Downtime was 7 secs.

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 19 | |
| psql13n52 | 192.168.56.52 | Replica | running | 19 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Maintenance mode

Sometimes it is necessary to do maintenance on a single node and you do not want Petroni to manage the cluster. For example when performing PostgreSQL upgrade.

When Patroni is paused, it won't change the state of the PostgreSQL - it will not to try automatically start PostgreSQL cluster when it is stopped.

For the test we will stop the replica and test if Petroni will start database automatically as in previous tests.

[postgres@psql13n52 ~]$ patronictl -c /opt/app/patroni/etc/postgresql.yml pause
Success: cluster management is paused

[postgres@psql13n51 ~]$ patronictl -c /opt/app/patroni/etc/postgresql.yml list
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 19 | |
| psql13n52 | 192.168.56.52 | Replica | running | 19 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 |
+-----------+---------------+---------+---------+----+-----------+
 Maintenance mode: on
Notice - "Maintenance mode: on".

Replica is stopped:

$ pg_ctl -D /var/lib/pgsql/14/data stop
waiting for server to shut down.... done
server stopped
Patroni didn't brought up database.

$ patronictl -c /opt/app/patroni/etc/postgresql.yml list
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 19 | |
| psql13n52 | 192.168.56.52 | Replica | stopped | | unknown |
| psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 |
+-----------+---------------+---------+---------+----+-----------+
 Maintenance mode: on
Resume Patroni.

$ patronictl -c /opt/app/patroni/etc/postgresql.yml resume
Success: cluster management is resumed
Node joined the cluster after few seconds.

$ patronictl -c /opt/app/patroni/etc/postgresql.yml list
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 19 | |
| psql13n52 | 192.168.56.52 | Replica | running | 19 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 |
+-----------+---------------+---------+---------+----+-----------+




References:
http://highscalability.com/blog/2019/9/16/managing-high-availability-in-postgresql-part-iii-patroni.html

Sunday, November 7, 2021

Deploying PostgreSQL 14.0 for High Availability using Patroni, etcd, HAProxy and keepalived on CetntOS 8


Patroni is an automatic failover system for PostgreSQL. It provides automatic and manual failover and keeps all vital data in distributed configuration store (DCS). The database connections do not hapen directly to the database nodes but are routed via a connection proxy like HAProxy. Proxy determines the active/master node.

Using proxy for routing connections risk of having split brain scenario is very limited.

By using Patroni all the dynamic settings are stored into the DCS in order to have complete consistency on the participating nodes.

In this blog post I will focus on building Patroni cluster on top of CentOS 8 by using etcd in clustering and HAProxy for routing database connections to the primary server.


OS setup

Firewalld and selinux need to be adjusted before configuring Patroni cluster.


Firewalld

The ports required for operating patroni/etcd/haproxy/postgresql are the following:

5432 - PostgreSQL standard port, not used by PostgreSQL itself but by HAProxy
5000 - PostgreSQL listening port used by HAproxy to route the database connections to write node
5001 - PostgreSQL listening port used by HAproxy to route the database connections to read nodes
2380 - etcd peer urls port required by the etcd members communication
2379 - etcd client port required by any client including patroni to communicate with etcd
8008 - patroni rest api port required by HAProxy to check the nodes status
7000 - HAProxy port to expose the proxy’s statistics

selinux

Selinux by default prevents the new services to bind to all the ip addresses.

In order to allow HAProxy to bind the ports required for its functionality we need to run this command.

sudo setsebool -P haproxy_connect_any=1

Initial setup

$ cat /etc/hosts
192.168.56.51 psql13n51
192.168.56.52 psql13n52
192.168.56.53 psql13n53

ETCD

Etcd is a fault-tolerant, distributed key-value store used to store the state of the Postgres cluster. Via Patroni, all of the Postgres nodes make use of etcd to keep the Postgres cluster up and running.

In production, it may be best to use larger etcd cluster so that if one etcd node fails, it doesn't affect other Postgres servers.

Download and Install the etcd Binaries (All nodes)

Install etcd on all three nodes.

ETCD_VER=v3.5.1

# choose either URL
GOOGLE_URL=https://storage.googleapis.com/etcd
GITHUB_URL=https://github.com/etcd-io/etcd/releases/download
DOWNLOAD_URL=${GOOGLE_URL}

rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
rm -rf /tmp/etcd-download-test && mkdir -p /tmp/etcd-download-test

curl -L ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /tmp/etcd-download-test --strip-components=1
rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz

/tmp/etcd-download-test/etcd --version
/tmp/etcd-download-test/etcdctl version
/tmp/etcd-download-test/etcdutl version
Move binaries to /usr/local/bin directory.

mv /tmp/etcd-download-test/etcd* /usr/local/bin/
Check etcd and etcdctl version.

$ etcd --version
etcd Version: 3.5.1
Git SHA: e8732fb5f
Go Version: go1.16.3
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.1
API version: 3.5
Configure Etcd Systemd service:

Create etcd directories and user (All nodes)

Create etcd system user:

sudo groupadd --system etcd
sudo useradd -s /sbin/nologin --system -g etcd etcd
Set /var/lib/etcd/ directory ownership to etcd user:

sudo mkdir -p /var/lib/etcd/
sudo mkdir /etc/etcd
sudo chown -R etcd:etcd /var/lib/etcd/
sudo chmod -R 700 /var/lib/etcd/
Configure the etcd on all nodes.

On each server, save these variables by running the commands below.

INT_NAME="eth1"
#INT_NAME="ens3"
ETCD_HOST_IP=$(ip addr show $INT_NAME | grep "inet\b" | awk '{print $2}' | cut -d/ -f1)
ETCD_NAME=$(hostname -s)
Where:
INT_NAME - The name of your network interface to be used for cluster traffic. Change it to match your server configuration.
ETCD_HOST_IP - The internal IP address of the specified network interface. This is used to serve client requests and communicate with etcd cluster peers.
ETCD_NAME – Each etcd member must have a unique name within an etcd cluster. Command used will set the etcd name to match the hostname of the current compute instance.

Check variables to confirm they have correct values:

echo $INT_NAME
echo $ETCD_HOST_IP
echo $ETCD_NAME
Once all variables are set, create the etcd.service systemd unit file:

Create a systemd service file for etcd. Replace --listen-client-urls with your server IPs.
For ETCD 3.5 default is api v3 but Patroni doesn't currently support v3 API so it is important to set parameter enable-v2=true.

cat << EOF > /lib/systemd/system/etcd.service
[Unit]
Description=etcd service
Documentation=https://github.com/coreos/etcd
 
[Service]
User=etcd
Type=notify
ExecStart=/usr/local/bin/etcd \\
 --name ${ETCD_NAME} \\
 --enable-v2=true \\
 --data-dir /var/lib/etcd \\
 --initial-advertise-peer-urls http://${ETCD_HOST_IP}:2380 \\
 --listen-peer-urls http://${ETCD_HOST_IP}:2380 \\
 --listen-client-urls http://${ETCD_HOST_IP}:2379,http://127.0.0.1:2379 \\
 --advertise-client-urls http://${ETCD_HOST_IP}:2379 \\
 --initial-cluster-token etcd-cluster-1 \\
 --initial-cluster psql13n51=http://192.168.56.51:2380,psql13n52=http://192.168.56.52:2380,psql13n53=http://192.168.56.53:2380 \\
 --initial-cluster-state new \\
 --heartbeat-interval 1000 \\
 --election-timeout 5000
Restart=on-failure
RestartSec=5
 
[Install]
WantedBy=multi-user.target
EOF
For CentOS / RHEL Linux distributions, set SELinux mode to permissive.

sudo setenforce 0
sudo sed -i 's/^SELINUX=.*/SELINUX=permissive/g' /etc/selinux/config
If you have active firewall service, allow ports 2379 and 2380.

# RHEL / CentOS / Fedora firewalld
sudo firewall-cmd --add-port={2379,2380}/tcp --permanent
sudo firewall-cmd --reload

# Ubuntu/Debian
sudo ufw allow proto tcp from any to any port 2379,2380
Bootstrap The etcd Cluster

Once all the configurations are applied on the three servers, start and enable the newly created etcd service on all the nodes. The first server will act as a bootstrap node. One node will be automatically elected as a leader once the service is started in all the three nodes.

# systemctl daemon-reload
# systemctl enable etcd
# systemctl start etcd.service
# systemctl status -l etcd.service
Test Etcd Cluster installation

Test your setup by listing the etcd cluster members:

# etcdctl member list
Check leader on host:

# etcdctl endpoint status --write-out=table
Also check cluster health by running the command:

# etcdctl endpoint health

127.0.0.1:2379 is healthy: successfully committed proposal: took = 4.383594ms
Let’s also try writing to etcd.

# etcdctl put /message "Hello World"
Read the value of message back – It should work on all nodes.

# etcdctl get /message
Hello World
Watchdog

Watchdog devices will reset the whole system when they do not get a keepalive heartbeat within a specified timeframe. This adds an additional layer of fail safe in case usual Patroni split-brain protection mechanisms fail. It is recommended to deploy watchdog mechanism in PostgreSQL HA when running configuration in the production.

Install on all nodes.

yum -y install watchdog

/sbin/modprobe softdog
Patroni will be the component interacting with the watchdog device. Since Patroni is run by the postgres user, we need to either set the permissions of the watchdog device open enough so the postgres user can write to it or make the device owned by postgres itself, which we consider a safer approach (as it is more restrictive):

Include the softdog kernel module to load on CentOS boot up.

It’s better that the softdog module is not loaded via /etc/rc.local but the default CentOS methodology to load module from /etc/rc.module is used:

echo modprobe softdog >> /etc/rc.modules
chmod +x /etc/rc.modules
sudo sh -c 'echo "KERNEL==\"watchdog\", OWNER=\"postgres\", GROUP=\"postgres\"" >> /etc/udev/rules.d/61-watchdog.rules'
Check if module is blacklisted by default and there was a strain file with such a directive still lingering around.

$ grep blacklist /lib/modprobe.d/* /etc/modprobe.d/* |grep softdog
Editing that file in each of the nodes to remove the line above and restarting the servers.

$ lsmod | grep softdog
softdog                16384  0

[root@localhost ~]# ls -l /dev/watchdog*
crw-------. 1 root root  10, 130 Nov  5 11:13 /dev/watchdog
crw-------. 1 root root 248,   0 Nov  5 11:13 /dev/watchdog0
PostgreSQL

Install PostgreSQL on all nodes.

By default the postgres module will have an older version of postgres enabled. But the current module does not include postgresql 14. Confirm with this command:

sudo dnf module list postgresql
Let us Install the repository RPM using this command:

sudo dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm
Then to avoid conflicts, let us disable the built-in PostgreSQL module:

sudo dnf -qy module disable postgresql
Finally install PostgreSQL 14 server:

sudo dnf install -y postgresql14-server
Let’s also install the Contrib package which provides several additional features for the PostgreSQL database system:

sudo dnf install -y postgresql14-contrib
An important concept to understand in a PostgreSQL HA environment like this one is that PostgreSQL should not be started automatically by systemd during the server initialization: we should leave it to Patroni to fully manage it, including the process of starting and stopping the server. Thus, we should disable the service:

sudo systemctl disable postgresql-14
Start with a fresh new PostgreSQL setup and let Patroni bootstrap the cluster. Remove the data directory that has been created as part of the PostgreSQL installation:

sudo systemctl stop postgresql-14
sudo rm -fr /var/lib/pgsql/14/data
Patroni

Patroni is a cluster manager used to customize and automate deployment and maintenance of PostgreSQL HA (High Availability) clusters. You should check the latest available release from Github page.


Install Patroni and python client for ETCD on all 3 nodes:

# yum install patroni-etcd
# yum install pyhton3-etcd
If you have active firewall service, allow 5432 port on all nodes. ``` # RHEL / CentOS / Fedora firewalld sudo firewall-cmd --add-port=5432/tcp --permanent sudo firewall-cmd --reload # Ubuntu/Debian sudo ufw allow proto tcp from any to any port 5432 ```
pip install python-etcd
Here’s the configuration file we have used for psql13n51:

cat /opt/app/patroni/etc/postgresql.yml

scope: postgres
name: psql13n51

restapi:
    listen: 0.0.0.0:8008
    connect_address: 192.168.56.51:8008

etcd:
    host: psql13n51:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        logging_collector: "on"
        max_wal_senders: 5
        max_replication_slots: 5

  initdb:
  - encoding: UTF8
  - data-checksums

  pg_hba:
  - host replication replicator 127.0.0.1/32 trust
  - host replication replicator 192.168.56.1/24 md5
  - host all all 192.168.56.1/24 md5
  - host all all 0.0.0.0/0 md5

  users:
    admin:
      password: admin
      options:
        - createrole
        - createdb

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 192.168.56.51:5432
  data_dir: "/var/lib/pgsql/14/data"
  bin_dir: "/usr/pgsql-14/bin"
  pgpass: /tmp/pgpass
  authentication:
    replication:
      username: replicator
      password: vagrant
    superuser:
      username: postgres
      password: vagrant
  parameters:
    unix_socket_directories: '/var/run/postgresql'

watchdog:
  mode: required
  device: /dev/watchdog
  safety_margin: 5

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false
Validate configuration:

patroni --validate-config /opt/app/patroni/etc/postgresql.yml
Lets bootstrap the cluster using parameter from yml file as postgres user:

# sudo su - postgres
patroni /opt/app/patroni/etc/postgresql.yml
2021-11-06 07:20:41,692 INFO: postmaster pid=1863
2021-11-06 07:20:41.704 UTC [1863] LOG:  redirecting log output to logging collector process
2021-11-06 07:20:41.704 UTC [1863] HINT:  Future log output will appear in directory "log".
localhost:5432 - rejecting connections
localhost:5432 - accepting connections
2021-11-06 07:20:41,807 INFO: establishing a new patroni connection to the postgres cluster
2021-11-06 07:20:41,820 INFO: running post_bootstrap
2021-11-06 07:20:41,844 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds
2021-11-06 07:20:41,864 INFO: initialized a new cluster
2021-11-06 07:20:51,859 INFO: no action. I am (psql13n51) the leader with the lock
2021-11-06 07:20:51,880 INFO: no action. I am (psql13n51) the leader with the lock
2021-11-06 07:21:01,883 INFO: no action. I am (psql13n51) the leader with the lock
2021-11-06 07:21:11,877 INFO: no action. I am (psql13n51) the leader with the lock
2021-11-06 07:21:21,878 INFO: no action. I am (psql13n51) the leader with the lock
2021-11-06 07:21:31,877 INFO: no action. I am (psql13n51) the leader with the lock
2021-11-06 07:21:41,880 INFO: no action. I am (psql13n51) the leader with the lock
Next, edit postgresql.yml file on psql13n52 node, and add the following configuration parameters.
Make sure, you change namespace, etcd host name, listen and connect_address:

cat /opt/app/patroni/etc/postgresql.yml

scope: postgres
name: psql13n52

restapi:
    listen: 0.0.0.0:8008
    connect_address: 192.168.56.52:8008

etcd:
    host: psql13n52:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        logging_collector: "on"
        max_wal_senders: 5
        max_replication_slots: 5

  initdb:
  - encoding: UTF8
  - data-checksums

  pg_hba:
  - host replication replicator 127.0.0.1/32 trust
  - host replication replicator 192.168.56.1/24 md5
  - host all all 192.168.56.1/24 md5
  - host all all 0.0.0.0/0 md5

  users:
    admin:
      password: admin
      options:
        - createrole
        - createdb

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 192.168.56.52:5432
  data_dir: "/var/lib/pgsql/14/data"
  bin_dir: "/usr/pgsql-14/bin"
  pgpass: /tmp/pgpass
  authentication:
    replication:
      username: replicator
      password: vagrant
    superuser:
      username: postgres
      password: vagrant
  parameters:
    unix_socket_directories: '/var/run/postgresql'

watchdog:
  mode: required
  device: /dev/watchdog
  safety_margin: 5

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false
Validate configuration:

patroni --validate-config /opt/app/patroni/etc/postgresql.yml
Run as postgres user on psql13n52 node:

# sudo su - postgres
patroni /opt/app/patroni/etc/postgresql.yml
2021-11-06 07:23:25,827 INFO: Selected new etcd server http://192.168.56.53:2379
2021-11-06 07:23:25,831 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-11-06 07:23:25,839 INFO: Lock owner: psql13n51; I am psql13n52
2021-11-06 07:23:25,843 INFO: trying to bootstrap from leader 'psql13n51'
2021-11-06 07:23:26,237 INFO: replica has been created using basebackup
2021-11-06 07:23:26,238 INFO: bootstrapped from leader 'psql13n51'
2021-11-06 07:23:26,398 INFO: postmaster pid=1506
localhost:5432 - no response
2021-11-06 07:23:26.435 UTC [1506] LOG:  redirecting log output to logging collector process
2021-11-06 07:23:26.435 UTC [1506] HINT:  Future log output will appear in directory "log".
localhost:5432 - accepting connections
localhost:5432 - accepting connections
2021-11-06 07:23:27,449 INFO: Lock owner: psql13n51; I am psql13n52
2021-11-06 07:23:27,449 INFO: establishing a new patroni connection to the postgres cluster
2021-11-06 07:23:27,478 INFO: no action. I am a secondary (psql13n52) and following a leader (psql13n51)
2021-11-06 07:23:31,874 INFO: no action. I am a secondary (psql13n52) and following a leader (psql13n51)
2021-11-06 07:23:41,879 INFO: no action. I am a secondary (psql13n52) and following a leader (psql13n51)
Next, edit postgresql.yml file on psql13n53:

cat /opt/app/patroni/etc/postgresql.yml

scope: postgres
name: psql13n53

restapi:
    listen: 0.0.0.0:8008
    connect_address: 192.168.56.53:8008

etcd:
    host: psql13n53:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        logging_collector: "on"
        max_wal_senders: 5
        max_replication_slots: 5

  initdb:
  - encoding: UTF8
  - data-checksums

  pg_hba:
  - host replication replicator 127.0.0.1/32 trust
  - host replication replicator 192.168.56.1/24 md5
  - host all all 192.168.56.1/24 md5
  - host all all 0.0.0.0/0 md5

  users:
    admin:
      password: admin
      options:
        - createrole
        - createdb

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 192.168.56.53:5432
  data_dir: "/var/lib/pgsql/14/data"
  bin_dir: "/usr/pgsql-14/bin"
  pgpass: /tmp/pgpass
  authentication:
    replication:
      username: replicator
      password: vagrant
    superuser:
      username: postgres
      password: vagrant
  parameters:
    unix_socket_directories: '/var/run/postgresql'

watchdog:
  mode: required
  device: /dev/watchdog
  safety_margin: 5

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false
Validate configuration:

patroni --validate-config /opt/app/patroni/etc/postgresql.yml
Run as postgres user:

# sudo su - postgres
patroni /opt/app/patroni/etc/postgresql.yml
2021-11-06 07:25:26,664 INFO: Selected new etcd server http://192.168.56.53:2379
2021-11-06 07:25:26,667 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-11-06 07:25:26,673 INFO: Lock owner: psql13n51; I am psql13n53
2021-11-06 07:25:26,676 INFO: trying to bootstrap from leader 'psql13n51'
2021-11-06 07:25:27,102 INFO: replica has been created using basebackup
2021-11-06 07:25:27,102 INFO: bootstrapped from leader 'psql13n51'
2021-11-06 07:25:27,262 INFO: postmaster pid=1597
localhost:5432 - no response
2021-11-06 07:25:27.299 UTC [1597] LOG:  redirecting log output to logging collector process
2021-11-06 07:25:27.299 UTC [1597] HINT:  Future log output will appear in directory "log".
localhost:5432 - accepting connections
localhost:5432 - accepting connections
2021-11-06 07:25:28,312 INFO: Lock owner: psql13n51; I am psql13n53
2021-11-06 07:25:28,313 INFO: establishing a new patroni connection to the postgres cluster
2021-11-06 07:25:28,340 INFO: no action. I am a secondary (psql13n53) and following a leader (psql13n51)
2021-11-06 07:25:31,877 INFO: no action. I am a secondary (psql13n53) and following a leader (psql13n51)
Check the state of the Patroni cluster:

# patronictl -c /opt/app/patroni/etc/postgresql.yml list

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member    | Host          | Role    | State   | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader  | running |  1 |           |
| psql13n52 | 192.168.56.52 | Replica | running |  1 |         0 |
| psql13n53 | 192.168.56.53 | Replica | running |  1 |         0 |
+-----------+---------------+---------+---------+----+-----------+
Psql13n51 started the Patroni cluster so it was automatically made the leader – and thus the primary/master PostgreSQL server. Nodes psql13n52 and psql13n53 are configured as read replicas (as the hot_standby option was enabled in Patroni’s configuration file).

Check PostgreSQL configuration parameters:

patronictl -c /opt/app/patroni/etc/postgresql.yml show-config postgres

loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    hot_standby: 'on'
    logging_collector: 'on'
    max_replication_slots: 5
    max_wal_senders: 5
    wal_level: replica
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
ttl: 30
With the configuration file in place, and now that we already have the etcd cluster up, all that is required is to restart the Patroni service:

Configure Patroni service on every node:

# vi /etc/systemd/system/patroni.service
 
[Unit]
Description=Runners to orchestrate a high-availability PostgreSQL
After=syslog.target network.target etcd.target
 
[Service]
Type=simple
 
User=postgres
Group=postgres
 
ExecStart=/usr/bin/patroni /opt/app/patroni/etc/postgresql.yml
 
KillMode=process
 
TimeoutSec=30
 
Restart=no
 
[Install]
WantedBy=multi-user.target
# systemctl status patroni
# systemctl start patroni
# systemctl enable patroni
 
# systemctl status etcd
# systemctl enable etcd
Reboot all 3 nodes:

# reboot
Check status of the Patroni service after reboot. Service should be up and running.

# systemctl status patroni
● patroni.service - Runners to orchestrate a high-availability PostgreSQL
   Loaded: loaded (/etc/systemd/system/patroni.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2021-11-05 23:33:22 UTC; 8h ago
 Main PID: 705 (patroni)
    Tasks: 14 (limit: 11401)
   Memory: 146.1M
   CGroup: /system.slice/patroni.service
           ├─ 705 /usr/bin/python3 /usr/bin/patroni /opt/app/patroni/etc/postgresql.yml
           ├─1278 /usr/pgsql-14/bin/postgres -D /var/lib/pgsql/14/data --config-file=/var/lib/pgsql/14/data/postgresql.conf --listen_addresses=0.0.0.0 --port=5432 --cluster_name=postgres --wal_level=replica --hot_standby=on --max_conn>
           ├─1280 postgres: postgres: logger
           ├─1282 postgres: postgres: checkpointer
           ├─1283 postgres: postgres: background writer
           ├─1284 postgres: postgres: stats collector
           ├─1359 postgres: postgres: postgres postgres 127.0.0.1(37558) idle
           ├─1371 postgres: postgres: walwriter
           ├─1372 postgres: postgres: autovacuum launcher
           └─1373 postgres: postgres: logical replication launcher

Nov 06 07:43:00 psql13n51 patroni[705]: 2021-11-06 07:43:00,048 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds
Nov 06 07:43:00 psql13n51 patroni[705]: 2021-11-06 07:43:00,058 INFO: promoted self to leader by acquiring session lock
Nov 06 07:43:00 psql13n51 patroni[705]: server promoting
Nov 06 07:43:00 psql13n51 patroni[705]: 2021-11-06 07:43:00,064 INFO: cleared rewind state after becoming the leade
Check again state of the Petroni cluster:

# patronictl -c /opt/app/patroni/etc/postgresql.yml list
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member    | Host          | Role    | State   | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader  | running |  5 |           |
| psql13n52 | 192.168.56.52 | Replica | running |  5 |         0 |
| psql13n53 | 192.168.56.53 | Replica | running |  5 |         0 |
+-----------+---------------+---------+---------+----+-----------+
$ sudo systemctl enable patroni
$ sudo systemctl start patroni
$ sudo systemctl status patroni
Postgresql is prohibited from the auto start, because it is managed by postgresql patroni.

Keepalived

Keepalived is used for IP failover between more servers.
Download latest Keepalived installation https://www.keepalived.org/download.html.

Run this on all 3 nodes:

wget https://www.keepalived.org/software/keepalived-2.2.4.tar.gz
Unpack archive and configure:

# tar xvfz keepalived-2.2.4.tar.gz
# cd keepalived-2.2.4
# ./configure
Fix errors befure running make command.

In my case I need to install openssl and libnl3 packages.

configure: error:
  !!! OpenSSL is not properly installed on your system. !!!
  !!! Can not include OpenSSL headers files.            !!!

yum -y install openssl openssl-devel

# ./configure
*** WARNING - this build will not support IPVS with IPv6. Please install libnl/libnl-3 dev libraries to support IPv6 with IPVS.

yum -y install libnl3 libnl3-devel

# ./configure
When there is no error then run make and make install:

# make && make install
For Keepalived startup the same service is created on all servers.

# cat /etc/systemd/system/keepalived.service
[Unit]
Description=LVS and VRRP High Availability Monitor
After=network-online.target syslog.target
Wants=network-online.target
[Service]
Type=forking
PIDFile=/run/keepalived.pid
KillMode=process
EnvironmentFile=-/usr/local/etc/sysconfig/keepalived
ExecStart=/usr/local/sbin/keepalived $KEEPALIVED_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
[Install]
WantedBy=multi-user.target
Before starting keepalived, we have to configure a configuration file on all servers.

psql13n51:

# cat /etc/keepalived/keepalived.conf
global_defs {
}
vrrp_script chk_haproxy { # Requires keepalived-1.1.13
    script "killall -0 haproxy" # widely used idiom
    interval 2 # check every 2 seconds
    weight 2 # add 2 points of prio if OK
}
vrrp_instance VI_1 {
    interface eth0
    state MASTER # or "BACKUP" on backup
    priority 101 # 101 on master, 100 on backup
    virtual_router_id 51
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    virtual_ipaddress {
        192.168.56.100
    }
    track_script {
        chk_haproxy
    }
}
psql13n52 & psql13n53:

# cat /etc/keepalived/keepalived.conf
global_defs {
}
vrrp_script chk_haproxy { # Requires keepalived-1.1.13
    script "killall -0 haproxy" # widely used idiom
    interval 2 # check every 2 seconds
    weight 2 # add 2 points of prio if OK
}
vrrp_instance VI_1 {
    interface eth0
    state BACKUP # or "BACKUP" on backup
    priority 100 # 101 on master, 100 on backup
    virtual_router_id 51
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    virtual_ipaddress {
        192.168.56.100
    }
    track_script {
        chk_haproxy
    }
}
Now start keepalived service on all servers

# systemctl start keepalived
# systemctl status keepalived
# systemctl enable keepalived
The VIP 192.168.56.100 should be run on one server and will automatically failover to the the second server if there is any issue with the server.

HAProxy

Instead of connecting directly to the database server, the application will be connecting to the proxy instead, which will forward the request to PostgreSQL. When HAproxy is used for this, it is also possible to route read requests to one or more replicas, for load balancing. With HAproxy, this is done by providing two different ports for the application to connect. We opted for the following setup:

Writes → 5000
Reads → 5001

HAproxy light-weight service and it can be installed as an independent server or, as in our case, on the database server.

sudo yum -y install haproxy
Set configuration on all nodes:

$ cat /etc/haproxy/haproxy.cfg
global
    maxconn 100
 
defaults
    log    global
    mode    tcp
    retries 2
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout check 5s
 
listen stats
    mode http
    bind *:7000
    stats enable
    stats uri /
 
listen primary
    bind *:5000
    option httpchk OPTIONS /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server psql13n51 psql13n51:5432 maxconn 100 check port 8008
    server psql13n52 psql13n52:5432 maxconn 100 check port 8008
    server psql13n53 psql13n53:5432 maxconn 100 check port 8008
 
listen standbys
    balance roundrobin
    bind *:5001
    option httpchk OPTIONS /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server psql13n51 psql13n51:5432 maxconn 100 check port 8008
    server psql13n52 psql13n52:5432 maxconn 100 check port 8008
    server psql13n53 psql13n53:5432 maxconn 100 check port 8008
Note there are two sections: primary, using port 5000, and standbys, using port 5001. All three nodes are included in both sections: that’s because they are all potential candidates to be either primary or secondary.

For HAproxy to know which role each node currently has, it will send an HTTP request to port 8008 of the node: Patroni will answer. Patroni provides a built-in REST API support for health check monitoring that integrates perfectly with HAproxy for this:

$ curl -s http://psql13n51:8008
{"state": "running", "postmaster_start_time": "2021-11-06 15:01:56.197081+00:00", "role": "replica", "server_version": 140000, "cluster_unlocked": false, "xlog": {"received_location": 83888920, "replayed_location": 83888920, "replayed_timestamp": null, "paused": false}, "timeline": 6, "database_system_identifier": "7027353509639501631", "patroni": {"version": "2.1.1", "scope": "postgres"}}
Notice how we received "role": "replica" from Patroni when we sent HTTP request to standby server. HAProxy uses this information for query routing.

We configured the standbys group to balance read-requests in a round-robin fashion, so each connection request (or reconnection) will alternate between the available replicas.

Let’s start HAProxy on all three nodes:

# systemctl enable haproxy.service
# systemctl start haproxy.service
# systemctl status haproxy.service
Test connections for master or standby connections:

sudo su - postgres
echo "localhost:5000:postgres:postgres:vagrant" > ~/.pgpass
echo "localhost:5001:postgres:postgres:vagrant" >> ~/.pgpass
chmod 0600 ~/.pgpass
We can then execute two read-requests to verify the round-robin mechanism is working as intended:

$ psql -Upostgres -hlocalhost -p5001 -t -c "select inet_server_addr()"
192.168.56.51

$ psql -Upostgres -hlocalhost -p5001 -t -c "select inet_server_addr()"
192.168.56.52
Master (read/write) connection:

$ psql -Upostgres -hlocalhost -p5000 -t -c "select inet_server_addr()"
192.168.56.53
You can also check the state of HAproxy by visiting http://192.168.56.51:7000/ on your browser.

# patronictl -c /opt/app/patroni/etc/postgresql.yml list

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member    | Host          | Role    | State   | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running |  7 |         0 |
| psql13n52 | 192.168.56.52 | Replica | running |  7 |         0 |
| psql13n53 | 192.168.56.53 | Leader  | running |  7 |           |
+-----------+---------------+---------+---------+----+-----------+

We have set up three node Patroni cluster without no single point of failure (SPOF). In next part we will test disaster and failover scenarios on this configuration with active read/write workload.


Reference:
https://github.com/zalando/patroni
https://patroni.readthedocs.io/en/latest/
https://www.percona.com/blog/2021/06/11/postgresql-ha-with-patroni-your-turn-to-test-failure-scenarios/
https://blog.dbi-services.com/postgresql-high-availabilty-patroni-ectd-haproxy-keepalived/
https://digitalis.io/blog/technology/part1-postgresql-ha-patroni-etcd-haproxy/


Wednesday, November 11, 2020

ProxySQL - Throttle for MySQL queries

ProxySQL is a great high availability and load balancing solution and it is mostly used for such purposes. But ProxySQL offers much more.

One of the nice features is the throttling mechanism for queries to the backends.

Imagine you have a very active system and applications are executing queries at very high rate, which is not so unusual nowadays. If just one of the queries slows down you could easily end up with many active sessions running the same query. Just one problematic query could cause high resource usage and general slowness.

Usually, DBA is called but DBA cannot modify a query, disable problematic application, or change database model without detailed analysis.

But ProxySQL could help.
Using ProxySQL we could delay execution of the problematic queries.
Yes, specific application request would still have a problem, but we would avoid general problem/downtime and "buy" some time for the fix.


Let's simulate such a situation in the test environment.

Run benchmark test using sysbench.

NUM_THREADS=1
TEST_DIR=/usr/share/sysbench/tests/include/oltp_legacy

sysbench \
  --test=${TEST_DIR}/oltp_simple.lua \
  --oltp-table-size=2000000 \
  --time=300 \
  --max-requests=0 \
  --mysql-table-engine=InnoDB \
  --mysql-user=sbtest \
  --mysql-password=sbtest \
  --mysql-port=3307 \
  --mysql-host=192.168.56.25 \
  --mysql-engine-trx=yes \
  --num-threads=$NUM_THREADS \
  prepare
  
sysbench \
  --test=${TEST_DIR}/oltp_simple.lua \
  --oltp-table-size=2000000 \
  --time=180 \
  --max-requests=0 \
  --mysql-table-engine=InnoDB \
  --mysql-user=sbtest \
  --mysql-password=sbtest \
  --mysql-port=3307 \
  --mysql-host=192.168.56.25 \
  --mysql-engine-trx=yes \
  --num-threads=$NUM_THREADS \
  run

Enable throttling mechanism and delay execution for all queries globally setting "mysql-default_query_delay=100".

  
ProxySQLServer> set mysql-default_query_delay=100;
ProxySQLServer> LOAD MYSQL VARIABLES TO RUNTIME; 
ProxySQLServer> SAVE MYSQL VARIABLES TO DISK;

Run test again and Check latency(ms).

  
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            1774
        write:                           0
        other:                           0
        total:                           1774
    transactions:                        1774   (9.85 per sec.)
    queries:                             1774   (9.85 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          180.0942s
    total number of events:              1774

Latency (ms):
         min:                                  100.76 <<<<<<<<<<<<<<<<<
         avg:                                  101.51 <<<<< Throttling
         max:                                  129.17 <<<<<<<<<<<<<<<<<
         95th percentile:                      102.97
         sum:                               180083.66

Threads fairness:
    events (avg/stddev):           1774.0000/0.00
    execution time (avg/stddev):   180.0837/0.00

Disable throttling and reset ProxySQL counters.

    
ProxySQLServer> set mysql-default_query_delay=0;
ProxySQLServer> LOAD MYSQL VARIABLES TO RUNTIME; SAVE MYSQL VARIABLES TO DISK;
ProxySQLServer> select * from stats_mysql_query_digest_reset;

Check latency(ms).

  
Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            641413
        write:                           0
        other:                           0
        total:                           641413
    transactions:                        641413 (3563.38 per sec.)
    queries:                             641413 (3563.38 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          180.0004s
    total number of events:              641413

Latency (ms):
         min:                                    0.19
         avg:                                    0.28
         max:                                   44.45
         95th percentile:                        0.43
         sum:                               179252.76

Threads fairness:
    events (avg/stddev):           641413.0000/0.00
    execution time (avg/stddev):   179.2528/0.00

Enable throttling for just a specific query using ProxySQL mysql query rules.

 
-- Find problematic query

ProxySQLServer> select hostgroup,username,count_star,
(count_star/(select Variable_Value from stats_mysql_global where Variable_Name='ProxySQL_Uptime')) 
as avg_per_sec, digest, digest_text from stats_mysql_query_digest order by count_star desc limit 10;

+-----------+----------+------------+-------------+--------------------+----------------------------------+
| hostgroup | username | count_star | avg_per_sec | digest             | digest_text                      |
+-----------+----------+------------+-------------+--------------------+----------------------------------+
| 2         | sbtest   | 641413     | 78          | 0xBF001A0C13781C1D | SELECT c FROM sbtest1 WHERE id=? |
+-----------+----------+------------+-------------+--------------------+----------------------------------+
1 row in set (0.00 sec)


-- Reset counters

ProxySQLServer> select * from stats_mysql_query_digest_reset;
+-----------+------------+----------+----------------+--------------------+----------------------------------+------------+------------+------------+-----------+----------+----------+-------------------+---------------+
| hostgroup | schemaname | username | client_address | digest             | digest_text                      | count_star | first_seen | last_seen  | sum_time  | min_time | max_time | sum_rows_affected | sum_rows_sent |
+-----------+------------+----------+----------------+--------------------+----------------------------------+------------+------------+------------+-----------+----------+----------+-------------------+---------------+
| 2         | sbtest     | sbtest   |                | 0xBF001A0C13781C1D | SELECT c FROM sbtest1 WHERE id=? | 641413     | 1601934890 | 1601935070 | 153023170 | 159      | 44349    | 0                 | 214399        |
+-----------+------------+----------+----------------+--------------------+----------------------------------+------------+------------+------------+-----------+----------+----------+-------------------+---------------+
1 row in set (0.00 sec)

Insert mysql query rule and enable throttling just for a specific query.

   
ProxySQLServer> insert into mysql_query_rules(rule_id,active,digest,delay,apply) values (1,1,'0xBF001A0C13781C1D',100,1);
Query OK, 1 row affected (0.00 sec)

LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;

Compare "min_time" between executions.

 
Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            1773
        write:                           0
        other:                           0
        total:                           1773
    transactions:                        1773   (9.85 per sec.)
    queries:                             1773   (9.85 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          180.0325s
    total number of events:              1773

Latency (ms):
         min:                                  100.78 <<<<<<<<<<<<<<<<<
         avg:                                  101.53 <<<<< Throttling
         max:                                  104.77 <<<<<<<<<<<<<<<<<
         95th percentile:                      102.97
         sum:                               180021.34

Threads fairness:
    events (avg/stddev):           1773.0000/0.00
    execution time (avg/stddev):   180.0213/0.00

 
ProxySQLServer> select * from stats_mysql_query_digest_reset;
+-----------+------------+----------+----------------+--------------------+----------------------------------+------------+------------+------------+-----------+----------+----------+-------------------+---------------+
| hostgroup | schemaname | username | client_address | digest             | digest_text                      | count_star | first_seen | last_seen  | sum_time  | min_time | max_time | sum_rows_affected | sum_rows_sent |
+-----------+------------+----------+----------------+--------------------+----------------------------------+------------+------------+------------+-----------+----------+----------+-------------------+---------------+
| 2         | sbtest     | sbtest   |                | 0xBF001A0C13781C1D | SELECT c FROM sbtest1 WHERE id=? | 1773       | 1601935408 | 1601935588 | 179697522 | 100681   | 104195   | 0                 | 594           |
+-----------+------------+----------+----------------+--------------------+----------------------------------+------------+------------+------------+-----------+----------+----------+-------------------+---------------+
1 row in set (0.01 sec)

Wednesday, January 24, 2018

Galera Cluster Schema Changes, Row Based Replication and Data Inconsistency

Galera Cluster is a virtually synchronous multi-master replication plug-in. When using Galera Cluster application can write to any node and transactions are then applied to all serves via row-based replication events.

This is built-in Mysql row-based replication which supports replication with differing table definitions between Master and Slave.
So, when using row-based repplication source and target table do not have to be identical. A table on master can have more or fewer columns or use different data types.

But there are limitations you must watch over depending on MySQL version you are running.
- The database and table names must be the same on both Master and Slave
- Columns must be in the same order before any additional column
- Each extra column must have default value
- ...

Newer MySQL versions may tolerate more differences between source and target table - check documentation for your version.


I want to show you what could happen with your data if you do not pay attention on this limitations.


Suppose I have 3-node MariaDB Galera Cluster with table t.
I want to add several columns to the table while database is used by an application.

For such task I will use built-in Rolling Schema Change (RSU) method which enables me to perform schema changes on node without impact on rest of the cluster.

Add column c4 to the table t following rules above for row-based replication.

Table t has three columns and one row inserted.
NODE1

MariaDB [testdb]> create table t (c1 varchar(10), c2 varchar(10), c3 varchar(10));
Query OK, 0 rows affected (0.37 sec)

MariaDB [testdb]> insert into t values ('n1-1','n1-1','n1-1');
Query OK, 1 row affected (0.00 sec)


NODE2

MariaDB [testdb]> select * from t;
+------+------+------+
| c1   | c2   | c3   |
+------+------+------+
| n1-1 | n1-1 | n1-1 |
+------+------+------+
1 row in set (0.00 sec)


NODE3

MariaDB [testdb]> select * from t;
+------+------+------+
| c1   | c2   | c3   |
+------+------+------+
| n1-1 | n1-1 | n1-1 |
+------+------+------+
1 row in set (0.01 sec)

I will enable RSU mode which ensures that this server will not impact the rest of the cluster during ALTER command execution.

Add column c4 and INSERT row simulating application activity.
MariaDB [testdb]> set session wsrep_OSU_method='RSU';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> alter table t add column c4 varchar(10);
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [testdb]> set session wsrep_OSU_method='TOI';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> insert into t(c1,c2,c3) values ('n1-1','n1-1','n1-1');
Query OK, 1 row affected (0.13 sec)

While table definition is different between Node1 and rest of the cluster INSERT few more rows on other nodes.

NODE2

insert into t(c1,c2,c3) values ('n2-1','n2-1','n2-1');


NODE3

insert into t(c1,c2,c3) values ('n3-1','n3-1','n3-1');

Check rows from table t.
NODE1

MariaDB [testdb]> select * from t;
+------+------+------+------+
| c1   | c2   | c3   | c4   |
+------+------+------+------+
| n1-1 | n1-1 | n1-1 | NULL |
| n1-1 | n1-1 | n1-1 | NULL |
| n2-1 | n2-1 | n2-1 | NULL |
| n3-1 | n3-1 | n3-1 | NULL |
+------+------+------+------+
4 rows in set (0.00 sec)

NODE2

MariaDB [testdb]> select * from t;
+------+------+------+
| c1   | c2   | c3   |
+------+------+------+
| n1-1 | n1-1 | n1-1 |
| n1-1 | n1-1 | n1-1 |
| n2-1 | n2-1 | n2-1 |
| n3-1 | n3-1 | n3-1 |
+------+------+------+
4 rows in set (0.00 sec)


NODE3

MariaDB [testdb]> select * from t;
+------+------+------+
| c1   | c2   | c3   |
+------+------+------+
| n1-1 | n1-1 | n1-1 |
| n1-1 | n1-1 | n1-1 |
| n2-1 | n2-1 | n2-1 |
| n3-1 | n3-1 | n3-1 |
+------+------+------+
4 rows in set (0.01 sec)

As you can notice everything is OK with my data.

Add new column to Node2 and Node3 following the same steps as for Node1.

NODE2

MariaDB [testdb]> set session wsrep_OSU_method='RSU';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> alter table t add column c4 varchar(10);
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [testdb]> set session wsrep_OSU_method='TOI';
Query OK, 0 rows affected (0.00 sec)


NODE3

MariaDB [testdb]> set session wsrep_OSU_method='RSU';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> alter table t add column c4 varchar(10);
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [testdb]> set session wsrep_OSU_method='TOI';
Query OK, 0 rows affected (0.00 sec)

And my task is completed. I have successfully changed model of the table.


But what can happen if I add new column between existing columns.
Remember, this is not permitted for a row-based replication and can cause replication to brake or something even worse.

Enable RSU mode on Node1 and add new column c11 after c1 column.
INSERT row simulating active application during schema change.

NODE1

MariaDB [testdb]> set session wsrep_OSU_method='RSU';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]>
MariaDB [testdb]> alter table t add column c11 varchar(10) after c1;
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [testdb]> set session wsrep_OSU_method='TOI';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> insert into t(c1,c2,c3) values ('n1-1','n1-1','n1-1');
Query OK, 1 row affected (0.01 sec)

MariaDB [testdb]> select * from t;
+------+------+------+------+------+
| c1   | c11  | c2   | c3   | c4   |
+------+------+------+------+------+
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
+------+------+------+------+------+
5 rows in set (0.00 sec)

INSERT row on other nodes because Galera Cluster allows us write on any node in the cluster configuration.

NODE2

MariaDB [testdb]> insert into t(c1,c2,c3) values ('n2-1','n2-1','n2-1');
Query OK, 1 row affected (0.01 sec)

MariaDB [testdb]> select * from t;
+------+------+------+------+
| c1   | c2   | c3   | c4   |
+------+------+------+------+
| n1-1 | n1-1 | n1-1 | NULL |
| n1-1 | n1-1 | n1-1 | NULL |
| n2-1 | n2-1 | n2-1 | NULL |
| n3-1 | n3-1 | n3-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 |
| n2-1 | n2-1 | n2-1 | NULL |
+------+------+------+------+
6 rows in set (0.00 sec)


NODE3

MariaDB [testdb]> insert into t(c1,c2,c3) values ('n3-1','n3-1','n3-1');
Query OK, 1 row affected (0.01 sec)

MariaDB [testdb]> select * from t;
+------+------+------+------+
| c1   | c2   | c3   | c4   |
+------+------+------+------+
| n1-1 | n1-1 | n1-1 | NULL |
| n1-1 | n1-1 | n1-1 | NULL |
| n2-1 | n2-1 | n2-1 | NULL |
| n3-1 | n3-1 | n3-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 |
| n2-1 | n2-1 | n2-1 | NULL |
| n3-1 | n3-1 | n3-1 | NULL |
+------+------+------+------+
7 rows in set (0.00 sec)

INSERT commands were successfully executed and everything is OK with my replication.
I don't have any errors in error.log that suggests that I have any problem.

But check contest of table t on the first node where new column is added.

NODE1

MariaDB [testdb]> select * from t;
+------+------+------+------+------+
| c1   | c11  | c2   | c3   | c4   |
+------+------+------+------+------+
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | n2-1 | n2-1 | NULL | NULL |
| n3-1 | n3-1 | n3-1 | NULL | NULL |
+------+------+------+------+------+
7 rows in set (0.00 sec)

Notice how rows differ between nodes, and we should have exactly the same data on all tree nodes.


Let's complete schema changes on other two nodes.
NODE2

MariaDB [testdb]> set session wsrep_OSU_method='RSU';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> alter table t add column c11 varchar(10) after c1;
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [testdb]> set session wsrep_OSU_method='TOI';
Query OK, 0 rows affected (0.00 sec)


NODE3

MariaDB [testdb]> set session wsrep_OSU_method='RSU';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> alter table t add column c11 varchar(10) after c1;
Query OK, 0 rows affected (0.34 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [testdb]> set session wsrep_OSU_method='TOI';
Query OK, 0 rows affected (0.00 sec)

I have successfully added new column, did not brake reapplication and everything seems OK, but my data is not consistent between nodes.

NODE1

MariaDB [testdb]> select * from t;
+------+------+------+------+------+
| c1   | c11  | c2   | c3   | c4   |
+------+------+------+------+------+
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | n2-1 | n2-1 | NULL | NULL |
| n3-1 | n3-1 | n3-1 | NULL | NULL |
+------+------+------+------+------+
7 rows in set (0.00 sec)


NODE2

MariaDB [testdb]> select * from t;
+------+------+------+------+------+
| c1   | c11  | c2   | c3   | c4   |
+------+------+------+------+------+
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
| n1-1 | NULL | NULL | n1-1 | n1-1 |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
+------+------+------+------+------+
7 rows in set (0.00 sec)


NODE3

MariaDB [testdb]> select * from t;
+------+------+------+------+------+
| c1   | c11  | c2   | c3   | c4   |
+------+------+------+------+------+
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
| n1-1 | NULL | NULL | n1-1 | n1-1 |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
+------+------+------+------+------+
7 rows in set (0.00 sec)


Data inconsistency is the worst problem that could happen in synchronous cluster configuration.
It could happen without any notice, but sooner or later it will stop reapplication process and failing node will be excluded from the cluster.





REFERENCE
https://dev.mysql.com/doc/refman/5.7/en/replication-features-differing-tables.html


Saturday, January 20, 2018

Global variable wsrep_OSU_method=RSU ignored on MariaDB Galera Cluster 10.0.32

Galera Cluster is great solution for application clustering but it has some limitations. One of the most critical limitations is how Galera Cluster performs model changes.

From Galera Cluster documentation:
http://galeracluster.com/documentation-webpages/clusterstallonalter.html
What’s happening is a side effect of a multi-master cluster with several appliers.
The cluster needs to control when a DDL statement ends in relation to other 
transactions, in order to deterministically detect conflicts and schedule
parallel appliers. Effectively, the DDL statement must execute in isolation.

Galera Cluster has a 65K window of tolerance for transactions applied in parallel,
but the cluster must wait when ALTER commands take too long.

Basically if you run DDL statement on slightly bigger table you will probably freeze your cluster for a duration of the DDL statement. And you cannot cancel this operation.

Built-in method to perform model changes is using "Rolling Schema Upgrade" method.

From MariaDB Galera Cluster documentation you have note on how to perform DDL changes on Galera Cluster.
https://mariadb.com/kb/en/library/tips-on-converting-to-galera/
A "fast" DDL may as well be done via TOI. This is a tentative list of such:
* CREATE/DROP/RENAME DATABASE/TABLE
* ALTER to change DEFAULT
* ALTER to change definition of ENUM or SET (see caveats in manual)
* Certain PARTITION ALTERs that are fast.
* DROP INDEX (other than PRIMARY KEY)
* ADD INDEX?
* Other ALTERs on 'small' tables.
* With 5.6 and especially 5.7 having a lot of ALTER ALGORITHM=INPLACE cases, 
check which ALTERs should be done which way.
Otherwise, use RSU. Do the following separately for each node: 
1. SET GLOBAL wsrep_OSU_method='RSU'; -- This also takes the node out of the cluster. 
2. ALTER TABLE ... 
3. SET GLOBAL wsrep_OSU_method='TOI'; -- Puts back in, leading to resync 
(hopefully a quick IST, not a slow SST)

Let’s test this solution.

My testing enviromnent:
Server version: 10.0.32-MariaDB-1~trusty-wsrep mariadb.org binary distribution, wsrep_25.20.rc3fc46e

I will add column to sbtest table.

Node1:
SET GLOBAL wsrep_OSU_method='RSU';

alter table sbtest add column text varchar(20);

In error.log I didn’t noticed any information about node being desynced from the cluster.

Node2 lock check:
MariaDB [(none)]> select trx_query,trx_tables_locked
                  from information_schema.innodb_trx;
+------------------------------------------------------+-------------------+
| trx_query                                            | trx_tables_locked |
+------------------------------------------------------+-------------------+
| alter table sbtest add column text varchar(20)       |                 1 |
+------------------------------------------------------+--------------------
1 row in set (0.00 sec)

Even though I globally enabled RSU method on Node1 with statement noted in documentation Node1 is not desynced from the cluster and DDL statement is executed on other nodes.
If this was production database I would have big problems during ALTER command.

But what if I enable RSU method using session level variable.

Node1:
SET SESSION wsrep_OSU_method=RSU;

alter table sbtest add column text1 varchar(20);

SET SESSION wsrep_OSU_method=TOI;

From error.log:
180120 11:26:34 [Note] WSREP: Member 0.0 (mariadb-test-1) desyncs itself from group
180120 11:26:34 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 8)
180120 11:26:34 [Note] WSREP: Provider paused at b95703dd-fd37-11e7-a31f-1af15b5193fa:8 (21)
180120 11:35:03 [Note] WSREP: resuming provider at 21
180120 11:35:03 [Note] WSREP: Provider resumed.
180120 11:35:03 [Note] WSREP: Member 0.0 (mariadb-test-1) resyncs itself to group
180120 11:35:03 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 8)
180120 11:35:03 [Note] WSREP: Member 0.0 (mariadb-test-1) synced with group.
180120 11:35:03 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 8)
180120 11:35:03 [Note] WSREP: Synchronized with group, ready for connections

Node2 lock check:
MariaDB [(none)]> select trx_query,trx_tables_locked 
                  from information_schema.innodb_trx;
Empty set (0.00 sec)

Now Node1 is desynced from the cluster and I am running ALTER statement locally without impact on the rest of the cluster.


I don’t know if this behaviour is expected for this version, or it is bug.

When running Galera Cluster in production it is VERY important to take some time and perform tests especially if you need to run ALTER statement on big tables.
Expect the unexpected even when running statements from documentation.


Monday, November 13, 2017

HASH GROUP BY not used when using more that 354 aggregate functions

Few days ago we had performance problem with one of our main application views. This was complex view that used a lot of aggregate function. Functions were used to transpose rows into columns.

When developer added few more aggregate functions for a new columns, query performance changed significantly and we had performance problem.

After quick analysis I have noticed one change in the execution plan.

HASH GROUP BY aggregation was replaced with less performant SORT GROUP BY. I have tried to force HASH GROUP BY using hints but nothing helped.


We tried to reproduce problem using dummy tables and then colleague found what was triggering plan change.

In this example I have query with 354 unique aggregate functions which is using HASH GROUP BY.

SELECT
          *
      FROM (SELECT LEVEL ID
             FROM DUAL CONNECT BY LEVEL < 1000) VANJSKI,
          (  SELECT
                    123 UNUTARNJI_ID,
                     sum(1) kolona0,
                     sum(1) kolona1,
                     sum(2) kolona2,
...
...
...
                     sum(350) kolona350    ,
                     sum(351) kolona351    ,
                     sum(352) kolona352    ,
                     sum(353) kolona353    ,
                     sum(354) kolona354
               FROM DUAL
           GROUP BY 123) UNUTARNJI
    WHERE     VANJSKI.ID = UNUTARNJI.UNUTARNJI_ID(+);

Plan hash value: 2294628051
---------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| A-Rows |   A-Time   |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |      |      1 |        |       |     5 (100)|    999 |00:00:00.01 |       |       |          |
|*  1 |  HASH JOIN OUTER               |      |      1 |      1 |  4631 |     5  (20)|    999 |00:00:00.01 |  2293K|  2293K| 1549K (0)|
|   2 |   VIEW                         |      |      1 |      1 |    13 |     2   (0)|    999 |00:00:00.01 |       |       |          |
|   3 |    CONNECT BY WITHOUT FILTERING|      |      1 |        |       |            |    999 |00:00:00.01 |       |       |          |
|   4 |     FAST DUAL                  |      |      1 |      1 |       |     2   (0)|      1 |00:00:00.01 |       |       |          |
|   5 |   VIEW                         |      |      1 |      1 |  4618 |     2   (0)|      1 |00:00:00.01 |       |       |          |
|   6 |    HASH GROUP BY               |      |      1 |      1 |       |     2   (0)|      1 |00:00:00.01 |   677K|   677K|  723K (0)|
|   7 |     FAST DUAL                  |      |      1 |      1 |       |     2   (0)|      1 |00:00:00.01 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------

Notice what will happen if I change "sum(1) kolona0" function and add one more unique function.

SELECT
          *
      FROM (SELECT LEVEL ID
             FROM DUAL CONNECT BY LEVEL < 1000) VANJSKI,
          (  SELECT
                    123 UNUTARNJI_ID,
                     sum(355) kolona0,
                     sum(1) kolona1,
                     sum(2) kolona2,
...
...
...
                     sum(350) kolona350    ,
                     sum(351) kolona351    ,
                     sum(352) kolona352    ,
                     sum(353) kolona353    ,
                     sum(354) kolona354
               FROM DUAL
           GROUP BY 123) UNUTARNJI
    WHERE     VANJSKI.ID = UNUTARNJI.UNUTARNJI_ID(+);

Plan hash value: 2326946862
---------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| A-Rows |   A-Time   |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |      |      1 |        |       |     5 (100)|    999 |00:00:00.01 |       |       |          |
|*  1 |  HASH JOIN OUTER               |      |      1 |      1 |  4631 |     5  (20)|    999 |00:00:00.01 |  2293K|  2293K| 1645K (0)|
|   2 |   VIEW                         |      |      1 |      1 |    13 |     2   (0)|    999 |00:00:00.01 |       |       |          |
|   3 |    CONNECT BY WITHOUT FILTERING|      |      1 |        |       |            |    999 |00:00:00.01 |       |       |          |
|   4 |     FAST DUAL                  |      |      1 |      1 |       |     2   (0)|      1 |00:00:00.01 |       |       |          |
|   5 |   VIEW                         |      |      1 |      1 |  4618 |     2   (0)|      1 |00:00:00.01 |       |       |          |
|   6 |    SORT GROUP BY               |      |      1 |      1 |       |     2   (0)|      1 |00:00:00.01 | 20480 | 20480 |18432  (0)|
|   7 |     FAST DUAL                  |      |      1 |      1 |       |     2   (0)|      1 |00:00:00.01 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------

Query execution plan changed - HASH GROUP BY was replaced with SORT GROUP BY.


This was obviously limitation for HASH GROUP BY but I couldn't find more information using Oracle docs or Google so I have asked Oracle support for a confirmation.

From Oracle support I have received answer that similar case was bug closed as not bug, without workaround. Using default DB_BLOCK_SIZE, the limitation is set to 354 aggregate functions.

To solve performance problem we have changed view to avoid HASH GROUP BY limitation.

Testing environment - Oracle Database 12c Enterprise Edition Release 12.1.0.2.0