Tuesday, November 16, 2021

PostgreSQL HA - Patroni, ETCD, HAProxy, Keepalived - Test failure scenarios

I will test few failover/maintenance scenarios and show results in this blog post.

Just to mention, this is not proper production test. Before considering this setup for the production it would be great to put cluster under proper load, simulate slow IO response time, memory crashes, etc. and check cluster behavior.

In this tests I am only checking start/stop resources in various scenarios.

Standby tests

No. Test Scenario Downtime Observation
1. Kill PostgreSQL process 10 secs 1 RO node - No problems for writer process.
- Patroni started on the boot and brought PostgreSQL instance automatically
2. Stop the PostgreSQL process 27 secs 1 RO node - No problems for read write process.
- Patroni started on the boot and brought PostgreSQL instance automatically
3. Reboot the server 27 secs 1 RO node - No problems for the write process.
- Patroni started on the boot and brought PostgreSQL instance automatically.
4. Stop the Patroni process 25 secs 1 RO node - No problem for the write process.
- Stopping Patroni stopped PostgreSQL process and excluded 192.168.56.53 node from the cluster.
- After starting Patroni brought PosgreSQL instance and joined to the cluster automatically.
Master tests

No. Test Scenario Downtime Observation
1. Kill PostgreSQL process 10 secs RW - After killing PostgreSQL process Patroni brought service back to the running state.
- No distruption for the read only requests.
2. Stop the PostgreSQL process 7 secs RW - Patroni brought PostgreSQL to the running state. Election was not triggered.
3. Reboot the server 17 secs RW - Faiover happened and one of the slave servers was elected as the new master.
- On the old master server, Petroni brought PostgreSQL and performed pg_rewind to create replica.
4. Stop Patroni process 10 ses RW - Patroni stopped PostgreSQL instance and new master node was elected.
- After starting Patroni, old master server was rewound using pg_rewind and new replica joined to the cluster.
Network isolation tests

No. Test Scenario Downtime Observation
1. Network isolate master server from the configuration 31 secs RW - New master was elected.
- Bringing back communication to old master server did not bring old master server as replica automatically.
- Restarting Patroni it brought PostgreSQL instance on 192.168.56.53 as replica.
2. Network isolate slave server from the configuration 0 secs RW - Isolated standby server was excluded from the cluster configuration.
- Bringing back communication to the standby server node rejoined to the cluster automatically.


Pinging cluster on the read write interface (port 5000):

while true; do echo "select inet_server_addr(),now()::timestamp" | psql -Upostgres -h192.168.56.100 -p5000 -t; sleep 1; done
Pinging cluster on the read only interface (port 5001):

while true; do echo "select inet_server_addr(),now()::timestamp" | psql -Upostgres -h192.168.56.100 -p5001 -t; sleep 1; done

Standby Tests

1. Kill PostgreSQL process

Read Write

192.168.56.52 | 2021-11-11 19:54:45.462582
192.168.56.52 | 2021-11-11 19:54:46.483056
192.168.56.52 | 2021-11-11 19:54:47.502918
192.168.56.52 | 2021-11-11 19:54:48.522746
192.168.56.52 | 2021-11-11 19:54:49.544109
192.168.56.52 | 2021-11-11 19:54:50.564185
192.168.56.52 | 2021-11-11 19:54:51.585437
192.168.56.52 | 2021-11-11 19:54:52.607154
192.168.56.52 | 2021-11-11 19:54:53.628248
192.168.56.52 | 2021-11-11 19:54:54.649941
192.168.56.52 | 2021-11-11 19:54:55.671482
No problems for writer process.

Read Only

192.168.56.53 | 2021-11-11 19:54:45.505932 <<-- KILL POSTGRES PROCESS ON 192.168.56.53
192.168.56.51 | 2021-11-11 19:54:46.562742

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

 192.168.56.51 | 2021-11-11 19:54:50.598035

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

 192.168.56.51 | 2021-11-11 19:54:54.633922
 192.168.56.51 | 2021-11-11 19:54:55.655928
 192.168.56.51 | 2021-11-11 19:54:56.679855
 192.168.56.51 | 2021-11-11 19:54:57.70306
 192.168.56.51 | 2021-11-11 19:54:58.725866
 192.168.56.51 | 2021-11-11 19:54:59.749008
 192.168.56.51 | 2021-11-11 19:55:00.770238
 192.168.56.51 | 2021-11-11 19:55:01.791585
 192.168.56.53 | 2021-11-11 19:55:02.779865 <<-- PATRONI BROUGHT POSTGRESQL PROCESS
 192.168.56.51 | 2021-11-11 19:55:03.835348
 192.168.56.53 | 2021-11-11 19:55:04.825825
 192.168.56.51 | 2021-11-11 19:55:05.890109
After 10 secs Patroni brought PostgreSQL automatically.

2. Stop the PostgreSQL process

Read Write

192.168.56.52 | 2021-11-11 20:05:18.785093
192.168.56.52 | 2021-11-11 20:05:19.806449
192.168.56.52 | 2021-11-11 20:05:20.82694
192.168.56.52 | 2021-11-11 20:05:21.847219
192.168.56.52 | 2021-11-11 20:05:22.868177
192.168.56.52 | 2021-11-11 20:05:23.888856
192.168.56.52 | 2021-11-11 20:05:24.90578
No problems for read write process.

Read Only

 
192.168.56.53 | 2021-11-11 20:05:18.990093 <<-- STOP POSTGRESQL PROCESS
 192.168.56.51 | 2021-11-11 20:05:20.04388

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

 192.168.56.51 | 2021-11-11 20:05:24.08155
 192.168.56.53 | 2021-11-11 20:05:25.073322 <<-- PATRONI BROUGHT POSTGRESQL PROCESS
 
Patroni brough PostgreSQL instance in 6 seconds.

3. Reboot the server

Read Write

192.168.56.52 | 2021-11-11 20:10:13.171874
192.168.56.52 | 2021-11-11 20:10:14.193623
192.168.56.52 | 2021-11-11 20:10:15.217776
192.168.56.52 | 2021-11-11 20:10:16.239323
192.168.56.52 | 2021-11-11 20:10:17.257308
192.168.56.52 | 2021-11-11 20:10:18.27552
192.168.56.52 | 2021-11-11 20:10:19.292373
192.168.56.52 | 2021-11-11 20:10:20.310198
192.168.56.52 | 2021-11-11 20:10:21.32735
192.168.56.52 | 2021-11-11 20:10:22.343773
192.168.56.52 | 2021-11-11 20:10:23.361844
192.168.56.52 | 2021-11-11 20:10:24.38691
192.168.56.52 | 2021-11-11 20:10:25.407598
192.168.56.52 | 2021-11-11 20:10:26.429343
192.168.56.52 | 2021-11-11 20:10:27.450577
192.168.56.52 | 2021-11-11 20:10:28.471854
192.168.56.52 | 2021-11-11 20:10:29.492637
192.168.56.52 | 2021-11-11 20:10:30.512336
192.168.56.52 | 2021-11-11 20:10:31.533257
192.168.56.52 | 2021-11-11 20:10:32.554038
192.168.56.52 | 2021-11-11 20:10:33.574338
192.168.56.52 | 2021-11-11 20:10:34.596119
192.168.56.52 | 2021-11-11 20:10:35.615495
192.168.56.52 | 2021-11-11 20:10:36.637819
192.168.56.52 | 2021-11-11 20:10:37.659621
192.168.56.52 | 2021-11-11 20:10:38.682478
192.168.56.52 | 2021-11-11 20:10:39.703187
192.168.56.52 | 2021-11-11 20:10:40.727444
No problems for the write process.

Read Only

192.168.56.53 | 2021-11-11 20:10:12.314665 <<-- REBOOT THE 192.168.56.53 SERVER
 192.168.56.51 | 2021-11-11 20:10:13.304627

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
 192.168.56.51 | 2021-11-11 20:10:24.340825

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
 192.168.56.51 | 2021-11-11 20:10:29.42999
 192.168.56.51 | 2021-11-11 20:10:30.44846
 192.168.56.51 | 2021-11-11 20:10:31.470978
 192.168.56.51 | 2021-11-11 20:10:32.49244
 192.168.56.51 | 2021-11-11 20:10:33.515443
 192.168.56.51 | 2021-11-11 20:10:34.53563
 192.168.56.51 | 2021-11-11 20:10:35.553104
 192.168.56.51 | 2021-11-11 20:10:36.572375
 192.168.56.51 | 2021-11-11 20:10:37.595694
 192.168.56.51 | 2021-11-11 20:10:38.620022
 192.168.56.53 | 2021-11-11 20:10:39.644502 <<-- PATRONI STARTED ON THE BOOT AND STARTET POSTGRESQL PROCESS
 
Patroni started on the boot and brought PostgreSQL instance automatically in 27 secs.

4. Stop the Patroni process

Read Write

192.168.56.52 | 2021-11-11 20:25:01.931924
192.168.56.52 | 2021-11-11 20:25:02.954774
192.168.56.52 | 2021-11-11 20:25:03.975514
192.168.56.52 | 2021-11-11 20:25:04.99868
192.168.56.52 | 2021-11-11 20:25:06.021456
192.168.56.52 | 2021-11-11 20:25:07.048917
192.168.56.52 | 2021-11-11 20:25:08.071156
192.168.56.52 | 2021-11-11 20:25:09.093902
192.168.56.52 | 2021-11-11 20:25:10.117138
192.168.56.52 | 2021-11-11 20:25:11.138296
192.168.56.52 | 2021-11-11 20:25:12.159975
192.168.56.52 | 2021-11-11 20:25:13.186149
192.168.56.52 | 2021-11-11 20:25:14.20717
192.168.56.52 | 2021-11-11 20:25:15.2286
No problem for the write process.

Read Only

Stopping Patroni stopped PostgreSQL process and excluded 192.168.56.53 node from the cluster.

+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 9 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 9 | |
+-----------+---------------+---------+---------+----+-----------+
192.168.56.51 | 2021-11-11 20:24:52.731887

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
192.168.56.51 | 2021-11-11 20:24:56.772703

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
192.168.56.51 | 2021-11-11 20:25:00.811616

192.168.56.51 | 2021-11-11 20:25:01.837298
192.168.56.51 | 2021-11-11 20:25:02.860275
192.168.56.51 | 2021-11-11 20:25:03.8829
192.168.56.51 | 2021-11-11 20:25:04.906505
192.168.56.51 | 2021-11-11 20:25:05.932158
Start Patroni.

After starting Patroni brought PosgreSQL instance and joined to the cluster automatically.

+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 9 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 9 | |
| psql13n53 | 192.168.56.53 | Replica | running | 9 | 0 |
+-----------+---------------+---------+---------+----+-----------+
192.168.56.51 | 2021-11-11 20:28:18.473041
192.168.56.51 | 2021-11-11 20:28:19.495974
192.168.56.51 | 2021-11-11 20:28:20.518773
192.168.56.51 | 2021-11-11 20:28:21.541587
192.168.56.51 | 2021-11-11 20:28:22.563967
192.168.56.51 | 2021-11-11 20:28:23.586971
192.168.56.51 | 2021-11-11 20:28:24.608738
192.168.56.53 | 2021-11-11 20:28:25.63165
It took 7 seconds to route traffic on stanby node after starting Patroni process.


Master Tests

1. Kill PostgreSQL process

Read Write

 192.168.56.52 | 2021-11-11 20:40:55.246602
 192.168.56.52 | 2021-11-11 20:40:56.270163 <<-- KILL POSTGRESQL PROCESS.

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

 192.168.56.52 | 2021-11-11 20:41:05.318191 <<-- PATRONI BROUGHT POSTGRESQL SERVICE BACK
 192.168.56.52 | 2021-11-11 20:41:06.341719
After killing PostgreSQL process Patroni brought service back to the running state. We had 10 secs downtime for the writer process.

Read Only

192.168.56.51 | 2021-11-11 20:40:56.774198
192.168.56.53 | 2021-11-11 20:40:57.797533
192.168.56.51 | 2021-11-11 20:40:58.821054
192.168.56.53 | 2021-11-11 20:40:59.843738
192.168.56.51 | 2021-11-11 20:41:00.86877
192.168.56.53 | 2021-11-11 20:41:01.889666
192.168.56.51 | 2021-11-11 20:41:02.912988
192.168.56.53 | 2021-11-11 20:41:03.933952
192.168.56.51 | 2021-11-11 20:41:05.045196
192.168.56.53 | 2021-11-11 20:41:06.078416
No distruption for the read only requests.

2. Stop the PostgreSQL process

Read Write

  
192.168.56.52 | 2021-11-11 20:52:01.251009

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.

192.168.56.52 | 2021-11-11 20:52:08.301596
Patroni brought PostgreSQL to the running state. Election was not triggered. There was 7 secs downtime for the writer process.

Read Only

  
192.168.56.53 | 2021-11-11 20:52:01.53767
192.168.56.51 | 2021-11-11 20:52:02.561452
192.168.56.53 | 2021-11-11 20:52:03.583391
192.168.56.51 | 2021-11-11 20:52:04.609092
192.168.56.53 | 2021-11-11 20:52:05.631433
192.168.56.51 | 2021-11-11 20:52:06.656341
192.168.56.53 | 2021-11-11 20:52:07.677131
192.168.56.51 | 2021-11-11 20:52:08.701682
192.168.56.53 | 2021-11-11 20:52:09.730157
3. Reboot the server

Read Write

192.168.56.52 | 2021-11-11 20:59:31.49515

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

192.168.56.51 | 2021-11-11 20:59:48.650256 <<-- SERVER 192.168.56.51 ELECTED AS THE NEW MASTER:
192.168.56.51 | 2021-11-11 20:59:49.669785
192.168.56.51 | 2021-11-11 20:59:50.687517
Faiover happened and one of the slave servers was elected as the new master. We had 17 seconds downtime for the writer process. On the old master server, Petroni brought PostgreSQL and performed pg_rewind to create replica.

  
+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 14 | |
| psql13n52 | 192.168.56.52 | Replica | running | 14 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 14 | 0 |
+-----------+---------------+---------+---------+----+-----------+
Read Only

192.168.56.51 | 2021-11-11 20:59:29.053858
192.168.56.53 | 2021-11-11 20:59:30.07594
192.168.56.51 | 2021-11-11 20:59:31.105134
192.168.56.53 | 2021-11-11 20:59:32.123691
192.168.56.51 | 2021-11-11 20:59:33.152343
192.168.56.53 | 2021-11-11 20:59:34.170016
192.168.56.51 | 2021-11-11 20:59:35.199209
192.168.56.53 | 2021-11-11 20:59:36.21726
192.168.56.51 | 2021-11-11 20:59:37.238567
192.168.56.53 | 2021-11-11 20:59:38.251579
192.168.56.51 | 2021-11-11 20:59:39.273968
192.168.56.53 | 2021-11-11 20:59:40.288168
192.168.56.51 | 2021-11-11 20:59:41.308803
192.168.56.53 | 2021-11-11 20:59:42.32304
192.168.56.53 | 2021-11-11 20:59:43.339712
192.168.56.53 | 2021-11-11 20:59:44.357711
192.168.56.53 | 2021-11-11 20:59:45.375188
192.168.56.53 | 2021-11-11 20:59:46.395121
192.168.56.53 | 2021-11-11 20:59:47.411711
192.168.56.53 | 2021-11-11 20:59:48.428075
192.168.56.53 | 2021-11-11 20:59:49.445494
192.168.56.53 | 2021-11-11 20:59:50.462092
There was no disrupions for the read only requests.

3. Stop Patroni process

Read Write

192.168.56.51 | 2021-11-11 21:08:25.526132

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.

192.168.56.53 | 2021-11-11 21:08:35.60005
192.168.56.53 | 2021-11-11 21:08:36.62634
192.168.56.53 | 2021-11-11 21:08:37.651523
  
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n52 | 192.168.56.52 | Replica | running | 15 | 0 |
| psql13n53 | 192.168.56.53 | Leader | running | 15 | |
+-----------+---------------+---------+---------+----+-----------+
Patroni stopped PostgreSQL instance and new master node was elected. We had 10 secs of downtime for the writer process.

After starting Patroni, old master server was rewound using pg_rewind and new replica joined to the cluster.

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 15 | 0 |
| psql13n52 | 192.168.56.52 | Replica | running | 15 | 0 |
| psql13n53 | 192.168.56.53 | Leader | running | 15 | |
+-----------+---------------+---------+---------+----+-----------+
Read Only

192.168.56.53 | 2021-11-11 21:08:25.3516
192.168.56.52 | 2021-11-11 21:08:26.374974
192.168.56.53 | 2021-11-11 21:08:27.397898
192.168.56.52 | 2021-11-11 21:08:28.432293
192.168.56.53 | 2021-11-11 21:08:29.455458
192.168.56.52 | 2021-11-11 21:08:30.479256
192.168.56.53 | 2021-11-11 21:08:31.500499
192.168.56.52 | 2021-11-11 21:08:32.525148
192.168.56.53 | 2021-11-11 21:08:33.54793
192.168.56.52 | 2021-11-11 21:08:34.571675
192.168.56.52 | 2021-11-11 21:08:35.610965
192.168.56.52 | 2021-11-11 21:08:36.639712
There was no disruptions for the read only process.


Network Isolation Tests

1. Network isolate master server from the configuration

Read Write

192.168.56.53 | 2021-11-11 22:15:37.376374 <<-- COMMUNICATION BLOCKED

192.168.56.52 | 2021-11-11 22:15:08.169172
192.168.56.52 | 2021-11-11 22:15:09.190167
192.168.56.52 | 2021-11-11 22:15:10.211688
192.168.56.52 | 2021-11-11 22:15:11.232966
192.168.56.52 | 2021-11-11 22:15:12.254794
192.168.56.52 | 2021-11-11 22:15:13.276149
192.168.56.52 | 2021-11-11 22:15:14.29847
192.168.56.52 | 2021-11-11 22:15:15.319335
192.168.56.52 | 2021-11-11 22:15:16.343936
Communication was blocked on the master (read/write) node. New master was elected. We had 31 secs downtime for the writer application.

  
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 16 | |
+-----------+---------------+---------+---------+----+-----------+
Bringing back communication to old master server did not bring old master server as replica automatically. Restarting Patroni it brought PostgreSQL instance on 192.168.56.53 as replica.

+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 16 | |
| psql13n53 | 192.168.56.53 | Replica | running | 16 | 0 |
+-----------+---------------+---------+---------+----+-----------+
Read Only

192.168.56.51 | 2021-11-11 22:15:11.676438
192.168.56.51 | 2021-11-11 22:15:12.699285
192.168.56.51 | 2021-11-11 22:15:13.722465
192.168.56.51 | 2021-11-11 22:15:14.74705
192.168.56.51 | 2021-11-11 22:15:15.77105
192.168.56.51 | 2021-11-11 22:15:16.794407
192.168.56.51 | 2021-11-11 22:15:17.816547
192.168.56.51 | 2021-11-11 22:15:18.838761

192.168.56.53 | 2021-11-11 22:25:57.360616
192.168.56.51 | 2021-11-11 22:25:58.390982
192.168.56.53 | 2021-11-11 22:25:59.42245
192.168.56.51 | 2021-11-11 22:26:00.450804
192.168.56.53 | 2021-11-11 22:26:01.480687
192.168.56.51 | 2021-11-11 22:26:02.510569
192.168.56.53 | 2021-11-11 22:26:03.540663
192.168.56.51 | 2021-11-11 22:26:04.574112
192.168.56.53 | 2021-11-11 22:26:05.606363
192.168.56.51 | 2021-11-11 22:26:06.635608
Afer adding old master server to cluster configuration as replica it started to accept read only requests.

2. Network isolate slave server from the configuration

Read Write

192.168.56.52 | 2021-11-11 22:28:22.539789
192.168.56.52 | 2021-11-11 22:28:23.559629
192.168.56.52 | 2021-11-11 22:28:24.580749
192.168.56.52 | 2021-11-11 22:28:25.925264
192.168.56.52 | 2021-11-11 22:28:26.946179
192.168.56.52 | 2021-11-11 22:28:27.969459
192.168.56.52 | 2021-11-11 22:28:28.991379
192.168.56.52 | 2021-11-11 22:28:30.013173
192.168.56.52 | 2021-11-11 22:28:31.032617
192.168.56.52 | 2021-11-11 22:28:32.053455
192.168.56.52 | 2021-11-11 22:28:33.074863
192.168.56.52 | 2021-11-11 22:28:34.096192
192.168.56.52 | 2021-11-11 22:28:35.116744
There was no problem for the writer applications.

Read Only

192.168.56.51 | 2021-11-11 22:28:03.186052
192.168.56.53 | 2021-11-11 22:28:04.208455
192.168.56.51 | 2021-11-11 22:28:05.23119
192.168.56.51 | 2021-11-11 22:28:16.665654

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.

192.168.56.51 | 2021-11-11 22:28:30.700107
192.168.56.51 | 2021-11-11 22:28:31.721761
192.168.56.51 | 2021-11-11 22:28:32.744021
192.168.56.51 | 2021-11-11 22:28:33.766453
192.168.56.51 | 2021-11-11 22:28:34.789146
192.168.56.51 | 2021-11-11 22:28:35.811602
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 16 | |
+-----------+---------------+---------+---------+----+-----------+
Isolated standby server was excluded from the cluster configuration.

Bringing back communication to the standby server node rejoined to the cluster automatically.

+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 16 | |
| psql13n53 | 192.168.56.53 | Replica | running | 16 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Switchover

Manually trigger switch of the primary node to one of the replicas and bring the old primary as a new replica into the cluster.

$ patronictl -c /opt/app/patroni/etc/postgresql.yml list

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 17 | |
| psql13n52 | 192.168.56.52 | Replica | running | 17 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 17 | 0 |
+-----------+---------------+---------+---------+----+-----------+
$ patronictl -c /opt/app/patroni/etc/postgresql.yml switchover
Master [psql13n51]:
Candidate ['psql13n52', 'psql13n53'] []: psql13n52
When should the switchover take place (e.g. 2021-11-15T22:08 ) [now]:
Current cluster topology
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 17 | |
| psql13n52 | 192.168.56.52 | Replica | running | 17 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 17 | 0 |
+-----------+---------------+---------+---------+----+-----------+
Are you sure you want to switchover cluster postgres, demoting current master psql13n51? [y/N]: y
2021-11-15 21:10:28.05685 Successfully switched over to "psql13n52"
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | stopped | | unknown |
| psql13n52 | 192.168.56.52 | Leader | running | 17 | |
| psql13n53 | 192.168.56.53 | Replica | running | 17 | 0 |
+-----------+---------------+---------+---------+----+-----------+
192.168.56.51 | 2021-11-15 21:10:21.190417
192.168.56.51 | 2021-11-15 21:10:22.223856
192.168.56.51 | 2021-11-15 21:10:23.259458
192.168.56.51 | 2021-11-15 21:10:24.293523
192.168.56.51 | 2021-11-15 21:10:25.329155

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
192.168.56.51 | 2021-11-15 21:10:30.379076
192.168.56.51 | 2021-11-15 21:10:31.40607
192.168.56.52 | 2021-11-15 21:10:32.417283
192.168.56.51 | 2021-11-15 21:10:33.450491
192.168.56.52 | 2021-11-15 21:10:34.468676
192.168.56.52 | 2021-11-15 21:10:35.494665
192.168.56.52 | 2021-11-15 21:10:36.517738
192.168.56.52 | 2021-11-15 21:10:37.541415
192.168.56.52 | 2021-11-15 21:10:38.567083
Node 192.168.56.52 bacame new primary node and 192.168.56.51 joined as the new replica to the cluster. Downtime for the read write node was 7 secs.

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 18 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 18 | |
| psql13n53 | 192.168.56.53 | Replica | running | 18 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Failover

Although failover can also be triggered manually it is mostly executed automatically, when leader node is unavailable for unplanned reason. We have noticed automatic failovers in previous tests.

For a test I will trigger failover manually.

patronictl -c /opt/app/patroni/etc/postgresql.yml failover
Candidate ['psql13n51', 'psql13n53'] []: psql13n51
Current cluster topology
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 18 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 18 | |
| psql13n53 | 192.168.56.53 | Replica | running | 18 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Are you sure you want to failover cluster postgres, demoting current master psql13n52? [y/N]: y
2021-11-15 21:25:04.85489 Successfully failed over to "psql13n51"
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 18 | |
| psql13n52 | 192.168.56.52 | Replica | stopped | | unknown |
| psql13n53 | 192.168.56.53 | Replica | running | 18 | 0 |
+-----------+---------------+---------+---------+----+-----------+
Node 192.168.56.51 became new master and node 192.168.56.52 joined to the cluster as a replica. Downtime was 7 secs.

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 19 | |
| psql13n52 | 192.168.56.52 | Replica | running | 19 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Maintenance mode

Sometimes it is necessary to do maintenance on a single node and you do not want Petroni to manage the cluster. For example when performing PostgreSQL upgrade.

When Patroni is paused, it won't change the state of the PostgreSQL - it will not to try automatically start PostgreSQL cluster when it is stopped.

For the test we will stop the replica and test if Petroni will start database automatically as in previous tests.

[postgres@psql13n52 ~]$ patronictl -c /opt/app/patroni/etc/postgresql.yml pause
Success: cluster management is paused

[postgres@psql13n51 ~]$ patronictl -c /opt/app/patroni/etc/postgresql.yml list
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 19 | |
| psql13n52 | 192.168.56.52 | Replica | running | 19 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 |
+-----------+---------------+---------+---------+----+-----------+
 Maintenance mode: on
Notice - "Maintenance mode: on".

Replica is stopped:

$ pg_ctl -D /var/lib/pgsql/14/data stop
waiting for server to shut down.... done
server stopped
Patroni didn't brought up database.

$ patronictl -c /opt/app/patroni/etc/postgresql.yml list
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 19 | |
| psql13n52 | 192.168.56.52 | Replica | stopped | | unknown |
| psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 |
+-----------+---------------+---------+---------+----+-----------+
 Maintenance mode: on
Resume Patroni.

$ patronictl -c /opt/app/patroni/etc/postgresql.yml resume
Success: cluster management is resumed
Node joined the cluster after few seconds.

$ patronictl -c /opt/app/patroni/etc/postgresql.yml list
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 19 | |
| psql13n52 | 192.168.56.52 | Replica | running | 19 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 |
+-----------+---------------+---------+---------+----+-----------+




References:
http://highscalability.com/blog/2019/9/16/managing-high-availability-in-postgresql-part-iii-patroni.html

3 comments:

  1. Hi Marko, did you maybe consider to test failure of etcd?
    Some things to test:
    1. Etcd lost quorum 2 nodes restart + PgSQL Primary node restarts
    2. Etcd network partition

    Thanks for this post it's great and some nice info :)

    ReplyDelete
  2. Hello!

    I have only tested ETCD stop/crash on the one node. For this configuration it is expected to survive failures only on the one node. Although I have tested hard shutdown of all the nodes in the configuration simulating "power outage".

    But haven't tested scenarios you noted.

    It would be interesting to test how will cluster behave when ETCD loses quorum. Will HA configuration survive possible split-brain situation and possible data inconsistencies.

    Thanks for the comment.

    Regards,
    Marko

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete