Just to mention, this is not proper production test. Before considering this setup for the production it would be great to put cluster under proper load, simulate slow IO response time, memory crashes, etc. and check cluster behavior.
In this tests I am only checking start/stop resources in various scenarios.
Standby tests
No. | Test Scenario | Downtime | Observation |
---|---|---|---|
1. | Kill PostgreSQL process | 10 secs 1 RO node | - No problems for writer process. - Patroni started on the boot and brought PostgreSQL instance automatically |
2. | Stop the PostgreSQL process | 27 secs 1 RO node | - No problems for read write process. - Patroni started on the boot and brought PostgreSQL instance automatically |
3. | Reboot the server | 27 secs 1 RO node | - No problems for the write process. - Patroni started on the boot and brought PostgreSQL instance automatically. |
4. | Stop the Patroni process | 25 secs 1 RO node | - No problem for the write process. - Stopping Patroni stopped PostgreSQL process and excluded 192.168.56.53 node from the cluster. - After starting Patroni brought PosgreSQL instance and joined to the cluster automatically. |
No. | Test Scenario | Downtime | Observation |
---|---|---|---|
1. | Kill PostgreSQL process | 10 secs RW | - After killing PostgreSQL process Patroni brought service back to the running state. - No distruption for the read only requests. |
2. | Stop the PostgreSQL process | 7 secs RW | - Patroni brought PostgreSQL to the running state. Election was not triggered. |
3. | Reboot the server | 17 secs RW | - Faiover happened and one of the slave servers was elected as the new master. - On the old master server, Petroni brought PostgreSQL and performed pg_rewind to create replica. |
4. | Stop Patroni process | 10 ses RW | - Patroni stopped PostgreSQL instance and new master node was elected. - After starting Patroni, old master server was rewound using pg_rewind and new replica joined to the cluster. |
No. | Test Scenario | Downtime | Observation |
---|---|---|---|
1. | Network isolate master server from the configuration | 31 secs RW | - New master was elected. - Bringing back communication to old master server did not bring old master server as replica automatically. - Restarting Patroni it brought PostgreSQL instance on 192.168.56.53 as replica. |
2. | Network isolate slave server from the configuration | 0 secs RW | - Isolated standby server was excluded from the cluster configuration. - Bringing back communication to the standby server node rejoined to the cluster automatically. |
Pinging cluster on the read write interface (port 5000):
while true; do echo "select inet_server_addr(),now()::timestamp" | psql -Upostgres -h192.168.56.100 -p5000 -t; sleep 1; donePinging cluster on the read only interface (port 5001):
while true; do echo "select inet_server_addr(),now()::timestamp" | psql -Upostgres -h192.168.56.100 -p5001 -t; sleep 1; done
Standby Tests
1. Kill PostgreSQL process
Read Write
192.168.56.52 | 2021-11-11 19:54:45.462582 192.168.56.52 | 2021-11-11 19:54:46.483056 192.168.56.52 | 2021-11-11 19:54:47.502918 192.168.56.52 | 2021-11-11 19:54:48.522746 192.168.56.52 | 2021-11-11 19:54:49.544109 192.168.56.52 | 2021-11-11 19:54:50.564185 192.168.56.52 | 2021-11-11 19:54:51.585437 192.168.56.52 | 2021-11-11 19:54:52.607154 192.168.56.52 | 2021-11-11 19:54:53.628248 192.168.56.52 | 2021-11-11 19:54:54.649941 192.168.56.52 | 2021-11-11 19:54:55.671482No problems for writer process.
Read Only
192.168.56.53 | 2021-11-11 19:54:45.505932 <<-- KILL POSTGRES PROCESS ON 192.168.56.53 192.168.56.51 | 2021-11-11 19:54:46.562742 psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 192.168.56.51 | 2021-11-11 19:54:50.598035 psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 192.168.56.51 | 2021-11-11 19:54:54.633922 192.168.56.51 | 2021-11-11 19:54:55.655928 192.168.56.51 | 2021-11-11 19:54:56.679855 192.168.56.51 | 2021-11-11 19:54:57.70306 192.168.56.51 | 2021-11-11 19:54:58.725866 192.168.56.51 | 2021-11-11 19:54:59.749008 192.168.56.51 | 2021-11-11 19:55:00.770238 192.168.56.51 | 2021-11-11 19:55:01.791585 192.168.56.53 | 2021-11-11 19:55:02.779865 <<-- PATRONI BROUGHT POSTGRESQL PROCESS 192.168.56.51 | 2021-11-11 19:55:03.835348 192.168.56.53 | 2021-11-11 19:55:04.825825 192.168.56.51 | 2021-11-11 19:55:05.890109After 10 secs Patroni brought PostgreSQL automatically.
2. Stop the PostgreSQL process
Read Write
192.168.56.52 | 2021-11-11 20:05:18.785093 192.168.56.52 | 2021-11-11 20:05:19.806449 192.168.56.52 | 2021-11-11 20:05:20.82694 192.168.56.52 | 2021-11-11 20:05:21.847219 192.168.56.52 | 2021-11-11 20:05:22.868177 192.168.56.52 | 2021-11-11 20:05:23.888856 192.168.56.52 | 2021-11-11 20:05:24.90578No problems for read write process.
Read Only
192.168.56.53 | 2021-11-11 20:05:18.990093 <<-- STOP POSTGRESQL PROCESS 192.168.56.51 | 2021-11-11 20:05:20.04388 psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 192.168.56.51 | 2021-11-11 20:05:24.08155 192.168.56.53 | 2021-11-11 20:05:25.073322 <<-- PATRONI BROUGHT POSTGRESQL PROCESSPatroni brough PostgreSQL instance in 6 seconds.
3. Reboot the server
Read Write
192.168.56.52 | 2021-11-11 20:10:13.171874 192.168.56.52 | 2021-11-11 20:10:14.193623 192.168.56.52 | 2021-11-11 20:10:15.217776 192.168.56.52 | 2021-11-11 20:10:16.239323 192.168.56.52 | 2021-11-11 20:10:17.257308 192.168.56.52 | 2021-11-11 20:10:18.27552 192.168.56.52 | 2021-11-11 20:10:19.292373 192.168.56.52 | 2021-11-11 20:10:20.310198 192.168.56.52 | 2021-11-11 20:10:21.32735 192.168.56.52 | 2021-11-11 20:10:22.343773 192.168.56.52 | 2021-11-11 20:10:23.361844 192.168.56.52 | 2021-11-11 20:10:24.38691 192.168.56.52 | 2021-11-11 20:10:25.407598 192.168.56.52 | 2021-11-11 20:10:26.429343 192.168.56.52 | 2021-11-11 20:10:27.450577 192.168.56.52 | 2021-11-11 20:10:28.471854 192.168.56.52 | 2021-11-11 20:10:29.492637 192.168.56.52 | 2021-11-11 20:10:30.512336 192.168.56.52 | 2021-11-11 20:10:31.533257 192.168.56.52 | 2021-11-11 20:10:32.554038 192.168.56.52 | 2021-11-11 20:10:33.574338 192.168.56.52 | 2021-11-11 20:10:34.596119 192.168.56.52 | 2021-11-11 20:10:35.615495 192.168.56.52 | 2021-11-11 20:10:36.637819 192.168.56.52 | 2021-11-11 20:10:37.659621 192.168.56.52 | 2021-11-11 20:10:38.682478 192.168.56.52 | 2021-11-11 20:10:39.703187 192.168.56.52 | 2021-11-11 20:10:40.727444No problems for the write process.
Read Only
192.168.56.53 | 2021-11-11 20:10:12.314665 <<-- REBOOT THE 192.168.56.53 SERVER 192.168.56.51 | 2021-11-11 20:10:13.304627 psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 192.168.56.51 | 2021-11-11 20:10:24.340825 psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 192.168.56.51 | 2021-11-11 20:10:29.42999 192.168.56.51 | 2021-11-11 20:10:30.44846 192.168.56.51 | 2021-11-11 20:10:31.470978 192.168.56.51 | 2021-11-11 20:10:32.49244 192.168.56.51 | 2021-11-11 20:10:33.515443 192.168.56.51 | 2021-11-11 20:10:34.53563 192.168.56.51 | 2021-11-11 20:10:35.553104 192.168.56.51 | 2021-11-11 20:10:36.572375 192.168.56.51 | 2021-11-11 20:10:37.595694 192.168.56.51 | 2021-11-11 20:10:38.620022 192.168.56.53 | 2021-11-11 20:10:39.644502 <<-- PATRONI STARTED ON THE BOOT AND STARTET POSTGRESQL PROCESSPatroni started on the boot and brought PostgreSQL instance automatically in 27 secs.
4. Stop the Patroni process
Read Write
192.168.56.52 | 2021-11-11 20:25:01.931924 192.168.56.52 | 2021-11-11 20:25:02.954774 192.168.56.52 | 2021-11-11 20:25:03.975514 192.168.56.52 | 2021-11-11 20:25:04.99868 192.168.56.52 | 2021-11-11 20:25:06.021456 192.168.56.52 | 2021-11-11 20:25:07.048917 192.168.56.52 | 2021-11-11 20:25:08.071156 192.168.56.52 | 2021-11-11 20:25:09.093902 192.168.56.52 | 2021-11-11 20:25:10.117138 192.168.56.52 | 2021-11-11 20:25:11.138296 192.168.56.52 | 2021-11-11 20:25:12.159975 192.168.56.52 | 2021-11-11 20:25:13.186149 192.168.56.52 | 2021-11-11 20:25:14.20717 192.168.56.52 | 2021-11-11 20:25:15.2286No problem for the write process.
Read Only
Stopping Patroni stopped PostgreSQL process and excluded 192.168.56.53 node from the cluster.
+-----------+---------------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | + Cluster: postgres (7027353509639501631) ------+----+-----------+ | psql13n51 | 192.168.56.51 | Replica | running | 9 | 0 | | psql13n52 | 192.168.56.52 | Leader | running | 9 | | +-----------+---------------+---------+---------+----+-----------+ 192.168.56.51 | 2021-11-11 20:24:52.731887 psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 192.168.56.51 | 2021-11-11 20:24:56.772703 psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 192.168.56.51 | 2021-11-11 20:25:00.811616 192.168.56.51 | 2021-11-11 20:25:01.837298 192.168.56.51 | 2021-11-11 20:25:02.860275 192.168.56.51 | 2021-11-11 20:25:03.8829 192.168.56.51 | 2021-11-11 20:25:04.906505 192.168.56.51 | 2021-11-11 20:25:05.932158Start Patroni.
After starting Patroni brought PosgreSQL instance and joined to the cluster automatically.
+-----------+---------------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | + Cluster: postgres (7027353509639501631) ------+----+-----------+ | psql13n51 | 192.168.56.51 | Replica | running | 9 | 0 | | psql13n52 | 192.168.56.52 | Leader | running | 9 | | | psql13n53 | 192.168.56.53 | Replica | running | 9 | 0 | +-----------+---------------+---------+---------+----+-----------+ 192.168.56.51 | 2021-11-11 20:28:18.473041 192.168.56.51 | 2021-11-11 20:28:19.495974 192.168.56.51 | 2021-11-11 20:28:20.518773 192.168.56.51 | 2021-11-11 20:28:21.541587 192.168.56.51 | 2021-11-11 20:28:22.563967 192.168.56.51 | 2021-11-11 20:28:23.586971 192.168.56.51 | 2021-11-11 20:28:24.608738 192.168.56.53 | 2021-11-11 20:28:25.63165It took 7 seconds to route traffic on stanby node after starting Patroni process.
Master Tests
1. Kill PostgreSQL process
Read Write
192.168.56.52 | 2021-11-11 20:40:55.246602 192.168.56.52 | 2021-11-11 20:40:56.270163 <<-- KILL POSTGRESQL PROCESS. psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 192.168.56.52 | 2021-11-11 20:41:05.318191 <<-- PATRONI BROUGHT POSTGRESQL SERVICE BACK 192.168.56.52 | 2021-11-11 20:41:06.341719After killing PostgreSQL process Patroni brought service back to the running state. We had 10 secs downtime for the writer process.
Read Only
192.168.56.51 | 2021-11-11 20:40:56.774198 192.168.56.53 | 2021-11-11 20:40:57.797533 192.168.56.51 | 2021-11-11 20:40:58.821054 192.168.56.53 | 2021-11-11 20:40:59.843738 192.168.56.51 | 2021-11-11 20:41:00.86877 192.168.56.53 | 2021-11-11 20:41:01.889666 192.168.56.51 | 2021-11-11 20:41:02.912988 192.168.56.53 | 2021-11-11 20:41:03.933952 192.168.56.51 | 2021-11-11 20:41:05.045196 192.168.56.53 | 2021-11-11 20:41:06.078416No distruption for the read only requests.
2. Stop the PostgreSQL process
Read Write
192.168.56.52 | 2021-11-11 20:52:01.251009 psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 192.168.56.52 | 2021-11-11 20:52:08.301596Patroni brought PostgreSQL to the running state. Election was not triggered. There was 7 secs downtime for the writer process.
Read Only
192.168.56.53 | 2021-11-11 20:52:01.53767 192.168.56.51 | 2021-11-11 20:52:02.561452 192.168.56.53 | 2021-11-11 20:52:03.583391 192.168.56.51 | 2021-11-11 20:52:04.609092 192.168.56.53 | 2021-11-11 20:52:05.631433 192.168.56.51 | 2021-11-11 20:52:06.656341 192.168.56.53 | 2021-11-11 20:52:07.677131 192.168.56.51 | 2021-11-11 20:52:08.701682 192.168.56.53 | 2021-11-11 20:52:09.7301573. Reboot the server
Read Write
192.168.56.52 | 2021-11-11 20:59:31.49515 psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 192.168.56.51 | 2021-11-11 20:59:48.650256 <<-- SERVER 192.168.56.51 ELECTED AS THE NEW MASTER: 192.168.56.51 | 2021-11-11 20:59:49.669785 192.168.56.51 | 2021-11-11 20:59:50.687517Faiover happened and one of the slave servers was elected as the new master. We had 17 seconds downtime for the writer process. On the old master server, Petroni brought PostgreSQL and performed pg_rewind to create replica.
+-----------+---------------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | + Cluster: postgres (7027353509639501631) ------+----+-----------+ | psql13n51 | 192.168.56.51 | Leader | running | 14 | | | psql13n52 | 192.168.56.52 | Replica | running | 14 | 0 | | psql13n53 | 192.168.56.53 | Replica | running | 14 | 0 | +-----------+---------------+---------+---------+----+-----------+Read Only
192.168.56.51 | 2021-11-11 20:59:29.053858 192.168.56.53 | 2021-11-11 20:59:30.07594 192.168.56.51 | 2021-11-11 20:59:31.105134 192.168.56.53 | 2021-11-11 20:59:32.123691 192.168.56.51 | 2021-11-11 20:59:33.152343 192.168.56.53 | 2021-11-11 20:59:34.170016 192.168.56.51 | 2021-11-11 20:59:35.199209 192.168.56.53 | 2021-11-11 20:59:36.21726 192.168.56.51 | 2021-11-11 20:59:37.238567 192.168.56.53 | 2021-11-11 20:59:38.251579 192.168.56.51 | 2021-11-11 20:59:39.273968 192.168.56.53 | 2021-11-11 20:59:40.288168 192.168.56.51 | 2021-11-11 20:59:41.308803 192.168.56.53 | 2021-11-11 20:59:42.32304 192.168.56.53 | 2021-11-11 20:59:43.339712 192.168.56.53 | 2021-11-11 20:59:44.357711 192.168.56.53 | 2021-11-11 20:59:45.375188 192.168.56.53 | 2021-11-11 20:59:46.395121 192.168.56.53 | 2021-11-11 20:59:47.411711 192.168.56.53 | 2021-11-11 20:59:48.428075 192.168.56.53 | 2021-11-11 20:59:49.445494 192.168.56.53 | 2021-11-11 20:59:50.462092There was no disrupions for the read only requests.
3. Stop Patroni process
Read Write
192.168.56.51 | 2021-11-11 21:08:25.526132 psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 192.168.56.53 | 2021-11-11 21:08:35.60005 192.168.56.53 | 2021-11-11 21:08:36.62634 192.168.56.53 | 2021-11-11 21:08:37.651523
+ Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n52 | 192.168.56.52 | Replica | running | 15 | 0 | | psql13n53 | 192.168.56.53 | Leader | running | 15 | | +-----------+---------------+---------+---------+----+-----------+Patroni stopped PostgreSQL instance and new master node was elected. We had 10 secs of downtime for the writer process.
After starting Patroni, old master server was rewound using pg_rewind and new replica joined to the cluster.
+ Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n51 | 192.168.56.51 | Replica | running | 15 | 0 | | psql13n52 | 192.168.56.52 | Replica | running | 15 | 0 | | psql13n53 | 192.168.56.53 | Leader | running | 15 | | +-----------+---------------+---------+---------+----+-----------+Read Only
192.168.56.53 | 2021-11-11 21:08:25.3516 192.168.56.52 | 2021-11-11 21:08:26.374974 192.168.56.53 | 2021-11-11 21:08:27.397898 192.168.56.52 | 2021-11-11 21:08:28.432293 192.168.56.53 | 2021-11-11 21:08:29.455458 192.168.56.52 | 2021-11-11 21:08:30.479256 192.168.56.53 | 2021-11-11 21:08:31.500499 192.168.56.52 | 2021-11-11 21:08:32.525148 192.168.56.53 | 2021-11-11 21:08:33.54793 192.168.56.52 | 2021-11-11 21:08:34.571675 192.168.56.52 | 2021-11-11 21:08:35.610965 192.168.56.52 | 2021-11-11 21:08:36.639712There was no disruptions for the read only process.
Network Isolation Tests
1. Network isolate master server from the configuration
Read Write
192.168.56.53 | 2021-11-11 22:15:37.376374 <<-- COMMUNICATION BLOCKED 192.168.56.52 | 2021-11-11 22:15:08.169172 192.168.56.52 | 2021-11-11 22:15:09.190167 192.168.56.52 | 2021-11-11 22:15:10.211688 192.168.56.52 | 2021-11-11 22:15:11.232966 192.168.56.52 | 2021-11-11 22:15:12.254794 192.168.56.52 | 2021-11-11 22:15:13.276149 192.168.56.52 | 2021-11-11 22:15:14.29847 192.168.56.52 | 2021-11-11 22:15:15.319335 192.168.56.52 | 2021-11-11 22:15:16.343936Communication was blocked on the master (read/write) node. New master was elected. We had 31 secs downtime for the writer application.
+ Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 | | psql13n52 | 192.168.56.52 | Leader | running | 16 | | +-----------+---------------+---------+---------+----+-----------+Bringing back communication to old master server did not bring old master server as replica automatically. Restarting Patroni it brought PostgreSQL instance on 192.168.56.53 as replica.
+-----------+---------------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | + Cluster: postgres (7027353509639501631) ------+----+-----------+ | psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 | | psql13n52 | 192.168.56.52 | Leader | running | 16 | | | psql13n53 | 192.168.56.53 | Replica | running | 16 | 0 | +-----------+---------------+---------+---------+----+-----------+Read Only
192.168.56.51 | 2021-11-11 22:15:11.676438 192.168.56.51 | 2021-11-11 22:15:12.699285 192.168.56.51 | 2021-11-11 22:15:13.722465 192.168.56.51 | 2021-11-11 22:15:14.74705 192.168.56.51 | 2021-11-11 22:15:15.77105 192.168.56.51 | 2021-11-11 22:15:16.794407 192.168.56.51 | 2021-11-11 22:15:17.816547 192.168.56.51 | 2021-11-11 22:15:18.838761 192.168.56.53 | 2021-11-11 22:25:57.360616 192.168.56.51 | 2021-11-11 22:25:58.390982 192.168.56.53 | 2021-11-11 22:25:59.42245 192.168.56.51 | 2021-11-11 22:26:00.450804 192.168.56.53 | 2021-11-11 22:26:01.480687 192.168.56.51 | 2021-11-11 22:26:02.510569 192.168.56.53 | 2021-11-11 22:26:03.540663 192.168.56.51 | 2021-11-11 22:26:04.574112 192.168.56.53 | 2021-11-11 22:26:05.606363 192.168.56.51 | 2021-11-11 22:26:06.635608Afer adding old master server to cluster configuration as replica it started to accept read only requests.
2. Network isolate slave server from the configuration
Read Write
192.168.56.52 | 2021-11-11 22:28:22.539789 192.168.56.52 | 2021-11-11 22:28:23.559629 192.168.56.52 | 2021-11-11 22:28:24.580749 192.168.56.52 | 2021-11-11 22:28:25.925264 192.168.56.52 | 2021-11-11 22:28:26.946179 192.168.56.52 | 2021-11-11 22:28:27.969459 192.168.56.52 | 2021-11-11 22:28:28.991379 192.168.56.52 | 2021-11-11 22:28:30.013173 192.168.56.52 | 2021-11-11 22:28:31.032617 192.168.56.52 | 2021-11-11 22:28:32.053455 192.168.56.52 | 2021-11-11 22:28:33.074863 192.168.56.52 | 2021-11-11 22:28:34.096192 192.168.56.52 | 2021-11-11 22:28:35.116744There was no problem for the writer applications.
Read Only
192.168.56.51 | 2021-11-11 22:28:03.186052 192.168.56.53 | 2021-11-11 22:28:04.208455 192.168.56.51 | 2021-11-11 22:28:05.23119 192.168.56.51 | 2021-11-11 22:28:16.665654 psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 192.168.56.51 | 2021-11-11 22:28:30.700107 192.168.56.51 | 2021-11-11 22:28:31.721761 192.168.56.51 | 2021-11-11 22:28:32.744021 192.168.56.51 | 2021-11-11 22:28:33.766453 192.168.56.51 | 2021-11-11 22:28:34.789146 192.168.56.51 | 2021-11-11 22:28:35.811602
+ Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 | | psql13n52 | 192.168.56.52 | Leader | running | 16 | | +-----------+---------------+---------+---------+----+-----------+Isolated standby server was excluded from the cluster configuration.
Bringing back communication to the standby server node rejoined to the cluster automatically.
+-----------+---------------+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | + Cluster: postgres (7027353509639501631) ------+----+-----------+ | psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 | | psql13n52 | 192.168.56.52 | Leader | running | 16 | | | psql13n53 | 192.168.56.53 | Replica | running | 16 | 0 | +-----------+---------------+---------+---------+----+-----------+
Switchover
Manually trigger switch of the primary node to one of the replicas and bring the old primary as a new replica into the cluster.
$ patronictl -c /opt/app/patroni/etc/postgresql.yml list + Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n51 | 192.168.56.51 | Leader | running | 17 | | | psql13n52 | 192.168.56.52 | Replica | running | 17 | 0 | | psql13n53 | 192.168.56.53 | Replica | running | 17 | 0 | +-----------+---------------+---------+---------+----+-----------+
$ patronictl -c /opt/app/patroni/etc/postgresql.yml switchover Master [psql13n51]: Candidate ['psql13n52', 'psql13n53'] []: psql13n52 When should the switchover take place (e.g. 2021-11-15T22:08 ) [now]: Current cluster topology + Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n51 | 192.168.56.51 | Leader | running | 17 | | | psql13n52 | 192.168.56.52 | Replica | running | 17 | 0 | | psql13n53 | 192.168.56.53 | Replica | running | 17 | 0 | +-----------+---------------+---------+---------+----+-----------+ Are you sure you want to switchover cluster postgres, demoting current master psql13n51? [y/N]: y 2021-11-15 21:10:28.05685 Successfully switched over to "psql13n52" + Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n51 | 192.168.56.51 | Replica | stopped | | unknown | | psql13n52 | 192.168.56.52 | Leader | running | 17 | | | psql13n53 | 192.168.56.53 | Replica | running | 17 | 0 | +-----------+---------------+---------+---------+----+-----------+
192.168.56.51 | 2021-11-15 21:10:21.190417 192.168.56.51 | 2021-11-15 21:10:22.223856 192.168.56.51 | 2021-11-15 21:10:23.259458 192.168.56.51 | 2021-11-15 21:10:24.293523 192.168.56.51 | 2021-11-15 21:10:25.329155 psql: error: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 192.168.56.51 | 2021-11-15 21:10:30.379076 192.168.56.51 | 2021-11-15 21:10:31.40607 192.168.56.52 | 2021-11-15 21:10:32.417283 192.168.56.51 | 2021-11-15 21:10:33.450491 192.168.56.52 | 2021-11-15 21:10:34.468676 192.168.56.52 | 2021-11-15 21:10:35.494665 192.168.56.52 | 2021-11-15 21:10:36.517738 192.168.56.52 | 2021-11-15 21:10:37.541415 192.168.56.52 | 2021-11-15 21:10:38.567083Node 192.168.56.52 bacame new primary node and 192.168.56.51 joined as the new replica to the cluster. Downtime for the read write node was 7 secs.
+ Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n51 | 192.168.56.51 | Replica | running | 18 | 0 | | psql13n52 | 192.168.56.52 | Leader | running | 18 | | | psql13n53 | 192.168.56.53 | Replica | running | 18 | 0 | +-----------+---------------+---------+---------+----+-----------+
Failover
Although failover can also be triggered manually it is mostly executed automatically, when leader node is unavailable for unplanned reason. We have noticed automatic failovers in previous tests.
For a test I will trigger failover manually.
patronictl -c /opt/app/patroni/etc/postgresql.yml failover Candidate ['psql13n51', 'psql13n53'] []: psql13n51 Current cluster topology + Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n51 | 192.168.56.51 | Replica | running | 18 | 0 | | psql13n52 | 192.168.56.52 | Leader | running | 18 | | | psql13n53 | 192.168.56.53 | Replica | running | 18 | 0 | +-----------+---------------+---------+---------+----+-----------+ Are you sure you want to failover cluster postgres, demoting current master psql13n52? [y/N]: y 2021-11-15 21:25:04.85489 Successfully failed over to "psql13n51" + Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n51 | 192.168.56.51 | Leader | running | 18 | | | psql13n52 | 192.168.56.52 | Replica | stopped | | unknown | | psql13n53 | 192.168.56.53 | Replica | running | 18 | 0 | +-----------+---------------+---------+---------+----+-----------+Node 192.168.56.51 became new master and node 192.168.56.52 joined to the cluster as a replica. Downtime was 7 secs.
+ Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n51 | 192.168.56.51 | Leader | running | 19 | | | psql13n52 | 192.168.56.52 | Replica | running | 19 | 0 | | psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 | +-----------+---------------+---------+---------+----+-----------+
Maintenance mode
Sometimes it is necessary to do maintenance on a single node and you do not want Petroni to manage the cluster. For example when performing PostgreSQL upgrade.
When Patroni is paused, it won't change the state of the PostgreSQL - it will not to try automatically start PostgreSQL cluster when it is stopped.
For the test we will stop the replica and test if Petroni will start database automatically as in previous tests.
[postgres@psql13n52 ~]$ patronictl -c /opt/app/patroni/etc/postgresql.yml pause Success: cluster management is paused
[postgres@psql13n51 ~]$ patronictl -c /opt/app/patroni/etc/postgresql.yml list + Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n51 | 192.168.56.51 | Leader | running | 19 | | | psql13n52 | 192.168.56.52 | Replica | running | 19 | 0 | | psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 | +-----------+---------------+---------+---------+----+-----------+ Maintenance mode: onNotice - "Maintenance mode: on".
Replica is stopped:
$ pg_ctl -D /var/lib/pgsql/14/data stop waiting for server to shut down.... done server stoppedPatroni didn't brought up database.
$ patronictl -c /opt/app/patroni/etc/postgresql.yml list + Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n51 | 192.168.56.51 | Leader | running | 19 | | | psql13n52 | 192.168.56.52 | Replica | stopped | | unknown | | psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 | +-----------+---------------+---------+---------+----+-----------+ Maintenance mode: onResume Patroni.
$ patronictl -c /opt/app/patroni/etc/postgresql.yml resume Success: cluster management is resumedNode joined the cluster after few seconds.
$ patronictl -c /opt/app/patroni/etc/postgresql.yml list + Cluster: postgres (7027353509639501631) ------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------+---------------+---------+---------+----+-----------+ | psql13n51 | 192.168.56.51 | Leader | running | 19 | | | psql13n52 | 192.168.56.52 | Replica | running | 19 | 0 | | psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 | +-----------+---------------+---------+---------+----+-----------+
References:
http://highscalability.com/blog/2019/9/16/managing-high-availability-in-postgresql-part-iii-patroni.html
Hi Marko, did you maybe consider to test failure of etcd?
ReplyDeleteSome things to test:
1. Etcd lost quorum 2 nodes restart + PgSQL Primary node restarts
2. Etcd network partition
Thanks for this post it's great and some nice info :)
Hello!
ReplyDeleteI have only tested ETCD stop/crash on the one node. For this configuration it is expected to survive failures only on the one node. Although I have tested hard shutdown of all the nodes in the configuration simulating "power outage".
But haven't tested scenarios you noted.
It would be interesting to test how will cluster behave when ETCD loses quorum. Will HA configuration survive possible split-brain situation and possible data inconsistencies.
Thanks for the comment.
Regards,
Marko
This comment has been removed by a blog administrator.
ReplyDelete