Marko Sutic's Database Blog

PostgreSQL HA - Patroni, ETCD, HAProxy, Keepalived - Test failure scenarios

2021-11-16T10:48:00.354+01:00

I will test few failover/maintenance scenarios and show results in this blog post.

Just to mention, this is not proper production test. Before considering this setup for the production it would be great to put cluster under proper load, simulate slow IO response time, memory crashes, etc. and check cluster behavior.

In this tests I am only checking start/stop resources in various scenarios.

Standby tests

No.	Test Scenario	Downtime	Observation
1.	Kill PostgreSQL process	10 secs 1 RO node	- No problems for writer process. - Patroni started on the boot and brought PostgreSQL instance automatically
2.	Stop the PostgreSQL process	27 secs 1 RO node	- No problems for read write process. - Patroni started on the boot and brought PostgreSQL instance automatically
3.	Reboot the server	27 secs 1 RO node	- No problems for the write process. - Patroni started on the boot and brought PostgreSQL instance automatically.
4.	Stop the Patroni process	25 secs 1 RO node	- No problem for the write process. - Stopping Patroni stopped PostgreSQL process and excluded 192.168.56.53 node from the cluster. - After starting Patroni brought PosgreSQL instance and joined to the cluster automatically.

Master tests

No.	Test Scenario	Downtime	Observation
1.	Kill PostgreSQL process	10 secs RW	- After killing PostgreSQL process Patroni brought service back to the running state. - No distruption for the read only requests.
2.	Stop the PostgreSQL process	7 secs RW	- Patroni brought PostgreSQL to the running state. Election was not triggered.
3.	Reboot the server	17 secs RW	- Faiover happened and one of the slave servers was elected as the new master. - On the old master server, Petroni brought PostgreSQL and performed pg_rewind to create replica.
4.	Stop Patroni process	10 ses RW	- Patroni stopped PostgreSQL instance and new master node was elected. - After starting Patroni, old master server was rewound using pg_rewind and new replica joined to the cluster.

Network isolation tests

No.	Test Scenario	Downtime	Observation
1.	Network isolate master server from the configuration	31 secs RW	- New master was elected. - Bringing back communication to old master server did not bring old master server as replica automatically. - Restarting Patroni it brought PostgreSQL instance on 192.168.56.53 as replica.
2.	Network isolate slave server from the configuration	0 secs RW	- Isolated standby server was excluded from the cluster configuration. - Bringing back communication to the standby server node rejoined to the cluster automatically.

Pinging cluster on the read write interface (port 5000):

while true; do echo "select inet_server_addr(),now()::timestamp" | psql -Upostgres -h192.168.56.100 -p5000 -t; sleep 1; done

Pinging cluster on the read only interface (port 5001):

while true; do echo "select inet_server_addr(),now()::timestamp" | psql -Upostgres -h192.168.56.100 -p5001 -t; sleep 1; done

Standby Tests

1. Kill PostgreSQL process

Read Write

192.168.56.52 | 2021-11-11 19:54:45.462582
192.168.56.52 | 2021-11-11 19:54:46.483056
192.168.56.52 | 2021-11-11 19:54:47.502918
192.168.56.52 | 2021-11-11 19:54:48.522746
192.168.56.52 | 2021-11-11 19:54:49.544109
192.168.56.52 | 2021-11-11 19:54:50.564185
192.168.56.52 | 2021-11-11 19:54:51.585437
192.168.56.52 | 2021-11-11 19:54:52.607154
192.168.56.52 | 2021-11-11 19:54:53.628248
192.168.56.52 | 2021-11-11 19:54:54.649941
192.168.56.52 | 2021-11-11 19:54:55.671482

No problems for writer process.

Read Only

192.168.56.53 | 2021-11-11 19:54:45.505932 <<-- KILL POSTGRES PROCESS ON 192.168.56.53
192.168.56.51 | 2021-11-11 19:54:46.562742

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

 192.168.56.51 | 2021-11-11 19:54:50.598035

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

 192.168.56.51 | 2021-11-11 19:54:54.633922
 192.168.56.51 | 2021-11-11 19:54:55.655928
 192.168.56.51 | 2021-11-11 19:54:56.679855
 192.168.56.51 | 2021-11-11 19:54:57.70306
 192.168.56.51 | 2021-11-11 19:54:58.725866
 192.168.56.51 | 2021-11-11 19:54:59.749008
 192.168.56.51 | 2021-11-11 19:55:00.770238
 192.168.56.51 | 2021-11-11 19:55:01.791585
 192.168.56.53 | 2021-11-11 19:55:02.779865 <<-- PATRONI BROUGHT POSTGRESQL PROCESS
 192.168.56.51 | 2021-11-11 19:55:03.835348
 192.168.56.53 | 2021-11-11 19:55:04.825825
 192.168.56.51 | 2021-11-11 19:55:05.890109

After 10 secs Patroni brought PostgreSQL automatically.

2. Stop the PostgreSQL process

Read Write

192.168.56.52 | 2021-11-11 20:05:18.785093
192.168.56.52 | 2021-11-11 20:05:19.806449
192.168.56.52 | 2021-11-11 20:05:20.82694
192.168.56.52 | 2021-11-11 20:05:21.847219
192.168.56.52 | 2021-11-11 20:05:22.868177
192.168.56.52 | 2021-11-11 20:05:23.888856
192.168.56.52 | 2021-11-11 20:05:24.90578

No problems for read write process.

Read Only

 
192.168.56.53 | 2021-11-11 20:05:18.990093 <<-- STOP POSTGRESQL PROCESS
 192.168.56.51 | 2021-11-11 20:05:20.04388

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

 192.168.56.51 | 2021-11-11 20:05:24.08155
 192.168.56.53 | 2021-11-11 20:05:25.073322 <<-- PATRONI BROUGHT POSTGRESQL PROCESS

Patroni brough PostgreSQL instance in 6 seconds.

3. Reboot the server

Read Write

192.168.56.52 | 2021-11-11 20:10:13.171874
192.168.56.52 | 2021-11-11 20:10:14.193623
192.168.56.52 | 2021-11-11 20:10:15.217776
192.168.56.52 | 2021-11-11 20:10:16.239323
192.168.56.52 | 2021-11-11 20:10:17.257308
192.168.56.52 | 2021-11-11 20:10:18.27552
192.168.56.52 | 2021-11-11 20:10:19.292373
192.168.56.52 | 2021-11-11 20:10:20.310198
192.168.56.52 | 2021-11-11 20:10:21.32735
192.168.56.52 | 2021-11-11 20:10:22.343773
192.168.56.52 | 2021-11-11 20:10:23.361844
192.168.56.52 | 2021-11-11 20:10:24.38691
192.168.56.52 | 2021-11-11 20:10:25.407598
192.168.56.52 | 2021-11-11 20:10:26.429343
192.168.56.52 | 2021-11-11 20:10:27.450577
192.168.56.52 | 2021-11-11 20:10:28.471854
192.168.56.52 | 2021-11-11 20:10:29.492637
192.168.56.52 | 2021-11-11 20:10:30.512336
192.168.56.52 | 2021-11-11 20:10:31.533257
192.168.56.52 | 2021-11-11 20:10:32.554038
192.168.56.52 | 2021-11-11 20:10:33.574338
192.168.56.52 | 2021-11-11 20:10:34.596119
192.168.56.52 | 2021-11-11 20:10:35.615495
192.168.56.52 | 2021-11-11 20:10:36.637819
192.168.56.52 | 2021-11-11 20:10:37.659621
192.168.56.52 | 2021-11-11 20:10:38.682478
192.168.56.52 | 2021-11-11 20:10:39.703187
192.168.56.52 | 2021-11-11 20:10:40.727444

No problems for the write process.

Read Only

192.168.56.53 | 2021-11-11 20:10:12.314665 <<-- REBOOT THE 192.168.56.53 SERVER
 192.168.56.51 | 2021-11-11 20:10:13.304627

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
 192.168.56.51 | 2021-11-11 20:10:24.340825

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
 192.168.56.51 | 2021-11-11 20:10:29.42999
 192.168.56.51 | 2021-11-11 20:10:30.44846
 192.168.56.51 | 2021-11-11 20:10:31.470978
 192.168.56.51 | 2021-11-11 20:10:32.49244
 192.168.56.51 | 2021-11-11 20:10:33.515443
 192.168.56.51 | 2021-11-11 20:10:34.53563
 192.168.56.51 | 2021-11-11 20:10:35.553104
 192.168.56.51 | 2021-11-11 20:10:36.572375
 192.168.56.51 | 2021-11-11 20:10:37.595694
 192.168.56.51 | 2021-11-11 20:10:38.620022
 192.168.56.53 | 2021-11-11 20:10:39.644502 <<-- PATRONI STARTED ON THE BOOT AND STARTET POSTGRESQL PROCESS

Patroni started on the boot and brought PostgreSQL instance automatically in 27 secs.

4. Stop the Patroni process

Read Write

192.168.56.52 | 2021-11-11 20:25:01.931924
192.168.56.52 | 2021-11-11 20:25:02.954774
192.168.56.52 | 2021-11-11 20:25:03.975514
192.168.56.52 | 2021-11-11 20:25:04.99868
192.168.56.52 | 2021-11-11 20:25:06.021456
192.168.56.52 | 2021-11-11 20:25:07.048917
192.168.56.52 | 2021-11-11 20:25:08.071156
192.168.56.52 | 2021-11-11 20:25:09.093902
192.168.56.52 | 2021-11-11 20:25:10.117138
192.168.56.52 | 2021-11-11 20:25:11.138296
192.168.56.52 | 2021-11-11 20:25:12.159975
192.168.56.52 | 2021-11-11 20:25:13.186149
192.168.56.52 | 2021-11-11 20:25:14.20717
192.168.56.52 | 2021-11-11 20:25:15.2286

No problem for the write process.

Read Only

Stopping Patroni stopped PostgreSQL process and excluded 192.168.56.53 node from the cluster.

+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 9 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 9 | |
+-----------+---------------+---------+---------+----+-----------+
192.168.56.51 | 2021-11-11 20:24:52.731887

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
192.168.56.51 | 2021-11-11 20:24:56.772703

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
192.168.56.51 | 2021-11-11 20:25:00.811616

192.168.56.51 | 2021-11-11 20:25:01.837298
192.168.56.51 | 2021-11-11 20:25:02.860275
192.168.56.51 | 2021-11-11 20:25:03.8829
192.168.56.51 | 2021-11-11 20:25:04.906505
192.168.56.51 | 2021-11-11 20:25:05.932158

Start Patroni.

After starting Patroni brought PosgreSQL instance and joined to the cluster automatically.

+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 9 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 9 | |
| psql13n53 | 192.168.56.53 | Replica | running | 9 | 0 |
+-----------+---------------+---------+---------+----+-----------+
192.168.56.51 | 2021-11-11 20:28:18.473041
192.168.56.51 | 2021-11-11 20:28:19.495974
192.168.56.51 | 2021-11-11 20:28:20.518773
192.168.56.51 | 2021-11-11 20:28:21.541587
192.168.56.51 | 2021-11-11 20:28:22.563967
192.168.56.51 | 2021-11-11 20:28:23.586971
192.168.56.51 | 2021-11-11 20:28:24.608738
192.168.56.53 | 2021-11-11 20:28:25.63165

It took 7 seconds to route traffic on stanby node after starting Patroni process.

Master Tests

1. Kill PostgreSQL process

Read Write

 192.168.56.52 | 2021-11-11 20:40:55.246602
 192.168.56.52 | 2021-11-11 20:40:56.270163 <<-- KILL POSTGRESQL PROCESS.

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

 192.168.56.52 | 2021-11-11 20:41:05.318191 <<-- PATRONI BROUGHT POSTGRESQL SERVICE BACK
 192.168.56.52 | 2021-11-11 20:41:06.341719

After killing PostgreSQL process Patroni brought service back to the running state. We had 10 secs downtime for the writer process.

Read Only

192.168.56.51 | 2021-11-11 20:40:56.774198
192.168.56.53 | 2021-11-11 20:40:57.797533
192.168.56.51 | 2021-11-11 20:40:58.821054
192.168.56.53 | 2021-11-11 20:40:59.843738
192.168.56.51 | 2021-11-11 20:41:00.86877
192.168.56.53 | 2021-11-11 20:41:01.889666
192.168.56.51 | 2021-11-11 20:41:02.912988
192.168.56.53 | 2021-11-11 20:41:03.933952
192.168.56.51 | 2021-11-11 20:41:05.045196
192.168.56.53 | 2021-11-11 20:41:06.078416

No distruption for the read only requests.

2. Stop the PostgreSQL process

Read Write

  
192.168.56.52 | 2021-11-11 20:52:01.251009

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.

192.168.56.52 | 2021-11-11 20:52:08.301596

Patroni brought PostgreSQL to the running state. Election was not triggered. There was 7 secs downtime for the writer process.

Read Only

  
192.168.56.53 | 2021-11-11 20:52:01.53767
192.168.56.51 | 2021-11-11 20:52:02.561452
192.168.56.53 | 2021-11-11 20:52:03.583391
192.168.56.51 | 2021-11-11 20:52:04.609092
192.168.56.53 | 2021-11-11 20:52:05.631433
192.168.56.51 | 2021-11-11 20:52:06.656341
192.168.56.53 | 2021-11-11 20:52:07.677131
192.168.56.51 | 2021-11-11 20:52:08.701682
192.168.56.53 | 2021-11-11 20:52:09.730157

3. Reboot the server

Read Write

192.168.56.52 | 2021-11-11 20:59:31.49515

psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
psql: error: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

192.168.56.51 | 2021-11-11 20:59:48.650256 <<-- SERVER 192.168.56.51 ELECTED AS THE NEW MASTER:
192.168.56.51 | 2021-11-11 20:59:49.669785
192.168.56.51 | 2021-11-11 20:59:50.687517

Faiover happened and one of the slave servers was elected as the new master. We had 17 seconds downtime for the writer process. On the old master server, Petroni brought PostgreSQL and performed pg_rewind to create replica.

  
+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 14 | |
| psql13n52 | 192.168.56.52 | Replica | running | 14 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 14 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Read Only

192.168.56.51 | 2021-11-11 20:59:29.053858
192.168.56.53 | 2021-11-11 20:59:30.07594
192.168.56.51 | 2021-11-11 20:59:31.105134
192.168.56.53 | 2021-11-11 20:59:32.123691
192.168.56.51 | 2021-11-11 20:59:33.152343
192.168.56.53 | 2021-11-11 20:59:34.170016
192.168.56.51 | 2021-11-11 20:59:35.199209
192.168.56.53 | 2021-11-11 20:59:36.21726
192.168.56.51 | 2021-11-11 20:59:37.238567
192.168.56.53 | 2021-11-11 20:59:38.251579
192.168.56.51 | 2021-11-11 20:59:39.273968
192.168.56.53 | 2021-11-11 20:59:40.288168
192.168.56.51 | 2021-11-11 20:59:41.308803
192.168.56.53 | 2021-11-11 20:59:42.32304
192.168.56.53 | 2021-11-11 20:59:43.339712
192.168.56.53 | 2021-11-11 20:59:44.357711
192.168.56.53 | 2021-11-11 20:59:45.375188
192.168.56.53 | 2021-11-11 20:59:46.395121
192.168.56.53 | 2021-11-11 20:59:47.411711
192.168.56.53 | 2021-11-11 20:59:48.428075
192.168.56.53 | 2021-11-11 20:59:49.445494
192.168.56.53 | 2021-11-11 20:59:50.462092

There was no disrupions for the read only requests.

3. Stop Patroni process

Read Write

192.168.56.51 | 2021-11-11 21:08:25.526132

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.

192.168.56.53 | 2021-11-11 21:08:35.60005
192.168.56.53 | 2021-11-11 21:08:36.62634
192.168.56.53 | 2021-11-11 21:08:37.651523

  
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n52 | 192.168.56.52 | Replica | running | 15 | 0 |
| psql13n53 | 192.168.56.53 | Leader | running | 15 | |
+-----------+---------------+---------+---------+----+-----------+

Patroni stopped PostgreSQL instance and new master node was elected. We had 10 secs of downtime for the writer process.

After starting Patroni, old master server was rewound using pg_rewind and new replica joined to the cluster.

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 15 | 0 |
| psql13n52 | 192.168.56.52 | Replica | running | 15 | 0 |
| psql13n53 | 192.168.56.53 | Leader | running | 15 | |
+-----------+---------------+---------+---------+----+-----------+

Read Only

192.168.56.53 | 2021-11-11 21:08:25.3516
192.168.56.52 | 2021-11-11 21:08:26.374974
192.168.56.53 | 2021-11-11 21:08:27.397898
192.168.56.52 | 2021-11-11 21:08:28.432293
192.168.56.53 | 2021-11-11 21:08:29.455458
192.168.56.52 | 2021-11-11 21:08:30.479256
192.168.56.53 | 2021-11-11 21:08:31.500499
192.168.56.52 | 2021-11-11 21:08:32.525148
192.168.56.53 | 2021-11-11 21:08:33.54793
192.168.56.52 | 2021-11-11 21:08:34.571675
192.168.56.52 | 2021-11-11 21:08:35.610965
192.168.56.52 | 2021-11-11 21:08:36.639712

There was no disruptions for the read only process.

Network Isolation Tests

1. Network isolate master server from the configuration

Read Write

192.168.56.53 | 2021-11-11 22:15:37.376374 <<-- COMMUNICATION BLOCKED

192.168.56.52 | 2021-11-11 22:15:08.169172
192.168.56.52 | 2021-11-11 22:15:09.190167
192.168.56.52 | 2021-11-11 22:15:10.211688
192.168.56.52 | 2021-11-11 22:15:11.232966
192.168.56.52 | 2021-11-11 22:15:12.254794
192.168.56.52 | 2021-11-11 22:15:13.276149
192.168.56.52 | 2021-11-11 22:15:14.29847
192.168.56.52 | 2021-11-11 22:15:15.319335
192.168.56.52 | 2021-11-11 22:15:16.343936

Communication was blocked on the master (read/write) node. New master was elected. We had 31 secs downtime for the writer application.

  
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 16 | |
+-----------+---------------+---------+---------+----+-----------+

Bringing back communication to old master server did not bring old master server as replica automatically. Restarting Patroni it brought PostgreSQL instance on 192.168.56.53 as replica.

+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 16 | |
| psql13n53 | 192.168.56.53 | Replica | running | 16 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Read Only

192.168.56.51 | 2021-11-11 22:15:11.676438
192.168.56.51 | 2021-11-11 22:15:12.699285
192.168.56.51 | 2021-11-11 22:15:13.722465
192.168.56.51 | 2021-11-11 22:15:14.74705
192.168.56.51 | 2021-11-11 22:15:15.77105
192.168.56.51 | 2021-11-11 22:15:16.794407
192.168.56.51 | 2021-11-11 22:15:17.816547
192.168.56.51 | 2021-11-11 22:15:18.838761

192.168.56.53 | 2021-11-11 22:25:57.360616
192.168.56.51 | 2021-11-11 22:25:58.390982
192.168.56.53 | 2021-11-11 22:25:59.42245
192.168.56.51 | 2021-11-11 22:26:00.450804
192.168.56.53 | 2021-11-11 22:26:01.480687
192.168.56.51 | 2021-11-11 22:26:02.510569
192.168.56.53 | 2021-11-11 22:26:03.540663
192.168.56.51 | 2021-11-11 22:26:04.574112
192.168.56.53 | 2021-11-11 22:26:05.606363
192.168.56.51 | 2021-11-11 22:26:06.635608

Afer adding old master server to cluster configuration as replica it started to accept read only requests.

2. Network isolate slave server from the configuration

Read Write

192.168.56.52 | 2021-11-11 22:28:22.539789
192.168.56.52 | 2021-11-11 22:28:23.559629
192.168.56.52 | 2021-11-11 22:28:24.580749
192.168.56.52 | 2021-11-11 22:28:25.925264
192.168.56.52 | 2021-11-11 22:28:26.946179
192.168.56.52 | 2021-11-11 22:28:27.969459
192.168.56.52 | 2021-11-11 22:28:28.991379
192.168.56.52 | 2021-11-11 22:28:30.013173
192.168.56.52 | 2021-11-11 22:28:31.032617
192.168.56.52 | 2021-11-11 22:28:32.053455
192.168.56.52 | 2021-11-11 22:28:33.074863
192.168.56.52 | 2021-11-11 22:28:34.096192
192.168.56.52 | 2021-11-11 22:28:35.116744

There was no problem for the writer applications.

Read Only

192.168.56.51 | 2021-11-11 22:28:03.186052
192.168.56.53 | 2021-11-11 22:28:04.208455
192.168.56.51 | 2021-11-11 22:28:05.23119
192.168.56.51 | 2021-11-11 22:28:16.665654

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.

192.168.56.51 | 2021-11-11 22:28:30.700107
192.168.56.51 | 2021-11-11 22:28:31.721761
192.168.56.51 | 2021-11-11 22:28:32.744021
192.168.56.51 | 2021-11-11 22:28:33.766453
192.168.56.51 | 2021-11-11 22:28:34.789146
192.168.56.51 | 2021-11-11 22:28:35.811602

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 16 | |
+-----------+---------------+---------+---------+----+-----------+

Isolated standby server was excluded from the cluster configuration.

Bringing back communication to the standby server node rejoined to the cluster automatically.

+-----------+---------------+---------+---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 16 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 16 | |
| psql13n53 | 192.168.56.53 | Replica | running | 16 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Switchover

Manually trigger switch of the primary node to one of the replicas and bring the old primary as a new replica into the cluster.

$ patronictl -c /opt/app/patroni/etc/postgresql.yml list

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 17 | |
| psql13n52 | 192.168.56.52 | Replica | running | 17 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 17 | 0 |
+-----------+---------------+---------+---------+----+-----------+

$ patronictl -c /opt/app/patroni/etc/postgresql.yml switchover
Master [psql13n51]:
Candidate ['psql13n52', 'psql13n53'] []: psql13n52
When should the switchover take place (e.g. 2021-11-15T22:08 ) [now]:
Current cluster topology
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 17 | |
| psql13n52 | 192.168.56.52 | Replica | running | 17 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 17 | 0 |
+-----------+---------------+---------+---------+----+-----------+
Are you sure you want to switchover cluster postgres, demoting current master psql13n51? [y/N]: y
2021-11-15 21:10:28.05685 Successfully switched over to "psql13n52"
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | stopped | | unknown |
| psql13n52 | 192.168.56.52 | Leader | running | 17 | |
| psql13n53 | 192.168.56.53 | Replica | running | 17 | 0 |
+-----------+---------------+---------+---------+----+-----------+

192.168.56.51 | 2021-11-15 21:10:21.190417
192.168.56.51 | 2021-11-15 21:10:22.223856
192.168.56.51 | 2021-11-15 21:10:23.259458
192.168.56.51 | 2021-11-15 21:10:24.293523
192.168.56.51 | 2021-11-15 21:10:25.329155

psql: error: server closed the connection unexpectedly
   This probably means the server terminated abnormally
   before or while processing the request.
192.168.56.51 | 2021-11-15 21:10:30.379076
192.168.56.51 | 2021-11-15 21:10:31.40607
192.168.56.52 | 2021-11-15 21:10:32.417283
192.168.56.51 | 2021-11-15 21:10:33.450491
192.168.56.52 | 2021-11-15 21:10:34.468676
192.168.56.52 | 2021-11-15 21:10:35.494665
192.168.56.52 | 2021-11-15 21:10:36.517738
192.168.56.52 | 2021-11-15 21:10:37.541415
192.168.56.52 | 2021-11-15 21:10:38.567083

Node 192.168.56.52 bacame new primary node and 192.168.56.51 joined as the new replica to the cluster. Downtime for the read write node was 7 secs.

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 18 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 18 | |
| psql13n53 | 192.168.56.53 | Replica | running | 18 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Failover

Although failover can also be triggered manually it is mostly executed automatically, when leader node is unavailable for unplanned reason. We have noticed automatic failovers in previous tests.

For a test I will trigger failover manually.

patronictl -c /opt/app/patroni/etc/postgresql.yml failover
Candidate ['psql13n51', 'psql13n53'] []: psql13n51
Current cluster topology
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running | 18 | 0 |
| psql13n52 | 192.168.56.52 | Leader | running | 18 | |
| psql13n53 | 192.168.56.53 | Replica | running | 18 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Are you sure you want to failover cluster postgres, demoting current master psql13n52? [y/N]: y
2021-11-15 21:25:04.85489 Successfully failed over to "psql13n51"
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 18 | |
| psql13n52 | 192.168.56.52 | Replica | stopped | | unknown |
| psql13n53 | 192.168.56.53 | Replica | running | 18 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Node 192.168.56.51 became new master and node 192.168.56.52 joined to the cluster as a replica. Downtime was 7 secs.

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 19 | |
| psql13n52 | 192.168.56.52 | Replica | running | 19 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 |
+-----------+---------------+---------+---------+----+-----------+

Maintenance mode

Sometimes it is necessary to do maintenance on a single node and you do not want Petroni to manage the cluster. For example when performing PostgreSQL upgrade.

When Patroni is paused, it won't change the state of the PostgreSQL - it will not to try automatically start PostgreSQL cluster when it is stopped.

For the test we will stop the replica and test if Petroni will start database automatically as in previous tests.

[postgres@psql13n52 ~]$ patronictl -c /opt/app/patroni/etc/postgresql.yml pause
Success: cluster management is paused


[postgres@psql13n51 ~]$ patronictl -c /opt/app/patroni/etc/postgresql.yml list
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 19 | |
| psql13n52 | 192.168.56.52 | Replica | running | 19 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 |
+-----------+---------------+---------+---------+----+-----------+
 Maintenance mode: on

Notice - "Maintenance mode: on".

Replica is stopped:

$ pg_ctl -D /var/lib/pgsql/14/data stop
waiting for server to shut down.... done
server stopped

Patroni didn't brought up database.

$ patronictl -c /opt/app/patroni/etc/postgresql.yml list
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 19 | |
| psql13n52 | 192.168.56.52 | Replica | stopped | | unknown |
| psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 |
+-----------+---------------+---------+---------+----+-----------+
 Maintenance mode: on

Resume Patroni.

$ patronictl -c /opt/app/patroni/etc/postgresql.yml resume
Success: cluster management is resumed

Node joined the cluster after few seconds.

$ patronictl -c /opt/app/patroni/etc/postgresql.yml list
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader | running | 19 | |
| psql13n52 | 192.168.56.52 | Replica | running | 19 | 0 |
| psql13n53 | 192.168.56.53 | Replica | running | 19 | 0 |
+-----------+---------------+---------+---------+----+-----------+

References:
http://highscalability.com/blog/2019/9/16/managing-high-availability-in-postgresql-part-iii-patroni.html

Deploying PostgreSQL 14.0 for High Availability using Patroni, etcd, HAProxy and keepalived on CetntOS 8

2021-11-07T10:56:00.012+01:00

Patroni is an automatic failover system for PostgreSQL. It provides automatic and manual failover and keeps all vital data in distributed configuration store (DCS). The database connections do not hapen directly to the database nodes but are routed via a connection proxy like HAProxy. Proxy determines the active/master node.

Using proxy for routing connections risk of having split brain scenario is very limited.

By using Patroni all the dynamic settings are stored into the DCS in order to have complete consistency on the participating nodes.

In this blog post I will focus on building Patroni cluster on top of CentOS 8 by using etcd in clustering and HAProxy for routing database connections to the primary server.

OS setup

Firewalld and selinux need to be adjusted before configuring Patroni cluster.

Firewalld

The ports required for operating patroni/etcd/haproxy/postgresql are the following:

5432 - PostgreSQL standard port, not used by PostgreSQL itself but by HAProxy
5000 - PostgreSQL listening port used by HAproxy to route the database connections to write node
5001 - PostgreSQL listening port used by HAproxy to route the database connections to read nodes
2380 - etcd peer urls port required by the etcd members communication
2379 - etcd client port required by any client including patroni to communicate with etcd
8008 - patroni rest api port required by HAProxy to check the nodes status
7000 - HAProxy port to expose the proxy’s statistics

selinux

Selinux by default prevents the new services to bind to all the ip addresses.

In order to allow HAProxy to bind the ports required for its functionality we need to run this command.

sudo setsebool -P haproxy_connect_any=1

Initial setup

$ cat /etc/hosts
192.168.56.51 psql13n51
192.168.56.52 psql13n52
192.168.56.53 psql13n53

ETCD

Etcd is a fault-tolerant, distributed key-value store used to store the state of the Postgres cluster. Via Patroni, all of the Postgres nodes make use of etcd to keep the Postgres cluster up and running.

In production, it may be best to use larger etcd cluster so that if one etcd node fails, it doesn't affect other Postgres servers.

Download and Install the etcd Binaries (All nodes)

Install etcd on all three nodes.

ETCD_VER=v3.5.1

# choose either URL
GOOGLE_URL=https://storage.googleapis.com/etcd
GITHUB_URL=https://github.com/etcd-io/etcd/releases/download
DOWNLOAD_URL=${GOOGLE_URL}

rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
rm -rf /tmp/etcd-download-test && mkdir -p /tmp/etcd-download-test

curl -L ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz
tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /tmp/etcd-download-test --strip-components=1
rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz

/tmp/etcd-download-test/etcd --version
/tmp/etcd-download-test/etcdctl version
/tmp/etcd-download-test/etcdutl version

Move binaries to /usr/local/bin directory.

mv /tmp/etcd-download-test/etcd* /usr/local/bin/

Check etcd and etcdctl version.

$ etcd --version
etcd Version: 3.5.1
Git SHA: e8732fb5f
Go Version: go1.16.3
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.1
API version: 3.5

Configure Etcd Systemd service:

Create etcd directories and user (All nodes)

Create etcd system user:

sudo groupadd --system etcd
sudo useradd -s /sbin/nologin --system -g etcd etcd

Set /var/lib/etcd/ directory ownership to etcd user:

sudo mkdir -p /var/lib/etcd/
sudo mkdir /etc/etcd
sudo chown -R etcd:etcd /var/lib/etcd/
sudo chmod -R 700 /var/lib/etcd/

Configure the etcd on all nodes.

On each server, save these variables by running the commands below.

INT_NAME="eth1"
#INT_NAME="ens3"
ETCD_HOST_IP=$(ip addr show $INT_NAME | grep "inet\b" | awk '{print $2}' | cut -d/ -f1)
ETCD_NAME=$(hostname -s)

Where:
INT_NAME - The name of your network interface to be used for cluster traffic. Change it to match your server configuration.
ETCD_HOST_IP - The internal IP address of the specified network interface. This is used to serve client requests and communicate with etcd cluster peers.
ETCD_NAME – Each etcd member must have a unique name within an etcd cluster. Command used will set the etcd name to match the hostname of the current compute instance.

Check variables to confirm they have correct values:

echo $INT_NAME
echo $ETCD_HOST_IP
echo $ETCD_NAME

Once all variables are set, create the etcd.service systemd unit file:

Create a systemd service file for etcd. Replace --listen-client-urls with your server IPs.
For ETCD 3.5 default is api v3 but Patroni doesn't currently support v3 API so it is important to set parameter enable-v2=true.

cat << EOF > /lib/systemd/system/etcd.service
[Unit]
Description=etcd service
Documentation=https://github.com/coreos/etcd
 
[Service]
User=etcd
Type=notify
ExecStart=/usr/local/bin/etcd \\
 --name ${ETCD_NAME} \\
 --enable-v2=true \\
 --data-dir /var/lib/etcd \\
 --initial-advertise-peer-urls http://${ETCD_HOST_IP}:2380 \\
 --listen-peer-urls http://${ETCD_HOST_IP}:2380 \\
 --listen-client-urls http://${ETCD_HOST_IP}:2379,http://127.0.0.1:2379 \\
 --advertise-client-urls http://${ETCD_HOST_IP}:2379 \\
 --initial-cluster-token etcd-cluster-1 \\
 --initial-cluster psql13n51=http://192.168.56.51:2380,psql13n52=http://192.168.56.52:2380,psql13n53=http://192.168.56.53:2380 \\
 --initial-cluster-state new \\
 --heartbeat-interval 1000 \\
 --election-timeout 5000
Restart=on-failure
RestartSec=5
 
[Install]
WantedBy=multi-user.target
EOF

For CentOS / RHEL Linux distributions, set SELinux mode to permissive.

sudo setenforce 0
sudo sed -i 's/^SELINUX=.*/SELINUX=permissive/g' /etc/selinux/config

If you have active firewall service, allow ports 2379 and 2380.

# RHEL / CentOS / Fedora firewalld
sudo firewall-cmd --add-port={2379,2380}/tcp --permanent
sudo firewall-cmd --reload

# Ubuntu/Debian
sudo ufw allow proto tcp from any to any port 2379,2380

Bootstrap The etcd Cluster

Once all the configurations are applied on the three servers, start and enable the newly created etcd service on all the nodes. The first server will act as a bootstrap node. One node will be automatically elected as a leader once the service is started in all the three nodes.

# systemctl daemon-reload
# systemctl enable etcd
# systemctl start etcd.service
# systemctl status -l etcd.service

Test Etcd Cluster installation

Test your setup by listing the etcd cluster members:

# etcdctl member list

Check leader on host:

# etcdctl endpoint status --write-out=table

Also check cluster health by running the command:

# etcdctl endpoint health

127.0.0.1:2379 is healthy: successfully committed proposal: took = 4.383594ms

Let’s also try writing to etcd.

# etcdctl put /message "Hello World"

Read the value of message back – It should work on all nodes.

# etcdctl get /message
Hello World

Watchdog

Watchdog devices will reset the whole system when they do not get a keepalive heartbeat within a specified timeframe. This adds an additional layer of fail safe in case usual Patroni split-brain protection mechanisms fail. It is recommended to deploy watchdog mechanism in PostgreSQL HA when running configuration in the production.

Install on all nodes.

yum -y install watchdog

/sbin/modprobe softdog

Patroni will be the component interacting with the watchdog device. Since Patroni is run by the postgres user, we need to either set the permissions of the watchdog device open enough so the postgres user can write to it or make the device owned by postgres itself, which we consider a safer approach (as it is more restrictive):

Include the softdog kernel module to load on CentOS boot up.

It’s better that the softdog module is not loaded via /etc/rc.local but the default CentOS methodology to load module from /etc/rc.module is used:

echo modprobe softdog >> /etc/rc.modules
chmod +x /etc/rc.modules

sudo sh -c 'echo "KERNEL==\"watchdog\", OWNER=\"postgres\", GROUP=\"postgres\"" >> /etc/udev/rules.d/61-watchdog.rules'

Check if module is blacklisted by default and there was a strain file with such a directive still lingering around.

$ grep blacklist /lib/modprobe.d/* /etc/modprobe.d/* |grep softdog

Editing that file in each of the nodes to remove the line above and restarting the servers.

$ lsmod | grep softdog
softdog                16384  0

[root@localhost ~]# ls -l /dev/watchdog*
crw-------. 1 root root  10, 130 Nov  5 11:13 /dev/watchdog
crw-------. 1 root root 248,   0 Nov  5 11:13 /dev/watchdog0

PostgreSQL

Install PostgreSQL on all nodes.

By default the postgres module will have an older version of postgres enabled. But the current module does not include postgresql 14. Confirm with this command:

sudo dnf module list postgresql

Let us Install the repository RPM using this command:

sudo dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm

Then to avoid conflicts, let us disable the built-in PostgreSQL module:

sudo dnf -qy module disable postgresql

Finally install PostgreSQL 14 server:

sudo dnf install -y postgresql14-server

Let’s also install the Contrib package which provides several additional features for the PostgreSQL database system:

sudo dnf install -y postgresql14-contrib

An important concept to understand in a PostgreSQL HA environment like this one is that PostgreSQL should not be started automatically by systemd during the server initialization: we should leave it to Patroni to fully manage it, including the process of starting and stopping the server. Thus, we should disable the service:

sudo systemctl disable postgresql-14

Start with a fresh new PostgreSQL setup and let Patroni bootstrap the cluster. Remove the data directory that has been created as part of the PostgreSQL installation:

sudo systemctl stop postgresql-14
sudo rm -fr /var/lib/pgsql/14/data

Patroni

Patroni is a cluster manager used to customize and automate deployment and maintenance of PostgreSQL HA (High Availability) clusters. You should check the latest available release from Github page.

Install Patroni and python client for ETCD on all 3 nodes:

# yum install patroni-etcd
# yum install pyhton3-etcd

If you have active firewall service, allow 5432 port on all nodes. ``` # RHEL / CentOS / Fedora firewalld sudo firewall-cmd --add-port=5432/tcp --permanent sudo firewall-cmd --reload # Ubuntu/Debian sudo ufw allow proto tcp from any to any port 5432 ```

pip install python-etcd

Here’s the configuration file we have used for psql13n51:

cat /opt/app/patroni/etc/postgresql.yml

scope: postgres
name: psql13n51

restapi:
    listen: 0.0.0.0:8008
    connect_address: 192.168.56.51:8008

etcd:
    host: psql13n51:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        logging_collector: "on"
        max_wal_senders: 5
        max_replication_slots: 5

  initdb:
  - encoding: UTF8
  - data-checksums

  pg_hba:
  - host replication replicator 127.0.0.1/32 trust
  - host replication replicator 192.168.56.1/24 md5
  - host all all 192.168.56.1/24 md5
  - host all all 0.0.0.0/0 md5

  users:
    admin:
      password: admin
      options:
        - createrole
        - createdb

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 192.168.56.51:5432
  data_dir: "/var/lib/pgsql/14/data"
  bin_dir: "/usr/pgsql-14/bin"
  pgpass: /tmp/pgpass
  authentication:
    replication:
      username: replicator
      password: vagrant
    superuser:
      username: postgres
      password: vagrant
  parameters:
    unix_socket_directories: '/var/run/postgresql'

watchdog:
  mode: required
  device: /dev/watchdog
  safety_margin: 5

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

Validate configuration:

patroni --validate-config /opt/app/patroni/etc/postgresql.yml

Lets bootstrap the cluster using parameter from yml file as postgres user:

# sudo su - postgres
patroni /opt/app/patroni/etc/postgresql.yml

2021-11-06 07:20:41,692 INFO: postmaster pid=1863
2021-11-06 07:20:41.704 UTC [1863] LOG:  redirecting log output to logging collector process
2021-11-06 07:20:41.704 UTC [1863] HINT:  Future log output will appear in directory "log".
localhost:5432 - rejecting connections
localhost:5432 - accepting connections
2021-11-06 07:20:41,807 INFO: establishing a new patroni connection to the postgres cluster
2021-11-06 07:20:41,820 INFO: running post_bootstrap
2021-11-06 07:20:41,844 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds
2021-11-06 07:20:41,864 INFO: initialized a new cluster
2021-11-06 07:20:51,859 INFO: no action. I am (psql13n51) the leader with the lock
2021-11-06 07:20:51,880 INFO: no action. I am (psql13n51) the leader with the lock
2021-11-06 07:21:01,883 INFO: no action. I am (psql13n51) the leader with the lock
2021-11-06 07:21:11,877 INFO: no action. I am (psql13n51) the leader with the lock
2021-11-06 07:21:21,878 INFO: no action. I am (psql13n51) the leader with the lock
2021-11-06 07:21:31,877 INFO: no action. I am (psql13n51) the leader with the lock
2021-11-06 07:21:41,880 INFO: no action. I am (psql13n51) the leader with the lock

Next, edit postgresql.yml file on psql13n52 node, and add the following configuration parameters.
Make sure, you change namespace, etcd host name, listen and connect_address:

cat /opt/app/patroni/etc/postgresql.yml

scope: postgres
name: psql13n52

restapi:
    listen: 0.0.0.0:8008
    connect_address: 192.168.56.52:8008

etcd:
    host: psql13n52:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        logging_collector: "on"
        max_wal_senders: 5
        max_replication_slots: 5

  initdb:
  - encoding: UTF8
  - data-checksums

  pg_hba:
  - host replication replicator 127.0.0.1/32 trust
  - host replication replicator 192.168.56.1/24 md5
  - host all all 192.168.56.1/24 md5
  - host all all 0.0.0.0/0 md5

  users:
    admin:
      password: admin
      options:
        - createrole
        - createdb

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 192.168.56.52:5432
  data_dir: "/var/lib/pgsql/14/data"
  bin_dir: "/usr/pgsql-14/bin"
  pgpass: /tmp/pgpass
  authentication:
    replication:
      username: replicator
      password: vagrant
    superuser:
      username: postgres
      password: vagrant
  parameters:
    unix_socket_directories: '/var/run/postgresql'

watchdog:
  mode: required
  device: /dev/watchdog
  safety_margin: 5

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

Validate configuration:

patroni --validate-config /opt/app/patroni/etc/postgresql.yml

Run as postgres user on psql13n52 node:

# sudo su - postgres
patroni /opt/app/patroni/etc/postgresql.yml

2021-11-06 07:23:25,827 INFO: Selected new etcd server http://192.168.56.53:2379
2021-11-06 07:23:25,831 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-11-06 07:23:25,839 INFO: Lock owner: psql13n51; I am psql13n52
2021-11-06 07:23:25,843 INFO: trying to bootstrap from leader 'psql13n51'
2021-11-06 07:23:26,237 INFO: replica has been created using basebackup
2021-11-06 07:23:26,238 INFO: bootstrapped from leader 'psql13n51'
2021-11-06 07:23:26,398 INFO: postmaster pid=1506
localhost:5432 - no response
2021-11-06 07:23:26.435 UTC [1506] LOG:  redirecting log output to logging collector process
2021-11-06 07:23:26.435 UTC [1506] HINT:  Future log output will appear in directory "log".
localhost:5432 - accepting connections
localhost:5432 - accepting connections
2021-11-06 07:23:27,449 INFO: Lock owner: psql13n51; I am psql13n52
2021-11-06 07:23:27,449 INFO: establishing a new patroni connection to the postgres cluster
2021-11-06 07:23:27,478 INFO: no action. I am a secondary (psql13n52) and following a leader (psql13n51)
2021-11-06 07:23:31,874 INFO: no action. I am a secondary (psql13n52) and following a leader (psql13n51)
2021-11-06 07:23:41,879 INFO: no action. I am a secondary (psql13n52) and following a leader (psql13n51)

Next, edit postgresql.yml file on psql13n53:

cat /opt/app/patroni/etc/postgresql.yml

scope: postgres
name: psql13n53

restapi:
    listen: 0.0.0.0:8008
    connect_address: 192.168.56.53:8008

etcd:
    host: psql13n53:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        logging_collector: "on"
        max_wal_senders: 5
        max_replication_slots: 5

  initdb:
  - encoding: UTF8
  - data-checksums

  pg_hba:
  - host replication replicator 127.0.0.1/32 trust
  - host replication replicator 192.168.56.1/24 md5
  - host all all 192.168.56.1/24 md5
  - host all all 0.0.0.0/0 md5

  users:
    admin:
      password: admin
      options:
        - createrole
        - createdb

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 192.168.56.53:5432
  data_dir: "/var/lib/pgsql/14/data"
  bin_dir: "/usr/pgsql-14/bin"
  pgpass: /tmp/pgpass
  authentication:
    replication:
      username: replicator
      password: vagrant
    superuser:
      username: postgres
      password: vagrant
  parameters:
    unix_socket_directories: '/var/run/postgresql'

watchdog:
  mode: required
  device: /dev/watchdog
  safety_margin: 5

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

Validate configuration:

patroni --validate-config /opt/app/patroni/etc/postgresql.yml

Run as postgres user:

# sudo su - postgres
patroni /opt/app/patroni/etc/postgresql.yml

2021-11-06 07:25:26,664 INFO: Selected new etcd server http://192.168.56.53:2379
2021-11-06 07:25:26,667 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-11-06 07:25:26,673 INFO: Lock owner: psql13n51; I am psql13n53
2021-11-06 07:25:26,676 INFO: trying to bootstrap from leader 'psql13n51'
2021-11-06 07:25:27,102 INFO: replica has been created using basebackup
2021-11-06 07:25:27,102 INFO: bootstrapped from leader 'psql13n51'
2021-11-06 07:25:27,262 INFO: postmaster pid=1597
localhost:5432 - no response
2021-11-06 07:25:27.299 UTC [1597] LOG:  redirecting log output to logging collector process
2021-11-06 07:25:27.299 UTC [1597] HINT:  Future log output will appear in directory "log".
localhost:5432 - accepting connections
localhost:5432 - accepting connections
2021-11-06 07:25:28,312 INFO: Lock owner: psql13n51; I am psql13n53
2021-11-06 07:25:28,313 INFO: establishing a new patroni connection to the postgres cluster
2021-11-06 07:25:28,340 INFO: no action. I am a secondary (psql13n53) and following a leader (psql13n51)
2021-11-06 07:25:31,877 INFO: no action. I am a secondary (psql13n53) and following a leader (psql13n51)

Check the state of the Patroni cluster:

# patronictl -c /opt/app/patroni/etc/postgresql.yml list

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member    | Host          | Role    | State   | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader  | running |  1 |           |
| psql13n52 | 192.168.56.52 | Replica | running |  1 |         0 |
| psql13n53 | 192.168.56.53 | Replica | running |  1 |         0 |
+-----------+---------------+---------+---------+----+-----------+

Psql13n51 started the Patroni cluster so it was automatically made the leader – and thus the primary/master PostgreSQL server. Nodes psql13n52 and psql13n53 are configured as read replicas (as the hot_standby option was enabled in Patroni’s configuration file).

Check PostgreSQL configuration parameters:

patronictl -c /opt/app/patroni/etc/postgresql.yml show-config postgres

loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
  parameters:
    hot_standby: 'on'
    logging_collector: 'on'
    max_replication_slots: 5
    max_wal_senders: 5
    wal_level: replica
  use_pg_rewind: true
  use_slots: true
retry_timeout: 10
ttl: 30

With the configuration file in place, and now that we already have the etcd cluster up, all that is required is to restart the Patroni service:

Configure Patroni service on every node:

# vi /etc/systemd/system/patroni.service
 
[Unit]
Description=Runners to orchestrate a high-availability PostgreSQL
After=syslog.target network.target etcd.target
 
[Service]
Type=simple
 
User=postgres
Group=postgres
 
ExecStart=/usr/bin/patroni /opt/app/patroni/etc/postgresql.yml
 
KillMode=process
 
TimeoutSec=30
 
Restart=no
 
[Install]
WantedBy=multi-user.target

# systemctl status patroni
# systemctl start patroni
# systemctl enable patroni
 
# systemctl status etcd
# systemctl enable etcd

Reboot all 3 nodes:

# reboot

Check status of the Patroni service after reboot. Service should be up and running.

# systemctl status patroni
● patroni.service - Runners to orchestrate a high-availability PostgreSQL
   Loaded: loaded (/etc/systemd/system/patroni.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2021-11-05 23:33:22 UTC; 8h ago
 Main PID: 705 (patroni)
    Tasks: 14 (limit: 11401)
   Memory: 146.1M
   CGroup: /system.slice/patroni.service
           ├─ 705 /usr/bin/python3 /usr/bin/patroni /opt/app/patroni/etc/postgresql.yml
           ├─1278 /usr/pgsql-14/bin/postgres -D /var/lib/pgsql/14/data --config-file=/var/lib/pgsql/14/data/postgresql.conf --listen_addresses=0.0.0.0 --port=5432 --cluster_name=postgres --wal_level=replica --hot_standby=on --max_conn>
           ├─1280 postgres: postgres: logger
           ├─1282 postgres: postgres: checkpointer
           ├─1283 postgres: postgres: background writer
           ├─1284 postgres: postgres: stats collector
           ├─1359 postgres: postgres: postgres postgres 127.0.0.1(37558) idle
           ├─1371 postgres: postgres: walwriter
           ├─1372 postgres: postgres: autovacuum launcher
           └─1373 postgres: postgres: logical replication launcher

Nov 06 07:43:00 psql13n51 patroni[705]: 2021-11-06 07:43:00,048 INFO: Software Watchdog activated with 25 second timeout, timing slack 15 seconds
Nov 06 07:43:00 psql13n51 patroni[705]: 2021-11-06 07:43:00,058 INFO: promoted self to leader by acquiring session lock
Nov 06 07:43:00 psql13n51 patroni[705]: server promoting
Nov 06 07:43:00 psql13n51 patroni[705]: 2021-11-06 07:43:00,064 INFO: cleared rewind state after becoming the leade

Check again state of the Petroni cluster:

# patronictl -c /opt/app/patroni/etc/postgresql.yml list
+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member    | Host          | Role    | State   | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Leader  | running |  5 |           |
| psql13n52 | 192.168.56.52 | Replica | running |  5 |         0 |
| psql13n53 | 192.168.56.53 | Replica | running |  5 |         0 |
+-----------+---------------+---------+---------+----+-----------+

$ sudo systemctl enable patroni
$ sudo systemctl start patroni
$ sudo systemctl status patroni

Postgresql is prohibited from the auto start, because it is managed by postgresql patroni.

Keepalived

Keepalived is used for IP failover between more servers.
Download latest Keepalived installation https://www.keepalived.org/download.html.

Run this on all 3 nodes:

wget https://www.keepalived.org/software/keepalived-2.2.4.tar.gz

Unpack archive and configure:

# tar xvfz keepalived-2.2.4.tar.gz
# cd keepalived-2.2.4
# ./configure

Fix errors befure running make command.

In my case I need to install openssl and libnl3 packages.

configure: error:
  !!! OpenSSL is not properly installed on your system. !!!
  !!! Can not include OpenSSL headers files.            !!!

yum -y install openssl openssl-devel

# ./configure
*** WARNING - this build will not support IPVS with IPv6. Please install libnl/libnl-3 dev libraries to support IPv6 with IPVS.

yum -y install libnl3 libnl3-devel

# ./configure

When there is no error then run make and make install:

# make && make install

For Keepalived startup the same service is created on all servers.

# cat /etc/systemd/system/keepalived.service
[Unit]
Description=LVS and VRRP High Availability Monitor
After=network-online.target syslog.target
Wants=network-online.target
[Service]
Type=forking
PIDFile=/run/keepalived.pid
KillMode=process
EnvironmentFile=-/usr/local/etc/sysconfig/keepalived
ExecStart=/usr/local/sbin/keepalived $KEEPALIVED_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
[Install]
WantedBy=multi-user.target

Before starting keepalived, we have to configure a configuration file on all servers.

psql13n51:

# cat /etc/keepalived/keepalived.conf
global_defs {
}
vrrp_script chk_haproxy { # Requires keepalived-1.1.13
    script "killall -0 haproxy" # widely used idiom
    interval 2 # check every 2 seconds
    weight 2 # add 2 points of prio if OK
}
vrrp_instance VI_1 {
    interface eth0
    state MASTER # or "BACKUP" on backup
    priority 101 # 101 on master, 100 on backup
    virtual_router_id 51
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    virtual_ipaddress {
        192.168.56.100
    }
    track_script {
        chk_haproxy
    }
}

psql13n52 & psql13n53:

# cat /etc/keepalived/keepalived.conf
global_defs {
}
vrrp_script chk_haproxy { # Requires keepalived-1.1.13
    script "killall -0 haproxy" # widely used idiom
    interval 2 # check every 2 seconds
    weight 2 # add 2 points of prio if OK
}
vrrp_instance VI_1 {
    interface eth0
    state BACKUP # or "BACKUP" on backup
    priority 100 # 101 on master, 100 on backup
    virtual_router_id 51
    authentication {
        auth_type PASS
        auth_pass 1234
    }
    virtual_ipaddress {
        192.168.56.100
    }
    track_script {
        chk_haproxy
    }
}

Now start keepalived service on all servers

# systemctl start keepalived
# systemctl status keepalived
# systemctl enable keepalived

The VIP 192.168.56.100 should be run on one server and will automatically failover to the the second server if there is any issue with the server.

HAProxy

Instead of connecting directly to the database server, the application will be connecting to the proxy instead, which will forward the request to PostgreSQL. When HAproxy is used for this, it is also possible to route read requests to one or more replicas, for load balancing. With HAproxy, this is done by providing two different ports for the application to connect. We opted for the following setup:

Writes → 5000
Reads → 5001

HAproxy light-weight service and it can be installed as an independent server or, as in our case, on the database server.

sudo yum -y install haproxy

Set configuration on all nodes:

$ cat /etc/haproxy/haproxy.cfg
global
    maxconn 100
 
defaults
    log    global
    mode    tcp
    retries 2
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout check 5s
 
listen stats
    mode http
    bind *:7000
    stats enable
    stats uri /
 
listen primary
    bind *:5000
    option httpchk OPTIONS /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server psql13n51 psql13n51:5432 maxconn 100 check port 8008
    server psql13n52 psql13n52:5432 maxconn 100 check port 8008
    server psql13n53 psql13n53:5432 maxconn 100 check port 8008
 
listen standbys
    balance roundrobin
    bind *:5001
    option httpchk OPTIONS /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server psql13n51 psql13n51:5432 maxconn 100 check port 8008
    server psql13n52 psql13n52:5432 maxconn 100 check port 8008
    server psql13n53 psql13n53:5432 maxconn 100 check port 8008

Note there are two sections: primary, using port 5000, and standbys, using port 5001. All three nodes are included in both sections: that’s because they are all potential candidates to be either primary or secondary.

For HAproxy to know which role each node currently has, it will send an HTTP request to port 8008 of the node: Patroni will answer. Patroni provides a built-in REST API support for health check monitoring that integrates perfectly with HAproxy for this:

$ curl -s http://psql13n51:8008
{"state": "running", "postmaster_start_time": "2021-11-06 15:01:56.197081+00:00", "role": "replica", "server_version": 140000, "cluster_unlocked": false, "xlog": {"received_location": 83888920, "replayed_location": 83888920, "replayed_timestamp": null, "paused": false}, "timeline": 6, "database_system_identifier": "7027353509639501631", "patroni": {"version": "2.1.1", "scope": "postgres"}}

Notice how we received "role": "replica" from Patroni when we sent HTTP request to standby server. HAProxy uses this information for query routing.

We configured the standbys group to balance read-requests in a round-robin fashion, so each connection request (or reconnection) will alternate between the available replicas.

Let’s start HAProxy on all three nodes:

# systemctl enable haproxy.service
# systemctl start haproxy.service
# systemctl status haproxy.service

Test connections for master or standby connections:

sudo su - postgres
echo "localhost:5000:postgres:postgres:vagrant" > ~/.pgpass
echo "localhost:5001:postgres:postgres:vagrant" >> ~/.pgpass
chmod 0600 ~/.pgpass

We can then execute two read-requests to verify the round-robin mechanism is working as intended:

$ psql -Upostgres -hlocalhost -p5001 -t -c "select inet_server_addr()"
192.168.56.51

$ psql -Upostgres -hlocalhost -p5001 -t -c "select inet_server_addr()"
192.168.56.52

Master (read/write) connection:

$ psql -Upostgres -hlocalhost -p5000 -t -c "select inet_server_addr()"
192.168.56.53

You can also check the state of HAproxy by visiting http://192.168.56.51:7000/ on your browser.

# patronictl -c /opt/app/patroni/etc/postgresql.yml list

+ Cluster: postgres (7027353509639501631) ------+----+-----------+
| Member    | Host          | Role    | State   | TL | Lag in MB |
+-----------+---------------+---------+---------+----+-----------+
| psql13n51 | 192.168.56.51 | Replica | running |  7 |         0 |
| psql13n52 | 192.168.56.52 | Replica | running |  7 |         0 |
| psql13n53 | 192.168.56.53 | Leader  | running |  7 |           |
+-----------+---------------+---------+---------+----+-----------+

We have set up three node Patroni cluster without no single point of failure (SPOF). In next part we will test disaster and failover scenarios on this configuration with active read/write workload.

Reference:
https://github.com/zalando/patroni
https://patroni.readthedocs.io/en/latest/
https://www.percona.com/blog/2021/06/11/postgresql-ha-with-patroni-your-turn-to-test-failure-scenarios/
https://blog.dbi-services.com/postgresql-high-availabilty-patroni-ectd-haproxy-keepalived/
https://digitalis.io/blog/technology/part1-postgresql-ha-patroni-etcd-haproxy/

ProxySQL - Throttle for MySQL queries

2020-11-11T13:21:00.003+01:00

ProxySQL is a great high availability and load balancing solution and it is mostly used for such purposes. But ProxySQL offers much more.

One of the nice features is the throttling mechanism for queries to the backends.

Imagine you have a very active system and applications are executing queries at very high rate, which is not so unusual nowadays. If just one of the queries slows down you could easily end up with many active sessions running the same query. Just one problematic query could cause high resource usage and general slowness.

Usually, DBA is called but DBA cannot modify a query, disable problematic application, or change database model without detailed analysis.

But ProxySQL could help.
Using ProxySQL we could delay execution of the problematic queries.
Yes, specific application request would still have a problem, but we would avoid general problem/downtime and "buy" some time for the fix.

Let's simulate such a situation in the test environment.

Run benchmark test using sysbench.

NUM_THREADS=1
TEST_DIR=/usr/share/sysbench/tests/include/oltp_legacy

sysbench \
  --test=${TEST_DIR}/oltp_simple.lua \
  --oltp-table-size=2000000 \
  --time=300 \
  --max-requests=0 \
  --mysql-table-engine=InnoDB \
  --mysql-user=sbtest \
  --mysql-password=sbtest \
  --mysql-port=3307 \
  --mysql-host=192.168.56.25 \
  --mysql-engine-trx=yes \
  --num-threads=$NUM_THREADS \
  prepare
  
sysbench \
  --test=${TEST_DIR}/oltp_simple.lua \
  --oltp-table-size=2000000 \
  --time=180 \
  --max-requests=0 \
  --mysql-table-engine=InnoDB \
  --mysql-user=sbtest \
  --mysql-password=sbtest \
  --mysql-port=3307 \
  --mysql-host=192.168.56.25 \
  --mysql-engine-trx=yes \
  --num-threads=$NUM_THREADS \
  run

Enable throttling mechanism and delay execution for all queries globally setting "mysql-default_query_delay=100".

  
ProxySQLServer> set mysql-default_query_delay=100;
ProxySQLServer> LOAD MYSQL VARIABLES TO RUNTIME; 
ProxySQLServer> SAVE MYSQL VARIABLES TO DISK;

Run test again and Check latency(ms).

  
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            1774
        write:                           0
        other:                           0
        total:                           1774
    transactions:                        1774   (9.85 per sec.)
    queries:                             1774   (9.85 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          180.0942s
    total number of events:              1774

Latency (ms):
         min:                                  100.76 <<<<<<<<<<<<<<<<<
         avg:                                  101.51 <<<<< Throttling
         max:                                  129.17 <<<<<<<<<<<<<<<<<
         95th percentile:                      102.97
         sum:                               180083.66

Threads fairness:
    events (avg/stddev):           1774.0000/0.00
    execution time (avg/stddev):   180.0837/0.00

Disable throttling and reset ProxySQL counters.

    
ProxySQLServer> set mysql-default_query_delay=0;
ProxySQLServer> LOAD MYSQL VARIABLES TO RUNTIME; SAVE MYSQL VARIABLES TO DISK;
ProxySQLServer> select * from stats_mysql_query_digest_reset;

Check latency(ms).

  
Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            641413
        write:                           0
        other:                           0
        total:                           641413
    transactions:                        641413 (3563.38 per sec.)
    queries:                             641413 (3563.38 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          180.0004s
    total number of events:              641413

Latency (ms):
         min:                                    0.19
         avg:                                    0.28
         max:                                   44.45
         95th percentile:                        0.43
         sum:                               179252.76

Threads fairness:
    events (avg/stddev):           641413.0000/0.00
    execution time (avg/stddev):   179.2528/0.00

Enable throttling for just a specific query using ProxySQL mysql query rules.

 
-- Find problematic query

ProxySQLServer> select hostgroup,username,count_star,
(count_star/(select Variable_Value from stats_mysql_global where Variable_Name='ProxySQL_Uptime')) 
as avg_per_sec, digest, digest_text from stats_mysql_query_digest order by count_star desc limit 10;

+-----------+----------+------------+-------------+--------------------+----------------------------------+
| hostgroup | username | count_star | avg_per_sec | digest             | digest_text                      |
+-----------+----------+------------+-------------+--------------------+----------------------------------+
| 2         | sbtest   | 641413     | 78          | 0xBF001A0C13781C1D | SELECT c FROM sbtest1 WHERE id=? |
+-----------+----------+------------+-------------+--------------------+----------------------------------+
1 row in set (0.00 sec)


-- Reset counters

ProxySQLServer> select * from stats_mysql_query_digest_reset;
+-----------+------------+----------+----------------+--------------------+----------------------------------+------------+------------+------------+-----------+----------+----------+-------------------+---------------+
| hostgroup | schemaname | username | client_address | digest             | digest_text                      | count_star | first_seen | last_seen  | sum_time  | min_time | max_time | sum_rows_affected | sum_rows_sent |
+-----------+------------+----------+----------------+--------------------+----------------------------------+------------+------------+------------+-----------+----------+----------+-------------------+---------------+
| 2         | sbtest     | sbtest   |                | 0xBF001A0C13781C1D | SELECT c FROM sbtest1 WHERE id=? | 641413     | 1601934890 | 1601935070 | 153023170 | 159      | 44349    | 0                 | 214399        |
+-----------+------------+----------+----------------+--------------------+----------------------------------+------------+------------+------------+-----------+----------+----------+-------------------+---------------+
1 row in set (0.00 sec)

Insert mysql query rule and enable throttling just for a specific query.

   
ProxySQLServer> insert into mysql_query_rules(rule_id,active,digest,delay,apply) values (1,1,'0xBF001A0C13781C1D',100,1);
Query OK, 1 row affected (0.00 sec)

LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL QUERY RULES TO DISK;

Compare "min_time" between executions.

 
Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            1773
        write:                           0
        other:                           0
        total:                           1773
    transactions:                        1773   (9.85 per sec.)
    queries:                             1773   (9.85 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          180.0325s
    total number of events:              1773

Latency (ms):
         min:                                  100.78 <<<<<<<<<<<<<<<<<
         avg:                                  101.53 <<<<< Throttling
         max:                                  104.77 <<<<<<<<<<<<<<<<<
         95th percentile:                      102.97
         sum:                               180021.34

Threads fairness:
    events (avg/stddev):           1773.0000/0.00
    execution time (avg/stddev):   180.0213/0.00

 
ProxySQLServer> select * from stats_mysql_query_digest_reset;
+-----------+------------+----------+----------------+--------------------+----------------------------------+------------+------------+------------+-----------+----------+----------+-------------------+---------------+
| hostgroup | schemaname | username | client_address | digest             | digest_text                      | count_star | first_seen | last_seen  | sum_time  | min_time | max_time | sum_rows_affected | sum_rows_sent |
+-----------+------------+----------+----------------+--------------------+----------------------------------+------------+------------+------------+-----------+----------+----------+-------------------+---------------+
| 2         | sbtest     | sbtest   |                | 0xBF001A0C13781C1D | SELECT c FROM sbtest1 WHERE id=? | 1773       | 1601935408 | 1601935588 | 179697522 | 100681   | 104195   | 0                 | 594           |
+-----------+------------+----------+----------------+--------------------+----------------------------------+------------+------------+------------+-----------+----------+----------+-------------------+---------------+
1 row in set (0.01 sec)

Galera Cluster Schema Changes, Row Based Replication and Data Inconsistency

2018-01-24T10:45:00.003+01:00

Galera Cluster is a virtually synchronous multi-master replication plug-in. When using Galera Cluster application can write to any node and transactions are then applied to all serves via row-based replication events.

This is built-in Mysql row-based replication which supports replication with differing table definitions between Master and Slave.
So, when using row-based repplication source and target table do not have to be identical. A table on master can have more or fewer columns or use different data types.

But there are limitations you must watch over depending on MySQL version you are running.
- The database and table names must be the same on both Master and Slave
- Columns must be in the same order before any additional column
- Each extra column must have default value
- ...

Newer MySQL versions may tolerate more differences between source and target table - check documentation for your version.

I want to show you what could happen with your data if you do not pay attention on this limitations.

Suppose I have 3-node MariaDB Galera Cluster with table t.
I want to add several columns to the table while database is used by an application.

For such task I will use built-in Rolling Schema Change (RSU) method which enables me to perform schema changes on node without impact on rest of the cluster.

Add column c4 to the table t following rules above for row-based replication.

Table t has three columns and one row inserted.

NODE1

MariaDB [testdb]> create table t (c1 varchar(10), c2 varchar(10), c3 varchar(10));
Query OK, 0 rows affected (0.37 sec)

MariaDB [testdb]> insert into t values ('n1-1','n1-1','n1-1');
Query OK, 1 row affected (0.00 sec)


NODE2

MariaDB [testdb]> select * from t;
+------+------+------+
| c1   | c2   | c3   |
+------+------+------+
| n1-1 | n1-1 | n1-1 |
+------+------+------+
1 row in set (0.00 sec)


NODE3

MariaDB [testdb]> select * from t;
+------+------+------+
| c1   | c2   | c3   |
+------+------+------+
| n1-1 | n1-1 | n1-1 |
+------+------+------+
1 row in set (0.01 sec)

I will enable RSU mode which ensures that this server will not impact the rest of the cluster during ALTER command execution.

Add column c4 and INSERT row simulating application activity.

MariaDB [testdb]> set session wsrep_OSU_method='RSU';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> alter table t add column c4 varchar(10);
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [testdb]> set session wsrep_OSU_method='TOI';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> insert into t(c1,c2,c3) values ('n1-1','n1-1','n1-1');
Query OK, 1 row affected (0.13 sec)

While table definition is different between Node1 and rest of the cluster INSERT few more rows on other nodes.

NODE2

insert into t(c1,c2,c3) values ('n2-1','n2-1','n2-1');


NODE3

insert into t(c1,c2,c3) values ('n3-1','n3-1','n3-1');

Check rows from table t.

NODE1

MariaDB [testdb]> select * from t;
+------+------+------+------+
| c1   | c2   | c3   | c4   |
+------+------+------+------+
| n1-1 | n1-1 | n1-1 | NULL |
| n1-1 | n1-1 | n1-1 | NULL |
| n2-1 | n2-1 | n2-1 | NULL |
| n3-1 | n3-1 | n3-1 | NULL |
+------+------+------+------+
4 rows in set (0.00 sec)

NODE2

MariaDB [testdb]> select * from t;
+------+------+------+
| c1   | c2   | c3   |
+------+------+------+
| n1-1 | n1-1 | n1-1 |
| n1-1 | n1-1 | n1-1 |
| n2-1 | n2-1 | n2-1 |
| n3-1 | n3-1 | n3-1 |
+------+------+------+
4 rows in set (0.00 sec)


NODE3

MariaDB [testdb]> select * from t;
+------+------+------+
| c1   | c2   | c3   |
+------+------+------+
| n1-1 | n1-1 | n1-1 |
| n1-1 | n1-1 | n1-1 |
| n2-1 | n2-1 | n2-1 |
| n3-1 | n3-1 | n3-1 |
+------+------+------+
4 rows in set (0.01 sec)

As you can notice everything is OK with my data.

Add new column to Node2 and Node3 following the same steps as for Node1.

NODE2

MariaDB [testdb]> set session wsrep_OSU_method='RSU';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> alter table t add column c4 varchar(10);
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [testdb]> set session wsrep_OSU_method='TOI';
Query OK, 0 rows affected (0.00 sec)


NODE3

MariaDB [testdb]> set session wsrep_OSU_method='RSU';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> alter table t add column c4 varchar(10);
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [testdb]> set session wsrep_OSU_method='TOI';
Query OK, 0 rows affected (0.00 sec)

And my task is completed. I have successfully changed model of the table.

But what can happen if I add new column between existing columns.
Remember, this is not permitted for a row-based replication and can cause replication to brake or something even worse.

Enable RSU mode on Node1 and add new column c11 after c1 column.
INSERT row simulating active application during schema change.

NODE1

MariaDB [testdb]> set session wsrep_OSU_method='RSU';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]>
MariaDB [testdb]> alter table t add column c11 varchar(10) after c1;
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [testdb]> set session wsrep_OSU_method='TOI';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> insert into t(c1,c2,c3) values ('n1-1','n1-1','n1-1');
Query OK, 1 row affected (0.01 sec)

MariaDB [testdb]> select * from t;
+------+------+------+------+------+
| c1   | c11  | c2   | c3   | c4   |
+------+------+------+------+------+
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
+------+------+------+------+------+
5 rows in set (0.00 sec)

INSERT row on other nodes because Galera Cluster allows us write on any node in the cluster configuration.

NODE2

MariaDB [testdb]> insert into t(c1,c2,c3) values ('n2-1','n2-1','n2-1');
Query OK, 1 row affected (0.01 sec)

MariaDB [testdb]> select * from t;
+------+------+------+------+
| c1   | c2   | c3   | c4   |
+------+------+------+------+
| n1-1 | n1-1 | n1-1 | NULL |
| n1-1 | n1-1 | n1-1 | NULL |
| n2-1 | n2-1 | n2-1 | NULL |
| n3-1 | n3-1 | n3-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 |
| n2-1 | n2-1 | n2-1 | NULL |
+------+------+------+------+
6 rows in set (0.00 sec)


NODE3

MariaDB [testdb]> insert into t(c1,c2,c3) values ('n3-1','n3-1','n3-1');
Query OK, 1 row affected (0.01 sec)

MariaDB [testdb]> select * from t;
+------+------+------+------+
| c1   | c2   | c3   | c4   |
+------+------+------+------+
| n1-1 | n1-1 | n1-1 | NULL |
| n1-1 | n1-1 | n1-1 | NULL |
| n2-1 | n2-1 | n2-1 | NULL |
| n3-1 | n3-1 | n3-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 |
| n2-1 | n2-1 | n2-1 | NULL |
| n3-1 | n3-1 | n3-1 | NULL |
+------+------+------+------+
7 rows in set (0.00 sec)

INSERT commands were successfully executed and everything is OK with my replication.
I don't have any errors in error.log that suggests that I have any problem.

But check contest of table t on the first node where new column is added.

NODE1

MariaDB [testdb]> select * from t;
+------+------+------+------+------+
| c1   | c11  | c2   | c3   | c4   |
+------+------+------+------+------+
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | n2-1 | n2-1 | NULL | NULL |
| n3-1 | n3-1 | n3-1 | NULL | NULL |
+------+------+------+------+------+
7 rows in set (0.00 sec)

Notice how rows differ between nodes, and we should have exactly the same data on all tree nodes.

Let's complete schema changes on other two nodes.

NODE2

MariaDB [testdb]> set session wsrep_OSU_method='RSU';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> alter table t add column c11 varchar(10) after c1;
Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [testdb]> set session wsrep_OSU_method='TOI';
Query OK, 0 rows affected (0.00 sec)


NODE3

MariaDB [testdb]> set session wsrep_OSU_method='RSU';
Query OK, 0 rows affected (0.00 sec)

MariaDB [testdb]> alter table t add column c11 varchar(10) after c1;
Query OK, 0 rows affected (0.34 sec)
Records: 0  Duplicates: 0  Warnings: 0

MariaDB [testdb]> set session wsrep_OSU_method='TOI';
Query OK, 0 rows affected (0.00 sec)

I have successfully added new column, did not brake reapplication and everything seems OK, but my data is not consistent between nodes.

NODE1

MariaDB [testdb]> select * from t;
+------+------+------+------+------+
| c1   | c11  | c2   | c3   | c4   |
+------+------+------+------+------+
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | n2-1 | n2-1 | NULL | NULL |
| n3-1 | n3-1 | n3-1 | NULL | NULL |
+------+------+------+------+------+
7 rows in set (0.00 sec)


NODE2

MariaDB [testdb]> select * from t;
+------+------+------+------+------+
| c1   | c11  | c2   | c3   | c4   |
+------+------+------+------+------+
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
| n1-1 | NULL | NULL | n1-1 | n1-1 |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
+------+------+------+------+------+
7 rows in set (0.00 sec)


NODE3

MariaDB [testdb]> select * from t;
+------+------+------+------+------+
| c1   | c11  | c2   | c3   | c4   |
+------+------+------+------+------+
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n1-1 | NULL | n1-1 | n1-1 | NULL |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
| n1-1 | NULL | NULL | n1-1 | n1-1 |
| n2-1 | NULL | n2-1 | n2-1 | NULL |
| n3-1 | NULL | n3-1 | n3-1 | NULL |
+------+------+------+------+------+
7 rows in set (0.00 sec)

Data inconsistency is the worst problem that could happen in synchronous cluster configuration.
It could happen without any notice, but sooner or later it will stop reapplication process and failing node will be excluded from the cluster.

REFERENCE
https://dev.mysql.com/doc/refman/5.7/en/replication-features-differing-tables.html

HASH GROUP BY not used when using more that 354 aggregate functions

2017-11-13T15:26:00.001+01:00

Few days ago we had performance problem with one of our main application views. This was complex view that used a lot of aggregate function. Functions were used to transpose rows into columns.

When developer added few more aggregate functions for a new columns, query performance changed significantly and we had performance problem.

After quick analysis I have noticed one change in the execution plan.

HASH GROUP BY aggregation was replaced with less performant SORT GROUP BY. I have tried to force HASH GROUP BY using hints but nothing helped.

We tried to reproduce problem using dummy tables and then colleague found what was triggering plan change.

In this example I have query with 354 unique aggregate functions which is using HASH GROUP BY.

SELECT
          *
      FROM (SELECT LEVEL ID
             FROM DUAL CONNECT BY LEVEL < 1000) VANJSKI,
          (  SELECT
                    123 UNUTARNJI_ID,
                     sum(1) kolona0,
                     sum(1) kolona1,
                     sum(2) kolona2,
...
...
...
                     sum(350) kolona350    ,
                     sum(351) kolona351    ,
                     sum(352) kolona352    ,
                     sum(353) kolona353    ,
                     sum(354) kolona354
               FROM DUAL
           GROUP BY 123) UNUTARNJI
    WHERE     VANJSKI.ID = UNUTARNJI.UNUTARNJI_ID(+);

Plan hash value: 2294628051
---------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| A-Rows |   A-Time   |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |      |      1 |        |       |     5 (100)|    999 |00:00:00.01 |       |       |          |
|*  1 |  HASH JOIN OUTER               |      |      1 |      1 |  4631 |     5  (20)|    999 |00:00:00.01 |  2293K|  2293K| 1549K (0)|
|   2 |   VIEW                         |      |      1 |      1 |    13 |     2   (0)|    999 |00:00:00.01 |       |       |          |
|   3 |    CONNECT BY WITHOUT FILTERING|      |      1 |        |       |            |    999 |00:00:00.01 |       |       |          |
|   4 |     FAST DUAL                  |      |      1 |      1 |       |     2   (0)|      1 |00:00:00.01 |       |       |          |
|   5 |   VIEW                         |      |      1 |      1 |  4618 |     2   (0)|      1 |00:00:00.01 |       |       |          |
|   6 |    HASH GROUP BY               |      |      1 |      1 |       |     2   (0)|      1 |00:00:00.01 |   677K|   677K|  723K (0)|
|   7 |     FAST DUAL                  |      |      1 |      1 |       |     2   (0)|      1 |00:00:00.01 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------

Notice what will happen if I change "sum(1) kolona0" function and add one more unique function.

SELECT
          *
      FROM (SELECT LEVEL ID
             FROM DUAL CONNECT BY LEVEL < 1000) VANJSKI,
          (  SELECT
                    123 UNUTARNJI_ID,
                     sum(355) kolona0,
                     sum(1) kolona1,
                     sum(2) kolona2,
...
...
...
                     sum(350) kolona350    ,
                     sum(351) kolona351    ,
                     sum(352) kolona352    ,
                     sum(353) kolona353    ,
                     sum(354) kolona354
               FROM DUAL
           GROUP BY 123) UNUTARNJI
    WHERE     VANJSKI.ID = UNUTARNJI.UNUTARNJI_ID(+);

Plan hash value: 2326946862
---------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| A-Rows |   A-Time   |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |      |      1 |        |       |     5 (100)|    999 |00:00:00.01 |       |       |          |
|*  1 |  HASH JOIN OUTER               |      |      1 |      1 |  4631 |     5  (20)|    999 |00:00:00.01 |  2293K|  2293K| 1645K (0)|
|   2 |   VIEW                         |      |      1 |      1 |    13 |     2   (0)|    999 |00:00:00.01 |       |       |          |
|   3 |    CONNECT BY WITHOUT FILTERING|      |      1 |        |       |            |    999 |00:00:00.01 |       |       |          |
|   4 |     FAST DUAL                  |      |      1 |      1 |       |     2   (0)|      1 |00:00:00.01 |       |       |          |
|   5 |   VIEW                         |      |      1 |      1 |  4618 |     2   (0)|      1 |00:00:00.01 |       |       |          |
|   6 |    SORT GROUP BY               |      |      1 |      1 |       |     2   (0)|      1 |00:00:00.01 | 20480 | 20480 |18432  (0)|
|   7 |     FAST DUAL                  |      |      1 |      1 |       |     2   (0)|      1 |00:00:00.01 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------

Query execution plan changed - HASH GROUP BY was replaced with SORT GROUP BY.

This was obviously limitation for HASH GROUP BY but I couldn't find more information using Oracle docs or Google so I have asked Oracle support for a confirmation.

From Oracle support I have received answer that similar case was bug closed as not bug, without workaround. Using default DB_BLOCK_SIZE, the limitation is set to 354 aggregate functions.

To solve performance problem we have changed view to avoid HASH GROUP BY limitation.

Testing environment - Oracle Database 12c Enterprise Edition Release 12.1.0.2.0

Beware of intensive slow query logging when using - log_queries_not_using_indexes

2017-10-21T21:01:00.001+02:00

MySQL slow query log is great for identifying slow queries that are good candidates for optimisation. Slow query logging is disabled by default, but it is activated by DBA's or developers on most environments.

You can use slow query log to record all the traffic but be careful with this action as logging all traffic could be very I/O intensive and could have negative impact on general performance. It is recommended to record all traffic only for specific time periods.

This is why slow query logging is controlled with long_query_time parameter to log only slow queries.
But there is another parameter to think about - log_queries_not_using_indexes.

By default log_queries_not_using_indexes is disabled. If you have this parameter turned on you will log queries that don’t use an index, or that perform a full index scan where the index doesn't limit the number of rows - regardless of time taken.

If you have long_query_time configured to reasonable time, and still notice that queries are intensively logged in slow query log file, then you probably have enabled log_queries_not_using_indexes.

Enabling this parameter you’re practically saying that full scans are "evil" and should be considered for optimisation. But full scan doesn’t always mean that query is slow. In some situations query optimizer chooses full table scan as better option than index or you are probably querying very small table.

For instance, on several occasions I've noticed slow query logs flooded with queries like this:

# Time: 171021 17:51:45
# User@Host: monitor[monitor] @ localhost []
# Thread_id: 1492974  Schema:   QC_hit: No
# Query_time: 0.000321  Lock_time: 0.000072  Rows_sent: 0  Rows_examined: 1
# Full_scan: Yes  Full_join: No  Tmp_table: Yes  Tmp_table_on_disk: No
# Filesort: No  Filesort_on_disk: No  Merge_passes: 0  Priority_queue: No
SET timestamp=1508608305;
SELECT
      SCHEMA_NAME
    FROM information_schema.schemata
    WHERE SCHEMA_NAME NOT IN ('mysql', 'performance_schema', 'information_schema');

+------+-------------+----------+------+---------------+------+---------+------+------+-------------+
| id   | select_type | table    | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+------+-------------+----------+------+---------------+------+---------+------+------+-------------+
|    1 | SIMPLE      | schemata | ALL  | NULL          | NULL | NULL    | NULL | NULL | Using where |
+------+-------------+----------+------+---------------+------+---------+------+------+-------------+

Notice, Query_time: 0.000321.

Should I optimize query that is running 0.000321 secs with adding indexes. Probably not. But anyway, my log is flooded with this or similar queries.

I don’t see that parameter very useful and I would leave it on default value to avoid possible problems with intensive query logging.

Enable SSL-encryption for MariaDB Galera Cluster

2017-10-17T12:46:00.002+02:00

Imagine you have MariaDB Galera cluster with nodes running in different data centers. Data centers are not connected via secured VPN tunnel.
As database security is very important you must ensure that traffic between nodes is fully secured.

Galera Cluster supports encrypted connections between nodes using SSL protocol and in this post I want to show how to encrypt all cluster communication using SSL encryption.

Check current SSL configuration.

MariaDB [(none)]> SHOW VARIABLES LIKE 'have_ssl';
+---------------+----------+
| Variable_name | Value    |
+---------------+----------+
| have_ssl      | DISABLED |  ###==> SSL Disabled
+---------------+----------+
1 row in set (0.01 sec)

MariaDB [(none)]> status
--------------
mysql  Ver 15.1 Distrib 10.0.29-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2

Connection id:          56
Current database:
Current user:           marko@localhost
SSL:                    Not in use   ###==> SSL is not used
Current pager:          stdout
Using outfile:          ''
Using delimiter:        ;
Server:                 MariaDB
Server version:         10.0.17-MariaDB-1~trusty-wsrep-log mariadb.org binary distribution, wsrep_25.10.r4144
Protocol version:       10
Connection:             Localhost via UNIX socket
Server characterset:    utf8
Db     characterset:    utf8
Client characterset:    utf8
Conn.  characterset:    utf8
UNIX socket:            /var/run/mysqld/mysqld.sock
Uptime:                 7 days 42 min 29 sec

Threads: 52  Questions: 10  Slow queries: 0  Opens: 0  Flush tables: 1  Open tables: 63  Queries per second avg: 0.000
--------------

SSL is currently disabled.

To fully secure all cluster communication we must SSL-encrypt replication traffic within Galera Cluster, State Snapshot Transfer and traffic between database server and client.

We will create SSL Certificates and Keys using openssl.

# Create new folder for certificates
mkdir -p /etc/mysql/ssl
cd /etc/mysql/ssl

# Create CA certificate
# Generate CA key
openssl genrsa 2048 > ca-key.pem

# Using the CA key, generate the CA certificate
openssl req -new -x509 -nodes -days 3600 \
> -key ca-key.pem -out ca-cert.pem
-----
Country Name (2 letter code) [AU]:HR
State or Province Name (full name) [Some-State]:Zagreb
Locality Name (eg, city) []:Zagreb
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Dummycorp
Organizational Unit Name (eg, section) []:IT
Common Name (e.g. server FQDN or YOUR name) []:myu1.localdomain
Email Address []:marko@dummycorp.com


# Create server certificate, remove passphrase, and sign it
# Create the server key
openssl req -newkey rsa:2048 -days 3600 \
>         -nodes -keyout server-key.pem -out server-req.pem
-----
Country Name (2 letter code) [AU]:HR
State or Province Name (full name) [Some-State]:Zagreb
Locality Name (eg, city) []:Zagreb
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Dummycorp
Organizational Unit Name (eg, section) []:IT
##==> Use the ".localdomain" only on the first certificate.
Common Name (e.g. server FQDN or YOUR name) []:myu1
Email Address []:marko@dummycorp.com

Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:secretpassword
An optional company name []:

# Process the server RSA key
openssl rsa -in server-key.pem -out server-key.pem

# Sign the server certificate
openssl x509 -req -in server-req.pem -days 3600 \
>         -CA ca-cert.pem -CAkey ca-key.pem -set_serial 01 -out server-cert.pem


# Create client certificate, remove passphrase, and sign it
# Create the client key
openssl req -newkey rsa:2048 -days 3600 \
>         -nodes -keyout client-key.pem -out client-req.pem
-----
Country Name (2 letter code) [AU]:HR
State or Province Name (full name) [Some-State]:Zagreb
Locality Name (eg, city) []:Zagreb
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Dummycorp
Organizational Unit Name (eg, section) []:IT
Common Name (e.g. server FQDN or YOUR name) []:myu1
Email Address []:marko@dummycorp.com

Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:secretpassword
An optional company name []:

# Process client RSA key
openssl rsa -in client-key.pem -out client-key.pem

# Sign the client certificate
openssl x509 -req -in client-req.pem -days 3600 \
>         -CA ca-cert.pem -CAkey ca-key.pem -set_serial 01 -out client-cert.pem


# Verify certificates

openssl verify -CAfile ca-cert.pem server-cert.pem client-cert.pem

server-cert.pem: OK
client-cert.pem: OK

If verification succeeds copy certificates to all nodes in the cluster.
Set mysql as owner of the files.

# Copy
scp -r /etc/mysql/ssl node1:/etc/mysql
scp -r /etc/mysql/ssl node2:/etc/mysql
scp -r /etc/mysql/ssl node3:/etc/mysql

# Change owner
node1: chown -R mysql:mysql /etc/mysql/ssl
node2: chown -R mysql:mysql /etc/mysql/ssl
node3: chown -R mysql:mysql /etc/mysql/ssl

Secure database and client connections.

Add following lines in my.cnf configuration file.

# MySQL Server
[mysqld]
ssl-ca=/etc/mysql/ssl/ca-cert.pem
ssl-cert=/etc/mysql/ssl/server-cert.pem
ssl-key=/etc/mysql/ssl/server-key.pem

# MySQL Client
[client]
ssl-ca=/etc/mysql/ssl/ca-cert.pem
ssl-cert=/etc/mysql/ssl/client-cert.pem
ssl-key=/etc/mysql/ssl/client-key.pem

Secure replication traffic.

Define paths to the key, certificate and certificate authority files. Galera Cluster will use this files for encrypting and decrypting replication traffic.

wsrep_provider_options="socket.ssl_key=/etc/mysql/ssl/server-key.pem;socket.ssl_cert=/etc/mysql/ssl/server-cert.pem;socket.ssl_ca=/etc/mysql/ssl/ca-cert.pem"

Enable SSL for mysqldump and Xtrabackup.

Create user which requires SSL for connection.

MariaDB [(none)]> CREATE USER 'sstssl'@'localhost' IDENTIFIED BY 'sstssl';
Query OK, 0 rows affected (0.03 sec)

MariaDB [(none)]> GRANT PROCESS, RELOAD, LOCK TABLES, REPLICATION CLIENT ON *.* TO 'sstssl'@'localhost' REQUIRE ssl;
Query OK, 0 rows affected (0.02 sec)

MariaDB [(none)]> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.00 sec)

I will use this user for replication.
Change wsrep_sst_auth in my.cnf configuration file.

wsrep_sst_auth="sstssl:sstssl"

Now we must recreate whole cluster.
If I restart only one node, while others are running, node won't join to existing cluster.
You can notice this errors in mysql error log.

171017  3:20:29 [ERROR] WSREP: handshake with remote endpoint ssl://192.168.56.22:4567 failed: asio.ssl:336031996: 'unknown protocol' ( 336031996: 'error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol')
171017  3:20:29 [ERROR] WSREP: handshake with remote endpoint ssl://192.168.56.23:4567 failed: asio.ssl:336031996: 'unknown protocol' ( 336031996: 'error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol')

Shutdown the cluster and bootstrap it.

Check.

MariaDB [(none)]> status
--------------
mysql  Ver 15.1 Distrib 10.0.29-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2

Connection id:          87
Current database:
Current user:           marko@localhost
SSL:                    Cipher in use is DHE-RSA-AES256-SHA  ###==> SSL is used
Current pager:          stdout
Using outfile:          ''
Using delimiter:        ;
Server:                 MariaDB
Server version:         10.0.17-MariaDB-1~trusty-wsrep-log mariadb.org binary distribution, wsrep_25.10.r4144
Protocol version:       10
Connection:             Localhost via UNIX socket
Server characterset:    utf8
Db     characterset:    utf8
Client characterset:    utf8
Conn.  characterset:    utf8
UNIX socket:            /var/run/mysqld/mysqld.sock
Uptime:                 1 min 4 sec

Threads: 52  Questions: 676  Slow queries: 16  Opens: 167  Flush tables: 1  Open tables: 31  Queries per second avg: 10.562
--------------


MariaDB [(none)]> SHOW VARIABLES LIKE 'have_ssl';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| have_ssl      | YES   |
+---------------+-------+
1 row in set (0.01 sec)

REFERENCES
6.4.3.2 Creating SSL Certificates and Keys Using openssl
MySQL : Configure SSL Connections

Delete large amounts of data on Galera Cluster using pt-archiver

2017-09-28T12:12:00.000+02:00

Galera Cluster is excellent virtually synchronous multi-master database cluster. It has many benefits which you can check on GaleraCluster.
But beside benefits it has some limitations and one of them is handling large transactions.

Large replication data sets could degrade performance of whole cluster causing cluster freezing, increased memory consumption, crashing nodes, etc. To avoid this issues it is recommended to split large transactions into smaller chunks.

In this post I want to show you how to safely delete large amounts of data on Galera Cluster. You can perform this task using several tools or writing custom procedures to split large transaction into chunks. In this example I will use pt-archiver tool from Percona.

Imagine you have received task to perform data cleanup in devices table for several schemas.
It looks like very simple task - delete rows from devices table where device_cookie is 0.

delete from devices where device_cookie = 0

But, although statement looks simple it could potentially freeze whole cluster so before executing delete statement count how many rows you need to delete.

In my case I have to delete few millions of rows which is too much for one transaction so I need to split transaction into smaller chunks.

mysql> select count(*) from devices;
+----------+
| count(*) |
+----------+
|  2788504 |
+----------+

mysql> select count(*) - (select count(*) from devices where device_cookie = 0) 
       from devices;
+----------+
| count(*) |
+----------+
|      208 |
+----------+

I have to delete around 2.7 millions of rows.

This is command I will use:

pt-archiver --source h=localhost,u=marko,p="passwd",D=sch_testdb,t=devices \
--purge --where "device_cookie = 0" --sleep-coef 1.0 --txn-size 1000

--purge - delete rows.
--where "device_cookie = 0" - filter rows you want to delete.
--sleep-coef 1.0 - throttle delete process to avoid pause signals from cluster.
--txn-size 1000 - this is chunk size for every transaction.

# time pt-archiver --source h=localhost,u=marko,p="passwd",D=sch_testdb,t=devices \
--purge --where "device_cookie = 0" --sleep-coef 1.0 --txn-size 1000

real 3m32.532s
user 0m17.268s
sys 0m2.460s

Check after delete finished.

mysql> select count(*) from devices;
+----------+
| count(*) |
+----------+
|      208 |
+----------+
1 row in set (0.00 sec)

As I have to perform delete for several schemas, I have created simple shell script which iterates through schema list and executes pt-archiver command.

# cat delete_rows.sh
#!/bin/bash

LOGFILE=/opt/skripte/schema/table_delete_rows.log
SCHEMA_LIST=/opt/skripte/schema/schema_list.conf

# Get schema list and populate conf file
mysql -B -u marko -ppasswd --disable-column-names --execute "select schema_name from information_schema.schemata where schema_name like 'sch_%' and schema_name <> 'sch_sys'" > $SCHEMA_LIST

while IFS= read -r schema; do

  START=`date +%s`

  echo "`date`=> Deleting rows from table in schema: $schema"

  pt-archiver --source h=localhost,u=marko,p="passwd",D=$schema,t=devices --purge --where "device_cookie = 0" --sleep-coef 1.0 --txn-size 500

  SPENT=$(((`date +%s` - $START) / 60))

  echo "`date`=> Finished deleting in schema - spent: $SPENT mins"
  echo "*************************************************************************"

done <$SCHEMA_LIST >> $LOGFILE

exit 0

Beware of ORA-19721 on 12c using Transportable Tablespace (Oracle changed behavior)

2017-09-16T19:19:00.003+02:00

Almost every big database has it's hot data which is used often, and cold data which is rarely touched. From version 9i I have used transportable tablespace feature to exclude cold (archive) data from database and keep it on cheap storage or tapes.

If someone needs to query some of archive tables it was very easy to plug in tablespace for a few days and after archive data is not needed anymore tablespace could be easily dropped. So I was plugging the same tablespaces more than once.

But when I tried the same process on 12c database I was unpleasantly surprised that Oracle changed behaviour and I could not reattach tablespace.

Let’s demonstrate this in simple demo case.

Create tablespace and set it to be read only.

create tablespace ARCHIVE01 datafile '/oradata1/data/ora12c/archive01.dbf' size 50M;
Tablespace created.

create table archtab tablespace ARCHIVE01 as select * from dba_objects;
Table created.

alter tablespace ARCHIVE01 read only;
Tablespace altered.

create directory export_tts as '/oradata1/export';
Directory created.

Export tablespace metadata.

$ expdp '" / as sysdba "' directory=EXPORT_TTS dumpfile=exp_archive01.dmp logfile=exp_archive01.log transport_tablespaces=ARCHIVE01 transport_full_check=Y

Export: Release 12.1.0.2.0 - Production on Sat Sep 16 18:07:27 2017

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

WARNING: Oracle Data Pump operations are not typically needed when connected to the root or seed of a container database.

Starting "SYS"."SYS_EXPORT_TRANSPORTABLE_01":  "/******** AS SYSDBA" directory=EXPORT_TTS dumpfile=exp_archive01.dmp logfile=exp_archive01.log transport_tablespaces=ARCHIVE01 transport_full_check=Y
Processing object type TRANSPORTABLE_EXPORT/PLUGTS_BLK
Processing object type TRANSPORTABLE_EXPORT/TABLE
Processing object type TRANSPORTABLE_EXPORT/TABLE_STATISTICS
Processing object type TRANSPORTABLE_EXPORT/STATISTICS/MARKER
Processing object type TRANSPORTABLE_EXPORT/POST_INSTANCE/PLUGTS_BLK
Master table "SYS"."SYS_EXPORT_TRANSPORTABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for SYS.SYS_EXPORT_TRANSPORTABLE_01 is:
  /oradata1/export/exp_archive01.dmp
******************************************************************************
Datafiles required for transportable tablespace ARCHIVE01:
  /oradata1/data/ora12c/archive01.dbf
Job "SYS"."SYS_EXPORT_TRANSPORTABLE_01" successfully completed at Sat Sep 16 18:08:06 2017 elapsed 0 00:00:3

Drop tablespace but keep datafile.

SQL> drop tablespace ARCHIVE01 including contents keep datafiles;
Tablespace dropped.

Let’s plug in tablespace.

$ impdp '" /as sysdba "' directory=EXPORT_TTS dumpfile=exp_archive01.dmp logfile=imp_archive01.log transport_datafiles='/oradata1/data/ora12c/archive01.dbf'

Import: Release 12.1.0.2.0 - Production on Sat Sep 16 18:11:32 2017

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

WARNING: Oracle Data Pump operations are not typically needed when connected to the root or seed of a container database.

Master table "SYS"."SYS_IMPORT_TRANSPORTABLE_01" successfully loaded/unloaded
Starting "SYS"."SYS_IMPORT_TRANSPORTABLE_01":  "/******** AS SYSDBA" directory=EXPORT_TTS dumpfile=exp_archive01.dmp logfile=imp_archive01.log transport_datafiles=/oradata1/data/ora12c/archive01.dbf
Processing object type TRANSPORTABLE_EXPORT/PLUGTS_BLK
Processing object type TRANSPORTABLE_EXPORT/TABLE
Processing object type TRANSPORTABLE_EXPORT/TABLE_STATISTICS
Processing object type TRANSPORTABLE_EXPORT/STATISTICS/MARKER
Processing object type TRANSPORTABLE_EXPORT/POST_INSTANCE/PLUGTS_BLK
Job "SYS"."SYS_IMPORT_TRANSPORTABLE_01" successfully completed at Sat Sep 16 18:11:51 2017 elapsed 0 00:00:18

Check alert log.

Plug in tablespace ARCHIVE01 with datafile
  '/oradata1/data/ora12c/archive01.dbf'
TABLE SYS.WRI$_OPTSTAT_HISTHEAD_HISTORY: ADDED INTERVAL PARTITION SYS_P451 (42993) VALUES LESS THAN (TO_DATE(' 2017-09-17 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
ALTER TABLESPACE "ARCHIVE01" READ WRITE
Completed: ALTER TABLESPACE "ARCHIVE01" READ WRITE
ALTER TABLESPACE "ARCHIVE01" READ ONLY
Sat Sep 16 18:11:51 2017
 Converting block 0 to version 10 format
Completed: ALTER TABLESPACE "ARCHIVE01" READ ONLY

Notice that Oracle is altering tablespace (datafile headers) to READ WRITE - Completed: ALTER TABLESPACE "ARCHIVE01" READ WRITE.

Quote from Oracle Support site:

Oracle Development declared it as "Expected Behavior" Starting from 12.1, during the TTS import operation, the tablespaces (datafile headers) are put into read-write mode intermittently in order to fix up TSTZ table columns and clean up unused segments in the datafiles. This functionality was implemented on many customer's request basis. And, hence, this cannot be reversed. Note that, it intermittently only changes the status to "read-write" and the final status will still be "read-only" only.

Now if I drop tablespace and try to reattach it again.

Create tablespace.

SQL> drop tablespace ARCHIVE01 including contents keep datafiles;
Tablespace dropped.

Import tablespace metadata.

$ impdp '" /as sysdba "' directory=EXPORT_TTS dumpfile=exp_archive01.dmp logfile=imp_archive01.log transport_datafiles='/oradata1/data/ora12c/archive01.dbf'

Import: Release 12.1.0.2.0 - Production on Sat Sep 16 18:13:51 2017

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

WARNING: Oracle Data Pump operations are not typically needed when connected to the root or seed of a container database.

Master table "SYS"."SYS_IMPORT_TRANSPORTABLE_01" successfully loaded/unloaded
Starting "SYS"."SYS_IMPORT_TRANSPORTABLE_01":  "/******** AS SYSDBA" directory=EXPORT_TTS dumpfile=exp_archive01.dmp logfile=imp_archive01.log transport_datafiles=/oradata1/data/ora12c/archive01.dbf
Processing object type TRANSPORTABLE_EXPORT/PLUGTS_BLK
ORA-39123: Data Pump transportable tablespace job aborted
ORA-19721: Cannot find datafile with absolute file number 14 in tablespace ARCHIVE01

Job "SYS"."SYS_IMPORT_TRANSPORTABLE_01" stopped due to fatal error at Sat Sep 16 18:13:55 2017 elapsed 0 00:00:02

I have received error and failed to plug in tablespace.

Workaround for this "expected" behaviour is to change datafile permissions in OS level to be read only.
There is also workaround if you are using ASM so check on Oracle supprot site.

Let’s repeat steps from demo but now using workaround.

Create tablespace.

SQL> create tablespace ARCHIVE02 datafile '/oradata1/data/ora12c/archive02.dbf' size 50M;
Tablespace created.

SQL> create table archtab tablespace ARCHIVE02 as select * from dba_objects;
Table created.

SQL> alter tablespace ARCHIVE02 read only;
Tablespace altered.

Export tablespace metadata.

$ expdp '" / as sysdba "' directory=EXPORT_TTS dumpfile=exp_archive02.dmp logfile=exp_archive02.log transport_tablespaces=ARCHIVE02 transport_full_check=Y

Export: Release 12.1.0.2.0 - Production on Sat Sep 16 18:18:25 2017

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

WARNING: Oracle Data Pump operations are not typically needed when connected to the root or seed of a container database.

Starting "SYS"."SYS_EXPORT_TRANSPORTABLE_01":  "/******** AS SYSDBA" directory=EXPORT_TTS dumpfile=exp_archive02.dmp logfile=exp_archive02.log transport_tablespaces=ARCHIVE02 transport_full_check=Y
Processing object type TRANSPORTABLE_EXPORT/PLUGTS_BLK
Processing object type TRANSPORTABLE_EXPORT/TABLE
Processing object type TRANSPORTABLE_EXPORT/TABLE_STATISTICS
Processing object type TRANSPORTABLE_EXPORT/STATISTICS/MARKER
Processing object type TRANSPORTABLE_EXPORT/POST_INSTANCE/PLUGTS_BLK
Master table "SYS"."SYS_EXPORT_TRANSPORTABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for SYS.SYS_EXPORT_TRANSPORTABLE_01 is:
  /oradata1/export/exp_archive02.dmp
******************************************************************************
Datafiles required for transportable tablespace ARCHIVE02:
  /oradata1/data/ora12c/archive02.dbf
Job "SYS"."SYS_EXPORT_TRANSPORTABLE_01" successfully completed at Sat Sep 16 18:18:44 2017 elapsed 0 00:00:18

Drop tablespace and keep datafile.

SQL> drop tablespace ARCHIVE02 including contents keep datafiles;
Tablespace dropped.

Change permissions for datafile to be read only.

$ chmod 0440 /oradata1/data/ora12c/archive02.dbf
$ ls -l /oradata1/data/ora12c/archive02.dbf
-r--r-----. 1 oracle oinstall 52436992 Sep 16 18:17 /oradata1/data/ora12c/archive02.dbf

Import tablespace metadata.

$ impdp '" /as sysdba "' directory=EXPORT_TTS dumpfile=exp_archive02.dmp logfile=imp_archive02.log transport_datafiles='/oradata1/data/ora12c/archive02.dbf'

Import: Release 12.1.0.2.0 - Production on Sat Sep 16 18:20:23 2017

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

WARNING: Oracle Data Pump operations are not typically needed when connected to the root or seed of a container database.

Master table "SYS"."SYS_IMPORT_TRANSPORTABLE_01" successfully loaded/unloaded
Starting "SYS"."SYS_IMPORT_TRANSPORTABLE_01":  "/******** AS SYSDBA" directory=EXPORT_TTS dumpfile=exp_archive02.dmp logfile=imp_archive02.log transport_datafiles=/oradata1/data/ora12c/archive02.dbf
Processing object type TRANSPORTABLE_EXPORT/PLUGTS_BLK
Processing object type TRANSPORTABLE_EXPORT/TABLE
Processing object type TRANSPORTABLE_EXPORT/TABLE_STATISTICS
Processing object type TRANSPORTABLE_EXPORT/STATISTICS/MARKER
Processing object type TRANSPORTABLE_EXPORT/POST_INSTANCE/PLUGTS_BLK
Job "SYS"."SYS_IMPORT_TRANSPORTABLE_01" successfully completed at Sat Sep 16 18:20:28 2017 elapsed 0 00:00:03

In alert log you can notice ORA-1114 IO errors because Oracle cannot modify datafile.

Plug in tablespace ARCHIVE02 with datafile
  '/oradata1/data/ora12c/archive02.dbf'
ALTER TABLESPACE "ARCHIVE02" READ WRITE
ORA-1114 signalled during: ALTER TABLESPACE "ARCHIVE02" READ WRITE...

Drop tablespace and reattach it again.

SQL> drop tablespace ARCHIVE02 including contents keep datafiles;
Tablespace dropped.

Plug in tablespace.

$ impdp '" /as sysdba "' directory=EXPORT_TTS dumpfile=exp_archive02.dmp logfile=imp_archive02.log transport_datafiles='/oradata1/data/ora12c/archive02.dbf'

Import: Release 12.1.0.2.0 - Production on Sat Sep 16 18:22:01 2017

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

WARNING: Oracle Data Pump operations are not typically needed when connected to the root or seed of a container database.

Master table "SYS"."SYS_IMPORT_TRANSPORTABLE_01" successfully loaded/unloaded
Starting "SYS"."SYS_IMPORT_TRANSPORTABLE_01":  "/******** AS SYSDBA" directory=EXPORT_TTS dumpfile=exp_archive02.dmp logfile=imp_archive02.log transport_datafiles=/oradata1/data/ora12c/archive02.dbf
Processing object type TRANSPORTABLE_EXPORT/PLUGTS_BLK
Processing object type TRANSPORTABLE_EXPORT/TABLE
Processing object type TRANSPORTABLE_EXPORT/TABLE_STATISTICS
Processing object type TRANSPORTABLE_EXPORT/STATISTICS/MARKER
Processing object type TRANSPORTABLE_EXPORT/POST_INSTANCE/PLUGTS_BLK
Job "SYS"."SYS_IMPORT_TRANSPORTABLE_01" successfully completed at Sat Sep 16 18:22:05 2017 elapsed 0 00:00:03

Now I didn’t received error and I was able to plug in tablespace.
I have to remind myself to change datafile permissions before plugging tablespaces from 12c version.

REFERENCES
Doc ID 2094476.1

Using In-Memory Option with SQL Plan Baselines, SQL Profiles and SQL Hints

2017-03-06T09:16:00.002+01:00

Oracle database In-Memory option was introduced in 12.1.0.2 patchset. It is great feature to improve performance of analytic queries. For mixed workload OLTP environments In-Memory option could improve performance of analytic queries without significant negative affect on quick OLTP queries or DML operations.

So you have decided that In-Memory option could be great for you and now you want to implement this option for your critical production database.

But in your code you have many SQL hints hard-coded, SQL Profiles implemented or SQL Plan baselines created to solve problems with unstable query performance. What will happen with execution plans if you populate In-Memory column store with critical tables in the database.

Example:
Version : Oracle 12.1.0.2

For test I will use query with fixed plan using both SQL profile and SQL plan baseline.

select object_type, count(*)
from admin.big_table
group by object_type;

OBJECT_TYPE               COUNT(*)
----------------------- ----------
PACKAGE                      14858
PACKAGE BODY                 13724
PROCEDURE                     2254
PROGRAM                        110


SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(format => 'TYPICAL'));

SQL_ID  8g28yt7c1nacr, child number 0
-------------------------------------
select object_type, count(*) from admin.big_table group by object_type

Plan hash value: 1753714399

--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |       |       |  4819 (100)|          |
|   1 |  HASH GROUP BY     |           |    39 |   351 |  4819   (1)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| BIG_TABLE |  1000K|  8789K|  4795   (1)| 00:00:01 |
--------------------------------------------------------------------------------

DECLARE
my_plans pls_integer;
BEGIN
  my_plans := DBMS_SPM.LOAD_PLANS_FROM_CURSOR_CACHE(
    sql_id => '8g28yt7c1nacr');
END;
/

@coe_xfr_sql_profile 8g28yt7c1nacr 1753714399
@coe_xfr_sql_profile_8g28yt7c1nacr_1753714399.sql


select object_type, count(*)
from admin.big_table
group by object_type;


SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(format => 'TYPICAL'));


SQL_ID  8g28yt7c1nacr, child number 0
-------------------------------------
select object_type, count(*) from admin.big_table group by object_type

Plan hash value: 1753714399

--------------------------------------------------------------------------------
| Id  | Operation          | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |           |       |       |  4819 (100)|          |
|   1 |  HASH GROUP BY     |           |    39 |   351 |  4819   (1)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| BIG_TABLE |  1000K|  8789K|  4795   (1)| 00:00:01 |
--------------------------------------------------------------------------------

Note
-----
   - SQL profile coe_8g28yt7c1nacr_1753714399 used for this statement
   - SQL plan baseline SQL_PLAN_1wn92bz7gqvxx73be0962 used for this statement

Note section in execution plan output says that I’m using both SQL profile and SQL plan baseline for this query.

I have previously enabled In-Memory Column Store and now I will populate table data into the in-memory column store.

alter table admin.big_table inmemory priority critical;

col segment_name for a15
select segment_name, 
       inmemory_size/1024/1024 im_size_mb,
       bytes/1024/1024 size_mb,
       bytes_not_populated,
       inmemory_compression
from v$im_segments;

SEGMENT_NAME    IM_SIZE_MB    SIZE_MB BYTES_NOT_POPULATED INMEMORY_COMPRESS
--------------- ---------- ---------- ------------------- -----------------
BIG_TABLE          27.1875        144                   0 FOR QUERY LOW

1 row selected.

Run query again.

select object_type, count(*)
from admin.big_table
group by object_type;

SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(format => 'TYPICAL'));

SQL_ID  8g28yt7c1nacr, child number 0
-------------------------------------
select object_type, count(*) from admin.big_table group by object_type

Plan hash value: 1753714399

--------------------------------------------------------------------------------------
|Id | Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time    |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT            |           |       |       |   257 (100)|         |
| 1 |  HASH GROUP BY              |           |    39 |   351 |   257  (13)| 00:00:01|
| 2 |   TABLE ACCESS INMEMORY FULL| BIG_TABLE |  1000K|  8789K|   233   (4)| 00:00:01|
--------------------------------------------------------------------------------------

Note
-----
   - SQL profile coe_8g28yt7c1nacr_1753714399 used for this statement
   - SQL plan baseline SQL_PLAN_1wn92bz7gqvxx73be0962 used for this statement

Notice "TABLE ACCESS INMEMORY FULL" operation is used instead of "TABLE ACCESS FULL" and both SQL profile and SQL plan baselines are used for this query.

In this case Oracle used in-memory column store to read data without any intervention on SQL profile or SQL plan baseline. Plan hash value remained the same in both cases.

But what if we have index operations involved in execution plan.

-- Temporary disable IM column store to optimise queries
SQL> alter system set inmemory_query=DISABLE;

-- Force Oracle to use index
SQL> alter session set optimizer_index_caching=100;
SQL> alter session set optimizer_index_cost_adj=1;


select object_type, count(*)
from admin.big_table
where object_type > 'C'
group by object_type;



SQL_ID  8xvfvz3axf5ct, child number 0
-------------------------------------
select object_type, count(*) from admin.big_table where object_type >
'C' group by object_type

Plan hash value: 3149057435

-------------------------------------------------------------------------------------
| Id  | Operation            | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |              |       |       |    28 (100)|          |
|   1 |  SORT GROUP BY NOSORT|              |    39 |   351 |    28   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN   | IDX_OBJ_TYPE |  1000K|  8789K|    28   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_TYPE">'C')


-- Create SQL plan baseline 

DECLARE
my_plans pls_integer;
BEGIN
  my_plans := DBMS_SPM.LOAD_PLANS_FROM_CURSOR_CACHE(
    sql_id => '8xvfvz3axf5ct');
END;
/

-- Create SQL profile

SQL>@coe_xfr_sql_profile 8xvfvz3axf5ct 3149057435
SQL>@coe_xfr_sql_profile_8xvfvz3axf5ct_3149057435.sql

I have slightly different query with "INDEX RANGE SCAN" operation in execution plan. SQL plan baseline and SQL profile are both created for this query.

In Note section you can see that SQL profile and SQL plan baseline are both used.

select object_type, count(*)
from admin.big_table
where object_type > 'C'
group by object_type;


SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(format => 'TYPICAL'));

SQL_ID  8xvfvz3axf5ct, child number 0
-------------------------------------
select object_type, count(*) from admin.big_table where object_type >
'C' group by object_type

Plan hash value: 3149057435

-------------------------------------------------------------------------------------
| Id  | Operation            | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |              |       |       |    28 (100)|          |
|   1 |  SORT GROUP BY NOSORT|              |    39 |   351 |    28   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN   | IDX_OBJ_TYPE |  1000K|  8789K|    28   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_TYPE">'C')

Note
-----
   - SQL profile coe_8xvfvz3axf5ct_3149057435 used for this statement
   - SQL plan baseline SQL_PLAN_76jwvc1sug4k44391ca35 used for this statement

Enable IM column store to optimise queries.

SQL> alter system set inmemory_query=ENABLE;

System altered.


select object_type, count(*)
from admin.big_table
where object_type > 'C'
group by object_type;

SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(format => 'TYPICAL'));

SQL_ID  8xvfvz3axf5ct, child number 1
-------------------------------------
select object_type, count(*) from admin.big_table where object_type >
'C' group by object_type

Plan hash value: 3149057435

-------------------------------------------------------------------------------------
| Id  | Operation            | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |              |       |       |    28 (100)|          |
|   1 |  SORT GROUP BY NOSORT|              |    39 |   351 |    28   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN   | IDX_OBJ_TYPE |  1000K|  8789K|    28   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_TYPE">'C')

Note
-----
   - SQL profile coe_8xvfvz3axf5ct_3149057435 used for this statement
   - SQL plan baseline SQL_PLAN_76jwvc1sug4k44391ca35 used for this statement

This time in-memory option is not used to improve performance of the query.

Let’s drop SQL profile and leave SQL plan baseline enabled.

exec dbms_sqltune.drop_sql_profile('coe_8xvfvz3axf5ct_3149057435');


elect object_type, count(*)
from admin.big_table
where object_type > 'C'
group by object_type;


Plan hash value: 1753714399

--------------------------------------------------------------------------------------
| Id| Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time    |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT            |           |    39 |   351 |   255  (12)| 00:00:01|
| 1 |  HASH GROUP BY              |           |    39 |   351 |   255  (12)| 00:00:01|
|*2 |   TABLE ACCESS INMEMORY FULL| BIG_TABLE |  1000K|  8789K|   231   (3)| 00:00:01|
--------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------

   2 - inmemory("OBJECT_TYPE">'C')
       filter("OBJECT_TYPE">'C')

Note
-----
   - SQL plan baseline "SQL_PLAN_76jwvc1sug4k473be0962" used for this statement

Note section says that SQL plan baseline is used for this statement, but different than before.
I have "TABLE ACCESS INMEMORY FULL" operation and plan has changed automatically.

In Oracle 12cR1 Adaptive SQL Plan Management is enabled by default. Oracle calculated more efficient plan using in-memory column store and automatically accepted new SQL execution plan for this query. As new SQL plan is added and accepted Oracle was able to change execution plan.

set lines 200
set pages 999
col plan_name for a30
col sql_text for a50 wrap

select plan_name, sql_text, enabled, accepted
from dba_sql_plan_baselines
where sql_text like '%object_type > %';


PLAN_NAME                      SQL_TEXT                                ENA ACC
------------------------------ --------------------------------------- --- ---
SQL_PLAN_76jwvc1sug4k4ebe5b30f select object_type, count(*)            YES NO
                               from admin.big_table
                               where object_type > 'C'
                               group by object_type

SQL_PLAN_76jwvc1sug4k473be0962 select object_type, count(*)            YES YES
                               from admin.big_table
                               where object_type > 'C'
                               group by object_type

SQL_PLAN_76jwvc1sug4k44391ca35 select object_type, count(*)            YES YES
                               from admin.big_table
                               where object_type > 'C'
                               group by object_type

What if I disable adaptive sql plan management to forbid automatically evolving existing baselines.

-- Disable automatic evolving
BEGIN
  DBMS_SPM.set_evolve_task_parameter(
    task_name => 'SYS_AUTO_SPM_EVOLVE_TASK',
    parameter => 'ACCEPT_PLANS',
    value     => 'FALSE');
END;
/

-- Drop SQL plan baseline used for in-memory full scan
DECLARE
  l_plans_dropped  PLS_INTEGER;
BEGIN
  l_plans_dropped := DBMS_SPM.drop_sql_plan_baseline (
    sql_handle => NULL,
    plan_name  => 'SQL_PLAN_76jwvc1sug4k473be0962');
END;
/

In-memory full scan is not used as index range scan operation was specified in existing baseline which is used for query.

select object_type, count(*)
from admin.big_table
where object_type > 'C'
group by object_type;

SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(format => 'TYPICAL'));

SQL_ID 8xvfvz3axf5ct, child number 1
-------------------------------------
select object_type, count(*) from admin.big_table where object_type >
'C' group by object_type

Plan hash value: 3149057435

-------------------------------------------------------------------------------------
| Id  | Operation      | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |     |     |  28 (100)|     |
|   1 |  SORT GROUP BY NOSORT|      |  39 | 351 |  28   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN   | IDX_OBJ_TYPE |  1000K|  8789K|  28   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_TYPE">'C')

Note
-----
   - SQL plan baseline SQL_PLAN_76jwvc1sug4k44391ca35 used for this statement

New plan was added but this time it is not accepted automatically and taken in consideration by the optimizer. We have to manually validate and accept new plan to use it for query executions.

set lines 200
set pages 999
col plan_name for a30
col sql_text for a50 wrap

select plan_name, sql_text, enabled, accepted
from dba_sql_plan_baselines
where sql_text like '%object_type > %';

PLAN_NAME                      SQL_TEXT                                 ENA ACC
------------------------------ ---------------------------------------- --- ---
SQL_PLAN_76jwvc1sug4k4ebe5b30f select object_type, count(*)             YES NO
                               from admin.big_table
                               where object_type > 'C'
                               group by object_type

SQL_PLAN_76jwvc1sug4k473be0962 select object_type, count(*)             YES NO
                               from admin.big_table
                               where object_type > 'C'
                               group by object_type

SQL_PLAN_76jwvc1sug4k44391ca35 select object_type, count(*)             YES YES
                               from admin.big_table
                               where object_type > 'C'
                               group by object_type

What will happen if I have query with hint.

select /*+index(t IDX_OBJ_TYPE)*/  
       object_type, count(*)
from admin.big_table t
where object_type > 'C'
group by object_type;


SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(format => 'TYPICAL'));

SQL_ID  8k7fykgphx8ra, child number 0
-------------------------------------
select /*+index(t IDX_OBJ_TYPE)*/        object_type, count(*) from
admin.big_table t where object_type > 'C' group by object_type

Plan hash value: 3149057435

-------------------------------------------------------------------------------------
| Id  | Operation            | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |              |       |       |  2770 (100)|          |
|   1 |  SORT GROUP BY NOSORT|              |    39 |   351 |  2770   (1)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN   | IDX_OBJ_TYPE |  1000K|  8789K|  2770   (1)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OBJECT_TYPE">'C')

In-memory data access is ignored as we have hint forcing usage of the index.

select /*+full(t)*/
       object_type, count(*)
from admin.big_table t
where object_type > 'C'
group by object_type;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(format => 'TYPICAL'));

Plan hash value: 1753714399

--------------------------------------------------------------------------------------
|Id | Operation                   | Name      | Rows  | Bytes | Cost (%CPU)| Time    |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT            |           |    39 |   351 |   255  (12)| 00:00:01|
| 1 |  HASH GROUP BY              |           |    39 |   351 |   255  (12)| 00:00:01|
|*2 |   TABLE ACCESS INMEMORY FULL| BIG_TABLE |  1000K|  8789K|   231   (3)| 00:00:01|
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - inmemory("OBJECT_TYPE">'C')
       filter("OBJECT_TYPE">'C')

In case we have hint forcing full scan, query will read data from in-memory column store as "TABLE ACCESS INMEMORY FULL" and "TABLE ACCESS FULL" are the same full table scan operations for the optimizer.

Conclusion
If your production application is heavily dependent on SQL profiles and SQL hints it would be hard to use full potential of in-memory column store option in a short time.
With SQL plan baselines it is slightly easier because you could use Adaptive SQL Plan Management to alter plans.

But you must dedicate some time for proper testing, because changing plans and dropping indexes blindly could cause many performance problems.

Reduce Hard Parse time using SQL Profile

2016-11-03T22:19:00.001+01:00

Few days ago we had concurrency problem with "cursor: pin S wait on X" wait event. This wait event is mostly associated with parsing in some form.

After quick diagnosis I’ve found problematic query. It was fairly complex query which was executed very often with average 0.20 seconds of execution time. As this query was using bind variables, Oracle reused existing plan and problems with "cursor: pin S wait on X" wait events weren’t appearing.

But when hard parse occurred we experienced problems with specified mutex waits. Query execution with hard parsing jumped from 0.20 seconds to over 2,1 seconds.

One session would hold mutex pin in exclusive mode while other sessions were waiting to get a mutex pin in share mode - waiting with "Cursor: pin S wait on X" wait event.

Rewriting query would solve this issue but we needed some solution quickly.

I have decided to perform few tests using SQL plan baselines and SQL profiles and measure effect on hard parse. Tested query is intentionally excluded from the post.

Version : Oracle 12.1.0.2

Query execution statistics:

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      1.15       2.09          0         10          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.00       0.01          0        177          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      1.16       2.11          0        187          0           1

Statistics
----------------------------------------------------------
       1691  recursive calls
          0  db block gets
       1594  consistent gets
          0  physical reads
          0  redo size
       7266  bytes sent via SQL*Net to client
       8393  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
         60  sorts (memory)
          0  sorts (disk)
          1  rows processed

Total query execution is 2.11 seconds where parsing took 2.09 seconds which is practically whole query execution time.

What will happen if we create fixed baseline for the query:

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      1.15       2.09          0          7          0           0
Execute      1      0.00       0.00          0          0          1           0
Fetch        2      0.00       0.01          0        177          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      1.16       2.11          0        184          1           1

Note
-----
   - SQL plan baseline "SQL_PLAN_6q3anxq5dfsj4e57c1833" used for this statement

Statistics
----------------------------------------------------------
       1691  recursive calls
          0  db block gets
       1594  consistent gets
          0  physical reads
          0  redo size
       7287  bytes sent via SQL*Net to client
       8393  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
         60  sorts (memory)
          0  sorts (disk)
          1  rows processed

I have practically the same results which means that SQL plan baseline had no effect on parse time.

But, what will happen if I create SQL profile instead of baseline:

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.65       1.21          6         21          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      0.01       0.01          0        177          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      0.66       1.23          6        198          0           1

Note
-----
   - SQL profile "PROFILE_09vf7nstqk7n2" used for this statement

Statistics
----------------------------------------------------------
        654  recursive calls
          0  db block gets
       1300  consistent gets
          6  physical reads
          0  redo size
       7284  bytes sent via SQL*Net to client
       8393  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
         60  sorts (memory)
          0  sorts (disk)
          1  rows processed

This is big improvement.
Notice elapsed time for parse - from 2.09 secs to 1.21 secs.
Check query statistics - almost three times less recursive calls.

But why?
This is my explanation and I might be wrong so please leave comment below if this is the case.

When we’re using SQL baselines for plan management first step is always generating execution plans from optimizer. Cost based optimizer produces several plans and then compares plans with plans in the SQL plan baseline. Many different plans will be probed as a part of optimizer calculations. SQL plan baseline has no effect on number of calculations.

With SQL profiles we will feed optimizer with estimations and hints before calculation starts. Future plan will be influenced by the SQL profile. Basically we will point optimizer "in the right direction" and optimizer will not perform the same amount of calculations like before. As a result we have less recursive calls and less time spent on hard parsing.

After "fixing" plan with SQL profile, I’ve tried to reproduce mutex concurrency intentionally forcing hard parse but now Oracle managed to perform hard parse without affecting many sessions. Maybe I’ve solved problem temporarily and bought some time for developers to rewrite problematic query.

Using Adaptive Cursors Sharing with SQL Plan Baselines

2016-06-28T10:27:00.001+02:00

We have several databases where automatic capturing of sql plan baselines is enabled for a few schemas.

Execution of some queries deeply depend on variables where is not always the best to reuse same execution plan for all executions. For those queries I want to avoid using literals and inefficient execution plans. Also, I want to use SQL plan baselines as I have automatic capturing enabled.

Question is, can I make Adaptive Cursor Sharing to work with SQL Plan Baselines without changing query?
Activate bind awareness for every execution to avoid inefficient execution plans?

I want to avoid even one inefficient execution or wait for ACS kick in automatically, because this one lousy execution could be potentially big problem.

For demo case I’m using 1000000 rows table with skewed data:

SQL> select * from v$version;

BANNER                                                                               CON_ID
-------------------------------------------------------------------------------- ----------
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production              0
PL/SQL Release 12.1.0.2.0 - Production                                                    0
CORE    12.1.0.2.0      Production                                                        0
TNS for IBM/AIX RISC System/6000: Version 12.1.0.2.0 - Production                         0
NLSRTL Version 12.1.0.2.0 - Production                                                    0


select owner, count(*)
from big_table
group by owner;

OWNER        COUNT(*)
---------- ----------
MDSYS               1
PUBLIC         499999
SYS            499999
ORDSYS              1


create index IDX_OWNER on BIG_TABLE(owner);

begin
  dbms_stats.gather_table_stats(ownname=>'MSUTIC',tabname=>'BIG_TABLE',cascade=>TRUE, estimate_percent=>100, method_opt=>'for columns size 4 owner');
end;
/

This is my test query.

SQL> var own varchar2(10);
SQL> exec :own := 'SYS';

select owner, sum(object_id)
from big_table
where owner = :own
group by owner;


SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(format => 'TYPICAL'));


SQL_ID  5cdba9s9mkag7, child number 0
-------------------------------------
select owner, sum(object_id) from big_table where owner = :own group by
owner

Plan hash value: 2943376087

----------------------------------------------------------------------------------
| Id  | Operation            | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |           |       |       |  3552 (100)|          |
|   1 |  SORT GROUP BY NOSORT|           |   499K|  9277K|  3552   (1)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL  | BIG_TABLE |   499K|  9277K|  3552   (1)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OWNER"=:OWN)

For a first execution bind sensitivity is enabled because I have gathered statistics with histogram.

select     sql_id
    ,      is_bind_aware 
    ,      is_bind_sensitive
    ,      is_shareable
    ,      plan_hash_value
    from   v$sql  
    where  sql_id = '5cdba9s9mkag7';
 
SQL_ID        I I I PLAN_HASH_VALUE
------------- - - - ---------------
5cdba9s9mkag7 N Y Y      2943376087

To enable bind awareness I want to insert BIND_AWARE hint without changing query.

I will use SQL Patch for this:

SQL> begin
  sys.dbms_sqldiag_internal.i_create_patch(
     sql_text => 'select owner, sum(object_id)
                  from big_table
                  where owner = :own
                  group by owner',
     hint_text => 'BIND_AWARE',
     name      => 'bind_aware_patch');
end;
/  2    3    4    5    6    7    8    9   10

PL/SQL procedure successfully completed.

Now let’s check execution and bind awareness for the query.

SQL> var own varchar2(10);
SQL> exec :own := 'SYS';

select owner, sum(object_id)
from big_table
where owner = :own
group by owner;


SQL_ID  5cdba9s9mkag7, child number 0
-------------------------------------
select owner, sum(object_id) from big_table where owner = :own group by
owner

Plan hash value: 2943376087

----------------------------------------------------------------------------------
| Id  | Operation            | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |           |       |       |  3552 (100)|          |
|   1 |  SORT GROUP BY NOSORT|           |   499K|  9277K|  3552   (1)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL  | BIG_TABLE |   499K|  9277K|  3552   (1)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OWNER"=:OWN)

Note
-----
   - SQL patch "bind_aware_patch" used for this statement
   
   
select     sql_id
    ,      is_bind_aware 
    ,      is_bind_sensitive
    ,      is_shareable
    ,      plan_hash_value
    from   v$sql  
    where  sql_id = '5cdba9s9mkag7';
 
 
SQL_ID        I I I PLAN_HASH_VALUE
------------- - - - ---------------
5cdba9s9mkag7 Y Y Y      2943376087

We have note that SQL patch is used and we have bind awareness enabled. For every query execution, during hard parse, Oracle will peak variable and calculate efficient execution plan accordingly. At least, I would expect this.

Let’s try with another variable - will Oracle alter execution plan.

SQL> var own varchar2(10);
SQL> exec :own := 'MDSYS';
   
   
select owner, sum(object_id)
from big_table
where owner = :own
group by owner;
   

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(format => 'TYPICAL'));   


SQL_ID  5cdba9s9mkag7, child number 1
-------------------------------------
select owner, sum(object_id) from big_table where owner = :own group by
owner

Plan hash value: 1772680857

------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |       |       |     4 (100)|          |
|   1 |  SORT GROUP BY NOSORT        |           |     1 |    19 |     4   (0)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| BIG_TABLE |     1 |    19 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | IDX_OWNER |     1 |       |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("OWNER"=:OWN)

Note
-----
   - SQL patch "bind_aware_patch" used for this statement


   
select     sql_id
    ,      is_bind_aware 
    ,      is_bind_sensitive
    ,      is_shareable
    ,      plan_hash_value
    from   v$sql  
    where  sql_id = '5cdba9s9mkag7';
   
   
SQL_ID        I I I PLAN_HASH_VALUE
------------- - - - ---------------
5cdba9s9mkag7 Y Y Y      2943376087
5cdba9s9mkag7 Y Y Y      1772680857

Notice how Oracle changed execution plan and now we have two plans for specified sql text.

Capture SQL plans from cursor cache to create baseline.

DECLARE
my_plans pls_integer;
BEGIN
  my_plans := DBMS_SPM.LOAD_PLANS_FROM_CURSOR_CACHE(
    sql_id => '5cdba9s9mkag7');
END;
/

We have two ACCEPTED plans saved for this query which Oracle will consider during execution, and SQL patch forcing bind awareness.

set lines 200
col sql_handle for a25
col plan_name  for a35
select sql_handle, plan_name, enabled, accepted, fixed 
from dba_sql_plan_baselines
where sql_handle='SQL_f02626d2f3cad6cc';

SQL_HANDLE                PLAN_NAME                           ENA ACC FIX
------------------------- ----------------------------------- --- --- ---
SQL_f02626d2f3cad6cc      SQL_PLAN_g09j6ubtwppqc69a8f699      YES YES NO 
SQL_f02626d2f3cad6cc      SQL_PLAN_g09j6ubtwppqcaf705ad7      YES YES NO

Now we will perform test to check will Oracle alter execution plan on variable value.

SQL> var own varchar2(10);
SQL> exec :own := 'SYS';

select owner, sum(object_id)
from big_table
where owner = :own
group by owner;

OWNER                            SUM(OBJECT_ID)
-------------------------------- --------------
SYS                                  7.5387E+10

SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(format => 'TYPICAL'));

SQL_ID  5cdba9s9mkag7, child number 0
-------------------------------------
select owner, sum(object_id) from big_table where owner = :own group by
owner

Plan hash value: 2943376087

----------------------------------------------------------------------------------
| Id  | Operation            | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |           |       |       |  3552 (100)|          |
|   1 |  SORT GROUP BY NOSORT|           |   499K|  9277K|  3552   (1)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL  | BIG_TABLE |   499K|  9277K|  3552   (1)| 00:00:01 |
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("OWNER"=:OWN)

Note
-----
   - SQL patch "bind_aware_patch" used for this statement
   - SQL plan baseline SQL_PLAN_g09j6ubtwppqcaf705ad7 used for this statement

Oracle used SQL patch and SQL plan baseline.

What if I change variable value.

SQL> var own varchar2(10);
SQL> exec :own := 'MDSYS';

select owner, sum(object_id)
from big_table
where owner = :own
group by owner;

OWNER                            SUM(OBJECT_ID)
-------------------------------- --------------
MDSYS                                    182924

SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(format => 'TYPICAL'));

SQL_ID  5cdba9s9mkag7, child number 1
-------------------------------------
select owner, sum(object_id) from big_table where owner = :own group by
owner

Plan hash value: 1772680857

------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |       |       |     4 (100)|          |
|   1 |  SORT GROUP BY NOSORT        |           |     1 |    19 |     4   (0)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| BIG_TABLE |     1 |    19 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | IDX_OWNER |     1 |       |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("OWNER"=:OWN)

Note
-----
   - SQL patch "bind_aware_patch" used for this statement
   - SQL plan baseline SQL_PLAN_g09j6ubtwppqc69a8f699 used for this statement

Oracle immediately changed execution plan and used different SQL plan baseline.

At the end I have original query with bind variables, I have SQL plan baselines captured, and I’m using powerful ACS feature to have efficient plans for different variables.

Slow full table scan due to row chaining

2016-02-24T14:37:00.002+01:00

Few days ago I’ve received complaint that simple count on 2 million rows table is running forever.

This was the statement:

select count(1)
from CLIENT k
where k.expires is null;

I've used fake names for table name and columns.
Database version: 11.2.0.4.0

Indeed, query was running longer than I would expect. Oracle was using FULL SCAN of the table with "db file sequential read" wait events. This was little odd to me as I would expect "direct path reads" or "db file scattered reads".

It was partitioned table with 4 partitions and 294 columns.

select count(*) from dba_tab_columns where table_name = 'CLIENT';
 
 COUNT(*)
----------
      294

select owner, segment_name, partition_name, bytes, blocks
from dba_segments
where segment_name in ('CLIENT');
 
OWNER      SEGMENT_NAME    PARTITION_NAME            BYTES     BLOCKS
---------- --------------- -------------------- ---------- ----------
SCOTT       CLIENT         CLIENT_OTHER          8388608        1024
SCOTT       CLIENT         CLIENT_CITY           1643118592     200576
SCOTT       CLIENT         CLIENT_CNTR           591396864      72192
SCOTT       CLIENT         CLIENT_STRNG          52428800       6400


select table_name, partition_name, NUM_ROWS, AVG_ROW_LEN
from dba_tab_partitions
where table_name='CLIENT';

TABLE_NAME                     PARTITION_NAME           NUM_ROWS    AVG_ROW_LEN
------------------------------ ----------------------- ----------- ---------------
CLIENT                          CLIENT_OTHER             0            0
CLIENT                          CLIENT_CITY              1469420      572
CLIENT                          CLIENT_CNTR              592056       495
CLIENT                          CLIENT_STRNG             48977        565


select table_name, data_type, count(*)
from dba_tab_cols
where table_name='CLIENT'
group by table_name, data_type
order by 3 desc;
 
TABLE_NAME DATA_TYPE                                  COUNT(*)
---------- ---------------------------------------- ----------
CLIENT   NUMBER                                          191
CLIENT   VARCHAR2                                         70
CLIENT   DATE                                             32
CLIENT   TIMESTAMP(6)                                      3
CLIENT   RAW                                               2
CLIENT   CL_UTR                                            1
CLIENT   O_TIP_KAR                                         1
CLIENT   O_ZA_NA                                           1
CLIENT   O_PO_OSO                                          1

Some of the columns were collections.

select type_name, typecode
from dba_types
where type_name in (select data_type
                   from dba_tab_cols
                   where table_name='CLIENT'
                   and data_type not in ('NUMBER','VARCHAR2',
                   'DATE','TIMESTAMP(6)','RAW'));
 
TYPE_NAME                      TYPECODE                     
------------------------------ ------------------------------
CL_UTR                         COLLECTION                    
O_TIP_KAR                      COLLECTION                    
O_ZA_NA                        COLLECTION                    
O_PO_OSO                       COLLECTION

These were varrays used to store multivalued attributes.

In trace I've seen lots disk reads and elapsed time over 2400 seconds.

select count(1)
  from CLIENT k
where k.expires is null

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2    203.96    2450.19    5455717    8240323          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4    203.97    2450.20    5455717    8240323          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 369  (MSUTIC)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=8240323 pr=5455717 pw=0 time=1349733885 us)
   1905617    1905617    1905617   PARTITION LIST ALL PARTITION: 1 4 (cr=8240323 pr=5455717 pw=0 time=2449532855 us cost=164110 size=3801914 card=1900957)
   1905617    1905617    1905617    TABLE ACCESS FULL CLIENT PARTITION: 1 4 (cr=8240323 pr=5455717 pw=0 time=2448530798 us cost=164110 size=3801914 card=1900957)


Rows     Execution Plan
-------  ---------------------------------------------------
      0  SELECT STATEMENT   MODE: ALL_ROWS
      1   SORT (AGGREGATE)
1905617    PARTITION LIST (ALL) PARTITION: START=1 STOP=4
1905617     TABLE ACCESS   MODE: ANALYZED (FULL) OF 'CLIENT' (TABLE) 
                PARTITION: START=1 STOP=4

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to CLIENT                       2        0.00          0.00
  Disk file operations I/O                       29        0.00          0.00
  direct path read                             2048        0.19          9.78
  db file sequential read                   5178860        0.23       2241.08
  resmgr:internal state change                    2        0.11          0.21
  SQL*Net message from CLIENT                     1        0.00          0.00

Object statistics were telling me that all reads were from table partitions.

Session Objects Statistics
Object/Event % Time Seconds Calls - Time per Call -
Avg Min Max
Obj#(299564)
    db file sequential read 78.1% 1,757.0600s 3,677,752 0.0005s 0.0001s 0.2333s
    direct path read 0.4% 8.8314s 1,706 0.0052s 0.0004s 0.1953s
    resmgr:internal state change 0.0% 0.2162s 2 0.1081s 0.1000s 0.1162s
    Disk file operations I/O 0.0% 0.0014s 23 0.0001s 0.0000s 0.0002s
Obj#(299565)
    db file sequential read 20.5% 462.5006s 1,416,370 0.0003s 0.0001s 0.1794s
    direct path read 0.0% 0.8966s 304 0.0029s 0.0001s 0.0479s
    Disk file operations I/O 0.0% 0.0003s 6 0.0000s 0.0000s 0.0000s
Obj#(299566)
    db file sequential read 1.0% 21.5203s 84,738 0.0003s 0.0001s 0.0552s
    direct path read 0.0% 0.0587s 38 0.0015s 0.0000s 0.0206s

Hm… why am I having so many db file sequential reads with direct path reads happening also?
This is a table with lots of columns so I might have problems with chained or migrated rows.
Oracle is probably using individual block reads to fetch pieces of each row.

As I had table with more than 255 columns I would expect intra-block chaining, but this shouldn't cause sequential reads. Only if row doesn’t fit in the block I would have regular row chaining.
I’m probably having problems with row migrations.

Chained row is a row that is too large to fit into a block and if this is the root cause of the problem there isn't much I can do to improve performance. If it’s too big to fit into a block, it would be too big after rebuilding table also.

Migration of an row occurs when row is updated in a block and amount of free space in the block is not adequate to store all the row’s data. Row is migrated to another physical block.
This usually happens when you have PCTFREE set to low.

What is important for migrated rows - you can improve performance reorganizing table/partition or simply deleting/inserting chained rows.

Tanel wrote blog post on the subject "Detect chained and migrated rows in Oracle – Part 1” and I’ve decided to use his great tool Snapper to get some diagnostic info.

SQL> @sn 60 6596
@snapper all 60 1 "6596"
Sampling SID 6596 with interval 60 seconds, taking 1 snapshots...
 
-- Session Snapper v4.06 BETA - by Tanel Poder ( http://blog.tanelpoder.com ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! :)
 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   SID, USERNAME  , TYPE, STATISTIC                                                 ,         DELTA, HDELTA/SEC,    %TIME, GRAPH       , NUM_WAITS,  WAITS/SEC,   AVERAGES
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  6596, MSUTIC    , STAT, session logical reads                                     ,        283813,      4.74k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, user I/O wait time                                        ,          5719,      95.46,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, non-idle wait time                                        ,          5719,      95.46,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, non-idle wait count                                       ,        193388,      3.23k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, session pga memory                                        ,         -8400,    -140.21,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, enqueue requests                                          ,             2,        .03,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, enqueue releases                                          ,             2,        .03,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, physical read total IO requests                           ,        193740,      3.23k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, physical read total multi block requests                  ,           353,       5.89,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, physical read total bytes                                 ,    1630494720,     27.21M,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, cell physical IO interconnect bytes                       ,    1630494720,     27.21M,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, consistent gets                                           ,        283812,      4.74k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, consistent gets direct                                    ,        283810,      4.74k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, physical reads                                            ,        199034,      3.32k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, physical reads direct                                     ,        199034,      3.32k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, physical read IO requests                                 ,        193739,      3.23k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, physical read bytes                                       ,    1630486528,     27.21M,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, file io wait time                                         ,      57195780,    954.66k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, Number of read IOs issued                                 ,           353,       5.89,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, no work - consistent read gets                            ,        283808,      4.74k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, table scan rows gotten                                    ,       2881106,     48.09k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, table scan blocks gotten                                  ,         83578,       1.4k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, table fetch continued row                                 ,        200188,      3.34k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , STAT, buffer is not pinned count                                ,        200226,      3.34k,         ,             ,          ,           ,          ~ per execution
  6596, MSUTIC    , TIME, DB CPU                                                    ,       5620720,    93.82ms,     9.4%, [@         ],          ,           ,
  6596, MSUTIC    , TIME, sql execute elapsed time                                  ,      60270147,      1.01s,   100.6%, [##########],          ,           ,
  6596, MSUTIC    , TIME, DB time                                                   ,      60270147,      1.01s,   100.6%, [##########],          ,           ,          ~ unaccounted time
  6596, MSUTIC    , WAIT, Disk file operations I/O                                  ,           123,     2.05us,      .0%, [          ],         2,        .03,     61.5us average wait
  6596, MSUTIC    , WAIT, db file sequential read                                   ,      57234629,   955.31ms,    95.5%, [WWWWWWWWWW],    192888,      3.22k,   296.72us average wait
 
--  End of Stats snap 1, end=2016-02-23 13:23:19, seconds=59.9
 
----------------------------------------------------------------------------------------------------
Active% | INST | SQL_ID          | SQL_CHILD | EVENT                               | WAIT_CLASS
----------------------------------------------------------------------------------------------------
   97% |    1 | 2q92xdvxjj712   | 0         | db file sequential read             | User I/O
    3% |    1 | 2q92xdvxjj712   | 0         | ON CPU                              | ON CPU
 
--  End of ASH snap 1, end=2016-02-23 13:23:19, seconds=60, samples_taken=99
 
PL/SQL procedure successfully completed.

Notice "table fetch continued row" statistic. Tanel wrote that this counter is usually increasing when rows are accessed with index access paths.
In my case I have full scan that is increasing the value. This count is number of chained pieces Oracle had to go through in order to find the individual pieces of the rows.
I won’t go any further in detail - just check Tanel’s blog post.

Let’s identify chained rows running ANALYZE command with the LIST CHAINED ROWS option. This command will collect information about each migrated or chained row.

SQL> analyze table SCOTT.CLIENT list chained rows;
 
Table analyzed.
 
SQL> select count(*) from chained_rows;
 
 COUNT(*)
----------
  2007045

SQL> select partition_name, count(*) from chained_rows group by partition_name;
 
PARTITION_NAME                   COUNT(*)
------------------------------ ----------
CLIENT_CITY                       1411813
CLIENT_CNTR                       552873
CLIENT_STRNG                      42359

Table with 2097647 rows has 2007045 chained/migrated rows. This was causing so many reads for simple full scan of the small table.

I have decided to rebuild table partitions without changing PCTFREE parameter to fit migrated rows into a single block.

After rebuild number of chained rows decreased.

SQL> analyze table SCOTT.CLIENT list chained rows;
 
Table analyzed.
 
SQL> select count(*) from chained_rows;
 
 COUNT(*)
----------
    37883

Now query finished in 14 secs without sequential reads happening.

select  count(1)
  from CLIENT k
  where k.expires is null

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        2      2.34      13.96     185802     185809          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        4      2.34      13.96     185802     185809          0           1

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 369  (MSUTIC)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=185809 pr=185802 pw=0 time=13965941 us)
   1905617    1905617    1905617   PARTITION LIST ALL PARTITION: 1 4 (cr=185809 pr=185802 pw=0 time=13560379 us cost=109526 size=3811234 card=1905617)
   1905617    1905617    1905617    TABLE ACCESS FULL CLIENT PARTITION: 1 4 (cr=185809 pr=185802 pw=0 time=12848619 us cost=109526 size=3811234 card=1905617)


Rows     Execution Plan
-------  ---------------------------------------------------
      0  SELECT STATEMENT   MODE: ALL_ROWS
      1   SORT (AGGREGATE)
1905617    PARTITION LIST (ALL) PARTITION: START=1 STOP=4
1905617     TABLE ACCESS   MODE: ANALYZED (FULL) OF 'CLIENT' (TABLE) 
                PARTITION: START=1 STOP=4

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to CLIENT                       2        0.00          0.00
  direct path read                             3569        0.11          8.99
  SQL*Net message from CLIENT                     2        0.00          0.01

Snapper also showed that I don’t have problem with row chaining.

SQL> @sn 15 6601
@snapper all 15 1 "6601"
Sampling SID 6601 with interval 15 seconds, taking 1 snapshots...
 
-- Session Snapper v4.06 BETA - by Tanel Poder ( http://blog.tanelpoder.com ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! :)
 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 SID, USERNAME  , TYPE, STATISTIC                                                 ,      DELTA, HDELTA/SEC, %TIME, GRAPH    , NUM_WAITS,  WAITS/SEC,   AVERAGES
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   6601, MSUTIC , STAT, Requests to/from CLIENT                                ,          1,     .07,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, user calls                                                ,          1,        .07,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, pinned cursors current                                 ,         -1,    -.07,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, session logical reads                                  ,     149590,    9.9k,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, CPU used when call started                             ,        227,   15.02,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, CPU used by this session                               ,        227,   15.02,      ,             ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, DB time                                                   ,       1047,   69.29,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, user I/O wait time                                     ,        424,   28.06,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, non-idle wait time                                     ,        424,   28.06,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, non-idle wait count                                    ,       3216,  212.84,      ,          ,       ,           ,       ~ per execution
   6601, MSUTIC , STAT, session uga memory                                     ,     135248,   8.95k,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, physical read total IO requests                        ,       9354,  619.07,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, physical read total multi block requests               ,       9333,  617.68,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, physical read total bytes                              , 1225228288,  81.09M,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, cell physical IO interconnect bytes                    , 1225228288,  81.09M,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, consistent gets                                        ,     149578,    9.9k,      ,             ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, consistent gets from cache                             ,          5,     .33,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, consistent gets from cache (fastpath)                  ,          5,     .33,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, consistent gets direct                                 ,     149572,    9.9k,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, logical read bytes from cache                          ,      40960,   2.71k,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, physical reads                                            ,     149548,    9.9k,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, physical reads direct                                  ,     149548,    9.9k,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, physical read IO requests                              ,       9353,  619.01,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, physical read bytes                                    , 1225097216,  81.08M,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, calls to kcmgcs                                           ,          5,     .33,      ,          ,          ,        ,       ~ per execution
   6601, MSUTIC , STAT, file io wait time                                      ,        304,   20.12,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, total number of slots                                  ,         -2,    -.13,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, Effective IO time                                      ,    4239980, 280.61k,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, Number of read IOs issued                              ,       9354,  619.07,      ,          ,       ,        ,          ~ per execution
   6601, MSUTIC , STAT, no work - consistent read gets                         ,     149564,    9.9k,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, Cached Commit SCN referenced                           ,     149132,   9.87k,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, table scans (cache partitions)                         ,          3,      .2,         ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, table scans (direct read)                              ,          3,      .2,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, table scan rows gotten                                 ,    3518684, 232.88k,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, table scan blocks gotten                               ,     149559,    9.9k,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, bytes sent via SQL*Net to CLIENT                       ,        211,   13.96,      ,          ,       ,           ,   105.5 bytes per roundtrip
   6601, MSUTIC , STAT, bytes received via SQL*Net from CLIENT                 ,          8,     .53,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , STAT, SQL*Net roundtrips to/from CLIENT                      ,          2,     .13,      ,          ,       ,        ,       ~ per execution
   6601, MSUTIC , TIME, DB CPU                                                    ,    2000964,   132.43ms, 13.2%, [@@     ],       ,        ,
   6601, MSUTIC , TIME, sql execute elapsed time                               ,    8500210,   562.57ms,    56.3%, [###### ],       ,        ,
   6601, MSUTIC , TIME, DB time                                                   ,    8500269,   562.57ms,    56.3%, [###### ],       ,        ,     14.62s unaccounted time
   6601, MSUTIC , WAIT, direct path read                                          ,    4059380,   268.66ms, 26.9%, [WWW    ],      3064,  202.78,  1.32ms average wait
   6601, MSUTIC , WAIT, SQL*Net message to CLIENT                              ,          4,   .26us,   .0%, [       ],      1,     .07,        4us average wait
   6601, MSUTIC , WAIT, SQL*Net message from CLIENT                            ,    8006127,   529.87ms,    53.0%, [WWWWWW ],      1,     .07,   8.01s average wait
 
--  End of Stats snap 1, end=2016-02-24 08:23:59, seconds=15.1
 
 
----------------------------------------------------------------------------------------------------
Active% | INST | SQL_ID       | SQL_CHILD | EVENT                               | WAIT_CLASS
----------------------------------------------------------------------------------------------------
 29% | 1 | gg54c4j6b9jb0   | 0         | direct path read                    | User I/O
 21% | 1 | gg54c4j6b9jb0   | 0         | ON CPU                           | ON CPU
 
--  End of ASH snap 1, end=2016-02-24 08:23:59, seconds=15, samples_taken=96

Reorganizing table solved my problem. Finally full scans on the table were running much faster.

There is interesting support note "Doc ID 238519.1" which states that trailing NULLs do not take space in the rowpiece: initially row fits in one rowpiece.
If column beyond 255 is then populated, then all the NULL columns between last populated and this new column now takes up space.
Row has to be split into two rowpieces and the new rowpiece is migrated to a new block - row becomes chained.

In our table we have trailing NULL columns so this probably caused such migration.

Unfortunately I don’t have time to perform detailed investigation.

REFERENCES
http://blog.tanelpoder.com/2009/11/04/detect-chained-and-migrated-rows-in-oracle/
Updating a Row with More Than 255 Columns Causes Row Chaining (Doc ID 238519.1)

Detecting Soft Corruption in 12c - V$NONLOGGED_BLOCK, ORA-01578/ORA-26040

2016-02-20T10:22:00.003+01:00

Last week we have created standby database in our dev environment and performed some ETL actions on primary side. Loading data or rebuilding indexes was performed with NOLOGGING option. After few days we noticed lots ORA-01578/ORA-26040 errors.
Corruption happened because we forgot to enable force logging.

As this was new dev database there wasn’t backup, but maybe not everything was lost. If only corrupted segments are indexes we could easily rebuild them.

Then I’ve learnt something new.
After performing validation check logical, we noticed lots corrupted blocks, but I was puzzled why do I have “v$database_block_corruption” view empty. Then my colleague told me that Oracle changed behaviour in reporting soft corrupted blocks in 12c version (we were using 12.1.0.2). New view was updated - V$NONLOGGED_BLOCK.

So I have created little demo case on how to detect (and repair) soft corrupted blocks on 12c database.

Create tablespace and small table.

SQL> create tablespace DEMO1 datafile '/oradata1/data/ora12c/demo01.dbf' size 50M;
Tablespace created.

SQL> create table objects tablespace DEMO as select * from dba_objects;
Table created.

SQL> alter table objects add constraint pk_obj primary key (object_id);
Table altered.

SQL> create index idx_obj_name on objects(object_name) tablespace demo1;
Index created.

Backup tablespace.

RMAN> backup tablespace DEMO1;

Starting backup at 23-AUG-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=50 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00002 name=/oradata1/data/ora12c/demo01.dbf
channel ORA_DISK_1: starting piece 1 at 23-AUG-15
channel ORA_DISK_1: finished piece 1 at 23-AUG-15
piece handle=/oradata1/fra/ORA12C/backupset/2015_08_23/o1_mf_nnndf_TAG20150823T060639_bxlkpj3j_.bkp tag=TAG20150823T060639 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 23-AUG-15

Starting Control File and SPFILE Autobackup at 23-AUG-15
piece handle=/oradata1/fra/ORA12C/autobackup/2015_08_23/o1_mf_s_888473201_bxlkpktg_.bkp comment=NONE
Finished Control File and SPFILE Autobackup at 23-AUG-15

Rebuild index with NOLOGGING option to simulate soft corruption later.

RMAN> alter index idx_obj_name rebuild nologging;
Statement processed

Confirm that we have datafiles that require backup because they have been affected with NOLOGGING operation.

RMAN> report unrecoverable;

Report of files that need backup due to unrecoverable operations
File Type of Backup Required Name
---- ----------------------- -----------------------------------
2    full or incremental     /oradata1/data/ora12c/demo01.dbf
5    full or incremental     /oradata1/data/ora12c/example01.dbf

Simulate corruption.

RMAN> alter database datafile 2 offline;
Statement processed

RMAN> restore datafile 2;

Starting restore at 23-AUG-15
using channel ORA_DISK_1

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00002 to /oradata1/data/ora12c/demo01.dbf
channel ORA_DISK_1: reading from backup piece /oradata1/fra/ORA12C/backupset/2015_08_23/o1_mf_nnndf_TAG20150823T060639_bxlkpj3j_.bkp
channel ORA_DISK_1: piece handle=/oradata1/fra/ORA12C/backupset/2015_08_23/o1_mf_nnndf_TAG20150823T060639_bxlkpj3j_.bkp tag=TAG20150823T060639
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:00:03
Finished restore at 23-AUG-15

RMAN> recover datafile 2;

Starting recover at 23-AUG-15
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:01

Finished recover at 23-AUG-15

RMAN> alter database datafile 2 online;
Statement processed

Query table with corrupted index and notice error.

SQL> select count(*) from objects where object_name like 'A%';
select count(*) from objects where object_name like 'A%'
       *
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 2, block # 2617)
ORA-01110: data file 2: '/oradata1/data/ora12c/demo01.dbf'
ORA-26040: Data block was loaded using the NOLOGGING option

Let’s perform validation of datafile to check block corruption.

RMAN> backup validate check logical datafile 2;

Starting backup at 23-AUG-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=40 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00002 name=/oradata1/data/ora12c/demo01.dbf
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
2    OK     460            129          6401            1776280
  File Name: /oradata1/data/ora12c/demo01.dbf
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              1537
  Index      0              462
  Other      0              4272

Finished backup at 23-AUG-15

Notice that we have 460 blocks marked corrupt but v$database_block_corruption view is empty.

SQL> select count(*) from v$database_block_corruption;

  COUNT(*)
----------
  0

Let’s query v$nonlogged_block view.

SQL> set lines 200
SQL> set pages 999
SQL> select file#, block#, blocks,object#,reason from v$nonlogged_block;

     FILE#     BLOCK#   BLOCKS OBJECT#      REASON
---------- ---------- ---------- ---------------------------------------- -------
  2  2308       12       UNKNOWN
  2  2321       15       UNKNOWN
  2  2337       15       UNKNOWN
  2  2353       15       UNKNOWN
  2  2369       15       UNKNOWN
  2  2385       15       UNKNOWN
  2  2401       15       UNKNOWN
  2  2417       15       UNKNOWN
  2  2434      126       UNKNOWN
  2  2562      126       UNKNOWN
  2  2690       91       UNKNOWN

11 rows selected.

Will RMAN detect that we have corrupted blocks?

RMAN> backup datafile 2;

Starting backup at 23-AUG-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=54 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00002 name=/oradata1/data/ora12c/demo01.dbf
channel ORA_DISK_1: starting piece 1 at 23-AUG-15
channel ORA_DISK_1: finished piece 1 at 23-AUG-15
piece handle=/oradata1/fra/ORA12C/backupset/2015_08_23/o1_mf_nnndf_TAG20150823T061602_bxll8275_.bkp tag=TAG20150823T061602 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
Finished backup at 23-AUG-15

RMAN backup won’t fail due to NOLOGGING corrupt blocks and our backup will contain soft corrupted blocks.

Let’s Identify corrupt segments using v$nonlogged_block view.

set lines 2000
set pages 9999
col owner for a20
col partition_name for a10
col segment_name for a20

SELECT e.owner, e.segment_type, e.segment_name, e.partition_name, c.file#
     , greatest(e.block_id, c.block#) corr_start_block#
     , least(e.block_id+e.blocks-1, c.block#+c.blocks-1) corr_end_block#
     , least(e.block_id+e.blocks-1, c.block#+c.blocks-1)
       - greatest(e.block_id, c.block#) + 1 blocks_corrupted
  FROM dba_extents e, V$NONLOGGED_BLOCK c
 WHERE e.file_id = c.file#
   AND e.block_id <= c.block# + c.blocks - 1
   AND e.block_id + e.blocks - 1 >= c.block#
UNION
SELECT s.owner, s.segment_type, s.segment_name, s.partition_name, c.file#
     , header_block corr_start_block#
     , header_block corr_end_block#
     , 1 blocks_corrupted
  FROM dba_segments s, V$NONLOGGED_BLOCK c
 WHERE s.header_file = c.file#
   AND s.header_block between c.block# and c.block# + c.blocks - 1
UNION
SELECT null owner, null segment_type, null segment_name, null partition_name, c.file#
     , greatest(f.block_id, c.block#) corr_start_block#
     , least(f.block_id+f.blocks-1, c.block#+c.blocks-1) corr_end_block#
     , least(f.block_id+f.blocks-1, c.block#+c.blocks-1)
       - greatest(f.block_id, c.block#) + 1 blocks_corrupted
  FROM dba_free_space f, V$NONLOGGED_BLOCK  c
 WHERE f.file_id = c.file#
   AND f.block_id <= c.block# + c.blocks - 1
   AND f.block_id + f.blocks - 1 >= c.block#
order by file#, corr_start_block#
/



OWNER       SEGMENT_TYPE SEGMENT_NAME      PARTITION_      FILE# CORR_START_BLOCK# CORR_END_BLOCK# BLOCKS_CORRUPTED
-------------------- ------------------ -------------------- ---------- ---------- ----------------- --------------- ----------------
SYS       INDEX  IDX_OBJ_NAME     2  2308  2311      4
SYS       INDEX  IDX_OBJ_NAME     2  2312  2319      8
SYS       INDEX  IDX_OBJ_NAME     2  2321  2327      7
SYS       INDEX  IDX_OBJ_NAME     2  2328  2335      8
SYS       INDEX  IDX_OBJ_NAME     2  2337  2343      7
SYS       INDEX  IDX_OBJ_NAME     2  2344  2351      8
SYS       INDEX  IDX_OBJ_NAME     2  2353  2359      7
SYS       INDEX  IDX_OBJ_NAME     2  2360  2367      8
SYS       INDEX  IDX_OBJ_NAME     2  2369  2375      7
SYS       INDEX  IDX_OBJ_NAME     2  2376  2383      8
SYS       INDEX  IDX_OBJ_NAME     2  2385  2391      7
SYS       INDEX  IDX_OBJ_NAME     2  2392  2399      8
SYS       INDEX  IDX_OBJ_NAME     2  2401  2407      7
SYS       INDEX  IDX_OBJ_NAME     2  2408  2415      8
SYS       INDEX  IDX_OBJ_NAME     2  2417  2423      7
SYS       INDEX  IDX_OBJ_NAME     2  2424  2431      8
SYS       INDEX  IDX_OBJ_NAME     2  2434  2559    126
SYS       INDEX  IDX_OBJ_NAME     2  2562  2687    126
SYS       INDEX  IDX_OBJ_NAME     2  2690  2780     91

19 rows selected.

This is the best outcome to get if you notice corruption errors. All errors are related to index corruption so we could fix this problem rebuilding index.

SQL> alter index idx_obj_name rebuild;
alter index idx_obj_name rebuild
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 2, block # 2308)
ORA-01110: data file 2: '/oradata1/data/ora12c/demo01.dbf'
ORA-26040: Data block was loaded using the NOLOGGING option

Simply issuing "alter index rebuild" command won't work.
We should mark index unusable to drop segment before rebuilding it or just rebuild index with online option.

It is better choice to mark index unusable because you don't need additional space then, but I will simply rebuild index with online option and see what will happen.

SQL> alter index idx_obj_name rebuild online;
Index altered.

SQL> select count(*) from objects where object_name like 'A%';

  COUNT(*)
----------
      2079

No errors... but, let's validate datafile for corruption.

RMAN> backup validate check logical datafile 2;

Starting backup at 23-AUG-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=40 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00002 name=/oradata1/data/ora12c/demo01.dbf
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
2    OK     460            94           6402            1779294
  File Name: /oradata1/data/ora12c/demo01.dbf
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              1537
  Index      0              587
  Other      0              4182

Finished backup at 23-AUG-15

Notice "Marked Corrupt" column. Hm... 460 like before.

Don't worry, this is not new corruption. These are FREE blocks which will be reused and Oracle will automatically re-format those blocks.

set lines 2000
set pages 9999
col owner for a20
col partition_name for a10
col segment_name for a20

SELECT e.owner, e.segment_type, e.segment_name, e.partition_name, c.file#
     , greatest(e.block_id, c.block#) corr_start_block#
     , least(e.block_id+e.blocks-1, c.block#+c.blocks-1) corr_end_block#
     , least(e.block_id+e.blocks-1, c.block#+c.blocks-1)
       - greatest(e.block_id, c.block#) + 1 blocks_corrupted
  FROM dba_extents e, V$NONLOGGED_BLOCK c
 WHERE e.file_id = c.file#
   AND e.block_id <= c.block# + c.blocks - 1
   AND e.block_id + e.blocks - 1 >= c.block#
UNION
SELECT s.owner, s.segment_type, s.segment_name, s.partition_name, c.file#
     , header_block corr_start_block#
     , header_block corr_end_block#
     , 1 blocks_corrupted
  FROM dba_segments s, V$NONLOGGED_BLOCK c
 WHERE s.header_file = c.file#
   AND s.header_block between c.block# and c.block# + c.blocks - 1
UNION
SELECT null owner, null segment_type, null segment_name, null partition_name, c.file#
     , greatest(f.block_id, c.block#) corr_start_block#
     , least(f.block_id+f.blocks-1, c.block#+c.blocks-1) corr_end_block#
     , least(f.block_id+f.blocks-1, c.block#+c.blocks-1)
       - greatest(f.block_id, c.block#) + 1 blocks_corrupted
  FROM dba_free_space f, V$NONLOGGED_BLOCK  c
 WHERE f.file_id = c.file#
   AND f.block_id <= c.block# + c.blocks - 1
   AND f.block_id + f.blocks - 1 >= c.block#
order by file#, corr_start_block#
/


OWNER       SEGMENT_TYPE SEGMENT_NAME      PARTITION_      FILE# CORR_START_BLOCK# CORR_END_BLOCK# BLOCKS_CORRUPTED
-------------------- ------------------ -------------------- ---------- ---------- ----------------- --------------- ----------------
           2  2308  2319     12
           2  2321  2335     15
           2  2337  2351     15
           2  2353  2367     15
           2  2369  2383     15
           2  2385  2399     15
           2  2401  2415     15
           2  2417  2431     15
           2  2434  2559    126
           2  2562  2687    126
           2  2690  2780     91

11 rows selected.

We could force re-formatting creating dummy table and inserting data to dummy table.
Check Doc ID 336133.1.

create table s (
       n number,
       c varchar2(4000)
     ) nologging tablespace DEMO1;


SQL> BEGIN
FOR i IN 1..1000000 LOOP
INSERT /*+ APPEND */ INTO sys.s select i, lpad('REFORMAT',3092, 'R') from dual;
commit ;
END LOOP;
END;
/  2    3    4    5    6    7


BEGIN
*
ERROR at line 1:
ORA-01653: unable to extend table SYS.S by 128 in tablespace DEMO1
ORA-06512: at line 3


SQL> drop table sys.s purge;
Table dropped.

Notice that we don't have corrupted blocks any more.

RMAN> backup validate check logical datafile 2;

Starting backup at 23-AUG-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=67 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00002 name=/oradata1/data/ora12c/demo01.dbf
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
2    OK     0              3929         14593           1818933
  File Name: /oradata1/data/ora12c/demo01.dbf
  Block Type Blocks Failing Blocks Processed
  ---------- -------------- ----------------
  Data       0              9851
  Index      0              461
  Other      0              351

Finished backup at 23-AUG-15

Recovering corrupted index is easy, but recovering data blocks could be slightly difficult or sometimes impossible.
Perform validation and backups regularly because corruption will hit you when you least expect ;)

Unindexed Foreign Keys on empty/unused table and locks

2015-12-17T10:27:00.003+01:00

It is widely known that unindexed foreign keys can be performance issue. Unindexed foreign keys on child tables can cause table locks or performance problems in general.
There are many articles on this subject so I won't go in details.

My plan is to show simple demo case where empty child table with unindexed foreign key column can cause big problems.

Imagine that you have highly active table (supplier) with lots DML operations from many sessions.
In the meantime someone created new child table (product) in relationship with parent table (supplier). This table is empty and unused so why should you bother about indexing foreign key columns on empty table.

I will show you case where this empty table can cause lock contention and serious performance issues.

Oracle version - 11.2.0.4.0

CREATE TABLE supplier 
( id number(10) not null,
  supplier_id number(10) not null, 
  supplier_name varchar2(50) not null, 
  contact_name varchar2(50), 
CONSTRAINT id_pk PRIMARY KEY (id),
CONSTRAINT supplier_uk UNIQUE(supplier_id)
); 


INSERT INTO supplier VALUES (1,100, 'Supplier 1', 'Contact 1'); 
INSERT INTO supplier VALUES (2,200, 'Supplier 2', 'Contact 2'); 

COMMIT; 



CREATE TABLE product 
( product_id number(10) not null, 
product_name varchar2(50) not null, 
supplier_id number(10) not null, 
CONSTRAINT fk_supplier 
FOREIGN KEY (supplier_id) 
REFERENCES supplier(supplier_id)
 ); 
  
 
 
SQL> select id, supplier_id, supplier_name, contact_name from supplier;

        ID SUPPLIER_ID SUPPLIER_NAME                                      CONTACT_NAME
---------- ----------- -------------------------------------------------- ------------
         1         100 Supplier 1                                         Contact 1
         2         200 Supplier 2                                         Contact 2
    
    
-- Product table is empty and unused
SQL> select product_id, product_name, supplier_id from product;

no rows selected

User from SESSION1 inserts row and waits some time to end transaction.

--SESSION 1:
INSERT INTO supplier VALUES (3,300, 'Supplier 3', 'Contact 3'); --(Without COMMIT)
1 row created.

In the same time there are lots sessions which are trying to update record with column used in foreign-key relationship. All sessions are hanging and you have big problem.

--SESSION 2:
UPDATE supplier SET supplier_id=200 WHERE supplier_id = 200;  --(HANG)

Let's try another INSERT in next session:

--SESSION 3:
INSERT INTO supplier VALUES (4,400, 'Supplier 4', 'Contact 4');  --(HANG)

Now we have inserts hanging which could lead to major problems for very active table.

Check locks:

SELECT l.sid, s.blocking_session blocker, s.event, l.type, l.lmode, 
       l.request, o.object_name, o.object_type 
FROM v$lock l, dba_objects o, v$session s 
WHERE UPPER(s.username) = UPPER('MSUTIC') 
AND l.id1 = o.object_id (+) 
AND l.sid = s.sid 
ORDER BY sid, type;

       SID    BLOCKER EVENT                                 TY      LMODE    REQUEST OBJECT_NAME                 OBJECT_TYPE
---------- ---------- -------------------------------------- -- ---------- ---------- -------------------------- ------------
        63       1641 enq: TM - contention                   AE          4          0 ORA$BASE                    EDITION
        63       1641 enq: TM - contention                   TM          3          0 SUPPLIER                    TABLE
        63       1641 enq: TM - contention                   TM          0          4 PRODUCT                     TABLE
      1390            SQL*Net message to client              AE          4          0 ORA$BASE                    EDITION
      1641            SQL*Net message from client            AE          4          0 ORA$BASE                    EDITION
      1641            SQL*Net message from client            TM          3          0 SUPPLIER                    TABLE
      1641            SQL*Net message from client            TM          3          0 PRODUCT                     TABLE
      1641            SQL*Net message from client            TX          6          0 TPT                         SYNONYM
      2159            SQL*Net message from client            AE          4          0 ORA$BASE                    EDITION
      2729         63 enq: TM - contention                   AE          4          0 ORA$BASE                    EDITION
      2729         63 enq: TM - contention                   TM          0          3 PRODUCT                     TABLE
      2729         63 enq: TM - contention                   TM          3          0 SUPPLIER                    TABLE

Unused and empty product table is culprit for performance issues.

Create index on foreign key column and check behaviour.

CREATE INDEX fk_supplier ON product (supplier_id);

--SESSION 1:
INSERT INTO supplier VALUES (3,300, 'Supplier 3', 'Contact 3');
1 row created.


--SESSION 2:
UPDATE supplier SET supplier_id=200 WHERE supplier_id = 200;
1 row updated.

Now everything worked without locking problems.

Notice that we have different behaviour in 12c version.

Oracle version - 12.1.0.2.0

CREATE TABLE supplier 
( supplier_id number(10) not null, 
  supplier_name varchar2(50) not null, 
  contact_name varchar2(50), 
CONSTRAINT supplier_pk PRIMARY KEY (supplier_id) 
); 

INSERT INTO supplier VALUES (1, 'Supplier 1', 'Contact 1'); 
INSERT INTO supplier VALUES (2, 'Supplier 2', 'Contact 2'); 
COMMIT; 

CREATE TABLE product 
( product_id number(10) not null, 
product_name varchar2(50) not null, 
supplier_id number(10) not null, 
CONSTRAINT fk_supplier 
FOREIGN KEY (supplier_id) 
REFERENCES supplier(supplier_id)
); 
  

--SESSION 1:
INSERT INTO supplier VALUES (3, 'Supplier 3', 'Contact 3'); --(Without COMMIT)
1 row created.

--SESSION 2:
UPDATE supplier SET supplier_id=2 WHERE supplier_id = 2; -- (No HANG)
1 row updated.

Check locks:

SELECT l.sid, s.blocking_session blocker, s.event, l.type, l.lmode,
       l.request, o.object_name, o.object_type 
FROM v$lock l, dba_objects o, v$session s 
WHERE UPPER(s.username) = UPPER('MSUTIC') 
AND l.id1 = o.object_id (+) 
AND l.sid = s.sid 
ORDER BY sid, type;


  SID    BLOCKER EVENT                          TY      LMODE    REQUEST OBJECT_NAME
------ ---------- ------------------------------ -- ---------- ---------- ------------
 4500            SQL*Net message from client    AE          4          0 ORA$BASE
 4500            SQL*Net message from client    TM          3          0 SUPPLIER
 4500            SQL*Net message from client    TX          6          0
 6139            SQL*Net message to client      AE          4          0 ORA$BASE
 6144            SQL*Net message from client    AE          4          0 ORA$BASE
 6144            SQL*Net message from client    TM          3          0 SUPPLIER
 6144            SQL*Net message from client    TM          2          0 PRODUCT
 6144            SQL*Net message from client    TX          6          0

I don't think that you should index all foreign keys all the time. Sometimes this is not needed and it could be overhead. Unnecessary indexes on foreign keys are wasting storage space and cause slower DML operations on the table.

Think about application and how parent/child tables will be used before creating indexes and check articles from Tom Kyte on this subject.

Update 2016-07-08:

Oracle version - 11.2.0.4.0

What if we index column using descending order.

CREATE INDEX fk_supplier ON product (SUPPLIER_ID DESC);

Index created.

--SESSION 1:
INSERT INTO supplier VALUES (3,300, 'Supplier 3', 'Contact 3'); --(Without COMMIT)

--SESSION 2:
UPDATE supplier SET supplier_id=200 WHERE supplier_id = 200;  --(HANG)

--Try another INSERT in next session:

--SESSION 3:
INSERT INTO supplier VALUES (4,400, 'Supplier 4', 'Contact 4');  --(HANG)

Check locks:

SELECT l.sid, s.blocking_session blocker, s.event, l.type, l.lmode, 
       l.request, o.object_name, o.object_type 
FROM v$lock l, dba_objects o, v$session s 
WHERE UPPER(s.username) = UPPER('MSUTIC') 
AND l.id1 = o.object_id (+) 
AND l.sid = s.sid 
ORDER BY sid, type;


   SID    BLOCKER EVENT                          TY      LMODE    REQUEST OBJECT_NAME   OBJECT_TYPE
------ ---------- ------------------------------ -- ---------- ---------- ------------- -----------
   192       1137 enq: TM - contention           AE          4          0 ORA$BASE      EDITION
   192       1137 enq: TM - contention           TM          3          0 SUPPLIER      TABLE
   192       1137 enq: TM - contention           TM          0          3 PRODUCT       TABLE
   382            SQL*Net message from client    AE          4          0 ORA$BASE      EDITION
   949            SQL*Net message from client    AE          4          0 ORA$BASE      EDITION
   949            SQL*Net message from client    TM          3          0 SUPPLIER      TABLE
   949            SQL*Net message from client    TM          3          0 PRODUCT       TABLE
   949            SQL*Net message from client    TX          6          0
  1137        949 enq: TM - contention           AE          4          0 ORA$BASE      EDITION
  1137        949 enq: TM - contention           TM          3          0 SUPPLIER      TABLE
  1137        949 enq: TM - contention           TM          0          4 PRODUCT       TABLE
  1516            SQL*Net message to client      AE          4          0 ORA$BASE      EDITION
  2459            SQL*Net message from client    AE          4          0 ORA$BASE      EDITION

Keep in mind - using descending order for the column to create index will not solve problem with concurrency.

Confusion and problems with lost+found directory in MySQL/Galera cluster configuration

2015-10-07T22:52:00.000+02:00

The lost+found directory is filesystem directory created at root level of mounted drive for ext file systems. It is used by file system check tools (fsck) for file recoveries.

In MySql world it can cause confusion or possible problems with synchronisation in Galera cluster configuration.

Let’s check some examples.

I have MySQL database with datadir=/data in configuration file. I have deleted lost+found directory and restarted MySQL service.

When I list my databases this is result:

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| employees          |
| mysql              |
| performance_schema |
| pitrdb             |
| sbtest             |
| sys                |
| test               |
+--------------------+
8 rows in set (0.34 sec)

I will stop MySQL service and recreate lost+found directory.

$ sudo service mysql stop

$ cd /data
$ sudo mklost+found
mklost+found 1.42.9 (4-Feb-2014)

Restart service and show databases.

$ sudo service mysql start

mysql> show databases;
+---------------------+
| Database            |
+---------------------+
| information_schema  |
| employees           |
| #mysql50#lost+found |
| mysql               |
| performance_schema  |
| pitrdb              |
| sbtest              |
| sys                 |
| test                |
+---------------------+
9 rows in set (0.01 sec)

Notice database : #mysql50#lost+found

If you have dedicated entire FS to use as MySQL datadir then MySQL will interpret all files under that directory as db-related files.
SHOW DATABASES lists database lost+found which is not real database.

If you check error log you can notice this message:

[ERROR] Invalid (old?) table or database name 'lost+found'

For a single server configuration issues with lost+found directory can only make confusion. I’m not aware of any negative effects for database.
To avoid confusion you should move database to sub-directory below the root level directory. Also remove all directories that are not MySql db-related from datadir location.

Stop MySQL service on database server.

$ sudo service mysql stop

Make sub-directory and move existing data to new directory.

$ sudo su -
root@galera1:~# cd /data
root@galera1:/data# mkdir mydata && mv !(mydata) mydata
root@galera1:/data# chown -R mysql:mysql /data

Update configuration file with new datadir location.

# vi /etc/mysql/my.cnf
...
datadir=/data/mydata
...

Remove non-database directories.

# rm -rf mydata/lost+found
# mklost+found
mklost+found 1.42.9 (4-Feb-2014)

# pwd
/data
# ls -l
total 56
drwx------ 2 root  root  49152 Oct  4 16:48 lost+found
drwxr-xr-x 9 mysql mysql  4096 Oct  4 16:48 mydata

Restart the service.

$ sudo service mysql start

From 5.6 version you can tell server to ignore non-database directories using ignore-db-dir option.

$ sudo vi /etc/mysql/my.cnf
...
ignore-db-dir=lost+found
...

Let’s test how lost+found directory affects Galera cluster configuration.
For this test I’m using Percona XtraDB Cluster 5.6 with 3 nodes.

# dpkg -l | grep percona-xtradb-cluster-server
ii  percona-xtradb-cluster-server-5.6         5.6.25-25.12-1.trusty                               amd64        Percona XtraDB Cluster database server binaries


mysql> select version();
+--------------------+
| version()          |
+--------------------+
| 5.6.25-73.1-56-log |
+--------------------+
1 row in set (0.00 sec)


mysql> show global status like 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name      | Value |
+--------------------+-------+
| wsrep_cluster_size | 3     |
+--------------------+-------+
1 row in set (0.01 sec)

In this configuration for datadir is specified /data location with lost+found directory.
As this is 5.6 version I’ve included ignore-db-dir option in configuration file.

In SHOW DATABASES list and error log I don’t see any issues.

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| employees          |
| mysql              |
| performance_schema |
| pitrdb             |
| sbtest             |
| sys                |
| test               |
+--------------------+
8 rows in set (0.00 sec)

For SST method I’m using default and recommended Percona’s xtrabackup-v2.
So, what will happen if I initiate SST for one of the nodes in the cluster.

$ sudo service mysql stop
 * Stopping MySQL (Percona XtraDB Cluster) mysqld  [OK]

$ sudo rm /data/grastate.dat

$ sudo service mysql start
[sudo] password for marko:
 * Starting MySQL (Percona XtraDB Cluster) database server mysqld
* State transfer in progress, setting sleep higher mysqld
* The server quit without updating PID file (/data/galera2.pid).

It appears that SST failed with errors:

WSREP_SST: [ERROR] Cleanup after exit with status:1 (20151004 12:01:00.936)
2015-10-04 12:01:02 16136 [Note] WSREP: (cf98f684, 'tcp://0.0.0.0:4567') turning message relay requesting off
2015-10-04 12:01:12 16136 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.56.102' --datadir '/data/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '16136' --binlog 'percona-bin' : 1 (Operation not permitted)
2015-10-04 12:01:12 16136 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2015-10-04 12:01:12 16136 [ERROR] WSREP: SST script aborted with error 1 (Operation not permitted)
2015-10-04 12:01:12 16136 [ERROR] WSREP: SST failed: 1 (Operation not permitted)
2015-10-04 12:01:12 16136 [ERROR] Aborting

2015-10-04 12:01:12 16136 [Warning] WSREP: 0.0 (galera3): State transfer to 1.0 (galera2) failed: -22 (Invalid argument)
2015-10-04 12:01:12 16136 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():731: Will never receive state. Need to abort.

The cause of SST failure is lost+found directory but in error log lost+found directory is not mentioned.

SST fails because xtrabackup ignores ignore-db-dir option and tries to synchronise lost+found directory which is owned by root user.

What will happen if I (for test) change the ownership of lost+found directory on donor nodes.

drwx------ 2 root  root      49152 Oct  4 11:50 lost+found

marko@galera3:/data# sudo chown -R mysql:mysql /data/lost+found
marko@galera1:/data$ sudo chown -R mysql:mysql /data/lost+found


marko@galera2:/data$ sudo service mysql start
 * Stale sst_in_progress file in datadir mysqld
* Starting MySQL (Percona XtraDB Cluster) database server mysqld
* State transfer in progress, setting sleep higher mysqld           [OK]

NODE2
...
drwxrwx--x  2 mysql mysql      4096 Oct  4 12:07 lost+found
...

SST succeeded and node is successfully joined/synced to the cluster.

To avoid this inconveniences just move databases from root directory.
Some of you will simply delete lost+found directory, but be aware, fsck may recreate lost+found directory and your cluster synchronisation will fail when you least expect it ;)

How to Pass Arguments to OS Shell Script from Oracle Database

2015-05-10T21:38:00.001+02:00

Imagine you have several Oracle databases on the same host under same OS user.

In scripts directory you have shell script that kills OS processes.
Idea is to call OS script from database procedure and kill problematic process using shell script.

Script will run simple query to get process id and kill that process.

But how to assure that this script will execute in correct environment for correct database?

One way is to create one script per database and set environment inside the script, or create just one script which will dynamically set correct environment for instance that is calling script.

For demo case I’ve created simple script that spools query output to the file.

#!/bin/bash

# Avoid oraenv asking
ORAENV_ASK="NO"; export ORAENV_ASK

ORACLE_SID=$1; export ORACLE_SID

. oraenv ${ORACLE_SID}

$ORACLE_HOME/bin/sqlplus -s "/ as sysdba" <<EOF > /tmp/my_environment.txt
set heading off feedback off verify off
col instance_name for a10
col host_name for a10
col status for a10
select instance_name, host_name, status
 from v\$instance;
exit
EOF

$ chmod u+x simple_script.sh

What happens when we execute script.

$ ./simple_script.sh testdb
The Oracle base for ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/dbhome_1 is /u01/app/oracle
$
$ cat /tmp/my_environment.txt

testdb     asterix    OPEN

$ ./simple_script.sh ora11gr2
The Oracle base for ORACLE_HOME=/u01/app/oracle/product/11.2.0.4/dbhome_1 is /u01/app/oracle
$
$ cat /tmp/my_environment.txt

ora11gr2   asterix    OPEN

Notice how I specified ORACLE_SID using command line argument. Script sets environment from ORATAB file according to specified SID and spools output to my_enviroment.txt file.

I will demonstrate how to pass argument from database layer.

To execute external job I have to create credentials on both databases.

-- Session 1

system@ORA11GR2> begin
  2      dbms_scheduler.create_credential(
  3      credential_name => 'ORACLE_CRED',
  4      username => 'oracle',
  5      password => 'password');
  6      end;
  7  /

PL/SQL procedure successfully completed.


-- Session 2

system@TESTDB> begin
  2      dbms_scheduler.create_credential(
  3      credential_name => 'ORACLE_CRED',
  4      username => 'oracle',
  5      password => 'password');
  6      end;
  7  /

PL/SQL procedure successfully completed.

Use SYS_CONTEXT function to get instance name and execute script for specified instance.

-- Session 1

system@ORA11GR2> DECLARE
  2    l_oracle_sid varchar2(20);
  3  BEGIN
  4    select sys_context('userenv','instance_name') into l_oracle_sid
  5    from dual;
  6        DBMS_SCHEDULER.CREATE_JOB (
  7              job_name => 'J_SIMPLE_SCRIPT',
  8              job_type => 'EXECUTABLE',
  9              job_action => '/home/oracle/skripte/simple_script.sh',
 10              number_of_arguments => 1,
 11              start_date => NULL,
 12              repeat_interval => NULL,
 13              end_date => NULL,
 14              enabled => FALSE,
 15              auto_drop => TRUE,
 16              comments => 'Set environment and execute query on v$instance view');
 17           dbms_scheduler.set_attribute('J_SIMPLE_SCRIPT','credential_name','ORACLE_CRED');
 18           DBMS_SCHEDULER.set_job_argument_value('J_SIMPLE_SCRIPT',1,l_oracle_sid);
 19          DBMS_SCHEDULER.enable('J_SIMPLE_SCRIPT');
 20          DBMS_SCHEDULER.run_job (job_name=> 'J_SIMPLE_SCRIPT', use_current_session => FALSE);
 21  END;
 22  /

PL/SQL procedure successfully completed.


system@ORA11GR2> host cat /tmp/my_environment.txt

ora11gr2   asterix    OPEN

I’ve called script from "ora11gr2" database and OS script was executed for specified database. DBMS_SCHEDULER job was used for passing argument to external OS script and for script execution.

From another session.

-- Session 2

system@TESTDB> DECLARE
  2    l_oracle_sid varchar2(20);
  3  BEGIN
  4    select sys_context('userenv','instance_name') into l_oracle_sid
  5    from dual;
  6        DBMS_SCHEDULER.CREATE_JOB (
  7              job_name => 'J_SIMPLE_SCRIPT',
  8              job_type => 'EXECUTABLE',
  9              job_action => '/home/oracle/skripte/simple_script.sh',
 10              number_of_arguments => 1,
 11              start_date => NULL,
 12              repeat_interval => NULL,
 13              end_date => NULL,
 14              enabled => FALSE,
 15              auto_drop => TRUE,
 16              comments => 'Set environment and execute query on v$instance view');
 17           dbms_scheduler.set_attribute('J_SIMPLE_SCRIPT','credential_name','ORACLE_CRED');
 18           DBMS_SCHEDULER.set_job_argument_value('J_SIMPLE_SCRIPT',1,l_oracle_sid);
 19          DBMS_SCHEDULER.enable('J_SIMPLE_SCRIPT');
 20          DBMS_SCHEDULER.run_job (job_name=> 'J_SIMPLE_SCRIPT', use_current_session => FALSE);
 21  END;
 22  /

PL/SQL procedure successfully completed.


SQL> host cat /tmp/my_environment.txt

testdb     asterix    OPEN

Notice how "/tmp/my_environment.txt" file changed according to specified database.

Using this method you can easily reuse OS scripts for more databases.

ASM not starting with ORA-00845 - how to fix ASM parameter file

2015-05-09T09:03:00.003+02:00

Few days ago I saw great post from Norman Dunbar on how to fix a broken ASM spfile.

With version 11gR2 ASM spfile can be stored in ASM diskgroup and by default Oracle Installer will put it there. So if you want to create pfile from spfile your ASM instance should be up and running.

If you have incorrect parameter in ASM spfile which is blocking ASM to start than you have slight problem. You cannot easily create pfile from spfile, correct incorrect parameter in pfile and recreate spfile, as you would do for database.

But don't worry, there are several options well explained available on net. I would recommend to practice all scenarios in you test environment if you want to avoid big stress in production later.

When I had problems with broken ASM parameter file (mostly in test/dev environment), I would always end up searching my notes or blog posts on how to solve this problem.

I knew that parameters were written directly in ASM disk header and I could extract them from there, or maybe check parameters in ASM alert log, but in back of my brain I was always thinking that there must be simpler way.

Thanks to Norman now I know how to quickly change incorrect parameter and keep other parameters intact.

I have used this trick few days ago and it worked perfectly. This blog post is just reminder which I know it will be useful for me in the future.

In my environment I have Oracle Restart with Oracle Database 12.1.0.2.0.

After starting my test server I have noticed that something is wrong because ASM was unable to start.

$ ./srvctl status asm
ASM is not running.

When I tried to start ASM manually I have received error:

$ ./srvctl start asm
PRCR-1079 : Failed to start resource ora.asm
CRS-5017: The resource action "ora.asm start" encountered the following error:
ORA-00845: MEMORY_TARGET not supported on this system
. For details refer to "(:CLSN00107:)" in "/u01/app/grid/diag/crs/obelix/crs/trace/ohasd_oraagent_grid.trc".

CRS-2674: Start of 'ora.asm' on 'obelix' failed

Let's check alert log.

alert+ASM.log
Fri May 01 19:40:16 2015
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 1, checking for the existence of node 0...
* node 0 does not exist. instance_number = 1
Starting ORACLE instance (normal) (OS id: 4136)
Fri May 01 19:40:16 2015
CLI notifier numLatches:3 maxDescs:222
Fri May 01 19:40:16 2015
WARNING: You are trying to use the MEMORY_TARGET feature. This feature requires the /dev/shm file system to be mounted for at least 1140850688 bytes. /dev/shm is either not mounted or is mounted with available space less than this size. Please fix this so that MEMORY_TARGET can work as expected. Current available is 1051975680 and used is 208896 bytes. Ensure that the mount point is /dev/shm for this directory.

So there is problem with MEMORY_TARGET parameter, but how can I disable AMM when my ASM instance is down.

First I had to find location of ASM parameter file. I don’t have GPnP profile as this is single instance setup so I have extracted ASM parameter file location from "ora.asm" resource information.

$ crsctl stat res ora.asm -p | egrep "ASM_DISKSTRING|SPFILE"
ASM_DISKSTRING= SPFILE=+DATA/ASM/ASMPARAMETERFILE/registry.253.822856169

Create new parameter file with corrected MEMORY_TARGET parameter.

$ vi /tmp/initASM.ora
spfile="+DATA/asm/asmparameterfile/registry.253.862145335"
MEMORY_TARGET=0

Start ASM instance using new parameter file.

$ sqlplus / as sysasm

SQL*Plus: Release 12.1.0.2.0 Production on Fri May 1 20:04:39 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup pfile=/tmp/initASM.ora
ASM instance started

Total System Global Area  197132288 bytes
Fixed Size                  2922520 bytes
Variable Size             169043944 bytes
ASM Cache                  25165824 bytes
ASM diskgroups mounted

And woila!
New parameter was applied and I was able to start ASM instance.

Change parameter in ASM spfile.

SQL> alter system set memory_target=0 scope=spfile;

System altered.

Restart ASM.

SQL> shutdown immediate;
ASM diskgroups dismounted
ASM instance shutdown

[grid@obelix bin]$ ./srvctl start asm
[grid@obelix bin]$ ./srvctl status asm
ASM is running on obelix

ASM instance successfully started with corrected parameter file.

Restore to Restore Point on Standard Edition (no Flashback technology)

2015-02-28T11:22:00.001+01:00

Restore points and Flashback database are nice features introduced in 10g database that provide efficient point in time recovery to reverse unwanted data changes.

But what if you have Standard Edition database:

SQL> shutdown immediate;

SQL> startup mount;

SQL> alter database flashback on;
alter database flashback on
*
ERROR at line 1:
ORA-00439: feature not enabled: Flashback Database

In Standard Edition you don’t have Flashback Database feature, but you can still create restore points and perform incomplete recoveries to restore point.

Create test table and insert status row.

SQL> create table admin.test_restore (datum date, komentar varchar2(100));
Table created.

SQL> insert into admin.test_restore values (sysdate, 'Before Restore Point');
1 row created.

SQL> commit;
Commit complete.

Create restore point here.

SQL> create restore point RP_UPGRADE;

Restore point created.


SQL> select scn, to_char(time,'dd.mm.yyyy hh24:mi:ss') time, name
  2  from v$restore_point;

       SCN TIME                NAME
---------- ------------------- ---------------------
    580752 27.02.2015 10:31:19 RP_UPGRADE

Notice how name of restore point is associated with SCN of the database.

Now you can perform potentially dangerous operations like database upgrades, table modifications, truncating data and like.

I will enter some status data for later checks.

SQL> insert into admin.test_restore values (sysdate, 'After Restore Point');
1 row created.

SQL> insert into admin.test_restore values (sysdate, 'Upgrade actions performed');
1 row created.

SQL> commit;
Commit complete.

Check table.

SQL> alter session set nls_date_format='dd.mm.yyyy hh24:mi:ss';
Session altered.

SQL> select datum, komentar from admin.test_restore order by datum;

DATUM               KOMENTAR
------------------- ------------------------------
27.02.2015 10:30:39 Before Restore Point
27.02.2015 10:31:45 After Restore Point
27.02.2015 10:31:55 Upgrade actions performed

Suppose we had some problems and want to "rewind" database to restore point. In EE we would perform flashback database to restore point but in SE we will use different approach.

Shutdown database and startup mount.

RMAN> shutdown immediate;

using target database control file instead of recovery catalog
database closed
database dismounted
Oracle instance shut down

RMAN> startup mount;

connected to target database (not started)
Oracle instance started
database mounted

Total System Global Area     471830528 bytes

Fixed Size                     2254344 bytes
Variable Size                247466488 bytes
Database Buffers             213909504 bytes
Redo Buffers                   8200192 bytes

Restore and recover database until restore point RP_UPGRADE.

RMAN> restore database until restore point RP_UPGRADE;

Starting restore at 27.02.2015 10:36:26
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=247 device type=DISK

channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00001 to +DATA1/ora11gr2/datafile/system.291.872722799
channel ORA_DISK_1: restoring datafile 00002 to +DATA1/ora11gr2/datafile/sysaux.292.872722847
channel ORA_DISK_1: restoring datafile 00003 to +DATA1/ora11gr2/datafile/undotbs1.278.872722879
channel ORA_DISK_1: restoring datafile 00004 to +DATA1/ora11gr2/datafile/users.296.872722925
channel ORA_DISK_1: reading from backup piece +FRA1/ora11gr2/backupset/2015_02_27/nnndf0_tag20150227t102559_0.1164.872763961
channel ORA_DISK_1: piece handle=+FRA1/ora11gr2/backupset/2015_02_27/nnndf0_tag20150227t102559_0.1164.872763961 tag=TAG20150227T102559
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:01:35
Finished restore at 27.02.2015 10:38:02

RMAN> recover database until restore point RP_UPGRADE;

Starting recover at 27.02.2015 10:38:45
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:01

Finished recover at 27.02.2015 10:38:49

Open database with resetlogs option.

RMAN> sql 'alter database open resetlogs';

sql statement: alter database open resetlogs

Final check.

SQL> alter session set nls_date_format='dd.mm.yyyy hh24:mi:ss';

Session altered.

SQL> select datum, komentar
  2  from admin.test_restore
  3  order by datum;

DATUM               KOMENTAR
------------------- --------------------------------------------------
27.02.2015 10:30:39 Before Restore Point

We "rewound" database to state that existed before RP_UPGRADE restore point is created.
This was incomplete recovery and RP_UPGRADE restore point was used just to mark location in time.

MariaDB - Measure Replicaton Lag and Check / Fix Replication Inconsistencies using Percona tools

2015-02-05T10:33:00.001+01:00

Percona Toolkit is collection of command-line tools to perform many MySQL tasks like creating backups, finding duplicate indexes, managing replication, etc.

In this post I will talk about how to measure replication lag and check/fix replication inconsistencies with this tools:
pt-heartbeat
pt-table-checksum
pt-table-sync

I am using environment from previous blog post.
Master-Master replication with MariaDB 10.0.16 database on Debian 7.

Install Percona Toolkit on both nodes:

$ sudo wget percona.com/get/percona-toolkit.deb

$ sudo apt-get install libterm-readkey-perl
$ sudo dpkg -i percona-toolkit.deb

I will create percona database where I will store tables needed for various checks. Also I will create percona user which will be used with Percona tools.

MASTER1

MariaDB [(none)]> create database percona;

MariaDB [(none)]> grant all privileges on *.* to 'percona'@'master1.localdomain' identified by 'percona';

MariaDB [(none)]> grant all privileges on *.* to 'percona'@'localhost' identified by 'percona';

MariaDB [(none)]> flush privileges;

MASTER2

MariaDB [(none)]> grant all privileges on *.* to 'percona'@'master2.localdomain' identified by 'percona';

MariaDB [(none)]> grant all privileges on *.* to 'percona'@'localhost' identified by 'percona';

MariaDB [(none)]> flush privileges;

MONITOR REPLICATION LAG

So, I have replication running and I want to be sure that everything is working fine.
Typical method to monitor replication lag would be to run SLAVE STATUS and look at Seconds_Behind_Master. But Seconds_Behind_Master is not always accurate.

Percona Toolkit has a tool to monitor replication delay called pt-heartbeat.

We must create heartbeat table on the master manually or using --create-table option and heartbeat table must contain one heartbeat row. This table will be updated in interval we specify by pt-heartbeat. Slave will actively check table and calculate time delay.

Create heartbeat table and start daemonized process to update percona.heartbeat table.

MASTER1

$ pt-heartbeat -upercona -ppercona -D percona --update master1 --daemonize --create-table

MASTER2
Start pt-heartbeat.

$ pt-heartbeat -upercona -ppercona --update --database percona

MASTER1
Monitor replication slave lag.

$ pt-heartbeat -upercona -ppercona -D percona --monitor -h master2
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.01s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]
0.00s [  0.00s,  0.00s,  0.00s ]

CHECK REPLICATION INCONSISTENCIES

If we want to check replication integrity we can use pt-table-checksum tool.

Run tool on master server. It will automatically detect slave servers and connect to them to do some safety checks. After that it runs checksums on the tables of the master database and reports results in the checksum table. This results are then compared with the results on the slave whether the data differs.
You can inspect that table anytime - in this example percona.checksums table.

If there are no different rows in the tables between master and slave database DIFF will show 0.

$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master2
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T20:58:15      0      0        5       1       0   1.134 testdb.users

MASTER2

MariaDB [testdb]> create table address (id int auto_increment primary key, city varchar(30));
Query OK, 0 rows affected (0.06 sec)

MariaDB [testdb]> insert into address (city) values ('New York');
Query OK, 1 row affected (0.07 sec)

MariaDB [testdb]> insert into address (city) values ('LA');
Query OK, 1 row affected (0.06 sec)

MariaDB [testdb]> insert into address (city) values ('Zagreb');
Query OK, 1 row affected (0.13 sec)

MASTER1

$ pt-table-checksum -upercona -ppercona --replicate percona.checksums --databases testdb -h master2
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T20:59:16      0      0        3       1       0   1.032 testdb.address
02-02T20:59:17      0      0        5       1       0   1.120 testdb.users

$ pt-table-checksum -upercona -ppercona --replicate=percona.checksums --replicate-check-only --databases=testdb master1

Nothing received in output which means that testdb database is in sync with slave.

Insert some test data:

MariaDB [testdb]> create table animals (id int not null auto_increment,
    -> name char(30) not null,
    -> primary key(id));
Query OK, 0 rows affected (0.04 sec)

MariaDB [testdb]> insert into animals (name) values ('dog'),('cat'),('whale');
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0


MariaDB [testdb]> create table countries (id int not null auto_increment,
    -> name varchar(30),
    -> primary key(id));
Query OK, 0 rows affected (0.09 sec)

MariaDB [testdb]> insert into countries(name) values ('Croatia'),('England'),('USA'),('Island');
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0

MariaDB [testdb]> select * from animals;
+----+-------+
| id | name  |
+----+-------+
|  1 | dog   |
|  2 | cat   |
|  3 | whale |
+----+-------+
3 rows in set (0.00 sec)

MariaDB [testdb]>  select * from countries;
+----+---------+
| id | name    |
+----+---------+
|  1 | Croatia |
|  2 | England |
|  3 | USA     |
|  4 | Island  |
+----+---------+
4 rows in set (0.00 sec)

Check if database is in sync:

$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master1
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T21:03:49      0      0        3       1       0   0.177 testdb.address
02-02T21:03:49      0      0        3       1       0   0.045 testdb.animals
02-02T21:03:49      0      0        4       1       0   0.049 testdb.countries
02-02T21:03:49      0      0        5       1       0   0.037 testdb.users

RESYNC REPLICA FROM THE MASTER

Lets make database on MASTER2 out-of-sync and create some differences between databases.

MASTER2

Instead of stopping replication process, I will temporarily disable binary logging on MASTER2 server.

MariaDB [testdb]> SET SQL_LOG_BIN=0;
Query OK, 0 rows affected (0.00 sec)

Make same data modifications.

MariaDB [testdb]> insert into animals (name) values ('Ostrich'),('Penguin');
Query OK, 2 rows affected (0.04 sec)
Records: 2  Duplicates: 0  Warnings: 0

MariaDB [testdb]> delete from countries where id=2;
Query OK, 1 row affected (0.01 sec)


MariaDB [testdb]> create table colors (name varchar(30));
Query OK, 0 rows affected (0.10 sec)

MariaDB [testdb]> insert into colors(name) values ('Red'),('Blue');
Query OK, 2 rows affected (0.02 sec)
Records: 2  Duplicates: 0  Warnings: 0

Enable binary logging again.

MariaDB [testdb]> SET SQL_LOG_BIN=1;
Query OK, 0 rows affected (0.00 sec)

MASTER1

MariaDB [testdb]> select * from animals;
+----+-------+
| id | name  |
+----+-------+
|  1 | dog   |
|  2 | cat   |
|  3 | whale |
+----+-------+
3 rows in set (0.00 sec)

MariaDB [testdb]> select * from countries;
+----+---------+
| id | name    |
+----+---------+
|  1 | Croatia |
|  2 | England |
|  3 | USA     |
|  4 | Island  |
+----+---------+
4 rows in set (0.00 sec)

MariaDB [testdb]> show tables;
+------------------+
| Tables_in_testdb |
+------------------+
| address          |
| animals          |
| countries        |
| users            |
+------------------+
4 rows in set (0.00 sec)

MASTER2

MariaDB [testdb]> select * from animals;
+----+---------+
| id | name    |
+----+---------+
|  1 | dog     |
|  2 | cat     |
|  3 | whale   |
|  4 | Ostrich |
|  5 | Penguin |
+----+---------+
5 rows in set (0.00 sec)

MariaDB [testdb]> select * from countries;
+----+---------+
| id | name    |
+----+---------+
|  1 | Croatia |
|  3 | USA     |
|  4 | Island  |
+----+---------+
3 rows in set (0.00 sec)

MariaDB [testdb]> show tables;
+------------------+
| Tables_in_testdb |
+------------------+
| address          |
| animals          |
| colors           |
| countries        |
| users            |
+------------------+
5 rows in set (0.00 sec)

Notice that there are some inconsistencies between databases and there isn’t any built-in tool that will notify us about that. Replication is working fine, even though replica has different data than master.

With pt-table-checksum we will check data differences between databases.

MASTER1

$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master1
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T21:11:23      0      0        3       1       0   0.106 testdb.address
02-02T21:11:23      0      1        3       1       0   0.053 testdb.animals
02-02T21:11:24      0      1        4       1       0   0.046 testdb.countries
02-02T21:11:24      0      0        5       1       0   0.042 testdb.users


$ pt-table-checksum -upercona -ppercona --replicate=percona.checksums --replicate-check-only --databases=testdb master1
Differences on master2
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
testdb.animals 1 2 1
testdb.countries 1 -1 1

Notice how tool reported differences in DIFFS column.

Synchronizing data between servers in master-master configuration is not trivial task. You have to think about which process is changing data where and be very careful to avoid data corruption.

In master-master configuration data changes are replicated between nodes and statements executed on "slave" node are replicated to the master.

Maybe the best approach would be to stop replication, restore replica from backup or reclone whole server and start replication again. You can also dump only affected data with mysqldump and reload it.

As this is my testing environment I will try to resolve differences using pt-table-sync tool from Percona toolkit.

First I will use tool with --print option which will only display me queries that will resolve differences. I will inspect those queries before executing them on the slave server.
These queries could be executed manually also.

$ pt-table-sync -upercona -ppercona --sync-to-master --databases testdb --transaction --lock=1 --verbose master2 --print

# Syncing h=master2,p=...,u=percona
# DELETE REPLACE INSERT UPDATE ALGORITHM START    END      EXIT DATABASE.TABLE
#      0       0      0      0 Chunk     22:13:17 22:13:17 0    testdb.address
DELETE FROM `testdb`.`animals` WHERE `id`='4' LIMIT 1 /*percona-toolkit src_db:testdb src_tbl:animals src_dsn:P=3306,h=master1,p=...,u=percona dst_db:testdb dst_tbl:animals dst_dsn:h=master2,p=...,u=percona lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:7723 user:msutic host:master1*/;
DELETE FROM `testdb`.`animals` WHERE `id`='5' LIMIT 1 /*percona-toolkit src_db:testdb src_tbl:animals src_dsn:P=3306,h=master1,p=...,u=percona dst_db:testdb dst_tbl:animals dst_dsn:h=master2,p=...,u=percona lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:7723 user:msutic host:master1*/;
#      2       0      0      0 Chunk     22:13:17 22:13:17 2    testdb.animals
REPLACE INTO `testdb`.`countries`(`id`, `name`) VALUES ('2', 'England') /*percona-toolkit src_db:testdb src_tbl:countries src_dsn:P=3306,h=master1,p=...,u=percona dst_db:testdb dst_tbl:countries dst_dsn:h=master2,p=...,u=percona lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:7723 user:msutic host:master1*/;
#      0       1      0      0 Chunk     22:13:17 22:13:17 2    testdb.countries
#      0       0      0      0 Chunk     22:13:17 22:13:17 0    testdb.users

Set --execute option to execute those queries.
With --sync-to-master option we will treat MASTER2 server as a slave.

$ pt-table-sync -upercona -ppercona --sync-to-master --databases testdb --transaction --lock=1 --verbose master2 --execute

# Syncing h=master2,p=...,u=percona
# DELETE REPLACE INSERT UPDATE ALGORITHM START    END      EXIT DATABASE.TABLE
#      0       0      0      0 Chunk     22:19:51 22:19:51 0    testdb.address
#      2       0      0      0 Chunk     22:19:51 22:19:51 2    testdb.animals
#      0       1      0      0 Chunk     22:19:51 22:19:51 2    testdb.countries
#      0       0      0      0 Chunk     22:19:51 22:19:51 0    testdb.users

Output shows that differences are successfully resolved with two DELETE and one REPLACE operation on specified tables.

Let’s run another check to verify if differences still exist.

$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master1

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T22:21:30      0      0        3       1       0   0.549 testdb.address
02-02T22:21:30      0      0        3       1       0   0.048 testdb.animals
02-02T22:21:30      0      0        4       1       0   0.043 testdb.countries
02-02T22:21:30      0      0        5       1       0   0.049 testdb.users

DIFFS columns shows only 0 which means that tables are in sync.

What if I run checksums on MASTER2 server.

MASTER2

$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master2

            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T22:24:16      0      0        3       1       0   0.072 testdb.address
02-02T22:24:16      0      0        3       1       0   0.048 testdb.animals
02-02T22:24:16 Skipping table testdb.colors because it has problems on these replicas:
Table testdb.colors does not exist on replica master1
This can break replication.  If you understand the risks, specify --no-check-slave-tables to disable this check.
02-02T22:24:16 Error checksumming table testdb.colors: DBD::mysql::db selectrow_hashref failed: Table 'testdb.colors' doesn't exist [for Statement "EXPLAIN SELECT * FROM `testdb`.`colors` WHERE 1=1"] at /usr/bin/pt-table-checksum line 6595.

02-02T22:24:16      1      0        0       0       0   0.003 testdb.colors
02-02T22:24:16      0      0        4       1       0   0.044 testdb.countries
02-02T22:24:16      0      0        5       1       0   0.043 testdb.users

Output shows error because table testdb.colors exists on MASTER2 but not in MASTER1.

I know that MASTER1 has "correct" data so I will just drop testdb.colors table on MASTER2 node.

MariaDB [testdb]> drop table if exists testdb.colors;
Query OK, 0 rows affected (0.05 sec)

Run check again:

$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master2
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
02-02T22:26:43      0      0        3       1       0   0.322 testdb.address
02-02T22:26:43      0      0        3       1       0   0.056 testdb.animals
02-02T22:26:43      0      0        4       1       0   0.050 testdb.countries
02-02T22:26:43      0      0        5       1       0   0.045 testdb.users

Now we have synced databases.

If we use --quiet option tool will print out row per table only if there are some differences. This is nice way to run tool from a cron job and send mail only if there is non-zero exit status.

$ pt-table-checksum -upercona -ppercona --create-replicate-table --replicate percona.checksums --databases testdb -h master1 --quiet
(no rows)

REFERENCES
http://www.percona.com/doc/percona-toolkit/2.2/pt-table-sync.html
http://www.percona.com/doc/percona-toolkit/2.2/pt-table-checksum.html
http://www.percona.com/software/percona-toolkit

MariaDB(MySQL) Master-Master Replication

2015-02-01T13:36:00.001+01:00

The simplest and probably most common replication method is master-slave replication. Basically, data is replicated from master database to the slave. In case of master database failure you must get the slave database up-to-date before failover and then promote slave to be new master.

Another method is to set up replication in both directions called master-master replication. But you must be aware that this setup brings some potential issues as data changes are happening on both nodes. It could be problem if you have tables with auto_increment fields. If both servers are inserting or updating in the same table replication will break on one server due to “duplicate entry” error. To resolve this issue you have "auto_increment_increment" and "auto_increment_offset" settings.

In my case its best to use master-master setup as active-passive replication. If we know that only one node is performing data modifications we can avoid many possible problems. In case of the failover "slave" could be easily promoted to a new master. Data modifications are automatically replicated to failed node when it comes back up.

Of course, this simple setup is not suitable for all situations and it has it's drawbacks but luckily you have several other options at your disposal, like MariaDB Galera Cluster.

Servers setup:
OS: Debian 7.8
DB: MariaDB 10.0.16

Install MariaDB 10 (both nodes).

$ sudo apt-get install python-software-properties
$ sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 0xcbcb082a1bb943db
$ sudo add-apt-repository 'deb http://mirror3.layerjet.com/mariadb/repo/10.0/debian wheezy main'
$ sudo apt-get update
$ sudo apt-get install mariadb-server

Stop MariaDB on both nodes:

$ sudo service mysql stop

MASTER1

Edit /etc/mysql/my.cnf parameter file.

# bind-address          = 127.0.0.1
server-id               = 61
report_host             = master1
log_bin                 = /var/log/mysql/mariadb-bin
log_bin_index           = /var/log/mysql/mariadb-bin.index
relay_log               = /var/log/mysql/relay-bin
relay_log_index         = /var/log/mysql/relay-bin.index
# replicate-do-db       = testdb
auto_increment_increment = 5
auto_increment_offset = 1

# bind-address = 127.0.0.1
By default mysql will accept connections only from local host. We will comment this line to enable connections from other hosts. This is important for replication to work.

server-id = 61
report_host = master1
Choose ID that will uniquely identify your host. I will use last two digits of my IP address. Optionally you could set report_host parameter for servers to report each other their hostnames.

log_bin = /var/log/mysql/mariadb-bin
log_bin_index = /var/log/mysql/mariadb-bin.index
Enable binary logging.

relay_log = /var/log/mysql/relay-bin
relay_log_index = /var/log/mysql/relay-bin.index
Enable creating relay log files. Events that are read from master’s binary log are written to slave relay log.

replicate-do-db = testdb
With this parameter we are telling to MariaDB which databases to replicate. This parameter is optional.

Now we can start MariaDB server.

$ sudo service mysql start

Login as root and create user that will be used for replicating data between our servers. Grant appropriate privileges to the user.

$ sudo mysql -uroot -p
MariaDB [(none)]> create user 'replusr'@'%' identified by 'replusr';
MariaDB [(none)]> grant replication slave on *.* to 'replusr'@'%';

For the last step check status information about binary log files as we will use this information to start replication on another node.

MariaDB [(none)]> show master status;
+--------------------+----------+--------------+------------------+
| File               | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+--------------------+----------+--------------+------------------+
| mariadb-bin.000009 |      634 |              |                  |
+--------------------+----------+--------------+------------------+

MASTER2

Edit /etc/mysql/my.cnf parameter file.

# bind-address          = 127.0.0.1
server-id               = 62
report_host             = master2
log_bin                 = /var/log/mysql/mariadb-bin
log_bin_index           = /var/log/mysql/mariadb-bin.index
relay_log               = /var/log/mysql/relay-bin
relay_log_index         = /var/log/mysql/relay-bin.index
# replicate-do-db       = testdb
auto_increment_increment = 5
auto_increment_offset = 2

Start MariaDB server.

$ sudo service mysql start

Create user which will be used for replication and grant privileges to the user.

$ sudo mysql -uroot -p
MariaDB [(none)]> create user 'replusr'@'%' identified by 'replusr';
MariaDB [(none)]> grant replication slave on *.* to 'replusr'@'%';

To start replication enter following commands.

MariaDB [(none)]> STOP SLAVE;

MariaDB [(none)]> CHANGE MASTER TO MASTER_HOST='master1', MASTER_USER='replusr',
-> MASTER_PASSWORD='replusr', MASTER_LOG_FILE='mariadb-bin.000009', MASTER_LOG_POS=634;

MariaDB [(none)]> START SLAVE;

For MASTER_LOG_FILE and MASTER_LOG_POS I have used information from "show master status" on the first node.

Check status information of the slave threads.

MariaDB [(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: master1
                  Master_User: replusr
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000009
          Read_Master_Log_Pos: 634
               Relay_Log_File: relay-bin.000002
                Relay_Log_Pos: 537
        Relay_Master_Log_File: mariadb-bin.000009
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: testdb
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 634
              Relay_Log_Space: 828
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:

Notice that Read_Master_Log_Pos and Exec_Master_Log_Pos are in sync which is good indicator that our databases are in sync.

Check status information about binary log files of the MASTER2 node. We will need this information to start replication on MASTER1 node.

MariaDB [(none)]> show master status;
+--------------------+----------+--------------+------------------+
| File               | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+--------------------+----------+--------------+------------------+
| mariadb-bin.000009 |      759 |              |                  |
+--------------------+----------+--------------+------------------+

MASTER1

Start replicating data from MASTER2 to MASTER1 node.

MariaDB [(none)]> STOP SLAVE;

MariaDB [(none)]> CHANGE MASTER TO MASTER_HOST='master2', MASTER_USER='replusr',
-> MASTER_PASSWORD='replusr', MASTER_LOG_FILE='mariadb-bin.000009', MASTER_LOG_POS=759;

MariaDB [(none)]> START SLAVE;

MariaDB [(none)]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: master2
                  Master_User: replusr
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000009
          Read_Master_Log_Pos: 759
               Relay_Log_File: relay-bin.000002
                Relay_Log_Pos: 537
        Relay_Master_Log_File: mariadb-bin.000009
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: testdb
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 759
              Relay_Log_Space: 828
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 62
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:

Everything seems to be OK.

Let’s create test table and insert some rows to test our replication.

MASTER1

MariaDB [(none)]> create database testdb;

MariaDB [(none)]> use testdb;
Database changed

MariaDB [testdb]> CREATE TABLE users (id INT AUTO_INCREMENT,
    -> name VARCHAR(30),
    -> datum TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    -> PRIMARY KEY(id));
Query OK, 0 rows affected (0.50 sec)

MariaDB [testdb]> INSERT INTO users(name) VALUES ('Marko');
Query OK, 1 row affected (0.06 sec)

MariaDB [testdb]> select * from users;
+----+-------+---------------------+
| id | name  | datum               |
+----+-------+---------------------+
|  1 | Marko | 2015-02-01 00:41:41 |
+----+-------+---------------------+
1 row in set (0.00 sec)

MASTER2

MariaDB [testdb]> use testdb
Database changed

MariaDB [testdb]> select * from users;
+----+-------+---------------------+
| id | name  | datum               |
+----+-------+---------------------+
|  1 | Marko | 2015-02-01 00:41:41 |
+----+-------+---------------------+
1 row in set (0.00 sec)

MariaDB [testdb]> INSERT INTO users(name) VALUES('John');
Query OK, 1 row affected (0.39 sec)

MariaDB [testdb]> select * from users;
+----+-------+---------------------+
| id | name  | datum               |
+----+-------+---------------------+
|  1 | Marko | 2015-02-01 00:41:41 |
|  2 | John  | 2015-01-31 16:17:55 |
+----+-------+---------------------+
2 rows in set (0.00 sec)

MASTER1

MariaDB [testdb]> select * from users;
+----+-------+---------------------+
| id | name  | datum               |
+----+-------+---------------------+
|  1 | Marko | 2015-02-01 00:41:41 |
|  2 | John  | 2015-01-31 16:17:55 |
+----+-------+---------------------+
2 rows in set (0.00 sec)

As we can see our table and rows are replicated successfully.

Let’s simulate crash of the MASTER1 node and power off the server.

$ sudo shutdown -h now

While server is down insert some rows on MASTER2 node.

MASTER2

MariaDB [testdb]> INSERT INTO users(name) VALUES ('Eric');
Query OK, 1 row affected (0.41 sec)

MariaDB [testdb]> INSERT INTO users(name) VALUES ('Clive');
Query OK, 1 row affected (0.08 sec)

MariaDB [testdb]> INSERT INTO users(name) VALUES ('Maria');
Query OK, 1 row affected (0.09 sec)

MariaDB [testdb]> select * from users;
+----+-------+---------------------+
| id | name  | datum               |
+----+-------+---------------------+
|  1 | Marko | 2015-02-01 00:41:41 |
|  2 | John  | 2015-01-31 16:17:55 |
|  3 | Eric  | 2015-01-31 16:19:49 |
|  4 | Clive | 2015-01-31 16:19:55 |
|  5 | Maria | 2015-01-31 16:20:01 |
+----+-------+---------------------+
5 rows in set (0.00 sec)

MariaDB [testdb]> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Reconnecting after a failed master event read
                  Master_Host: master1
                  Master_User: replusr
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mariadb-bin.000010
          Read_Master_Log_Pos: 1828
               Relay_Log_File: relay-bin.000012
                Relay_Log_Pos: 1083
        Relay_Master_Log_File: mariadb-bin.000010
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
              Replicate_Do_DB: testdb
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1828
              Relay_Log_Space: 1663
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 2003
                Last_IO_Error: error reconnecting to master 'replusr@master1:3306' - retry-time: 
                               60  retries: 86400  message: Can't connect to MySQL server 
                               on 'master1' (111 "Connection refused")
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: No
                  Gtid_IO_Pos:

Check Last_IO_Error message while MASTER1 is down.

Now turn on MASTER1 node again.
MariaDB server and replication will start automatically and MASTER1 should catch up MASTER2.

MASTER1

Check "users" table - it's synchronised again.

$ mysql -u root -p -D testdb

MariaDB [testdb]> select * from users;
+----+-------+---------------------+
| id | name  | datum               |
+----+-------+---------------------+
|  1 | Marko | 2015-02-01 00:41:41 |
|  2 | John  | 2015-01-31 16:17:55 |
|  3 | Eric  | 2015-01-31 16:19:49 |
|  4 | Clive | 2015-01-31 16:19:55 |
|  5 | Maria | 2015-01-31 16:20:01 |
+----+-------+---------------------+
5 rows in set (0.00 sec)

Please let me know if you see possible problems in this configuration. I will update post gladly. Thanks for reading!

ORA-19599 block corruption when filesystemio_options=SETALL on ext4 file system using Linux

2014-12-22T14:12:00.001+01:00

Few days ago I experienced strange issue in my development environment running on OEL 5.8 with EXT4 filesystem. Note - EXT4 filesystem is supported from OEL 5.6 version.

This was virtual machine running oldish 10.2.0.5.0 Oracle database.

I noticed that backup for my database is failing because of archive log corruption. As this is development database I simply deleted corrupted archive logs and initiated full backup again. But backup failed because new archive logs were corrupted.

Weird issue...

I forced switch of log file few times and validated new archive logs - everything was OK. Redo logs were multiplexed and everything was fine with them. I have validated database for physical and logical corruption - everything was OK.

Then I initiated backup again and it failed.
This is excerpt from RMAN log (I've changed log slightly):

RMAN> connect target *
2> run
3> {
7>
8> ALLOCATE CHANNEL d1 DEVICE TYPE DISK;
9> BACKUP INCREMENTAL LEVEL 0 FORMAT '/u01/backup_db/QAS/fullbkp_dir/FULL_%d_%u' DATABASE TAG "weekly_full";
10> RELEASE CHANNEL d1;
11> sql 'alter system archive log current';
12> ALLOCATE CHANNEL d1 DEVICE TYPE DISK;
13> BACKUP (ARCHIVELOG ALL FORMAT '/u01/backup_db/QAS/fullbkp_dir/ARCH_%d_%T_%u_s%s_p%p' DELETE INPUT TAG "archivelogs");
14> RELEASE CHANNEL d1;
15>
16> DELETE OBSOLETE;
17>
18> BACKUP CURRENT CONTROLFILE FORMAT '/u01/backup_db/QAS/fullbkp_dir/controlf_%d_%u_%s_%T';
19> }
20>
connected to target database: QAS (DBID=2203246509)
using target database control file instead of recovery catalog

allocated channel: d1
channel d1: sid=43 devtype=DISK

Starting backup at 17.12.2014 08:17:02
channel d1: starting compressed incremental level 0 datafile backupset
channel d1: specifying datafile(s) in backupset
input datafile fno=00035 name=/u01/oradata/qas700.data1
input datafile fno=00036 name=/u01/oradata/qas700.data2
input datafile fno=00037 name=/u01/oradata/qas700.data3
input datafile fno=00002 name=/u01/oradata/undo.data1
...
...
...
channel d1: starting piece 1 at 17.12.2014 08:17:03
channel d1: finished piece 1 at 17.12.2014 09:45:48
piece handle=/u01/backup_db/QAS/fullbkp_dir/FULL_QAS_26pqchvu tag=WEEKLY_FULL comment=NONE
channel d1: backup set complete, elapsed time: 01:28:46
Finished backup at 17.12.2014 09:45:48

Starting Control File and SPFILE Autobackup at 17.12.2014 09:45:48
piece handle=/u01/app/oracle10/product/10.2.0/db_1/dbs/c-2203246509-20141217-13 comment=NONE
Finished Control File and SPFILE Autobackup at 17.12.2014 09:45:53

released channel: d1

sql statement: alter system archive log current

allocated channel: d1
channel d1: sid=43 devtype=DISK

Starting backup at 17.12.2014 09:45:54
current log archived
channel d1: starting compressed archive log backupset
channel d1: specifying archive log(s) in backup set
input archive log thread=1 sequence=11350 recid=39 stamp=866540753
input archive log thread=1 sequence=11351 recid=40 stamp=866540754
channel d1: starting piece 1 at 17.12.2014 09:45:55
released channel: d1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on d1 channel at 12/17/2014 09:45:56
ORA-19599: block number 6144 is corrupt in archived log /u01/oradata/QAS/QASarch/1_11350_826737654.dbf

Recovery Manager complete.

Notice that full backup finished successfully and when RMAN tried to backup new archive logs it failed due to corruption.

I've mentioned this issue on Twitter and got responses from Ronald Rood (@Ik_zelf) and Philippe Fierens (@pfierens) who helped me to find problem resolution.
Thanks guys!

Check this note:

ORA-1578 ORA-353 ORA-19599 Corrupt blocks with zeros when filesystemio_options=SETALL on ext4 file system using Linux (Doc ID 1487957.1)

I had filesystemio_options configured as SETALL and resetting this parameter to default value solved my corruption problem.

As this was development machine I wasn't thinking much about filesystem, but next time it will be ASM or XFS - EXT4 probably not :-)

Mount ASM diskgroups with new ASM instance

2014-10-29T15:46:00.003+01:00

Imagine you have 11gR2 Oracle Restart configuration with database files located in ASM.

After server crash you realized that local disks are corrupted and with local disks you lost all Oracle installations. Even though this is important system you don’t have database backup (always take backups!).

But you managed to save all ASM disks as they were located on separate storage.

This will be small beginner guide on how to help yourself in such situation.

As old server crashed you must create new server configuration, identical as old configuration. Nice thing about ASM is that it keeps it’s metadata in disk header. If disks are intact and headers are not damaged you should be able to mount diskgroups with new ASM instance. But this new instance must be compatible with your diskgroups.

Grid Infrastrcuture and database software were 11.2.0.1 version and this version I will install on new server.

To keep this post short enough steps like creating users, installing ASMLib and other packages, configuring kernel parameters,... are excluded.

List Oracle ASM disks mounted to new server.
With "scandisks" command I will find devices which have been labeled as ASM disks.

# oracleasm scandisks
Reloading disk partitions: done
Cleaning any stale ASM disks...
Scanning system for ASM disks...

# oracleasm listdisks
DISK1
DISK2
DISK3
DISK4
DISK5
FRA1

Install "Oracle Grid Infrastructure software only" option to avoid automatic Oracle Restart and ASM configuration. This configuration will be performed later manually.

After installation finished run noted perl script as root to configure Grid Infrastructure for a Stand-Alone server.
For my configuration script looks like this:

To configure Grid Infrastructure for a Stand-Alone Server run the following command as the root user:
/u01/app/11.2.0.1/grid/perl/bin/perl -I/u01/app/11.2.0.1/grid/perl/lib -I/u01/app/11.2.0.1/grid/crs/install /u01/app/11.2.0.1/grid/crs/install/roothas.pl

Start cssd if it’s not running.

# ./crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        OFFLINE OFFLINE
ora.diskmon
      1        OFFLINE OFFLINE

# ./crs_start ora.cssd
Attempting to start `ora.cssd` on member `asterix`
Attempting to stop `ora.diskmon` on member `asterix`
Stop of `ora.diskmon` on member `asterix` succeeded.
Attempting to start `ora.diskmon` on member `asterix`
Start of `ora.diskmon` on member `asterix` succeeded.
Start of `ora.cssd` on member `asterix` succeeded.

Create parameter file for ASM instance in $ORACLE_HOME/dbs directory of Grid Infrastructure.

init+ASM.ora
*.asm_diskstring='/dev/oracleasm/disks'
*.asm_power_limit=1
*.diagnostic_dest='/u01/app/grid'
*.instance_type='asm'
*.large_pool_size=12M
*.remote_login_passwordfile='EXCLUSIVE'

$ export ORACLE_SID=+ASM
$ export ORACLE_HOME=/u01/app/11.2.0.1/grid
$ srvctl add asm -p $ORACLE_HOME/dbs/init+ASM.ora

$ srvctl start asm
$ srvctl status asm
ASM is running on asterix

Now notice what I see when I start ASM configuration assistant.

$ ./asmca

These are diskgroups with my database and recovery files.
Click "Mount all" to mount them all.

Install Oracle database software and create parameter file in "$ORACLE_HOME/dbs" to start database.

$ export ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1
$ export ORACLE_SID=ora11gr2

$ cd $ORACLE_HOME/dbs
$ cat initora11gr2.ora
*.spfile='+DATA1/ora11gr2/spfileora11gr2.ora'

$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.1.0 Production on Wed Oct 29 14:29:37 2014

Copyright (c) 1982, 2009, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup
ORACLE instance started.

Total System Global Area  668082176 bytes
Fixed Size                  2216344 bytes
Variable Size             222301800 bytes
Database Buffers          436207616 bytes
Redo Buffers                7356416 bytes
Database mounted.
Database opened.
SQL>
SQL>
SQL> select name from v$datafile;

NAME
--------------------------------------------------------------------------------
+DATA1/ora11gr2/datafile/system.297.844627929
+DATA1/ora11gr2/datafile/sysaux.265.844627967
+DATA1/ora11gr2/datafile/undotbs1.266.844627991
+DATA1/ora11gr2/datafile/users.267.844628031
+DATA2/ora11gr2/datafile/marko.261.859213577

Database is successfully opened and you can register instance using SRVCTL command.

$ srvctl add database -d $ORACLE_SID -o $ORACLE_HOME -p $ORACLE_HOME/dbs/initora11gr2.ora
$ srvctl start database -d $ORACLE_SID

Final status.

$ ./crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA1.dg
               ONLINE  ONLINE       asterix
ora.DATA2.dg
               ONLINE  ONLINE       asterix
ora.FRA1.dg
               ONLINE  ONLINE       asterix
ora.asm
               ONLINE  ONLINE       asterix                  Started
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        ONLINE  ONLINE       asterix
ora.diskmon
      1        ONLINE  ONLINE       asterix
ora.ora11gr2.db
      1        ONLINE  ONLINE       asterix                  Open

Be aware that this demo is performed in virtual environment on my notebook.

Increase disk space for VM running Linux

2014-10-24T09:35:00.001+02:00

When I create virtual machines on my notebook I always create too small disk for root partition or partition where I put Oracle binaries. After a while when I want to perform upgrade, or install another Oracle software, there is not enough space. This time I want to note steps about how to increase disk free space.

I can easily extend or shrink my logical volumes because I am using LVM in my virtual machines. Consider using LVM in production also because it gives you more flexibility then using normal hard drive partitions.

In this demo I'm using Oracle Linux 6.4.

Check disk free space after OS installation.

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_linuxtest-lv_root
                      4.9G  2.8G  2.0G  59% /
tmpfs                 770M  100K  770M   1% /dev/shm
/dev/sda1             485M   55M  405M  12% /boot

Add "/u01" mount and assign some disk space for Oracle installation files.

Shutdown VM and add disk.

Partition new disk "/dev/sdb" using fdisk command.

# fdisk /dev/sdb
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0xa07249dd.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c') and change display units to
         sectors (command 'u').

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-391, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-391, default 391):
Using default value 391

Command (m for help): t
Selected partition 1
Hex code (type L to list codes): L

 0  Empty           24  NEC DOS         81  Minix / old Lin bf  Solaris
 1  FAT12           39  Plan 9          82  Linux swap / So c1  DRDOS/sec (FAT-
 2  XENIX root      3c  PartitionMagic  83  Linux           c4  DRDOS/sec (FAT-
 3  XENIX usr       40  Venix 80286     84  OS/2 hidden C:  c6  DRDOS/sec (FAT-
 4  FAT16 <32M      41  PPC PReP Boot   85  Linux extended  c7  Syrinx
 5  Extended        42  SFS             86  NTFS volume set da  Non-FS data
 6  FAT16           4d  QNX4.x          87  NTFS volume set db  CP/M / CTOS / .
 7  HPFS/NTFS       4e  QNX4.x 2nd part 88  Linux plaintext de  Dell Utility
 8  AIX             4f  QNX4.x 3rd part 8e  Linux LVM       df  BootIt
 9  AIX bootable    50  OnTrack DM      93  Amoeba          e1  DOS access
 a  OS/2 Boot Manag 51  OnTrack DM6 Aux 94  Amoeba BBT      e3  DOS R/O
 b  W95 FAT32       52  CP/M            9f  BSD/OS          e4  SpeedStor
 c  W95 FAT32 (LBA) 53  OnTrack DM6 Aux a0  IBM Thinkpad hi eb  BeOS fs
 e  W95 FAT16 (LBA) 54  OnTrackDM6      a5  FreeBSD         ee  GPT
 f  W95 Ext'd (LBA) 55  EZ-Drive        a6  OpenBSD         ef  EFI (FAT-12/16/
10  OPUS            56  Golden Bow      a7  NeXTSTEP        f0  Linux/PA-RISC b
11  Hidden FAT12    5c  Priam Edisk     a8  Darwin UFS      f1  SpeedStor
12  Compaq diagnost 61  SpeedStor       a9  NetBSD          f4  SpeedStor
14  Hidden FAT16 <3 63  GNU HURD or Sys ab  Darwin boot     f2  DOS secondary
16  Hidden FAT16    64  Novell Netware  af  HFS / HFS+      fb  VMware VMFS
17  Hidden HPFS/NTF 65  Novell Netware  b7  BSDI fs         fc  VMware VMKCORE
18  AST SmartSleep  70  DiskSecure Mult b8  BSDI swap       fd  Linux raid auto
1b  Hidden W95 FAT3 75  PC/IX           bb  Boot Wizard hid fe  LANstep
1c  Hidden W95 FAT3 80  Old Minix       be  Solaris boot    ff  BBT
1e  Hidden W95 FAT1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

Notice that I have identified partition as "Linux LVM" choosing "8e" hex code.

Using pvcreate command create a physical volume for later use by the LVM.

# pvcreate /dev/sdb1
  Physical volume "/dev/sdb1" successfully created

Create new volume group "vg_orabin". Later I can add or remove disks from this volume group.

# vgcreate vg_orabin /dev/sdb1
  Volume group "vg_orabin" successfully created

Information about volume group.

# vgdisplay vg_orabin
  --- Volume group ---
  VG Name               vg_orabin
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               2.99 GiB
  PE Size               4.00 MiB
  Total PE              766
  Alloc PE / Size       0 / 0
  Free  PE / Size       766 / 2.99 GiB
  VG UUID               h3N1o5-AlYF-9nkL-PXiB-P8HK-tGAa-GlXPa5

Create logical volume using disk space from volume group.

# lvcreate --extents 766 -n lv_orabin vg_orabin
  Logical volume "lv_orabin" created

Create and mount filesystem.

# mkfs.ext4 /dev/mapper/vg_orabin-lv_orabin
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
196224< inodes, 784384 blocks
39219 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=805306368
24 block groups
32768 blocks per group, 32768 fragments per group
8176 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912

Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 25 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.




# mkdir /u01
# mount /dev/mapper/vg_orabin-lv_orabin /u01

Check disk space available.

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_linuxtest-lv_root
                      4.9G  2.8G  2.0G  59% /
tmpfs                 770M   88K  770M   1% /dev/shm
/dev/sda1             485M   55M  405M  12% /boot
/dev/mapper/vg_orabin-lv_orabin
                      3.0G   69M  2.8G   3% /u01

Hm, 2.8G is not enough free space for me. Let’s extend this mount adding another disk.

Shutdown VM and add disk.

Partition new disk and create physical volume for LVM.

# fdisk /dev/sdc
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x16953397.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c') and change display units to
         sectors (command 'u').

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-652, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-652, default 652):
Using default value 652

Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.




# pvcreate /dev/sdc1
  Physical volume "/dev/sdc1" successfully created

Check current status of volume group “vg_orabin”.

# vgdisplay vg_orabin
  --- Volume group ---
  VG Name               vg_orabin
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  2
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               2.99 GiB
  PE Size               4.00 MiB
  Total PE              766
  Alloc PE / Size       766 / 2.99 GiB
  Free  PE / Size       0 / 0
  VG UUID               h3N1o5-AlYF-9nkL-PXiB-P8HK-tGAa-GlXPa5

Extend volume group by adding physical volume "/dev/sdc1" using vgextend command.

# vgextend vg_orabin /dev/sdc1
  Volume group "vg_orabin" successfully extended

Check volume group size - it is extended from 2.99G to 7.98G.

# vgdisplay vg_orabin
  --- Volume group ---
  VG Name               vg_orabin
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  3
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               0
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               7.98 GiB
  PE Size               4.00 MiB
  Total PE              2044
  Alloc PE / Size       766 / 2.99 GiB
  Free  PE / Size       1278 / 4.99 GiB
  VG UUID               h3N1o5-AlYF-9nkL-PXiB-P8HK-tGAa-GlXPa5

Using pvscan command scan all disks and notice physical volumes with free space.

# pvscan
  PV /dev/sdb1   VG vg_orabin      lvm2 [2.99 GiB / 0    free]
  PV /dev/sdc1   VG vg_orabin      lvm2 [4.99 GiB / 4.99 GiB free]
  PV /dev/sda2   VG vg_linuxtest   lvm2 [6.51 GiB / 0    free]
  Total: 3 [14.49 GiB] / in use: 3 [14.49 GiB] / in no VG: 0 [0   ]

With lvdisplay command display logical volume properties
Notice LV size = 2.99G.

# lvdisplay /dev/vg_orabin/lv_orabin
  --- Logical volume ---
  LV Path                /dev/vg_orabin/lv_orabin
  LV Name                lv_orabin
  VG Name                vg_orabin
  LV UUID                ypw9X1-vIsM-4rVF-NtVB-ACrf-f5nh-25p2sn
  LV Write Access        read/write
  LV Creation host, time linuxtest.localdomain, 2014-10-23 13:19:56 +0200
  LV Status              available
  # open                 0
  LV Size                2.99 GiB
  Current LE             766
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:2

I will add only 2G (of 5G) using lvextend command.

# lvextend -L +2G /dev/mapper/vg_orabin-lv_orabin /dev/sdc1
  Extending logical volume lv_orabin to 4.99 GiB
  Logical volume lv_orabin successfully resized

Mount volume and check for free space.

# mount /dev/mapper/vg_orabin-lv_orabin /u01

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_linuxtest-lv_root
                      4.9G  2.8G  2.0G  59% /
tmpfs                 770M   88K  770M   1% /dev/shm
/dev/sda1             485M   55M  405M  12% /boot
/dev/mapper/vg_orabin-lv_orabin
                      3.0G   69M  2.8G   3% /u01

Resize filesystem using resize2fs command:

# resize2fs /dev/mapper/vg_orabin-lv_orabin
resize2fs 1.41.12 (17-May-2010)
Filesystem at /dev/mapper/vg_orabin-lv_orabin is mounted on /u01; on-line resizing required
old desc_blocks = 1, new_desc_blocks = 1
Performing an on-line resize of /dev/mapper/vg_orabin-lv_orabin to 1308672 (4k) blocks.
The filesystem on /dev/mapper/vg_orabin-lv_orabin is now 1308672 blocks long.

Now I have 4.6G free space for "/u01" mount.

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_linuxtest-lv_root
                      4.9G  2.8G  2.0G  59% /
tmpfs                 770M   88K  770M   1% /dev/shm
/dev/sda1             485M   55M  405M  12% /boot
/dev/mapper/vg_orabin-lv_orabin
                      5.0G   70M  4.6G   2% /u01

===========================================

Now I will try to extend root partition.

Newer Oracle Linux releases are using LVM by default during install.
Let’s see can I increase my root partition using commands above.

Display information about logical volumes using lvs command.

# lvs
  LV        VG           Attr      LSize Pool Origin Data%  Move Log Cpy%Sync Convert
  lv_root   vg_linuxtest -wi-ao--- 4.97g
  lv_swap   vg_linuxtest -wi-ao--- 1.54g
  lv_orabin vg_orabin    -wi-a---- 4.99g

Check free space.

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_linuxtest-lv_root
                      4.9G  2.8G  2.0G  59% /
tmpfs                 770M   88K  770M   1% /dev/shm
/dev/sda1             485M   55M  405M  12% /boot

Shutdown VM and add disk for extending root partition.

Partition new disk and create physical volume for LVM.

# fdisk /dev/sdd
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0xf0608435.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c') and change display units to
         sectors (command 'u').

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-652, default 1):
Using default value 1
Last cylinder, +cylinders or +size{K,M,G} (1-652, default 652):
Using default value 652

Command (m for help): t
Selected partition 1
Hex code (type L to list codes): 8e
Changed system type of partition 1 to 8e (Linux LVM)

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.



# pvcreate /dev/sdd1
  Physical volume "/dev/sdd1" successfully created

Check information about volume group.

# vgdisplay vg_linuxtest

  --- Volume group ---
  VG Name               vg_linuxtest
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  3
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               6.51 GiB
  PE Size               4.00 MiB
  Total PE              1666
  Alloc PE / Size       1666 / 6.51 GiB
  Free  PE / Size       0 / 0
  VG UUID               TXkKYl-PIxu-s2xk-LsEB-sgTZ-TdcO-8wapCV

Extend volume group using new physical volume.

# vgextend vg_linuxtest /dev/sdd1
  Volume group "vg_linuxtest" successfully extended

Logical volume status.

# lvdisplay  /dev/vg_linuxtest/lv_root
  --- Logical volume ---
  LV Path                /dev/vg_linuxtest/lv_root
  LV Name                lv_root
  VG Name                vg_linuxtest
  LV UUID                VNgeT7-4yhd-XqRi-2da1-XTqT-qTvm-oVK2pz
  LV Write Access        read/write
  LV Creation host, time linuxtest.localdomain, 2014-10-23 10:30:21 +0200
  LV Status              available
  # open                 1
  LV Size                4.97 GiB
  Current LE             1272
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:0

Extend logical volume.

# lvextend /dev/mapper/vg_linuxtest-lv_root /dev/sdd1
  Extending logical volume lv_root to 9.96 GiB
  Logical volume lv_root successfully resized

Resize filesystem.

# resize2fs /dev/mapper/vg_linuxtest-lv_root
resize2fs 1.41.12 (17-May-2010)
Filesystem at /dev/mapper/vg_linuxtest-lv_root is mounted on /; on-line resizing required
old desc_blocks = 1, new_desc_blocks = 1
Performing an on-line resize of /dev/mapper/vg_linuxtest-lv_root to 2611200 (4k) blocks.
The filesystem on /dev/mapper/vg_linuxtest-lv_root is now 2611200 blocks long.

Check disk free space. Notice that I have 6.6G of free space for my root partition.

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_linuxtest-lv_root
                      9.9G  2.8G  6.6G  30% /
tmpfs                 770M   88K  770M   1% /dev/shm
/dev/sda1             485M   55M  405M  12% /boot

WARNING! Be very careful when using commands from blog post on your production system. These are dangerous commands which can cause loss of data or many other problems. I’ve used this commands in my test environment for educational purpose and it is possible that I have made mistakes in this demo. After all I am only simple Oracle DBA not Linux SA :-)

REFERENCES
http://www.linuxuser.co.uk/features/resize-your-disks-on-the-fly-with-lvm
http://www.rootusers.com/how-to-increase-the-size-of-a-linux-lvm-by-adding-a-new-disk/
https://wiki.archlinux.org/index.php/LVM

Using Oracle Flex ASM with single instance database

2014-07-25T14:34:00.001+02:00

Oracle Flex ASM was introduced in 12c version. This is one of the best features introduced with new version in my opinion.

I won’t speak in detail about Flex ASM because you can find more information in documentation. In this post I will concentrate on how Flex ASM handles crash of ASM instance.

For this test I’ve created 2 node cluster - 12c Grid Infrastructure with Flex ASM enabled.

$ asmcmd showclustermode
ASM cluster : Flex mode enabled


$ srvctl config asm
ASM home: /u01/app/12.1.0/grid_1
Password file: +OCRVOTE/ASM/PASSWORD/pwdasm.256.853771307
ASM listener: LISTENER
ASM instance count: ALL
Cluster ASM listener: ASMNET1LSNR_ASM


$ srvctl status asm
ASM is running on cluster1,cluster2

Install single instance database on one of the nodes.

$ ./dbca -silent \
> -createDatabase \
> -templateName General_Purpose.dbc \
> -gdbName singl12 \
> -sid singl12 \
> -sysPassword oracle \
> -SystemPassword oracle \
> -emConfiguration none \
> -recoveryAreaDestination FRA \
> -storageType ASM \
> -asmSysPassword oracle \
> -diskGroupName DATA \
> -characterSet AL32UTF8 \
> -nationalCharacterSet AL16UTF16 \
> -totalMemory 768 \


Copying database files
1% complete
3% complete
10% complete
17% complete
24% complete
31% complete
35% complete
Creating and starting Oracle instance
37% complete
42% complete
47% complete
52% complete
53% complete
56% complete
58% complete
Registering database with Oracle Restart
64% complete
Completing Database Creation
68% complete
71% complete
75% complete
85% complete
96% complete
100% complete
Look at the log file "/u01/app/orcl12/cfgtoollogs/dbca/singl12/singl12.log" for further details.

Single instance database is registered to the OCR.

$ srvctl config database -d singl12
Database unique name: singl12
Database name: singl12
Oracle home: /u01/app/orcl12/product/12.1.0/dbhome_1
Oracle user: orcl12
Spfile: +DATA/singl12/spfilesingl12.ora
Password file:
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: singl12
Database instance: singl12
Disk Groups: DATA
Mount point paths:
Services:
Type: SINGLE  <<<<<-------
Database is administrator managed

V$ASM_CLIENT shows that my database is managed by the Oracle ASM instance.

SQL> select instance_name, db_name, status
  2  from v$asm_client
  3  where db_name='singl12';

INSTANCE_NAME        DB_NAME  STATUS
-------------------- -------- ------------
singl12              singl12  CONNECTED

Check that ASM instances are running on both nodes.

$ ./crsctl status resource ora.asm
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE            , ONLINE
STATE=ONLINE on cluster2, ONLINE on cluster1

My database is running on cluster1 node.

$ srvctl status database -d singl12
Instance singl12 is running on node cluster1


SQL> select instance_name, host_name from v$instance;

INSTANCE_NAME   HOST_NAME
--------------- --------------------
singl12         cluster1.localdomain

Now I will simulate crash of ASM instance on cluster1 node where I have my database running.

# ps -ef|grep asm_pmon|grep -v grep
oracle    3072     1  0 10:12 ?        00:00:01 asm_pmon_+ASM1

# kill -9 3072

Without Flex ASM I would expect that crash of ASM instance would crash database instance also but with Flex ASM my database stays up and running.

Check alert log of database instance:

...
NOTE: ASMB registering with ASM instance as client 0x10005 (reg:2156157897)
NOTE: ASMB connected to ASM instance +ASM2 (Flex mode; client id 0x10005)
NOTE: ASMB rebuilding ASM server state
NOTE: ASMB rebuilt 1 (of 1) groups
NOTE: ASMB rebuilt 13 (of 13) allocated files
NOTE: fetching new locked extents from server
NOTE: 0 locks established; 0 pending writes sent to server
SUCCESS: ASMB reconnected & completed ASM server state

Check line - "NOTE: ASMB connected to ASM instance +ASM2 (Flex mode; client id 0x10005)"

As +ASM1 instance crashed ASMB connected to ASM instance +ASM2.

Check status:

# ./crsctl status resource ora.asm
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE            , ONLINE
STATE=ONLINE on cluster2, INTERMEDIATE on cluster1


SQL> select instance_name, host_name from v$instance;

INSTANCE_NAME   HOST_NAME
--------------- --------------------
singl12         cluster1.localdomain

Oracle Clusterware restarted crashed ASM instance and both instances were up in a minute.

# ./crsctl status resource ora.asm
NAME=ora.asm
TYPE=ora.asm.type
TARGET=ONLINE            , ONLINE
STATE=ONLINE on cluster2, ONLINE on cluster1

Now to test crash ASM instance on second node.

SQL> select instance_name from v$instance;

INSTANCE_NAME
----------------
+ASM2

SQL> shutdown abort;
ASM instance shutdown

Excerpt from alertlog:

...
Fri Jul 25 12:44:33 2014
NOTE: ASMB registering with ASM instance as client 0x10005 (reg:4169355750)
NOTE: ASMB connected to ASM instance +ASM1 (Flex mode; client id 0x10005)
NOTE: ASMB rebuilding ASM server state
NOTE: ASMB rebuilt 1 (of 1) groups
NOTE: ASMB rebuilt 13 (of 13) allocated files
NOTE: fetching new locked extents from server
NOTE: 0 locks established; 0 pending writes sent to server
SUCCESS: ASMB reconnected & completed ASM server state

Again, user connected to database instance didn’t even noticed that something is happening with ASM.

Flex ASM enables for ASM instance to run on separate nodes than database servers. If ASM instance fails database will failover to another available ASM instance.

In case you are running <12c databases on your cluster you can still configure Flex ASM but you are required to configure local ASM instances on nodes. ASM instance failover won’t work for 10g or 11g databases.

Good reason to move towards 12c? ;-)