Wednesday, June 25, 2014

RAC One Node 12cR1 Instance placement after failover

RAC One Node is Oracle active-passive solution very similar to RAC which is active-active solution.
Running active-passive solution using Oracle Clusterware was possible even before introduction of RAC One Node, but RAC One node brought configuration and managing of active-passive cluster to another level.

In my test environment I've created active-passive cluster using 12cR1 Grid Infrastructure with 12cR1 RAC One Node database.

I've tested failover for Oracle database and noticed interesting behavior with instance placement.


Configuration of my database:
$ srvctl config database -d one12
Database unique name: one12
Database name: one12
Oracle home: /u01/app/orcl12/product/12.1.0/dbhome_1
Oracle user: orcl12
Spfile: +DATA/one12/spfileone12.ora
Password file: +DATA/one12/orapwone12
Domain:
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Server pools: one12
Database instances:
Disk Groups: DATA,FRA
Mount point paths:
Services: one12app
Type: RACOneNode
Online relocation timeout: 30
Instance name prefix: one12
Candidate servers: cluster1,cluster2
Database is administrator managed

Notice type is RACOneNode which is specified for RAC One Node database "one12".


Check status of database.
$ srvctl status database -d one12
Instance one12_1 is running on node cluster1
Online relocation: INACTIVE
col instance_name for a20
col host_name for a20
SQL> select host_name, instance_name from v$instance;

HOST_NAME            INSTANCE_NAME
-------------------- --------------------
cluster1.localdomain one12_1

Instance one12_1 is running on node cluster1.


Let's relocate database to another node.
$ srvctl relocate database -d one12 -n cluster2 -w 5 -v
Configuration updated to two instances
Instance one12_2 started
Services relocated
Waiting for up to 5 minutes for instance one12_1 to stop ...
Instance one12_1 stopped
Configuration updated to one instance

Check status.
$ srvctl status database -d one12
Instance one12_2 is running on node cluster2
Online relocation: INACTIVE
col instance_name for a20
col host_name for a20
SQL> select host_name, instance_name from v$instance;

HOST_NAME            INSTANCE_NAME
-------------------- --------------------
cluster2.localdomain one12_2

After relocation active node became cluster2. Instance one12_2 was brought up on cluster2 node while one12_1 was shut down on cluster1 node.



Now I will simulate instance crash killing PMON process on cluster2 node. Clusterware will restart failed instance few times on the same node and then relocate database to cluster1 node.

You can check in resource parameters how many restart attempts will Clusterware perform before failover to other node.

$ ./crsctl stat res ora.one12.db -p|grep "RESTART_ATTEMPTS"

RESTART_ATTEMPTS=2

In my case Clusterware will try twice to restart failed instance before relocating to other node.

Check status of resource.
$ ./crsctl stat res ora.one12.db -v |\
> grep -w 'STATE\|RESTART_COUNT\|FAILURE_COUNT'

STATE=ONLINE on cluster2
RESTART_COUNT=0
FAILURE_COUNT=0

Kill PMON on cluster2 node.
$ ps -ef|grep pmon_one12|grep -v grep
orcl12    4503     1  0 12:19 ?        00:00:00 ora_pmon_one12_2

$ kill -9 4503

Wait for a short time for Clusterware to restart instance and check status.
$ ./crsctl stat res ora.one12.db -v |\
> grep -w 'STATE\|RESTART_COUNT\|FAILURE_COUNT'

STATE=ONLINE on cluster2
RESTART_COUNT=1
FAILURE_COUNT=0

Clusterware restarted instance on the same node. Notice RESTART_COUNT is now 1.

Kill PMON again.
$ ps -ef|grep pmon_one12|grep -v grep
orcl12    5262     1  0 12:37 ?        00:00:00 ora_pmon_one12_2

$ kill -9 5262

Check status.
$ ./crsctl stat res ora.one12.db -v |\
> grep -w 'STATE\|RESTART_COUNT\|FAILURE_COUNT'

STATE=ONLINE on cluster2
RESTART_COUNT=2
FAILURE_COUNT=0

RESTART_COUNT is now 2. Next time I kill PMON process Clusterware should relocate instance to cluster1 node.

$ ps -ef|grep pmon_one12|grep -v grep
orcl12    5588     1  0 12:40 ?        00:00:00 ora_pmon_one12_2

$ kill -9 5588

Check status.
$ ./crsctl stat res ora.one12.db -v |\
> grep -w 'STATE\|RESTART_COUNT\|FAILURE_COUNT'

STATE=ONLINE on cluster1
RESTART_COUNT=0
FAILURE_COUNT=1

Clusterware started instance on cluster1 node as expected. STATE is now ONLINE on cluster1.


Now the interesting part.
$ srvctl status database -d one12
Instance one12_2 is running on node cluster1
Online relocation: INACTIVE
col instance_name for a20
col host_name for a20
SQL> select host_name, instance_name from v$instance;

HOST_NAME            INSTANCE_NAME
-------------------- --------------------
cluster1.localdomain one12_2

Notice instance placement.
Instance one12_2 is started on cluster1 node. Clusterware kept instance name one12_2 after failover to cluster1 node.


Can we start instance one12_1 on cluster1 node manually.
$ srvctl stop database -d one12

$ srvctl start instance -db one12 -n cluster1 -instance one12_1
PRKO-2136 : 'srvctl start/stop/enable/disable/modify/status/setenv/getenv/unsetenv instance' commands are not supported with RAC One Node databases

This operation is not supported with RAC One Node.


Start database on cluster2 node.
$ srvctl start database -d one12 -n cluster2

$ srvctl status database -d one12
Instance one12_2 is running on node cluster2
Online relocation: INACTIVE
col instance_name for a20
col host_name for a20
SQL> select host_name, instance_name from v$instance;

HOST_NAME            INSTANCE_NAME
-------------------- --------------------
cluster2.localdomain one12_2

Instance one12_2 is started on cluster2 host.

Relocate database to cluster1 node.
$ srvctl relocate database -d one12 -n cluster1 -w 5 -v
Configuration updated to two instances
Instance one12_1 started
Services relocated
Waiting for up to 5 minutes for instance one12_2 to stop ...
Instance one12_2 stopped
Configuration updated to one instance

Check status.
$ srvctl status database -d one12
Instance one12_1 is running on node cluster1
Online relocation: INACTIVE
col instance_name for a20
col host_name for a20
SQL> select host_name, instance_name from v$instance;

HOST_NAME            INSTANCE_NAME
-------------------- --------------------
cluster1.localdomain one12_1

I have "one12_1" instance running on "cluster1" node again.


From documentation:
Using the Single Client Access Name (SCAN) to connect to the database, clients can locate the service independently of the node on which it is running.

As clients should connect to database using service name (Oracle recommendation) instance placement shouldn't be important.



2 comments:

  1. "As clients should connect to database using service name (Oracle recommendation) instance placement shouldn't be important."

    ...except in the case where you are monitoring via Cloud Control, where an instance that was discovered is suddenly not running. Cloud Control still doesn't seem to understand that in the context of One Node, a "missing" instance is not necessarily a sign of fault.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete