Troubleshooting memory usage

Tuesday, January 10, 2012 4 comments
Last few days we had significant load problems with one of our database servers. Server was choking under minor database load which caused applications to be non-responsive.

This is 8-core 32-bit OEL5 (2.6.18-92.el5PAE) server with 4GB RAM running two Oracle 10g (10.2.0.4) SE databases. SGA's of those databases where 1,2 GB and 700MB with pga_aggregate_target 64M and 128M.

As databases were small (under 50G) and not very active I didn't expected any performance problems.


Using vmstat, mpstat, top and sar tools I've noticed that we were having slight issues with swapping, despite that there was enough physical memory.

So I've started troubleshooting...

First I've found that one of top running processes was "kswapd0" process.
This is Kernel Swap Deamon responsible for scanning memory to locate free pages, and scheduling dirty pages to be written to disk. It is OK to occasionally notice kswapd running, but it shouldn't continuously appear in top processes. In some situations you can resolve kswapd issues using huge pages but this wasn't the case here.

I've decided to check several OS parameters related to swapping to find possible cause of our issues.

Parameters below were specified in "/etc/sysctl.conf":
...
vm.overcommit_memory=1
vm.swappiness=0
vm.min_free_kbytes=262144
vm.page-cluster=5
...

Considering that I didn't installed this server, this parameters were little odd to me. All of them were custom parameters not required from install guide.

The swappiness variable bothered me the most. It tells Linux kernel how fast it can move inactive pages to swap area. Default value is 60, but in this case value was changed to 0 which tells the Linux kernel to avoid swapping as much as possible.

As this was production server which was still choking we had to react quickly without prior testing. I haven't seen the purpose of vm custom parameters on this server so I've decided to comment them and apply only default parameters required from install guide. It's easy to add custom parameters later if there is a need.

After I applied parameters running "sysctl -p" load average suddenly started to fall. Several minutes later it fell from 13 to 1. From vmstat I've noticed significant drop in swapping and kswapd0 wasn't in top processes anymore.

With altering swappiness to 60 (default) I've told kernel to find inactive pages more aggressively and swap them out to disk.
Probably that helped OS to use memory more effectively.

I will closely monitor performance of this server in next few days but so far it is working very good.

Considerations for future:
- enable Huge Pages
- migrate os/db to 64bit

When I'm installing database servers I tend to use parameters specified in install guide or Oracle support site. Eventually I will add some custom parameters but only if I have proper reason for that. Even then it is wise to test changes before applying them in production.

So my conclusion for this case would be - do not add custom parameters without proper reason.
Google - "Compulsive Tuning Disorder" :)


Read More...

IMPDP creates indexes with parallel degree 1 during import

Wednesday, August 31, 2011 4 comments
Last few days I’ve been playing with export and import of 750 GB database. During this whole trial and error process I’ve learned several things that were unfamiliar to me. In this blog post I want to share some of the stuff that I learnt.

I have Oracle 10.2.0.4 database on Solaris 64bit which I want to migrate to Oracle 11.1.0.7 on Linux 64bit. It is 750GB large database with mostly partitioned tables (compressed partitions).


Let's get to the subject of the post.

I won't talk much about export operation - just to mention that it is important to specify this bolded line in parameter file:
dumpfile=exportdb_%U.dmp
filesize=30G


This means that multiple 30GB sized dump files will be generated during export. It is important to specify %U expression or to manually define multiple dump files if you want to get the most from parallel processing during export and import.

In my case - after performing full export of the database I've created 20 dump files which is about 600GB of export data.


Now comes the tricky part - importing database to the new environment.

Data import was very fast as it finished in below 2,5 hours. During import server was fully utilized using all available resources in parallel processing. When index creation started utilization of the server dropped to about 10%.

Only one worker performed the entire work, while others were waiting - even with parallel option specified.

Then I found that this was the result of the very nasty bug:
“Bug 8604502 - IMPDP creates indexes with parallel degree 1 during import”

This issue is fixed in:
12.1 (Future Release)
11.2.0.2 (Server Patch Set)
11.2.0.1 Bundle Patch 7 for Exadata Database
11.1.0.7 Patch 24 on Windows Platforms

Great, I don’t have fix for this bug.
With only one worker creating indexes this import could last for days.


For a quick solution I’ve decided to perform import in two steps. Firstly I will import everything but indexes and in next step I will import just indexes.

Data import again lasted below 2,5 hours and it finished without any problems fully utilizing the server.

For index creation I’ve generated CREATE INDEX script using SQLFILE option in IMPDP.

*parameter file*
userid=xxx/yyy
sqlfile=createidx.sql
dumpfile=exportdb_%U.dmp
include=INDEX
directory=export_dir
schemas=(...)


As SQLFILE generates CREATE INDEX scripts with parallel degree set to 1 I’ve decided to manually edit script and replace parallel 1 to parallel 16.

Besides parallel 16 option, I’ve specified NOLOGGING option to restrict generation of large redo log. As I expected creation of large indexes I’ve allocated large enough temporary tablespace and specified higher pga_aggregate_target parameter.

Then executed script from sqlplus and monitored server utilization and waits.

Script finished in about 3 hours which was nice improvement.
Server was fully utilized - mostly using 16 processes for index creation.

Main waits were:
- WAITING direct path read
- WAITING direct path read temp
- WAITING direct path write temp

This waits were expected as temporary space was extensively used.


Later I’ve found nice article from great Oracle expert Randolf Geist - Concurrent Index Creation. As a solution for the same problem he wrote tool for concurrent index creation.
I plan to test this approach in my environment.


Read More...

Wrong results with 'hash group by' aggregation on 10.2.0.2

Thursday, July 7, 2011 0 comments
If you are running 10.2.0.2 in your production environment check document [ID 4604970.8] on Oracle support site.


This week colleague noticed that query, he wanted to use for some reports, returns different results than he expected. He made some additional checks using Microsof Excel and confirmed that something is wrong. He was getting incorrect results.

My first suspicion was that probably something is wrong with query or maybe logical corruption happened.

Example of the queries (I’ve changed names of the table/columns):

-- partitioned_table (date range partitioned table)

create table temp_tab_2011
as
select col, date_to,  sum (col1) col1, 
       sum (col2) f_col1, sum (col3) col3, 
    sum (col4) col4, 
    sum (col5) col5, count(*) cnt
from partitioned_table t
where t.date_to >= to_date ('01.02.2011 00:00:00', 'dd.mm.yyyy HH24:MI:SS')
    and t.date_to < to_date ('01.07.2011 00:00:00', 'dd.mm.yyyy HH24:MI:SS')
group by col, date_to
/

create table temp_tab_2010_2011
as
select col, date_to, sum (col1) col1, 
       sum (f_col1) f_col1, sum (col3) col3, 
    sum (col4) col4, 
    sum (col5) col5, count(*) cnt
from partitioned_table t
where t.date_to >= to_date ('01.02.2010 00:00:00', 'dd.mm.yyyy HH24:MI:SS')
    and t.date_to < to_date ('01.07.2011 00:00:00', 'dd.mm.yyyy HH24:MI:SS')
group by col, date_to
/

-- COMPARASION QUERY
select j.col, j.col1, j.col2, j.col3, j.cnt, d.col, d.col1, d.col2, d.col3, d.cnt, j.col1 - d.col1
from
temp_tab_2011 j, temp_tab_2010_2011 d
where 1=1
and j.col = d.col
and j.date_to = d.date_to
and j.date_to = to_date ('01.06.2011 00:00:00', 'dd.mm.yyyy HH24:MI:SS')
and d.date_to = to_date ('01.06.2011 00:00:00', 'dd.mm.yyyy HH24:MI:SS')
and (j.col1 - d.col1) <> 0
/


We expected 0 rows returned for comparison query but we received some rows - which was serious issue.



I’ve restored that database on another machine. Checked files for logical corruption, rebuilded indexes into another tablespace, rebuilded table into another tablespace and re-created temporary tablespace - but still we were receiving wrong results.


Luckily for us I’ve noticed this blog post during my searches for answer on Google:
http://www.oracloid.com/2006/05/hash-group-by-can-give-wrong-result-in-oracle-102/

Queries were using this plan:
(click "view source" to get better output)
----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name                          | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | CREATE TABLE STATEMENT               |                               |       |       |       | 43069 (100)|          |       |       |
|   1 |  LOAD AS SELECT                      |                               |       |       |       |            |          |       |       |
|   2 |   PARTITION RANGE ITERATOR           |                               |   631K|    50M|       | 39729   (1)| 00:13:08 |    57 |    74 |
|   3 |    HASH GROUP BY                     |                               |   631K|    50M|   219M| 39729   (1)| 00:13:08 |       |       |
|   4 |     TABLE ACCESS BY LOCAL INDEX ROWID| PARTITIONED_TABLE             |  1557K|   124M|       | 20999   (1)| 00:06:56 |    57 |    74 |
|*  5 |      INDEX RANGE SCAN                | IX_DATE                       |  1557K|       |       |  3874   (1)| 00:01:17 |    57 |    74 |
----------------------------------------------------------------------------------------------------------------------------------------------

In our version 10.2.0.2 wrong results are possible from the result of a non-distinct aggregation with a group by when HASH GROUB BY is used. We were hitting that bug.


In session level we changed undocumented parameter "_gby_hash_aggregation_enabled" to FALSE and executed queries again. Result was as expected - 0 rows returned.

It is noted in Metalink note that this issue is fixed in 10.2.0.3 version so I’ve upgraded test instance to 10.2.0.4 version and that solved problems with incorrect results.


This is nasty Oracle bug and if you are running 10.2.0.1 or 10.2.0.2 check your results :)


Read More...

Recover lost datafile without valid backup

Thursday, June 9, 2011 4 comments
I’ve decided to perform this test case and publish blog post after reading this topic on OTN forums - http://bit.ly/mLwaA9

How to recover lost datafile when you don’t have valid backup of your database. Your database is running in archivelog mode and you have all necessary archive logs.

In situation when you don’t have valid backup and your datatabase is running in noarchivelog mode - your datafile is lost forever.

So take regular backups of your database and enable archivelog mode ;-)


Environment for this test:
OS: OEL 5
DB: Oracle EE 10.2.0.4

To check am I running my database in archivelog mode.
SQL> archive log list;
Database log mode              Archive Mode
Automatic archival             Enabled
Archive destination            USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence     70
Next log sequence to archive   72
Current log sequence           72


I will create new tablespace for my test and name it - TS_TEST.

SQL> create tablespace ts_test datafile '/u01/oradata/orcl/ts_test01.dbf' size 256M;

Tablespace created.


Now to create table in TS_TEST tablespace. I will fill this table with 1000000 rows using Tom Kyte's script.

SQL> create table bigtab tablespace ts_test
  2  as
  3  select rownum id, a.*
  from all_objects a
 where 1=0;  4    5
 
 
 
SQL> select tablespace_name, segment_name from 
2         dba_segments where segment_name = 'BIGTAB';

TABLESPACE_NAME                SEGMENT_NAME
--------------- ---------------------------------------
TS_TEST                        BIGTAB
 
--
-- I'VE EXCLUDED SCRIPT LINES CAUSE
-- IT'S NOT IMPORTANT FOR THIS TEST CASE
--
SQL> select count(*) from bigtab;

  COUNT(*)
----------
   1000000



I can see in alert log that there were 5 log switches during insert.

Starting control autobackup
Control autobackup written to DISK device
        handle '/u01/app/oracle/flash_recovery_area/ORCL/autobackup/2011_06_09/o1_mf_s_753372242_6z1folfy_.bkp'
Completed: create tablespace ts_test datafile '/u01/oradata/orcl/ts_test01.dbf' size 256M
Thu Jun  9 14:07:42 2011
Thread 1 advanced to log sequence 73 (LGWR switch)
  Current log# 1 seq# 73 mem# 0: /u01/oradata/orcl/redo01.log
Thu Jun  9 14:07:45 2011
Thread 1 advanced to log sequence 74 (LGWR switch)
  Current log# 2 seq# 74 mem# 0: /u01/oradata/orcl/redo02.log
Thu Jun  9 14:10:55 2011
Thread 1 cannot allocate new log, sequence 75
Checkpoint not complete
  Current log# 2 seq# 74 mem# 0: /u01/oradata/orcl/redo02.log
Thu Jun  9 14:10:56 2011
Thread 1 advanced to log sequence 75 (LGWR switch)
  Current log# 3 seq# 75 mem# 0: /u01/oradata/orcl/redo03.log
Thu Jun  9 14:10:58 2011
Thread 1 advanced to log sequence 76 (LGWR switch)
  Current log# 1 seq# 76 mem# 0: /u01/oradata/orcl/redo01.log
Thread 1 cannot allocate new log, sequence 77
Checkpoint not complete
  Current log# 1 seq# 76 mem# 0: /u01/oradata/orcl/redo01.log
Thu Jun  9 14:11:00 2011
Thread 1 advanced to log sequence 77 (LGWR switch)
  Current log# 2 seq# 77 mem# 0: /u01/oradata/orcl/redo02.log



To simulate disaster I will delete “ts_test01.dbf” datafile using OS command and flush buffer cache several times.

SQL> !rm /u01/oradata/orcl/ts_test01.dbf

SQL> alter system flush buffer_cache;

System altered.

SQL> /

System altered.

SQL> /

System altered.


Now my query throws an error because datafile 8 is missing.

SQL> select count(*) from bigtab;
select count(*) from bigtab
*
ERROR at line 1:
ORA-01116: error in opening database file 8
ORA-01110: data file 8: '/u01/oradata/orcl/ts_test01.dbf'
ORA-27041: unable to open file
Linux Error: 2: No such file or directory
Additional information: 3


How to recover this datafile when you don’t have valid backup to restore it?

Trick is to create new datafile based on the file “ts_test01.dbf” using command:
ALTER DATABASE CREATE DATAFILE 'xxxx.dbf' AS 'xxxx.dbf' and fill it with data from archive logs.


SQL> alter database create datafile '/u01/oradata/orcl/ts_test01.dbf' as '/u01/oradata/orcl/ts_test01.dbf';
alter database create datafile '/u01/oradata/orcl/ts_test01.dbf' as '/u01/oradata/orcl/ts_test01.dbf'
*
ERROR at line 1:
ORA-01182: cannot create database file 8 - file is in use or recovery
ORA-01110: data file 8: '/u01/oradata/orcl/ts_test01.dbf'

To avoid error - before creating new datafile, you must take the existing datafile (or the tablespace) offline.


SQL> alter database datafile 8 offline drop;

Database altered.


SQL> alter database create datafile '/u01/oradata/orcl/ts_test01.dbf' as '/u01/oradata/orcl/ts_test01.dbf';

Database altered.



Now I have new empty datafile which I will fill with records applying archive logs over it during recovery.

SQL> recover datafile 8;
ORA-00279: change 92069000 generated at 06/09/2011 14:03:59 needed for thread 1
ORA-00289: suggestion :
/u01/app/oracle/flash_recovery_area/ORCL/archivelog/2011_06_09/o1_mf_1_72_%u_.arc
ORA-00280: change 92069000 for thread 1 is in sequence #72


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 92069587 generated at 06/09/2011 14:07:42 needed for thread 1
ORA-00289: suggestion :
/u01/app/oracle/flash_recovery_area/ORCL/archivelog/2011_06_09/o1_mf_1_73_%u_.arc
ORA-00280: change 92069587 for thread 1 is in sequence #73
ORA-00278: log file
'/u01/app/oracle/flash_recovery_area/ORCL/archivelog/2011_06_09/o1_mf_1_72_6z1fw
gvn_.arc' no longer needed for this recovery


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 92069713 generated at 06/09/2011 14:07:45 needed for thread 1
ORA-00289: suggestion :
/u01/app/oracle/flash_recovery_area/ORCL/archivelog/2011_06_09/o1_mf_1_74_%u_.arc
ORA-00280: change 92069713 for thread 1 is in sequence #74
ORA-00278: log file
'/u01/app/oracle/flash_recovery_area/ORCL/archivelog/2011_06_09/o1_mf_1_73_6z1fw
kyj_.arc' no longer needed for this recovery


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 92070100 generated at 06/09/2011 14:10:56 needed for thread 1
ORA-00289: suggestion :
/u01/app/oracle/flash_recovery_area/ORCL/archivelog/2011_06_09/o1_mf_1_75_%u_.arc
ORA-00280: change 92070100 for thread 1 is in sequence #75
ORA-00278: log file
'/u01/app/oracle/flash_recovery_area/ORCL/archivelog/2011_06_09/o1_mf_1_74_6z1g2
jh1_.arc' no longer needed for this recovery


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 92071034 generated at 06/09/2011 14:10:58 needed for thread 1
ORA-00289: suggestion :
/u01/app/oracle/flash_recovery_area/ORCL/archivelog/2011_06_09/o1_mf_1_76_%u_.arc
ORA-00280: change 92071034 for thread 1 is in sequence #76
ORA-00278: log file
'/u01/app/oracle/flash_recovery_area/ORCL/archivelog/2011_06_09/o1_mf_1_75_6z1g2
lbg_.arc' no longer needed for this recovery


Specify log: {=suggested | filename | AUTO | CANCEL}

ORA-00279: change 92071751 generated at 06/09/2011 14:11:00 needed for thread 1
ORA-00289: suggestion :
/u01/app/oracle/flash_recovery_area/ORCL/archivelog/2011_06_09/o1_mf_1_77_%u_.arc
ORA-00280: change 92071751 for thread 1 is in sequence #77
ORA-00278: log file
'/u01/app/oracle/flash_recovery_area/ORCL/archivelog/2011_06_09/o1_mf_1_76_6z1g2
nmz_.arc' no longer needed for this recovery


Specify log: {=suggested | filename | AUTO | CANCEL}

Log applied.
Media recovery complete.

Recovery is successfully completed and now to alter datafile 8 to be online again.



SQL> alter database datafile 8 online;

Database altered.


-- CHECK
SQL> select count(*) from bigtab;

  COUNT(*)
----------
   1000000


-- CLEANUP

SQL> drop tablespace ts_test including contents and datafiles;

Tablespace dropped.

Read More...

Oracle Listener crashes with core dump in Linux 64bit

Wednesday, June 8, 2011 3 comments
We had serious problems with Oracle TNS Listener last few days. Problems were happening on very critical system - most active OLTP database.

Every few hours/minutes, in irregular intervals, TNS listener suddenly crashes which forces our Red Hat Cluster Manager to restart instance and even switch database to another node. Business systems that have relied on this database were experiencing great difficulties.

It is worth to mention that everything worked without any problems for about 2 years without any intervention on OS or database configuration.

Environment:
OS: Red Hat Enterprise Linux ES release 4 (64bit)
DB: Oracle EE 10.2.0.4.0



From Listener log:
…
…
…
07-LIP-2011 11:19:21 * (CONNECT_DATA=(SERVICE_NAME=odb )(CID=(PROGRAM=httpd)(HOST=rad2)(USER=apache))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.3.13.49)(PORT=58405)) * establish * odb * 12518
TNS-12518: TNS:listener could not hand off client connection
TNS-12547: TNS:lost contact
 TNS-12560: TNS:protocol adapter error
  TNS-00517: Lost contact
    Linux Error: 32: Broken pipe
07-LIP-2011 11:19:21 * service_update * odb * 0
07-LIP-2011 11:19:21 * service_update * odb * 0
07-LIP-2011 11:19:21 * (CONNECT_DATA=(SERVICE_NAME=otp)(CID=(PROGRAM=httpd)(HOST=rad2)(USER=apache))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.3.13.49)(PORT=58402)) * establish * odb * 0

TNSLSNR for Linux: Version 10.2.0.4.0 - Production on 07-LIP-2011 11:21:23

Copyright (c) 1991, 2007, Oracle.  All rights reserved.

System parameter file is /u01/app/oracle/product/10.2/db_1/network/admin/listener.ora
Log messages written to /u01/app/oracle/product/10.2/db_1/network/log/listener_odb.log
Trace information written to /u01/app/oracle/product/10.2/db_1/network/trace/listener_odb.trc
Trace level is currently 0

Started with pid=9607
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.2.10.40)(PORT=1521)))
Listener completed notification to CRS on start
…
…
...

….
….
….
07-LIP-2011 11:10:16 * (CONNECT_DATA=(SERVICE_NAME=UDB)(CID=(PROGRAM=)(HOST=POP1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.3.12.71)(PORT=44557)) * establish * UDB * 12518
TNS-12518: TNS:listener could not hand off client connection
TNS-12571: TNS:packet writer failure
 TNS-12560: TNS:protocol adapter error
  TNS-00530: Protocol adapter error
    Linux Error: 104: Connection reset by peer

TNSLSNR for Linux: Version 10.2.0.4.0 - Production on 07-LIP-2011 11:11:01

Copyright (c) 1991, 2007, Oracle.  All rights reserved.

System parameter file is /u01/app/oracle/product/10.2/db_1/network/admin/listener.ora
Log messages written to /u01/app/oracle/product/10.2/db_1/network/log/listener_udb.log
Trace information written to /u01/app/oracle/product/10.2/db_1/network/trace/listener_udb.trc
Trace level is currently 0

Started with pid=21862
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.2.10.39)(PORT=1521)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC0)))
Listener completed notification to CRS on start

TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
WARNING: Subscription for node down event still pending
…
…



From /var/log/messages:
tnslsnr[31928]: segfault at 0000000000000018 rip 0000003d34268593 rsp 0000007fbfff9ba0 error 4
tnslsnr[21862]: segfault at 0000000000000018 rip 0000003c00367f2d rsp 0000007fbfff9220 error 4




My first destinations to search for answers were Oracle support site, OTN forums and Oracle related technical blogs.

I’ve found interesting note on Oracle support site: [ID 549932.1]
Symptoms mentioned there:
- There may be heavy load on the CPU shooting up to 100%.
- The number of sessions in the database is well below the upper or maximum limit defined in the parameter file.
- The listener crashes suddenly during this heavy CPU load generating the core.
- (Optional) Listener.Ora has SUBSCRIBE_FOR_NODE_DOWN_EVENT_LISTENER=OFF.

Cause:
Extensive paging/swapping activity is a clear indication that the system is running out of the physical memory.



So I’ve decided to run vmstat for hour or two to check for swapping activity.
$vmstat 5

Columns si and so were 0 all the time so I thought that our problems weren’t associated with paging/swapping activity. We continued to search problems in other areas.


To cut a long story short - after few days searching for answers we found nothing so I’ve decided to apply patch for unpublished Bug 6139856 despite the fact that I haven't found problem with swapping.

Before applying patch I like to take backup of Oracle Home so I’ve decided to make compressed copy of that directory. Soon after compression began CPU load significantly raised and vmstat started to display swapping activity.

It was pretty clear now that paging/swapping activity is probably the cause of our problems so I applied the patch 6139856.
That fixed our problems and system was stable again :-)


Conclusion
Due to some recent network/firewall problems CPU load on our OLTP database server occasionally raises, causes extensive swapping which crashes TNS listener.


Read More...

Datapump fails with ORA-27086 on NFS

Tuesday, May 31, 2011 2 comments
I've received following error while trying to perform full database export using datapump to NFS:

Export: Release 10.2.0.4.0 - Production on Tuesday, 31 May, 2011 11:27:50

Copyright (c) 2003, 2007, Oracle.  All rights reserved.

Connected to: Oracle Database 10g Release 10.2.0.4.0 - Production
ORA-39001: invalid argument value
ORA-39000: bad dump file specification
ORA-31641: unable to create dump file "/backup-nfs/testdb/export/exportfull.dmp"
ORA-27086: unable to lock file - already in use
Linux Error: 37: No locks available
Additional information: 10

In this case problem was lying in NFS locking deamons - on most Unix-like systems also known as rpc.lockd and the rpc.statd.



Checking if 'nfslock' service is running:
# service nfslock status
rpc.statd is stopped

Start 'nfslock' service:
# /etc/init.d/nfslock start
Starting NFS statd:                                        [  OK  ]

Status:
# service nfslock status
rpc.statd (pid  4943) is running...


Starting 'nfslock' service resolved my problem with performing export to NFS.


Read More...

Replacing old database servers (project check list)

Tuesday, March 22, 2011 0 comments
We are currently in process of replacing our old database servers with new ones. During the evaluation process we’re checking various technical solutions to find the best option in terms of price/performance.

For me, real life experience is the most important when we are talking about database server performance so I’ve posted question on Oracle-L lists to get some answers about noted servers and performance.

DBA’s from Oracle-L lists helped me a lot. Thanks guys!

As replay to my question David Robillard posted answer that is worth publishing on this blog. It is nicely summarized project check list that can be used as a template when evaluating various technical solutions in or outside world of Oracle.

We will surely use this check list in our projects :-)



What to look for when buying (database) servers.
Hello Marko,

> we're considering to replace our SunFire v490 servers with newer generation
> of Oracle SUN or IBM servers.
>
> Our current configuration is:
> 2x SunFire v490 , 4GB RAM , 2 (1x dual-port HBA) x 2Gb FC
> 2x SunFire v490 , 8GB RAM , 2 (1x dual-port HBA) x 2Gb FC
>
> Do you have any experience with Oracle M-Series SPARC servers or x86 Sun
> servers running as database servers?

Yes indeed :)

> We were checking Intel 7500 X4470 and Sparc VII+ M4000 servers as replacement
> for our SunFire v490's.

Both CPU architectures can run your type of workload, that's for sure.
What you need to look out for is (but not limited to)

a) Budget.
b) Server's RAS & Virtualization features.
c) Server's technical specs.
d) Employee technical knowledge in hardware and software.
e) Data center system's homogenization.

So if we go into more detail for each bullet point, we have...

-- a) Budget.

This is a very important starting point in which you need to check: 1.
Oracle software licenses, 2. Oracle hardware cost of purchase, 3.
Oracle software and hardware maintenance costs and 4. electrical and
HVAC costs.

1. Oracle software licenses.

Make sure to check with your Oracle database friendly salesman. I
don't know if this is still the case, but the Oracle license cost for
an x86 CPU was not the same as for the SPARC achitecture. Count the
number of cores and run the numbers with your Oracle software
representative. Unless things have changed since the aquisition of Sun
by Oracle, you might be in for a little surprise.

2. Oracle hardware cost of purchase.

The X4470 and M4000 machines are probably not the same price up front.
And you might not be able to get the same discount on both
architectures. Make sure to talk with your IHV or Oracle hardware
technical representative to see what configuration of each system is
adequate for your workload and then get a price for both system types.
There could be a big difference.

3. Oracle software and hardware maintenance costs.

Depending on the type of hardware, your Oracle software licenses might
be different. So is the maintenance contract on the hardware itself.
Make sure to check those two over a period of 3 to 5 years and compare
the TCO at the end of the system's life cycle. Again, that can be
quite significant. Considering your pay around 20 % annually of the
total cost of purchase on the Oracle software licenses, if they're
more expensive up front, they're going to be a lot more expensive
after 5 years. I'm not sure about the 20 %, so check this number with
your Oracle sales rep.

4. electrical and HVAC costs.

Check the electrical specs for both systems. Keep in mind that you
need to feed these machines power to make the run and more power to
cool them. Since they probably won't have the same requirements, make
sure to check this with your data center specialist to see how much
money it will cost to run those systems? Again, run those numbers up
to 3 or 5 years and compare them.

-- b) Server's RAS and Virtualization features.

The M-series is packed with RAS features. Especially if you run
Solaris on them. Compare those features with what the X-series
platform offers you. The M-series also has built in hardware
virtualization with logical domains and such. In an Oracle database
scenario, are those important? For example, you could decide to assign
a single CPU/memory board for OLTP and perform a dynamic domain change
to add the other CPU/memory board for backup or batch jobs. Will you
be using RAC? Don't forget to ask your Oracle software and hardware
reps to know if you can run RAC in logical domains and/or Solaris
containers?

Make a side-by-side comparision and see what is an added-value to your
organisation and what is simly a nice to have. See if the price
difference is worth the features?

-- c) Server's technical specs.

The M4000 machine can use up to 256 GB while the X4470 is limited to
64 GB. Is that interesting to you? Of course most of us would you
benifit from a very big SGA, but does it makes sense to pay for the
M4000 and 256 GB if your workload is happy with a 10 GB SGA that fits
into both hardware platforms?

The Intel 7500 CPU is most probably faster then the SPARC64 VII+ for
single threaded applications. But the SPARC architecture is probably
better wth multi-threaded applications. Check your SQL code and see if
you would benefit from one or the other. Ask your Oracle hardware rep
for Oracle benchmarks on both architectures and try to compare apples
to apples.

The X4470 is a 3 U machine while the M4000 requires double that. Do
you have enough space in your data center racks to have the new
machines online while the old ones are still there? You will need it
in order to have a smooth transition of your old production machines
to the new ones.

-- d) Employee technical knowledge in hardware and software.

Your team of sysadmins and DBAs have obviously been working with
Solaris + SPARC for a few years now. Are they familiar with Linux +
x86? Or Solaris + x86? They are not the same. Even Solaris SPARC vs
Solaris x86 is not the same. If you have all your operating practices
documented on Solaris + SPARC, are you ready to update all those to
another OS? Will you need to send sysadmins to Oracle university to
learn about Solaris x86, the new M4000 or the X4470? Don't forget to
factor these into the overall price of the solution too. You talked
about IBM machines. Will they be running AIX or Linux? Consider that
the IBM POWER systems running AIX are *very* different then the Oracle
SPARC systems running Solaris. Both are very good products, but be
ready for a big learning curve if you switch from one to another.

IMHO you should stick to what you know and comfortable with. In this
case this is Solaris SPARC. But the price might be prohibitive. If you
need to switch to Solaris x86, consider training your staff to Linux
maybe?

-- e) Data center system's homogenization.

What do you have in your data center? Only Sun Oracle hardware or a
mix of many vendors? It's a lot easier to handle a single vendor then
three to four different ones. Plus the spare parts are easier to stock
and share between the machines of an identical vendor. You also have
an easier time to manage the service contracts from one or two vendors
then three of four. If your team is trained on Sun and Oracle
procedures and hardware, be wary of moving to IBM. It's a lot
different and the learning curve is steep. Keep in mind that it's also
more expensive and more difficult to keep and find trained sysadmins
on different OSes and platforms. Think of what your job offer would
look like if you need to hire another sysadmin. The more product you
have, the more difficult it will be to find qualified sysadmins. Not
to mention that they might ask for a higher salary :)

> Do you think that Intel 7500 X4470 is strong enough to handle workload of two
> SunFire v490 servers?
> Same question related to Sparc VII+ M4000 server - can this server handle
> workload of two SunFire v490's?

Again, ask your Oracle rep these questions. I'm pretty sure the answer
is yes, both machines can support your workload. But at what speed?
And at what cost?

Finally, make sure you have fun! That's a very nice project you have there :o)

Cheers,

David

Read More...

"SQL*Net message to/from dblink" - How to improve query performance over dblink?

Tuesday, February 22, 2011 4 comments
Colleague reported to me that execution of his query is taking too long. After quick look it was obvious that this was ideal tunning candidate.
I will note my tuning process in this blog post.

Table and column names are changed.

DB: Oracle 10.2.0.2.0
OS: Solaris 9 64-bit

Query:
SELECT
       ROUND (
          SUM (poz_traj) / 60,2)
          mins,
       SUM (amount) iznos,
       COUNT (*) pozivi
  FROM 
       (SELECT co.con_nr
          FROM usr.ugov@remote_db co,
               usr.ugov_item@remote_db ci,
               usr.np_zah@remote_db np
         WHERE     1 = 1
               AND co.ugov_id = ci.ugov_id
               AND ci.ugov_item_id = np.ugov_item_id
               AND NP.kol_id IN
                      (355375, 360996, 424469, 466193, 466194, 480438, 492748)
               AND np.u_po IS NOT NULL
               AND status_cd = 'PRI'
               AND co.status <> 'CC') c,
             table1 x
 WHERE 1 = 1 
       AND x.sub_cd = c.con_nr
       AND record_date >=
              TO_DATE ('24.01.2011 08:40:01', 'dd.mm.yyyy hh24:mi:ss')
       AND record_date <
              TO_DATE ('24.01.2011 18:00:01', 'dd.mm.yyyy hh24:mi:ss')
       AND orgn = 0                                     
       AND item_cd = 1                      
       AND corr_status > -1
       AND parameter_sg = 'glas';



This query is using database link to access tables on remote database - "REMOTE_DB".

Explain plan showed:

NOTE!
Click "view source" to get explain plan better formatted.
----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name          | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop | Inst   |IN-OUT|
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |               |       |       |   105 (100)|          |       |       |        |      |
|   1 |  SORT AGGREGATE                        |               |     1 |   275 |            |          |       |       |        |      |
|   2 |   NESTED LOOPS                         |               |     1 |   275 |   105   (0)| 00:00:03 |       |       |        |      |
|   3 |    NESTED LOOPS                        |               |    32 |  3264 |   103   (0)| 00:00:03 |       |       |        |      |
|   4 |     NESTED LOOPS                       |               |    16 |  1216 |    55   (0)| 00:00:02 |       |       |        |      |
|   5 |      PARTITION RANGE SINGLE            |               |    16 |   704 |    23   (0)| 00:00:01 |    37 |    37 |        |      |
|*  6 |       TABLE ACCESS BY LOCAL INDEX ROWID| TABLE1        |    16 |   704 |    23   (0)| 00:00:01 |    37 |    37 |        |      |
|*  7 |        INDEX RANGE SCAN                | IX_TABLE1_RD  |    21 |       |     3   (0)| 00:00:01 |    37 |    37 |        |      |
|   8 |      REMOTE                            | UGOV          |     1 |    32 |     2   (0)| 00:00:01 |       |       | REMO_~ | R->S |
|   9 |     REMOTE                             | UGOV_ITEM     |     2 |    52 |     3   (0)| 00:00:01 |       |       | REMO_~ | R->S |
|  10 |    REMOTE                              | NP_ZAH        |     1 |   173 |     1   (0)| 00:00:01 |       |       | REMO_~ | R->S |
----------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   6 - filter(("item_cd"=1 AND "parameter_sg"='glas' AND "orgn"=0 AND "corr_status">(-1)))
   7 - access("RECORD_DATE">=TO_DATE('2011-01-24 08:40:01', 'yyyy-mm-dd hh24:mi:ss') AND "RECORD_DATE"<TO_DATE('2011-01-24
              18:00:01', 'yyyy-mm-dd hh24:mi:ss'))


To help myself a little I’ve checked how many rows there are in tables queried in inline view.

ugov - 777802 rows
ugov_item - 1581553 rows
np_zah - 155802 rows

SQL>  SELECT co.con_nr
  2            FROM usr.ugov@remote_db co,
  3                 usr.ugov_item@remote_db ci,
  4                 usr.np_zah@remote_db np
  5           WHERE     1 = 1
  6                 AND co.ugov_id = ci.ugov_id
  7                 AND ci.ugov_item_id = np.ugov_item_id
  8                 AND NP.kol_id IN
  9                        (355375, 360996, 424469, 466193, 466194, 480438, 492748)
 10                 AND np.u_po IS NOT NULL
 11                 AND status_cd = 'PRI'
 12                 AND co.status <> 'CC';

4599 rows selected.

Elapsed: 00:00:23.60

Execution of inline view was satisfying with 4599 rows received.


After this initial checks it was obvious where problem lies. Oracle probably transfers whole tables over network from remote database and then resolves the query.

To confirm my thoughts I’ve decided to trace execution of the current query and check wait events.

Results were:
SQL> SELECT
  2         ROUND (
  3            SUM (poz_traj) / 60,
  4            2)
  5            mins,
  6         SUM (amount) iznos,
  7         COUNT (*) pozivi
  8    FROM
  9         (SELECT co.con_nr
 10            FROM usr.ugov@remote_db co,
 11                 usr.ugov_item@remote_db ci,
 12                 usr.np_zah@remote_db np
 13           WHERE     1 = 1
 14                 AND co.ugov_id = ci.ugov_id
 15                 AND ci.ugov_item_id = np.ugov_item_id
 16                 AND NP.kol_id IN
 17                        (355375, 360996, 424469, 466193, 466194, 480438, 492748)
 18                 AND np.u_po IS NOT NULL
 19                 AND status_cd = 'PRI'
 20                 AND co.status <> 'CC') c,
 21               table1 x
 22   WHERE 1 = 1
 23         AND x.sub_cd = c.con_nr
 24         AND record_date >=
 25                TO_DATE ('24.01.2011 08:40:01', 'dd.mm.yyyy hh24:mi:ss')
 26         AND record_date <
 27                TO_DATE ('24.01.2011 18:00:01', 'dd.mm.yyyy hh24:mi:ss')
 28         AND orgn = 0
 29         AND item_cd = 1
 30         AND corr_status > -1
 31         AND parameter_sg = 'glas';

1 row selected.

Elapsed: 01:22:23.17

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to dblink                 3055416        0.00          1.55
  SQL*Net message from dblink               3055416        2.73       3835.51
  db file sequential read                     34051        0.51        226.57
  SQL*Net message from client                     2      323.35        323.35
  SQL*Net message to client                       1        0.00          0.00


Time of query execution was 1:22:23.

Top wait events were "SQL*Net message to dblink" and "SQL*Net message from dblink".

I’ve had network bottleneck because query produced lot of traffic between databases.


To resolve this problem I have to minimize data transfer from remote database.
Inline view should be executed on remote database to cut down number of rows that are transferred over dblink.


So I’ve created view on remote database and checked execution again.

REMOTE DATABASE

create view temp_view_test as
  SELECT co.con_nr
          FROM usr.ugov co,
               usr.ugov_item ci,
               usr.np_zah np
         WHERE     1 = 1
               AND co.ugov_id = ci.ugov_id
               AND ci.ugov_item_id = np.ugov_item_id
               AND NP.kol_id IN
                      (355375, 360996, 424469, 466193, 466194, 480438, 492748)
               AND np.u_po IS NOT NULL
               AND status_cd = 'PRI'
               AND co.status <> 'CC';


What will happen now:

SQL > SELECT
  2         ROUND (
  3            SUM (poz_traj) / 60,
  4            2)
  5            mins,
  6         SUM (amount) iznos,
  7         COUNT (*) pozivi
  8    FROM
  9         usr.temp_view_test@remote_db c,
 10               table1 x
 11   WHERE 1 = 1
 12         AND x.sub_cd = c.con_nr
 13         AND record_date >=
 14                TO_DATE ('25.01.2011 08:40:01', 'dd.mm.yyyy hh24:mi:ss')
 15         AND record_date <
 16                TO_DATE ('25.01.2011 18:00:01', 'dd.mm.yyyy hh24:mi:ss')
 17         AND orgn = 0
 18         AND item_cd = 1
 19         AND corr_status > -1
 20         AND parameter_sg = 'glas';

1 row selected.

Elapsed: 00:02:55.53

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name            | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop | Inst   |IN-OUT|
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                 |       |       |   625 (100)|          |       |       |        |      |
|   1 |  SORT AGGREGATE                      |                 |     1 |    56 |            |          |       |       |        |      |
|*  2 |   HASH JOIN                          |                 |     8 |   448 |   625   (1)| 00:00:13 |       |       |        |      |
|   3 |    PARTITION RANGE SINGLE            |                 |    16 |   704 |    23   (0)| 00:00:01 |    37 |    37 |        |      |
|*  4 |     TABLE ACCESS BY LOCAL INDEX ROWID| TABLE1          |    16 |   704 |    23   (0)| 00:00:01 |    37 |    37 |        |      |
|*  5 |      INDEX RANGE SCAN                | IX_TABLE1_RD    |    21 |       |     3   (0)| 00:00:01 |    37 |    37 |        |      |
|   6 |    REMOTE                            | TEMP_VIEW_TEST  | 51418 |   602K|   601   (1)| 00:00:12 |       |       | REMO_~ | R->S |
----------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("X"."sub_cd"="C"."con_nr")
   4 - filter(("item_cd"=1 AND "parameter_sg"='glas' AND "orgn"=0 AND "corr_status">(-1)))
   5 - access("RECORD_DATE">=TO_DATE('2011-01-24 08:40:01', 'yyyy-mm-dd hh24:mi:ss') AND "RECORD_DATE"<TO_DATE('2011-01-24
              18:00:01', 'yyyy-mm-dd hh24:mi:ss'))



This was significant performance improvement. From 1 hour 22 mins to 2 mins 55 secs !

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file sequential read                     27932        0.27         41.32
  SQL*Net message to dblink                       2        0.00          0.00
  SQL*Net message from dblink                     2       42.79         44.46
  SQL*Net more data from dblink                  36        3.24         65.83
  SQL*Net message from client                     2        0.01          0.01
  SQL*Net message to client      

Top wait events weren't related to network bottlenecks.

So the trick was to create view on remote database which reduced number of rows transferred over network.

Colleague was happy with new execution time :)


Read More...

Invalid objects after Revoking Execute On DBMS_JOB, DBMS_LOB, UTL_TCP, UTL_HTTP, UTL_SMTP, UTL_FILE from PUBLIC

Friday, November 19, 2010 3 comments
You can use Oracle Grid Control for checking policy violations, to keep compliance with company security or configuration standards. Grid Control comes with pre-installed set of policies and recommendations of best practices for databases.

With having that in mind I checked Critical Policy Violations for my database and noticed this critical violations:
...
Execute Privileges on DBMS_JOB to PUBLIC 
Restricted Privilege to Execute UTL_TCP
Restricted Privilege to Execute UTL_HTTP 
Restricted Privilege to Execute UTL_SMTP 
Execute Privileges on DBMS_LOB to PUBLIC 
Execute Privileges on UTL_FILE To PUBLIC
...

It is recommended to remove excessive privileges from some users to prevent possible attacks for happening in the first place. So based on recommendations from Grid Control I’ve decided to revoke noted privileges from PUBLIC and check what will happen after.


As SYS user I executed:
REVOKE EXECUTE ON utl_tcp FROM public;
REVOKE EXECUTE ON utl_smtp FROM public;
REVOKE EXECUTE ON utl_http FROM public;
REVOKE EXECUTE ON utl_file FROM public;
REVOKE EXECUTE ON dbms_job FROM public;
REVOKE EXECUTE ON dbms_lob FROM public;

Immediately after revoking those grants I checked database status noticed invalid objects in sys schema, mdsys, ordsys, wmsys, etc.

My next move was rollback:
GRANT EXECUTE ON dbms_job TO public;
GRANT EXECUTE ON utl_file TO public;
GRANT EXECUTE ON utl_tcp TO public;
GRANT EXECUTE ON utl_http TO public;
GRANT EXECUTE ON utl_smtp TO public;
GRANT EXECUTE ON dbms_lob TO public;

-- Compile
SQL> shutdown immediate;
SQL> startup restrict;
SQL> @$ORACLE_HOME/rdbms/admin/@utlrp.sql
SQL> shutdown immediate;
SQL> startup;

After granting privileges back and recompiling invalid objects everything was valid again.
This test was very educational for me.

My next step was searching for Metalink notes that explain this situation and I’ve found very useful notes - [ID 247093.1] [ID 1165830.1] [ID 797706.1]

Important lesson learned in this experiment was - do not blindly follow EM/GC policy recommendations without consulting Oracle support or (at least) skimming through Metalink notes.

From Metalink note:
“Oracle highly recommends keeping DBMS_SQL, DBMS_JOB, DBMS_LOB granted to PUBLIC just to keep user’s database running. Otherwise lots of things break including some of the utilities like Import/Export, Datapump expdp/impdp, SQL*Loader, etc.”


PUBLIC is pseudo user/schema used by database for its internal functionality. Privileges are granted to PUBLIC with CATALOG and CATPROC components for 10g database instance.

If you still want to revoke privileges from PUBLIC make sure that you re-grant privileges to user that require them and recompile possible invalidated objects.


Read More...

RMAN restore fails with RMAN-06023 but there are backups available

Thursday, November 18, 2010 7 comments
I must admit - today I learned new thing about RMAN restore and this is good enough reason for a blog post.

Part of my daily tasks is to perform RMAN backup and recovery scenarios in my test environment (as I said beore), to be familiar with every possible situation that can happen.

So today, very simple task of restoring database to another machine took me over 2 hours...


Demo case:
Environment : Oracle 10.2.0.4 on Linux 64bit


I have backup files of the database:
$ ls -l /oracle/backup/*
ARCH_OTP_20101117_0tlt6lum_s8221_p1
controlf_OTP_0ult6mop_8222_20101117
FULL_OTP_0rlt6ip4
spfile_OTP_0vlt6mos_8223_20101117

Restore controlfile from backup and mount database:
$ rman target /
Recovery Manager: Release 10.2.0.4.0 - Production on Thu Nov 18 08:32:09 2010
Copyright (c) 1982, 2007, Oracle.  All rights reserved.
connected to target database (not started)

RMAN> startup nomount;
Oracle instance started
Total System Global Area     838860800 bytes
Fixed Size                     2087672 bytes
Variable Size                750781704 bytes
Database Buffers              67108864 bytes
Redo Buffers                  18882560 bytes

RMAN> run
2> {
3> allocate channel c1 device type disk;
4> restore controlfile from '/oracle/backup/controlf_OTP_0ult6mop_8222_20101117';
5> }

Starting restore at 18.11.2010 08:40:51

channel c1: restoring control file
channel c1: restore complete, elapsed time: 00:00:01
output filename=/oracle/product/10.2.0/db_1/dbs/control01.ctl
output filename=/oracle/product/10.2.0/db_1/dbs/control02.ctl
Finished restore at 18.11.2010 08:40:52
released channel: c1

RMAN> alter database mount;
database mounted

I will delete all EXPIRED backups from RMAN repository:

RMAN> crosscheck backup;
RMAN> delete noprompt expired backup;

RMAN> crosscheck archivelog all;
RMAN> delete noprompt expired archivelog all;

For this case it is very important to mention that I’ve had enabled controlfile autobackup (CONFIGURE CONTROLFILE AUTOBACKUP ON) in older incarnations. With this parameter enabled RMAN automatically takes backup of controlfile and server parameter file whenever the database structure metadata in the control file changes or whenever a backup or copy operation is performed using RMAN.

List backup command shows that there are two old autobackup files in flash recovery area.
RMAN> list backup;


List of Backup Sets
===================

BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ -------------------
8199    Full    7.89M      DISK        00:00:00     16.06.2010 14:50:28
        BP Key: 8199   Status: AVAILABLE  Compressed: NO  Tag: TAG20100616T145028
        Piece Name: /oracle/oradata/archive/flash_recovery_area/OTP/autobackup/2010_06_16/o1_mf_s_721839028_61kl4o2y_.bkp
  Control File Included: Ckp SCN: 43471045788   Ckp time: 16.06.2010 14:50:28
  SPFILE Included: Modification time: 16.06.2010 13:15:03

BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ -------------------
8200    Full    7.89M      DISK        00:00:00     16.06.2010 14:49:49
        BP Key: 8200   Status: AVAILABLE  Compressed: NO  Tag: TAG20100616T144949
        Piece Name: /oracle/oradata/archive/flash_recovery_area/OTP/autobackup/2010_06_16/o1_mf_s_721838989_61kl3hhq_.bkp
  Control File Included: Ckp SCN: 43471045571   Ckp time: 16.06.2010 14:49:49
  SPFILE Included: Modification time: 16.06.2010 13:15:03


To continue with my restore process I will catalog my backup files noted before.
RMAN> catalog start with '/oracle/backup/';

searching for all files that match the pattern /oracle/backup/

List of Files Unknown to the Database
=====================================
File Name: /oracle/backup/FULL_OTP_0rlt6ip4
File Name: /oracle/backup/ARCH_OTP_20101117_0tlt6lum_s8221_p1
File Name: /oracle/backup/controlf_OTP_0ult6mop_8222_20101117

Do you really want to catalog the above files (enter YES or NO)? YES
cataloging files...
cataloging done

List of Cataloged Files
=======================
File Name: /oracle/backup/FULL_OTP_0rlt6ip4
File Name: /oracle/backup/ARCH_OTP_20101117_0tlt6lum_s8221_p1
File Name: /oracle/backup/controlf_OTP_0ult6mop_8222_20101117

Check backup of database.
RMAN> list backup of database;


List of Backup Sets
===================

BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ -------------------
8196    Incr 0  11.18G     DISK        00:53:38     17.11.2010 01:53:58
        BP Key: 8201   Status: AVAILABLE  Compressed: YES  Tag: WEEKLY_FULL
        Piece Name: /oracle/backup/FULL_OTP_0rlt6ip4
  List of Datafiles in backup set 8196
  File LV Type Ckp SCN    Ckp Time            Name
  ---- -- ---- ---------- ------------------- ----
  1    0  Incr 48852276752 17.11.2010 01:00:20 /u02/oradata/system/otp/system01.dbf
  2    0  Incr 48852276752 17.11.2010 01:00:20 /u02/oradata/data02/otp/users02.dbf
  3    0  Incr 48852276752 17.11.2010 01:00:20 /u02/oradata/system/otp/sysaux01.dbf
  4    0  Incr 48852276752 17.11.2010 01:00:20 /u02/oradata/data02/otp/users01.dbf
  5    0  Incr 48852276752 17.11.2010 01:00:20 /u02/oradata/index/otp/tools01.dbf
  6    0  Incr 48852276752 17.11.2010 01:00:20 /u02/oradata/data02/otp/users03.dbf
  7    0  Incr 48852276752 17.11.2010 01:00:20 /u02/oradata/system/otp/undotbs01.dbf
  8    0  Incr 48852276752 17.11.2010 01:00:20 /u02/oradata/system/otp/cluser_tbs01.dbf

Restore database files from backup to another location.
RMAN> run
2> {
3> allocate channel c1 device type disk;
4> SET NEWNAME FOR DATAFILE 1 TO '/oracle/u02/oradata/system/otp/system01.dbf';
5> SET NEWNAME FOR DATAFILE 2 TO '/oracle/u02/oradata/data02/otp/users02.dbf';
6> SET NEWNAME FOR DATAFILE 3 TO '/oracle/u02/oradata/system/otp/sysaux01.dbf';
7> SET NEWNAME FOR DATAFILE 4 TO '/oracle/u02/oradata/data02/otp/users01.dbf';
8> SET NEWNAME FOR DATAFILE 5 TO '/oracle/u02/oradata/index/otp/tools01.dbf';
9> SET NEWNAME FOR DATAFILE 6 TO '/oracle/u02/oradata/data02/otp/users03.dbf';
10> SET NEWNAME FOR DATAFILE 7 TO '/oracle/u02/oradata/system/otp/undotbs01.dbf';
11> SET NEWNAME FOR DATAFILE 8 TO '/oracle/u02/oradata/system/otp/cluser_tbs01.dbf';
12>
13> RESTORE DATABASE;
14> SWITCH DATAFILE ALL;
15> RECOVER DATABASE;
16> }

using target database control file instead of recovery catalog
allocated channel: c1
channel c1: sid=321 devtype=DISK

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

Starting restore at 18.11.2010 08:46:21

released channel: c1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 11/18/2010 08:46:22
RMAN-06026: some targets not found - aborting restore
RMAN-06023: no backup or copy of datafile 4 found to restore
RMAN-06023: no backup or copy of datafile 3 found to restore
RMAN-06023: no backup or copy of datafile 1 found to restore

Hm...
For me this was unexpected error because I had database backup available.

RMAN> list backup of datafile 4;


List of Backup Sets
===================

BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ -------------------
8196    Incr 0  11.18G     DISK        00:53:38     17.11.2010 01:53:58
        BP Key: 8201   Status: AVAILABLE  Compressed: YES  Tag: WEEKLY_FULL
        Piece Name: /oracle/backup/FULL_OTP_0rlt6ip4
  List of Datafiles in backup set 8196
  File LV Type Ckp SCN    Ckp Time            Name
  ---- -- ---- ---------- ------------------- ----
  4    0  Incr 48852276752 17.11.2010 01:00:20 /u02/oradata/data02/otp/users01.dbf

RMAN> list backup of datafile 3;


List of Backup Sets
===================

BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ -------------------
8196    Incr 0  11.18G     DISK        00:53:38     17.11.2010 01:53:58
        BP Key: 8201   Status: AVAILABLE  Compressed: YES  Tag: WEEKLY_FULL
        Piece Name: /oracle/backup/FULL_OTP_0rlt6ip4
  List of Datafiles in backup set 8196
  File LV Type Ckp SCN    Ckp Time            Name
  ---- -- ---- ---------- ------------------- ----
  3    0  Incr 48852276752 17.11.2010 01:00:20 /u02/oradata/system/otp/sysaux01.dbf

…
…
…

After two hours of searching for answers and performing more tests I ran to the Metalink document 965122.1.

The problem were those autobackups in FRA that I showed before. That files belonged to different incarnation than the available backups current incarnation.

[From Metalink document]
If we start a RESTORE database with a BACKUP controlfile and Flash Recovery Area is defined, RMAN execute and implicit crosscheck and catalog of all the objects in the Flash Recovery Area.

RMAN will catalog any objects in the Flash Recovery Area that will not be registered in the controlfile and if any of this files belongs to an incarnation different from CURRENT incarnation in the controlfile then changes controlfile CURRENT incarnation to the one found in the file that is being cataloged.

This prevents database from restoring backups that belong to old CURRENT incarnation.
RMAN considers backup availble for being restored if the backup incarnation and CURRENT incarnation in controlfile are the same.



So I’ve decided to temporary disable FRA during recovery process commenting db_recovery_file_dest and db_recovery_file_dest_size parameters in parameter file.

RMAN> shutdown abort;

Oracle instance shut down

$ vi $ORACLE_HOME/dbs/initotp.ora
...
#*.db_recovery_file_dest='/oracle/oradata/archive/flash_recovery_area'
#*.db_recovery_file_dest_size=2147483648
…
...

$ sqlplus "/as sysdba"

SQL> startup nomount pfile='$ORACLE_HOME/dbs/initotp.ora';
ORACLE instance started.

Total System Global Area  838860800 bytes
Fixed Size                  2087672 bytes
Variable Size             750781704 bytes
Database Buffers           67108864 bytes
Redo Buffers               18882560 bytes

Restore and recover database.
RMAN> run
2> {
3> allocate channel c1 device type disk;
4> restore controlfile from '/oracle/backup/controlf_OTP_0ult6mop_8222_20101117';
5> }
…
…
…
RMAN> alter database mount;
database mounted

RMAN> catalog start with '/oracle/backup';
…
…

RMAN> run
2> {
3> allocate channel c1 device type disk;
4> SET NEWNAME FOR DATAFILE 1 TO '/oracle/u02/oradata/system/otp/system01.dbf';
5> SET NEWNAME FOR DATAFILE 2 TO '/oracle/u02/oradata/data02/otp/users02.dbf';
6> SET NEWNAME FOR DATAFILE 3 TO '/oracle/u02/oradata/system/otp/sysaux01.dbf';
7> SET NEWNAME FOR DATAFILE 4 TO '/oracle/u02/oradata/data02/otp/users01.dbf';
8> SET NEWNAME FOR DATAFILE 5 TO '/oracle/u02/oradata/index/otp/tools01.dbf';
9> SET NEWNAME FOR DATAFILE 6 TO '/oracle/u02/oradata/data02/otp/users03.dbf';
10> SET NEWNAME FOR DATAFILE 7 TO '/oracle/u02/oradata/system/otp/undotbs01.dbf';
11> SET NEWNAME FOR DATAFILE 8 TO '/oracle/u02/oradata/system/otp/cluser_tbs01.dbf';
12>
13> RESTORE DATABASE;
14> SWITCH DATAFILE ALL;
15> RECOVER DATABASE;
16> }

using target database control file instead of recovery catalog
allocated channel: c1
channel c1: sid=321 devtype=DISK

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

executing command: SET NEWNAME

Starting restore at 18.11.2010 09:01:48

channel c1: starting datafile backupset restore
channel c1: specifying datafile(s) to restore from backup set
restoring datafile 00001 to /oracle/u02/oradata/system/otp/system01.dbf
restoring datafile 00002 to /oracle/u02/oradata/data02/otp/users02.dbf
restoring datafile 00003 to /oracle/u02/oradata/system/otp/sysaux01.dbf
restoring datafile 00004 to /oracle/u02/oradata/data02/otp/users01.dbf
restoring datafile 00005 to /oracle/u02/oradata/index/otp/tools01.dbf
restoring datafile 00006 to /oracle/u02/oradata/data02/otp/users03.dbf
restoring datafile 00007 to /oracle/u02/oradata/system/otp/undotbs01.dbf
restoring datafile 00008 to /oracle/u02/oradata/system/otp/cluser_tbs01.dbf
channel c1: reading from backup piece /var/umoracle/otp/fullbkp_dir/FULL_OTP_0rlt6ip4
channel c1: restored backup piece 1
failover to piece handle=/mnt/l01dbdev-s01storage1/oracle/backup/FULL_OTP_0rlt6ip4 tag=WEEKLY_FULL
channel c1: restore complete, elapsed time: 01:28:51
Finished restore at 18.11.2010 10:30:40

datafile 1 switched to datafile copy
input datafile copy recid=9 stamp=735388241 filename=/oracle/u02/oradata/system/otp/system01.dbf
datafile 2 switched to datafile copy
input datafile copy recid=10 stamp=735388241 filename=/oracle/u02/oradata/data02/otp/users02.dbf
datafile 3 switched to datafile copy
input datafile copy recid=11 stamp=735388241 filename=/oracle/u02/oradata/system/otp/sysaux01.dbf
datafile 4 switched to datafile copy
input datafile copy recid=12 stamp=735388241 filename=/oracle/u02/oradata/data02/otp/users01.dbf
datafile 5 switched to datafile copy
input datafile copy recid=13 stamp=735388241 filename=/oracle/u02/oradata/index/otp/tools01.dbf
datafile 6 switched to datafile copy
input datafile copy recid=14 stamp=735388242 filename=/oracle/u02/oradata/data02/otp/users03.dbf
datafile 7 switched to datafile copy
input datafile copy recid=15 stamp=735388242 filename=/oracle/u02/oradata/system/otp/undotbs01.dbf
datafile 8 switched to datafile copy
input datafile copy recid=16 stamp=735388242 filename=/oracle/u02/oradata/system/otp/cluser_tbs01.dbf

Starting recover at 18.11.2010 10:30:43

starting media recovery

channel c1: starting archive log restore to default destination
channel c1: restoring archive log
archive log thread=1 sequence=33610
channel c1: reading from backup piece /var/otp/fullbkp_dir/ARCH_OTP_20101117_0tlt6lum_s8221_p1
channel c1: restored backup piece 1
failover to piece handle=/oracle/backup/ARCH_OTP_20101117_0tlt6lum_s8221_p1 tag=ARCHIVELOGS
channel c1: restore complete, elapsed time: 00:11:44
archive log filename=/oracle/oradata/archive/arch_1_33610_586538926.arc thread=1 sequence=33610
unable to find archive log
archive log thread=1 sequence=33611
released channel: c1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 11/18/2010 10:50:16
RMAN-06054: media recovery requesting unknown log: thread 1 seq 33611 lowscn 48852472202
RMAN> exit


$ sqlplus "/as sysdba"

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 48852472202 generated at 11/17/2010 01:54:16 needed for
thread 1
ORA-00289: suggestion :
/oracle/oradata/archive/arch_1_33611_586538926.arc
ORA-00280: change 48852472202 for thread 1 is in sequence #33611

Specify log: {=suggested | filename | AUTO | CANCEL}
CANCEL
Media recovery cancelled.

SQL> alter database open resetlogs;
Database altered.

I'm glad that this problem occurred now in test environment because those two hours spent on searching for answers could be critical in recovery of a production database.

You never know :)


REFERENCES
Metalink - [ID 965122.1]

Read More...

Recover From Incrementally Updated Backup (test case)

Thursday, November 4, 2010 0 comments
In this test I will play with incrementally updated backup. Using this feature we can recover datafile image copies applying incremental backups to them.

NOTE!
You should use 'flash recovery area' as primary location for your backups whenever you can. To complicate this test case a little I won’t use 'flash recovery area' for my backups.

Oracle 10.2.0.4.0
Linux OEL5

For taking backup I will use this RMAN script:

connect target /
run
{
sql 'alter system checkpoint';
allocate channel c1 device type disk format '/home/ora10g/backup/%U.dbf';
backup as compressed backupset
check logical incremental level 1
for recover of copy with tag DAILY_FULL database
filesperset 1;

recover copy of database with tag DAILY_FULL;
release channel c1;

allocate channel c1 device type disk;
backup (archivelog all format '/home/ora10g/backup/ARCH_%d_%T_%u_s%s_p%p' delete input TAG "archivelogs");
backup current controlfile format '/home/ora10g/backup/controlf_%d_%u_%s_%T';
release channel c1;
}
(I've borrowed first part of the script from great DBA Husnu Sensoy - Thanks!)

After first execution of this script RMAN will create initial image copy of the database, backup archived redo logs and current controlfile to directory ‘/home/ora10g/backup’.

Now I will create small table just for later checks and run script above again.

$ sqlplus admin

SQL> create table test123 as select * from dba_objects;
Table created.

SQL> alter system switch logfile;
System altered.

Next time RMAN will find image copies of data files and make just incremental backups. Additional step we will be updating image copies of datafiles applying the incremental backups to them.

RMAN script in usual circumstances works perfectly but this time I will interfere and kill RMAN process during recovery using ‘kill -9 [rman pid]’ command to simulate disaster.

To simulate disk crash I will drop original database using rm command.

So to summarize:
- I’ve lost my original database due to the 'disk crash'
- I have:
Image copies of datafiles partially recovered
Backups of archived redo logs
Multiplexed redo logs
Backup of controlfile


Can I fully recover my original database using this files? I can try... :)

I have already created empty directories where I will put database files, controlfiles, etc.


Restore controlfile and mount database:

RMAN> run
2> {
3> allocate channel c1 device type disk;
4> restore controlfile from '/home/ora10g/backup/controlf_RASAPP_khls5prr_3729_20101104';
5> }

using target database control file instead of recovery catalog
allocated channel: c1
channel c1: sid=157 devtype=DISK

Starting restore at 04-NOV-10

channel c1: restoring control file
channel c1: restore complete, elapsed time: 00:00:02
output filename=/u01/oradata/rasapp/control01.ctl
output filename=/u01/oradata/rasapp/control02.ctl
output filename=/u02/oradata/rasapp/control03.ctl
Finished restore at 04-NOV-10
released channel: c1

RMAN> alter database mount;

database mounted



Catalog backup pieces from backup directory ‘/home/ora10g/backup’:
RMAN> catalog start with '/home/ora10g/backup';
...
...

-- Check copy of database:

RMAN> list copy of database;
...
...


Now I will try to switch database to the latest image copy.

RMAN> switch database to copy;

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of switch to copy command at 11/04/2010 15:30:15
RMAN-06571: datafile 9 does not have recoverable copy


As I’ve killed RMAN process during recovery operation my image copy of database is not valid.

How to overcome this error...


I will run ‘RECOVER COPY OF DATABASE WITH TAG ‘DAILY_FULL’’ to apply incremental backups to datafile copies with the same tag.

RMAN> recover copy of database with tag 'DAILY_FULL';

Starting recover at 04-NOV-10
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=157 devtype=DISK
no copy of datafile 9 found to recover
no copy of datafile 12 found to recover
no copy of datafile 13 found to recover
no copy of datafile 17 found to recover
channel ORA_DISK_1: starting incremental datafile backupset restore
channel ORA_DISK_1: specifying datafile copies to recover
recovering datafile copy fno=00018 name=/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-UNDOTBS2_FNO-18_k4ls5pee.dbf
channel ORA_DISK_1: reading from backup piece /home/ora10g/backup/krls5qrn_1_1.dbf
channel ORA_DISK_1: restored backup piece 1
piece handle=/home/ora10g/backup/krls5qrn_1_1.dbf tag=TAG20101104T145544
channel ORA_DISK_1: restore complete, elapsed time: 00:00:01
channel ORA_DISK_1: starting incremental datafile backupset restore
channel ORA_DISK_1: specifying datafile copies to recover
recovering datafile copy fno=00008 name=/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-SAP_MERGE_FNO-8_k3ls5p99.dbf
channel ORA_DISK_1: reading from backup piece /home/ora10g/backup/kqls5qrm_1_1.dbf
channel ORA_DISK_1: restored backup piece 1
piece handle=/home/ora10g/backup/kqls5qrm_1_1.dbf tag=TAG20101104T145544
channel ORA_DISK_1: restore complete, elapsed time: 00:00:01
channel ORA_DISK_1: starting incremental datafile backupset restore
channel ORA_DISK_1: specifying datafile copies to recover
recovering datafile copy fno=00006 name=/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-BILL_DATA_FNO-6_k8ls5pnh.dbf
channel ORA_DISK_1: reading from backup piece /home/ora10g/backup/kvls5qrs_1_1.dbf
channel ORA_DISK_1: restored backup piece 1
piece handle=/home/ora10g/backup/kvls5qrs_1_1.dbf tag=TAG20101104T145544
channel ORA_DISK_1: restore complete, elapsed time: 00:00:01
...
...
Finished recover at 04-NOV-10

Notice this part of the log:
no copy of datafile 9 found to recover
no copy of datafile 12 found to recover
no copy of datafile 13 found to recover
no copy of datafile 17 found to recover

From alertlog:
...
Thu Nov  4 15:26:51 2010
datafilecopy header validation failure for file /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-LNP_FNO-9_k1ls5orq.dbf
datafilecopy header validation failure for file /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-LNP_FNO-9_k1ls5orq.dbf
datafilecopy header validation failure for file /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-PSUB_DATA_FNO-12_jvls5nuo.dbf
datafilecopy header validation failure for file /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-PSUB_DATA_FNO-12_jvls5nuo.dbf
datafilecopy header validation failure for file /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-DSLAM_STATS_FNO-13_juls5nco.dbf
datafilecopy header validation failure for file /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-DSLAM_STATS_FNO-13_juls5nco.dbf
datafilecopy header validation failure for file /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-TS_DATA_FNO-17_k0ls5od9.dbf
datafilecopy header validation failure for file /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-TS_DATA_FNO-17_k0ls5od9.dbf
Thu Nov  4 15:27:06 2010
datafilecopy header validation failure for file /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-LNP_FNO-9_k1ls5orq.dbf
datafilecopy header validation failure for file /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-PSUB_DATA_FNO-12_jvls5nuo.dbf
datafilecopy header validation failure for file /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-DSLAM_STATS_FNO-13_juls5nco.dbf
datafilecopy header validation failure for file /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-TS_DATA_FNO-17_k0ls5od9.dbf
...
...

ORA-19563:
%s header validation failed for file string

Cause: When opening the file to be placed in a copy or backup set, to be inspected, or used as the target for an incremental restore, its header was not recognized as a valid file header for a file of the indicated type (data file, archived log, or control file) belonging to the current database.

Action: Ensure that the correct files are being specified for the copy or backup operation.


When I run ‘switch database to copy’ command it fails again with same error as before:

RMAN> switch database to copy;

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of switch to copy command at 11/04/2010 18:29:03
RMAN-06571: datafile 9 does not have recoverable copy


To overcome this error I will uncatalog noted datafiles:

RMAN> change copy of datafile 9 uncatalog;

using target database control file instead of recovery catalog
uncataloged datafile copy
datafile copy filename=/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-LNP_FNO-9_k1ls5orq.dbf recid=95 stamp=734192788
Uncataloged 1 objects


RMAN> change copy of datafile 12 uncatalog;

uncataloged datafile copy
datafile copy filename=/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-PSUB_DATA_FNO-12_jvls5nuo.dbf recid=93 stamp=734192037
Uncataloged 1 objects


RMAN> change copy of datafile 13 uncatalog;

uncataloged datafile copy
datafile copy filename=/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-DSLAM_STATS_FNO-13_juls5nco.dbf recid=92 stamp=734191568
Uncataloged 1 objects


RMAN> change copy of datafile 17 uncatalog;

uncataloged datafile copy
datafile copy filename=/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-TS_DATA_FNO-17_k0ls5od9.dbf recid=94 stamp=734192505
Uncataloged 1 objects


Catalog them again:

RMAN> catalog start with '/home/ora10g/backup';

searching for all files that match the pattern /home/ora10g/backup

List of Files Unknown to the Database
=====================================
File Name: /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-PSUB_DATA_FNO-12_jvls5nuo.dbf
File Name: /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-LNP_FNO-9_k1ls5orq.dbf
File Name: /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-DSLAM_STATS_FNO-13_juls5nco.dbf
File Name: /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-TS_DATA_FNO-17_k0ls5od9.dbf

Do you really want to catalog the above files (enter YES or NO)? yes
cataloging files...
cataloging done

List of Cataloged Files
=======================
File Name: /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-PSUB_DATA_FNO-12_jvls5nuo.dbf
File Name: /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-LNP_FNO-9_k1ls5orq.dbf
File Name: /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-DSLAM_STATS_FNO-13_juls5nco.dbf
File Name: /home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-TS_DATA_FNO-17_k0ls5od9.dbf


Let’s try switch database to copy:

RMAN> switch database to copy;

datafile 1 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-SYSTEM_FNO-1_k6ls5plb.dbf"
datafile 2 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-SAPBI_DATA_FNO-2_k7ls5pmo.dbf"
datafile 3 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-SYSAUX_FNO-3_k5ls5piv.dbf"
datafile 4 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-APP_DATA_FNO-4_k2ls5p4o.dbf"
datafile 5 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-USERS_FNO-5_kbls5ppj.dbf"
datafile 6 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-BILL_DATA_FNO-6_k8ls5pnh.dbf"
datafile 7 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-OT_DATA_FNO-7_kcls5pq2.dbf"
datafile 8 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-SAP_MERGE_FNO-8_k3ls5p99.dbf"
datafile 9 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-LNP_FNO-9_k1ls5orq.dbf"
datafile 10 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-PHONEBOOK_FNO-10_kals5pp4.dbf"
datafile 11 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-IVR_DATA_FNO-11_kdls5pqh.dbf"
datafile 12 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-PSUB_DATA_FNO-12_jvls5nuo.dbf"
datafile 13 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-DSLAM_STATS_FNO-13_juls5nco.dbf"
datafile 14 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-SIPDIR_DATA_FNO-14_k9ls5pob.dbf"
datafile 16 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-INV_RN_INTERFACE_FNO-16_kels5pr0.dbf"
datafile 17 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-TS_DATA_FNO-17_k0ls5od9.dbf"
datafile 18 switched to datafile copy "/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-UNDOTBS2_FNO-18_k4ls5pee.dbf"

It worked...


Recover database now:

RMAN> recover database;

Starting recover at 04-NOV-10
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=157 devtype=DISK

starting media recovery

channel ORA_DISK_1: starting archive log restore to default destination
channel ORA_DISK_1: restoring archive log
archive log thread=1 sequence=10
channel ORA_DISK_1: restoring archive log
archive log thread=1 sequence=11
channel ORA_DISK_1: reading from backup piece /home/ora10g/backup/ARCH_RASAPP_20101104_kjls5q8e_s3731_p1
channel ORA_DISK_1: restored backup piece 1
piece handle=/home/ora10g/backup/ARCH_RASAPP_20101104_kjls5q8e_s3731_p1 tag=ARCHIVELOGS
channel ORA_DISK_1: restore complete, elapsed time: 00:00:02
archive log filename=/u01/oradata/rasapp/archlogs/1_10_734173548.dbf thread=1 sequence=10
archive log filename=/u01/oradata/rasapp/archlogs/1_11_734173548.dbf thread=1 sequence=11
unable to find archive log
archive log thread=1 sequence=12
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 11/04/2010 18:31:26
RMAN-06054: media recovery requesting unknown log: thread 1 seq 12 lowscn 48367782959

I’m missing some redo entries - this is because I forgot to copy multiplexed redo log files to the correct location.

Copy multiplexed redo logs to specified locations:

SQL> select member from v$logfile;

MEMBER
--------------------------------------------------------------------------------
/u01/oradata/rasapp/redo01a.rdo
/u02/oradata/rasapp/redo01b.rdo
/u01/oradata/rasapp/redo02a.rdo
/u02/oradata/rasapp/redo02b.rdo
/u01/oradata/rasapp/redo03a.rdo
/u02/oradata/rasapp/redo03b.rdo

6 rows selected.


$ cp redo01b.rdo /u02/oradata/rasapp/
...
...


Everything should be OK now during recovery:

$ rman target /

Recovery Manager: Release 10.2.0.4.0 - Production on Thu Nov 4 18:50:25 2010

Copyright (c) 1982, 2007, Oracle.  All rights reserved.

connected to target database: RASAPP (DBID=1650089695, not open)

RMAN> recover database;

Starting recover at 04-NOV-10
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=157 devtype=DISK

starting media recovery

archive log thread 1 sequence 12 is already on disk as file /u01/oradata/rasapp/redo03b.rdo
archive log filename=/u02/oradata/rasapp/redo03b.rdo thread=1 sequence=12
media recovery complete, elapsed time: 00:00:06
Finished recover at 04-NOV-10

RMAN> alter database open resetlogs;

database opened


Let’s check data:

SQL> col file_name for a100
SQL> set lines 200
SQL> set pages 999
SQL> select file_name from dba_data_files;

FILE_NAME
----------------------------------------------------------------------------------------------------
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-SYSTEM_FNO-1_k6ls5plb.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-SAPBI_DATA_FNO-2_k7ls5pmo.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-SYSAUX_FNO-3_k5ls5piv.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-OT_DATA_FNO-7_kcls5pq2.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-APP_DATA_FNO-4_k2ls5p4o.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-USERS_FNO-5_kbls5ppj.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-BILL_DATA_FNO-6_k8ls5pnh.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-SAP_MERGE_FNO-8_k3ls5p99.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-LNP_FNO-9_k1ls5orq.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-PHONEBOOK_FNO-10_kals5pp4.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-IVR_DATA_FNO-11_kdls5pqh.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-PSUB_DATA_FNO-12_jvls5nuo.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-DSLAM_STATS_FNO-13_juls5nco.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-SIPDIR_DATA_FNO-14_k9ls5pob.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-INV_RN_INTERFACE_FNO-16_kels5pr0.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-TS_DATA_FNO-17_k0ls5od9.dbf
/home/ora10g/backup/data_D-RASAPP_I-1650089695_TS-UNDOTBS2_FNO-18_k4ls5pee.dbf

17 rows selected.

SQL> select count(*) from dba_objects where status<>'VALID';

  COUNT(*)
----------
         0

SQL> select count(*) from admin.test123;

  COUNT(*)
----------
     50118


In your own test environment, you can try out many different recovery scenarios without doing any harm. Performing test cases is very important for DBA’s because you must be well prepared when production database encounters disaster ;)

Read More...

Finally, I’m Oracle 10g Certified Professional now

Tuesday, October 19, 2010 5 comments
I’ve been studying last couple of months for the exam “1Z0-043 Oracle Database 10g: Administration II” and today I finally passed that exam. So I’m officially OCP 10g now :)


As I was mostly learning in my spare time whole that process lasted longer then I’ve expected. Material was familiar to me before I began studying for the exam but still I’ve become more proficient with resource management, using scheduler, handling block corruption, securing listener, etc.

For learning material I used Official Oracle Documentation, several different books, blog articles and OTN forum posts. But the most important was testing environment where I installed 10g database and played with it.

I don’t know what will be my next certification but I’m sure that I will find something interesting eventually :)

Read More...

Using Flashback Database to perform PITR

Tuesday, August 24, 2010 6 comments
Flashback database is great alternative for incomplete recovery because it is much less time consuming in some cases. You can find in documentation nice explanations how flashback database works so I won’t concentrate much on theory.

In this blog post I will perform simple demo case of point in time recovery using flashback database feature.

Database must be in ARCHIVELOG mode and FLASHBACK should be enabled.

SQL> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.1.0.4.0 - Prod
PL/SQL Release 10.1.0.4.0 - Production
CORE    10.1.0.4.0      Production
TNS for Linux: Version 10.1.0.4.0 - Production
NLSRTL Version 10.1.0.4.0 - Production





SQL> select log_mode,flashback_on from v$database;

LOG_MODE     FLASHBACK_ON
------------ ------------------
ARCHIVELOG   YES


For testing purposes I will use SCOTT schema where I will create three tables in different time periods with my “sensitive data”.

Created first table at 21:35.
SQL> conn scott/tiger
Connected.

SQL> select to_char(sysdate,'dd.mm.yyyy hh24:mi:ss') from dual;

TO_CHAR(SYSDATE,'DD
--------
23.08.2010 21:35:19

SQL> create table t1 (a number);

Table created.

SQL> insert into t1 values (1);

1 row created.

SQL> commit;

Commit complete.

Second table at 21:45.
SQL> select to_char(sysdate,'dd.mm.yyyy hh24:mi:ss') from dual;

TO_CHAR(SYSDATE,'DD
--------
23.08.2010 21:45:05

SQL> create table t2 (b number);

Table created.

SQL> insert into t2 values (2);

1 row created.

SQL> commit;

Commit complete.

Third table at 22:05.
SQL> select to_char(sysdate,'dd.mm.yyyy hh24:mi:ss') from dual;

TO_CHAR(SYSDATE,'DD
--------
23.08.2010 22:05:48

SQL> create table t3 (c number);

Table created.

SQL> insert into t3 values (3);

1 row created.

SQL> commit;

Commit complete.


Lets assume that SCOTT is important production schema and DBA drops that schema by mistake. It is very important to recover lost schema as quickly as possible. DBA isn’t sure about correct time when he/she dropped schema.

Drop SCOTT schema:
$ sqlplus "/as sysdba"

SQL*Plus: Release 10.2.0.1.0 - Production on Mon Aug 23 22:13:13 2010

Copyright (c) 1982, 2005, Oracle.  All rights reserved.


Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options

SQL> drop user scott cascade;

User dropped.


RMAN uses flashback logs to undo changes to point before target time and then applies redo logs that are needed to make database consistent.

I will shutdown database and start in mount mode. As you can see I will use “shutdown abort” opinion - simply to minimize downtime.
SQL> shutdown abort;
ORACLE instance shut down.


SQL> startup mount;
ORACLE instance started.

Total System Global Area  297795584 bytes
Fixed Size                  1219064 bytes
Variable Size              83887624 bytes
Database Buffers          209715200 bytes
Redo Buffers                2973696 bytes
Database mounted.
SQL>

Flashback database to 21:40 because DBA said that he/she dropped schema somewhere around that time.
Before opening database with resetlogs opinion I want to check for my sensitive data. So I will temporary open flashbacked database in read only mode and do some checks.
SQL> flashback database to timestamp
 2  to_timestamp('23.08.2010 21:40:00','dd.mm.yyyy hh24:mi:ss');

Flashback complete.


SQL> alter database open read only;

Database altered.

My SCOTT schema is here but some data is missing.
SQL> select * from scott.t1;

        A
----------
        1

SQL> select * from scott.t2;
select * from scott.t2
                   *
ERROR at line 1:
ORA-00942: table or view does not exist


SQL> select * from scott.t3;
select * from scott.t3
                   *
ERROR at line 1:
ORA-00942: table or view does not exist

I will recover database to 21:50 and do some checks in read only mode:
SQL> shutdown abort;
ORACLE instance shut down.
SQL>
SQL>
SQL> startup mount;
ORACLE instance started.

Total System Global Area  297795584 bytes
Fixed Size                  1219064 bytes
Variable Size              83887624 bytes
Database Buffers          209715200 bytes
Redo Buffers                2973696 bytes
Database mounted.
SQL>


SQL> recover database until time '2010-08-23:21:50:00';

ORA-00279: change 609683 generated at 08/23/2010 21:40:13 needed for thread 1
ORA-00289: suggestion :
/u01/app/oracle/flash_
recovery_area/ORA10G/archivelog/2010_08_23/o1_mf_1_10_%u_.
arc
ORA-00280: change 609683 for thread 1 is in sequence #10


Specify log: {=suggested | filename | AUTO | CANCEL}

Log applied.
Media recovery complete.

This is better situation but still some data is missing.
SQL> alter database open read only;

Database altered.

SQL> select * from scott.t1;

        A
----------
        1

SQL> select * from scott.t2;

        B
----------
        2

SQL> select * from scott.t3;
select * from scott.t3
                   *
ERROR at line 1:
ORA-00942: table or view does not exist

This time I will recover to 22:07.
SQL> shutdown abort;
ORACLE instance shut down.

SQL> startup mount;
ORACLE instance started.

Total System Global Area  297795584 bytes
Fixed Size                  1219064 bytes
Variable Size              83887624 bytes
Database Buffers          209715200 bytes
Redo Buffers                2973696 bytes
Database mounted.
SQL>
SQL>
SQL> recover database until time '2010-08-23:22:07:00';
Media recovery complete.

Now everything is here.
SQL> alter database open read only;

Database altered.

SQL> select * from scott.t1;

        A
----------
        1

SQL> select * from scott.t2;

        B
----------
        2

SQL> select * from scott.t3;

        C
----------
        3


As I recovered all my sensitive data it is safe to open database with RESETLOGS opinion.

SQL> shutdown abort;
ORACLE instance shut down.

SQL> startup mount;
ORACLE instance started.

Total System Global Area  608174080 bytes
Fixed Size                  1268896 bytes
Variable Size             209716064 bytes
Database Buffers          390070272 bytes
Redo Buffers                7118848 bytes
Database mounted.

SQL> alter database open resetlogs;
Database altered.

Point in time recovery with flashback database lasted only several minutes with minimal loss of data.




REFERENCE
http://www.comp.dit.ie/btierney/oracle11gdoc/backup.111/b28273/rcmsynta023.htm

Read More...

Recovering From A Lost Controlfile

Sunday, August 22, 2010 0 comments
I decided to play a little with control file recovery. It is always wise to have controlfiles multiplexed on separate disks but for testing purposes my controlfiles will be placed in the same directory.

What will happen when I delete one of my three controlfiles?

From documentation:
Damage to any control file, whether it is multiplexed or not, halts database operation when the database attempts to read or write to the damaged control file (which happens frequently, for example at every checkpoint and log switch).


Little demonstration...


Environment:
Linux 32 bit
Oracle 10gR2


Delete one controlfile:
SQL> select name from v$controlfile;

NAME
--------------------------------------------------------------------------------
/u01/app/oracle/oradata/ora10g/control01.ctl
/u01/app/oracle/oradata/ora10g/control02.ctl
/u01/app/oracle/oradata/ora10g/control03.ctl


SQL> !rm /u01/app/oracle/oradata/ora10g/control03.ctl

First I will try shutdown immediate:
(Operation failed but database is still up and active.)
SQL> shutdown immediate;
ORA-00210: cannot open the specified control file
ORA-00202: control file: '/u01/app/oracle/oradata/ora10g/control03.ctl'
ORA-27041: unable to open file
Linux Error: 2: No such file or directory
Additional information: 3

-- from alert.log:

Sat Aug 21 23:52:00 2010
Errors in file /u01/app/oracle/admin/ora10g/udump/ora10g_ora_16200.trc:
ORA-00210: cannot open the specified control file
ORA-00202: control file: '/u01/app/oracle/oradata/ora10g/control01.ctl'
ORA-27041: unable to open file
Linux Error: 2: No such file or directory
Additional information: 3
Sat Aug 21 23:52:00 2010
Errors in file /u01/app/oracle/admin/ora10g/udump/ora10g_ora_16200.trc:
ORA-00210: cannot open the specified control file
ORA-00202: control file: '/u01/app/oracle/oradata/ora10g/control01.ctl'
ORA-27041: unable to open file
Linux Error: 2: No such file or directory
Additional information: 3
Sat Aug 21 23:52:01 2010
Errors in file /u01/app/oracle/admin/ora10g/udump/ora10g_ora_16200.trc:
ORA-00210: cannot open the specified control file
ORA-00202: control file: '/u01/app/oracle/oradata/ora10g/control01.ctl'
ORA-27041: unable to open file
Linux Error: 2: No such file or directory
Additional information: 3


What will happen if I copy good controlfile to location of missing one and then try to alter tablespace.
SQL> !cp /u01/app/oracle/oradata/ora10g/control01.ctl -
> /u01/app/oracle/oradata/ora10g/control03.ctl

SQL> alter system checkpoint;

System altered.



SQL> alter tablespace users read only;
alter tablespace users read only
     *
ERROR at line 1:
ORA-03113: end-of-file on communication channel



-- from alert.log:
Sat Aug 21 18:54:29 2010
********************* ATTENTION: ********************
The controlfile header block returned by the OS
has a sequence number that is too old.
The controlfile might be corrupted.
PLEASE DO NOT ATTEMPT TO START UP THE INSTANCE
without following the steps below.
RE-STARTING THE INSTANCE CAN CAUSE SERIOUS DAMAGE
TO THE DATABASE, if the controlfile is truly corrupted.
In order to re-start the instance safely,
please do the following:
(1) Save all copies of the controlfile for later
    analysis and contact your OS vendor and Oracle support.
(2) Mount the instance and issue:
    ALTER DATABASE BACKUP CONTROLFILE TO TRACE;
(3) Unmount the instance.
(4) Use the script in the trace file to
    RE-CREATE THE CONTROLFILE and open the database.
*****************************************************
Instance terminated by DBW0, pid = 7444
Instance was immediately terminated.


What if I execute startup now:
(Operation failed with error below.)
SQL> startup
ORACLE instance started.

Total System Global Area  297795584 bytes
Fixed Size                  1219064 bytes
Variable Size              71304712 bytes
Database Buffers          222298112 bytes
Redo Buffers                2973696 bytes
Database mounted.
ORA-01110: data file 4: '/u01/app/oracle/oradata/ora10g/users01.dbf'
ORA-00372: file 4 cannot be modified at this time
ORA-01110: data file 4: '/u01/app/oracle/oradata/ora10g/users01.dbf'




So what is the proper way to handle this controlfile recovery.

Delete one controlfile to simulate disk or controller failure.
SQL> select name from v$controlfile;

NAME
--------------------------------------------------------------------------------
/u01/app/oracle/oradata/ora10g/control01.ctl
/u01/app/oracle/oradata/ora10g/control02.ctl
/u01/app/oracle/oradata/ora10g/control03.ctl

SQL>

SQL> !rm /u01/app/oracle/oradata/ora10g/control03.ctl

Shutdown abort database and copy good copy of controlfile to the location of the lost one. You can also modify parameter file (CONTROL_FILES parameter) if you have to change location of lost controlfile.
SQL> shutdown abort;
ORACLE instance shut down.

SQL> !cp /u01/app/oracle/oradata/ora10g/control01.ctl -
> /u01/app/oracle/oradata/ora10g/control03.ctl

SQL> startup
ORACLE instance started.

Total System Global Area  297795584 bytes
Fixed Size                  1219064 bytes
Variable Size              71304712 bytes
Database Buffers          222298112 bytes
Redo Buffers                2973696 bytes
Database mounted.
Database opened.

SQL>

Database is up and fully functional.

REFERENCE
http://download.oracle.com/docs/cd/B19306_01/backup.102/b14192/intro008.htm#sthref96

Read More...