Wednesday, June 8, 2011

Oracle Listener crashes with core dump in Linux 64bit

We had serious problems with Oracle TNS Listener last few days. Problems were happening on very critical system - most active OLTP database.

Every few hours/minutes, in irregular intervals, TNS listener suddenly crashes which forces our Red Hat Cluster Manager to restart instance and even switch database to another node. Business systems that have relied on this database were experiencing great difficulties.

It is worth to mention that everything worked without any problems for about 2 years without any intervention on OS or database configuration.

Environment:
OS: Red Hat Enterprise Linux ES release 4 (64bit)
DB: Oracle EE 10.2.0.4.0



From Listener log:
…
…
…
07-LIP-2011 11:19:21 * (CONNECT_DATA=(SERVICE_NAME=odb )(CID=(PROGRAM=httpd)(HOST=rad2)(USER=apache))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.3.13.49)(PORT=58405)) * establish * odb * 12518
TNS-12518: TNS:listener could not hand off client connection
TNS-12547: TNS:lost contact
 TNS-12560: TNS:protocol adapter error
  TNS-00517: Lost contact
    Linux Error: 32: Broken pipe
07-LIP-2011 11:19:21 * service_update * odb * 0
07-LIP-2011 11:19:21 * service_update * odb * 0
07-LIP-2011 11:19:21 * (CONNECT_DATA=(SERVICE_NAME=otp)(CID=(PROGRAM=httpd)(HOST=rad2)(USER=apache))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.3.13.49)(PORT=58402)) * establish * odb * 0

TNSLSNR for Linux: Version 10.2.0.4.0 - Production on 07-LIP-2011 11:21:23

Copyright (c) 1991, 2007, Oracle.  All rights reserved.

System parameter file is /u01/app/oracle/product/10.2/db_1/network/admin/listener.ora
Log messages written to /u01/app/oracle/product/10.2/db_1/network/log/listener_odb.log
Trace information written to /u01/app/oracle/product/10.2/db_1/network/trace/listener_odb.trc
Trace level is currently 0

Started with pid=9607
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.2.10.40)(PORT=1521)))
Listener completed notification to CRS on start
…
…
...

….
….
….
07-LIP-2011 11:10:16 * (CONNECT_DATA=(SERVICE_NAME=UDB)(CID=(PROGRAM=)(HOST=POP1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.3.12.71)(PORT=44557)) * establish * UDB * 12518
TNS-12518: TNS:listener could not hand off client connection
TNS-12571: TNS:packet writer failure
 TNS-12560: TNS:protocol adapter error
  TNS-00530: Protocol adapter error
    Linux Error: 104: Connection reset by peer

TNSLSNR for Linux: Version 10.2.0.4.0 - Production on 07-LIP-2011 11:11:01

Copyright (c) 1991, 2007, Oracle.  All rights reserved.

System parameter file is /u01/app/oracle/product/10.2/db_1/network/admin/listener.ora
Log messages written to /u01/app/oracle/product/10.2/db_1/network/log/listener_udb.log
Trace information written to /u01/app/oracle/product/10.2/db_1/network/trace/listener_udb.trc
Trace level is currently 0

Started with pid=21862
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.2.10.39)(PORT=1521)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC0)))
Listener completed notification to CRS on start

TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE
WARNING: Subscription for node down event still pending
…
…



From /var/log/messages:
tnslsnr[31928]: segfault at 0000000000000018 rip 0000003d34268593 rsp 0000007fbfff9ba0 error 4
tnslsnr[21862]: segfault at 0000000000000018 rip 0000003c00367f2d rsp 0000007fbfff9220 error 4




My first destinations to search for answers were Oracle support site, OTN forums and Oracle related technical blogs.

I’ve found interesting note on Oracle support site: [ID 549932.1]
Symptoms mentioned there:
- There may be heavy load on the CPU shooting up to 100%.
- The number of sessions in the database is well below the upper or maximum limit defined in the parameter file.
- The listener crashes suddenly during this heavy CPU load generating the core.
- (Optional) Listener.Ora has SUBSCRIBE_FOR_NODE_DOWN_EVENT_LISTENER=OFF.

Cause:
Extensive paging/swapping activity is a clear indication that the system is running out of the physical memory.



So I’ve decided to run vmstat for hour or two to check for swapping activity.
$vmstat 5

Columns si and so were 0 all the time so I thought that our problems weren’t associated with paging/swapping activity. We continued to search problems in other areas.


To cut a long story short - after few days searching for answers we found nothing so I’ve decided to apply patch for unpublished Bug 6139856 despite the fact that I haven't found problem with swapping.

Before applying patch I like to take backup of Oracle Home so I’ve decided to make compressed copy of that directory. Soon after compression began CPU load significantly raised and vmstat started to display swapping activity.

It was pretty clear now that paging/swapping activity is probably the cause of our problems so I applied the patch 6139856.
That fixed our problems and system was stable again :-)


Conclusion
Due to some recent network/firewall problems CPU load on our OLTP database server occasionally raises, causes extensive swapping which crashes TNS listener.


5 comments:

  1. Ups, that was a nice one :) Thanks for sharing

    ReplyDelete
  2. Just courious ... have you tried to give listener process real time priority (like explained on http://linux.die.net/man/1/chrt)?

    I have seen (esspecially on RAC) that oracle like to have important processes in high priority.

    This is set throught it's ora_parameter:

    SQL> show parameter high

    NAME TYPE VALUE
    ------------------------------------ ----------- ------------------------------
    _high_priority_processes string LMS*|LGWR|PMON

    here this is not a case but setting from linux might IMHO, help.

    ReplyDelete
  3. Nope. I've never manually raised priority for any Oracle process and I haven't found any information that this could help in my case - thanks for info.

    After I applied patch listener hasn't crashed since then so this was obviously bug.

    Regards,
    Marko

    ReplyDelete
  4. dear all,

    I have same error.plz help.one thing that is i checked listner.log file.there is no entry of last 12days....need help

    ReplyDelete
  5. Hello Sam,

    what OS and Oracle version you are talking about?

    Can you check size of the listener.log file? Is it 4GB?

    Regards,
    Marko

    ReplyDelete