Every few hours/minutes, in irregular intervals, TNS listener suddenly crashes which forces our Red Hat Cluster Manager to restart instance and even switch database to another node. Business systems that have relied on this database were experiencing great difficulties.
It is worth to mention that everything worked without any problems for about 2 years without any intervention on OS or database configuration.
Environment:
OS: Red Hat Enterprise Linux ES release 4 (64bit)
DB: Oracle EE 10.2.0.4.0
From Listener log:
… … … 07-LIP-2011 11:19:21 * (CONNECT_DATA=(SERVICE_NAME=odb )(CID=(PROGRAM=httpd)(HOST=rad2)(USER=apache))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.3.13.49)(PORT=58405)) * establish * odb * 12518 TNS-12518: TNS:listener could not hand off client connection TNS-12547: TNS:lost contact TNS-12560: TNS:protocol adapter error TNS-00517: Lost contact Linux Error: 32: Broken pipe 07-LIP-2011 11:19:21 * service_update * odb * 0 07-LIP-2011 11:19:21 * service_update * odb * 0 07-LIP-2011 11:19:21 * (CONNECT_DATA=(SERVICE_NAME=otp)(CID=(PROGRAM=httpd)(HOST=rad2)(USER=apache))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.3.13.49)(PORT=58402)) * establish * odb * 0 TNSLSNR for Linux: Version 10.2.0.4.0 - Production on 07-LIP-2011 11:21:23 Copyright (c) 1991, 2007, Oracle. All rights reserved. System parameter file is /u01/app/oracle/product/10.2/db_1/network/admin/listener.ora Log messages written to /u01/app/oracle/product/10.2/db_1/network/log/listener_odb.log Trace information written to /u01/app/oracle/product/10.2/db_1/network/trace/listener_odb.trc Trace level is currently 0 Started with pid=9607 Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.2.10.40)(PORT=1521))) Listener completed notification to CRS on start … … ...
…. …. …. 07-LIP-2011 11:10:16 * (CONNECT_DATA=(SERVICE_NAME=UDB)(CID=(PROGRAM=)(HOST=POP1)(USER=root))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.3.12.71)(PORT=44557)) * establish * UDB * 12518 TNS-12518: TNS:listener could not hand off client connection TNS-12571: TNS:packet writer failure TNS-12560: TNS:protocol adapter error TNS-00530: Protocol adapter error Linux Error: 104: Connection reset by peer TNSLSNR for Linux: Version 10.2.0.4.0 - Production on 07-LIP-2011 11:11:01 Copyright (c) 1991, 2007, Oracle. All rights reserved. System parameter file is /u01/app/oracle/product/10.2/db_1/network/admin/listener.ora Log messages written to /u01/app/oracle/product/10.2/db_1/network/log/listener_udb.log Trace information written to /u01/app/oracle/product/10.2/db_1/network/trace/listener_udb.trc Trace level is currently 0 Started with pid=21862 Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.2.10.39)(PORT=1521))) Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC0))) Listener completed notification to CRS on start TIMESTAMP * CONNECT DATA [* PROTOCOL INFO] * EVENT [* SID] * RETURN CODE WARNING: Subscription for node down event still pending … …
From /var/log/messages:
tnslsnr[31928]: segfault at 0000000000000018 rip 0000003d34268593 rsp 0000007fbfff9ba0 error 4 tnslsnr[21862]: segfault at 0000000000000018 rip 0000003c00367f2d rsp 0000007fbfff9220 error 4
My first destinations to search for answers were Oracle support site, OTN forums and Oracle related technical blogs.
I’ve found interesting note on Oracle support site: [ID 549932.1]
Symptoms mentioned there:
- There may be heavy load on the CPU shooting up to 100%.
- The number of sessions in the database is well below the upper or maximum limit defined in the parameter file.
- The listener crashes suddenly during this heavy CPU load generating the core.
- (Optional) Listener.Ora has SUBSCRIBE_FOR_NODE_DOWN_EVENT_LISTENER=OFF.
Cause:
Extensive paging/swapping activity is a clear indication that the system is running out of the physical memory.
So I’ve decided to run vmstat for hour or two to check for swapping activity.
$vmstat 5
Columns si and so were 0 all the time so I thought that our problems weren’t associated with paging/swapping activity. We continued to search problems in other areas.
To cut a long story short - after few days searching for answers we found nothing so I’ve decided to apply patch for unpublished Bug 6139856 despite the fact that I haven't found problem with swapping.
Before applying patch I like to take backup of Oracle Home so I’ve decided to make compressed copy of that directory. Soon after compression began CPU load significantly raised and vmstat started to display swapping activity.
It was pretty clear now that paging/swapping activity is probably the cause of our problems so I applied the patch 6139856.
That fixed our problems and system was stable again :-)
Conclusion
Due to some recent network/firewall problems CPU load on our OLTP database server occasionally raises, causes extensive swapping which crashes TNS listener.
Ups, that was a nice one :) Thanks for sharing
ReplyDeleteJust courious ... have you tried to give listener process real time priority (like explained on http://linux.die.net/man/1/chrt)?
ReplyDeleteI have seen (esspecially on RAC) that oracle like to have important processes in high priority.
This is set throught it's ora_parameter:
SQL> show parameter high
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
_high_priority_processes string LMS*|LGWR|PMON
here this is not a case but setting from linux might IMHO, help.
Nope. I've never manually raised priority for any Oracle process and I haven't found any information that this could help in my case - thanks for info.
ReplyDeleteAfter I applied patch listener hasn't crashed since then so this was obviously bug.
Regards,
Marko
dear all,
ReplyDeleteI have same error.plz help.one thing that is i checked listner.log file.there is no entry of last 12days....need help
Hello Sam,
ReplyDeletewhat OS and Oracle version you are talking about?
Can you check size of the listener.log file? Is it 4GB?
Regards,
Marko