Thursday, February 7, 2013

Kill stuck Oracle process from OS using DBMS_SCHEDULER

Last month we had annoying problems caused by stuck Oracle process holding locks and not releasing them. Every few days one process hangs with status “ACTIVE” and “SQL*Net message from client” wait event.

In that state it holds locks and never releases them. As this is active OLTP database, new sessions arrive wanting those locked resources but they could not get them. So after some time you get lots of waiters and if you don’t react quickly maximum number of processes could be reached. Then your database refuses new connections and you're system becomes unresponsive.

To complicate a problem little more, when you kill session from Oracle using “disconnect session immediate” or “kill session immediate” - session is only marked as “KILLED”. It is not cleaned and locks are not released even if you wait for several hours. Only opinion is to kill process from OS side and then PMON manages to perform cleaning and release locks.

So we created temporary solution until “stuck process” problem is solved. As this is active OLTP system we are expecting most queries to complete under a minute. Every session active for over 30 minutes is probably stuck process and needs to be killed from OS side.

I will publish scripts we used to temporary solve this problem. You should change them according to your needs.

kill_stuck_process.sh
Main script used to perform OS kill. This script needs to be executable file.
#!/bin/bash

# SET ORACLE ENVIRONMENT #
export ORACLE_BASE=/u01/app/oracle
export ORACLE_HOME=$ORACLE_BASE/product/11.1.0/db_1
PATH=$ORACLE_HOME/bin:$PATH; export PATH
export ORACLE_SID=orcl

# File which will be used to keep spid
if [ -f /tmp/spid.txt ]; then
/bin/rm -f /tmp/spid.txt
fi

# Find SPID of the hanging process you want to kill
$ORACLE_HOME/bin/sqlplus -s system/password @/oracle/scripts/get_spid.sql > /tmp/spid.txt

spid=`cat /tmp/spid.txt`

# If spid exists kill specified process
if [ -s /tmp/spid.txt ]; then
# Send mail from Oracle about killed process
$ORACLE_HOME/bin/sqlplus -s system/password @/oracle/scripts/send_mail.sql
# Kill process
/bin/kill -9 $spid
fi

get_spid.sql
Query used to retrieve SPID of the stuck process. Modify this query based on your needs.
set serverout on head off ver off feed off
SELECT spid from v$process
where addr = ( SELECT paddr
                        FROM
                        (
                            SELECT paddr, rownum rn
                                from (
                                    SELECT paddr, last_call_et 
                                            FROM v$session
                                            where username='OLTP_USER'
                                            and last_call_et >= 1800
                                            and status='ACTIVE'
                                            order by last_call_et desc
                                       )
                        )
                        where rn=1);
exit

send_mail.sql
Procedure sends detailed information about killed process for better diagnosis later. This is just additional info and it is not essential for the script.
begin
  "ADMINUSR"."P_SEND_PROCESS_DETAILS";
end;
/
exit

Now to create DBMS_SCHEDULER job which will be executed every 30 minutes and kill stuck process.
BEGIN
DBMS_SCHEDULER.CREATE_JOB (
   job_name             => 'KILL_STUCK_PROCESS',
   job_type             => 'EXECUTABLE',
   job_action           => '/oracle/script/kill_stuck_process.sh',
   start_date           => SYSDATE,
   repeat_interval      => 'FREQ=MINUTELY;INTERVAL=30',
   enabled              =>  FALSE,
   comments             =>'Kill long running stuck OLTP_USER session!');
END;
/

Now to test the job:
BEGIN
  DBMS_SCHEDULER.run_job (job_name=> 'KILL_STUCK_PROCESS', use_current_session => TRUE);
END;
/

You will probably receive error:
BEGIN
*
ERROR at line 1:
ORA-27369: job of type EXECUTABLE failed with exit code: Permission denied
ORA-06512: at "SYS.DBMS_ISCHED", line 154
ORA-06512: at "SYS.DBMS_SCHEDULER", line 450
ORA-06512: at line 2

When dbms_scheduler is executing external (operating system) jobs it is using lowly privileged user and group for your platform. In my environment (OEL 5.8) this is “nobody” and “nobody”. As you must kill OS process you must specify more privileged user. There is “externaljob.ora” configuration file in “$ORACLE_HOME/rdbms/admin”. Edit this file as root and specify privileged user and group. In my case:
run_user = orcl11
run_group = oinstall

Everything is working fine now:
BEGIN
  DBMS_SCHEDULER.run_job (job_name=> 'KILL_STUCK_PROCESS', use_current_session => TRUE);
END;
/

PL/SQL procedure successfully completed.


Thoroughly test this scripts in your environment before going to production with them.


5 comments:

  1. Nice workaround!
    I've become a fan of "alter system disconnect session 'sid, serial#' immediate;" for these situations.
    It's usually as good as killing the os process, but does not require to deal with scripts and reduces the complexity.

    Maris

    ReplyDelete
  2. Hello Maris,

    I'm using "alter system disconnect session 'sid, serial#' immediate;" almost all the time. As I noted in blog post, I've tried with "disconnect session immediate" but session only becomes marked "KILLED" and keeps locks for hours.

    I agree with you that this is complex solution and it is not very nice if you ask me. But I had to create some workaround. Or just wait for database to become unresponsive :) For temporary solution it serves its purpose.

    Thanks for comment.

    Regards,
    Marko

    ReplyDelete
  3. Hi Marko,

    Just want to know.

    Is "ADMINUSR"."P_SEND_PROCESS_DETAILS"; your personal email PL/SQL or can I install take from somewhere?

    ReplyDelete
  4. Hello,

    this is simple procedure which sends info about the session that will be disconnected (sql text,wait event, wait class, program, etc.) for later diagnosis.

    I've wrote that procedure according to my needs. You can even exclude this procedure from the job as it is only used to send additional info.

    It is not available for download/install.

    Regards,
    Marko

    ReplyDelete
  5. Really a very nice oracle
    post.Thanks for sharing.

    ReplyDelete