Database users may complain that something isn’t right with the application and request that the DBAs have a look in the database for problems. Oracle doesn’t record every error reported to database clients in the alert log, just server related messages, (internal errors, snapshot too old, segment cannot extend, etc). It is the application that should capture and/or report the full error stack returned to the database client, but unfortunately not all of them do.
The following is an example of this situation, in which I had to track down the cause of a problem where the application should have made that clear from its error log.
Application is Stuck – Example Case
An application owner asked me to look at the database, because an job on the application server was stuck and appeared to be looping. I asked for the error messages and while waiting for them, I fired up SQL*Plus.
The predominant and most unusual wait event for the recent and active sessions was “undo segment recovery”. This happens when the session is waiting for the PMON to roll back a dead transaction.
V$SESSION showed that the SQL being executed just before the wait was:
select STATE from DUMMY where ID = :1 FOR UPDATE
By this time I had received the application error logs. They referred to a different select statement on the same table, but the only clue was “This may be due to a transient race condition. Rolling back and retrying the transaction.”
The “rolling back” error text fitted with the “undo segment recovery” wait event I was seeing, even if the SQL didn’t match. The application was trying to do the same thing with the same result repeatedly.
I noticed that the number of “SQL*Net break/reset to client” wait events were steadily increasing. This indicates error messages were being generated for some database sessions.
At this point I could either try to trace from the client side, or from the server side. I only had access to the database server, so I enabled SQL tracing for a couple of likely database sessions and soon had a trace file of a session that had been waiting on “undo segment recovery”:
===================== PARSING IN CURSOR #4 len=52 dep=0 uid=30 oct=3 lid=30 tim=52668306764175 hv=3327540542 ad='914b1fb8' select STATE from DUMMY where ID = :1 FOR UPDATE END OF STMT PARSE #4:c=0,e=941,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,tim=52668306764164 BINDS #4: kkscoacd Bind#0 oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00 oacflg=03 fl2=1000000 frm=01 csi=871 siz=24 off=0 kxsbbbfp=ffffffff7ae74f60 bln=22 avl=04 flg=05 value=376059 WAIT #4: nam='undo segment recovery' ela= 2929424 segment#=5 tx flags=183 p3=0 obj#=27086 tim=52668309703725 EXEC #4:c=10000,e=2939555,p=0,cr=7,cu=2,mis=1,r=0,dep=0,og=1,tim=52668309703911 ERROR #4:err=1591 tim=1098267617 WAIT #4: nam='SQL*Net break/reset to client' ela= 9 driver id=1952673792 break?=1 p3=0 obj#=27086 tim=52668309704412 WAIT #4: nam='SQL*Net break/reset to client' ela= 973 driver id=1952673792 break?=0 p3=0 obj#=27086 tim=52668309705441 WAIT #4: nam='SQL*Net message to client' ela= 5 driver id=1952673792 #bytes=1 p3=0 obj#=27086 tim=52668309705502 WAIT #4: nam='SQL*Net message from client' ela= 2441 driver id=1952673792 #bytes=1 p3=0 obj#=27086 tim=52668309708092 WAIT #28: nam='SQL*Net message to client' ela= 3 driver id=1952673792 #bytes=1 p3=0 obj#=27086 tim=52668309708548 WAIT #28: nam='SQL*Net message from client' ela= 2608 driver id=1952673792 #bytes=1 p3=0 obj#=27086 tim=52668309711219 XCTEND rlbk=1, rd_only=1 WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1952673792 #bytes=1 p3=0 obj#=27086 tim=52668309711650
And there is the error message! ORA-1591 means “lock held by in-doubt distributed transaction”.
Querying DBA_2PC_PENDING showed a couple of pending two-phase-commit transactions since the previous day. I informed the application owner, who then asked me to ‘commit force’ those transactions. Problem solved.
If the users can’t provide the DBA with an ORA-NNNN error message, then the DBA can try to capture the error message by enabling SQL tracing and searching through the trace files for “err=“.