A colleague was asked to run a data fix script in a production Oracle database:

update services s
set s.service_ref = (select si.service_reference
                    from SERVICES_INFO si
                    where si.SERVICE_ID = S.SERVICE_ID)
where s.service_ref is null;

This was the explain plan:

| Operation                                        |  Name                    |  Rows | Bytes|  Cost  | I/O|PQ Dist| Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------------
|  0 UPDATE STATEMENT                              |                          |       |      |  39885 |    |       |       |       |
|  1  UPDATE                                       |SERVICES                  |       |      |        |    |       |       |       |
|  2   TABLE ACCESS FULL                           |SERVICES                  |     6M|  128M|  39885 |    |       |       |       |
|  3   TABLE ACCESS BY INDEX ROWID                 |SERVICES_INFO             |     1 |   18 |      3 |    |       |       |       |
|  4    INDEX UNIQUE SCAN                          |SERVICES_INFO_PK          |     1 |      |      2 |    |       |       |       |
------------------------------------------------------------------------------------------------------------------------------------

From v$session_longops I could see that a full tablescan of SERVICES was estimated to take about two minutes.  After waiting a few minutes the update was still running and sessions blocked by the transaction were piling up.
V$session_longops showed that the tablescan of SERVICES had started again!  (Note that this was not a partitioned table).
This pattern repeated, so we stopped the update to allow the blocked sessions to continue.

To get around the problem, we listed generated a list of rows where service_ref was null using a select statement, and then performed a quick update on just those rows, which completed immediately.

I believe that this update was never completing because it was restarting due to the inability to perform a consistent write. (Tom Kyte has written about this, so I’ll be brief).
I guessed that after the data fix update started, other sessions had updated at least one row where service_ref was null, and then committed. The data fix update session could not complete what it had started, because some of the target rows had changed already.  It had to rollback and try the update again in order to update a consistent set of rows.

A quick check of V$SQL provided more supporting evidence. There were twenty thousand recent update statements that set the value of every NOT NULL column in SERVICES for a row identified by SERVICE_ID.  It appears that the row is inserted with nulls and then the values updated later.  This explains why the data fix update session couldn’t scan through SERVICES before one of its targeted rows had been been changed and committed.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s