In a recent post, I talked about the most common way to prove that a performance problem is not in the Oracle database by taking notice of the SQL*Net waits that are often dismissed as idle events.

This post is about another example of where a performance problem was passed to us and we found that the real issue laid outside of the database.

A number of applications using databases on the same server had been reporting various performance issues and application timeouts.  At first, there was no reason to think they were related, but then when investigating a very simple case, the real cause became clear.

One database was rebuilding indexes every day.  Most days this took minutes, but other days it ran for hours.  Using StatsPack I identified the create index statements and found the “fast” days and compared them to the “slow” days.  I half expected to see enqueue wait events, but didn’t find any.  When the indexes were being rebuilt, the only significant wait event was for scattered reads, (which matched the work and elapsed time needed to recreate the indexes).

An example of a “slow rebuild”:

                                                                   Avg
                                                     Total Wait   wait    Waits
Event                               Waits   Timeouts   Time (s)   (ms)     /txn
---------------------------- ------------ ---------- ---------- ------ --------
db file scattered read          2,199,645          0      8,520      4  3,034.0

An example of a “fast rebuild”:

                                                                   Avg
                                                     Total Wait   wait    Waits
Event                               Waits   Timeouts   Time (s)   (ms)     /txn
---------------------------- ------------ ---------- ---------- ------ --------
db file scattered read          2,419,071          0        553      0  3,516.1

The average response time for the scattered reads varied greatly from one day to the next!  This database is not by-passing the filesystem cache with direct I/O, so at this point I thought it was possible that the problem was limited to the server.  To learn more, I used the sar history.  I identified the disk devices for the database (filesystem to volume to LUN to multipathed devices on a SAN), and then used the average for the each day.

7 Jul
           device        %busy   avque   r+w/s  blks/s  avwait  avserv
           ssd272            8     0.2       5     400     0.1    36.3
           ssd273            9     0.2       5     411     0.2    39.7
           ssd274           10     0.2       5     429     0.3    40.0
           ssd284           11     0.3       6     512     1.5    39.8
           ssd285           11     0.2       6     505     0.2    39.8
           ssd286           11     0.2       7     499     0.5    37.5
8 Jul
           device        %busy   avque   r+w/s  blks/s  avwait  avserv
           ssd272            2     0.0       5     338     0.0     5.0
           ssd273            2     0.0       5     339     0.0     4.7
           ssd274            2     0.0       5     342     0.0     4.5
           ssd284            2     0.0       5     338     0.0     4.5
           ssd285            2     0.0       5     338     0.0     4.3
           ssd286            2     0.0       5     342     0.0     4.1

Although the busy percentage doesn’t get up above 11%, the average service time changes by a factor of nine from one day to the next.  (Sure, service times can vary simply because the I/O request size can vary, but I was averaging over an entire work day to mitigate this).  So, the filesystem cache isn’t a big factor – the delay is at a lower level.
Similar results were seen for the other databases’s LUNs, even on different servers that used the same SAN, despite the workloads not changing significantly for most databases/servers.

This was enough evidence to ask for statistics from the SAN.  The following graph shows the performance problem is clearly with the SAN.  (Apologies for the background pattern).

SAN response time

Looking at a month’s worth of sar averages, it becomes obvious that ‘something happens’ between 7am Monday and 6pm Tuesdays every week.

Because the SAN write times are affected, I believe that the SAN cache is full of dirty blocks, so it cannot accept blocks into the cache immediately.  These LUNs are both RAID-5.

Over to the SAN team to investigate further.

Avg

Total Wait   wait    Waits

Event                               Waits   Timeouts   Time (s)   (ms)     /txn

—————————- ———— ———- ———- —— ——–

db file scattered read          2,419,071          0        553      0 3,516.1

Advertisements

One thought on “Don't Blame the Database, part two.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s