The issue I’m writing about today remains a bit of a mystery.

My monitoring scripts had picked up that a LGWR process had started using 100% CPU and other jobs were running for longer than usual.

The 11gR2 RAC database was on Solaris and used ASM.  Only one instance of the two node cluster was affected. (Although the other did suffer from more GCS related waits).

The most prominent wait was ‘log file sync’, but a comparison of AWR data showed that at 10pm, many I/O operations suddenly slowed down.

Slower I/Os

Higher flashback log writes, log file parallel write, db file scattered read and db file sequential read.

Not Slower

RMAN backups, direct path reads.
Unfortunately I didn’t have sufficient privileges to diagnose the problem.  A kind UNIX admin ran truss on the LGWR process for me.

syscall               seconds   calls  errors
 times                    .000      10
 semctl                   .000      30
 semop                    .000       3
 semtimedop               .000       2
 pread                    .001      14
 kaio                    6.043      80
 kaio                    4.581      78
 pollsys                  .000       9
 recvmsg                  .000      18       8
 sendmsg                  .001      15
 --------               ------    ----
 sys totals:            10.628     259       8
 usr time:                .022
 elapsed:               11.030

So KAIO system calls were taking all the CPU time. Lets see the detail.
Before the instance was restarted:

 /1:     0.0897    kaio(AIOWRITE, 258, 0x380DEF600, 28160, 0x501DFE007A072930) = 0
 /1:     0.0893    kaio(AIOWRITE, 256, 0x380DEF600, 28160, 0x0D2DFE007A074DC0) = 0
 /1:     0.0893    kaio(AIOWRITE, 261, 0x380DEF600, 28160, 0x0A0DFE007A074B50) = 0
 /1:     0.0005    sendmsg(17, 0xFFFFFFFF7FFFB660, 0)        = 224
 /1:     0.0706    kaio(AIOWAIT, 0xFFFFFFFF7FFF8ED0)        = -2247677648
 /1:     0.0708    kaio(AIOWAIT, 0xFFFFFFFF7FFF8ED0)        = -2247668288
 /1:     0.0706    kaio(AIOWAIT, 0xFFFFFFFF7FFF8ED0)        = -2247668912

After the instance was restarted the KAIO calls were back to normal:

 /1:     0.0006    kaio(AIOWRITE, 262, 0x380A7A600, 20480, 0xE2C382007A072BA0) = 0
 /1:     0.0002    kaio(AIOWRITE, 258, 0x380A7A600, 20480, 0x5D2382007A0732F0) = 0
 /1:     0.0001    kaio(AIOWRITE, 256, 0x380A7A600, 20480, 0x13A382007A073080) = 0
 /1:     0.0003    sendmsg(17, 0xFFFFFFFF7FFFB670, 0)        = 224
 /1:     0.0005    kaio(AIOWAIT, 0xFFFFFFFF7FFF8EE0)        = -2247677024
 /1:     0.0001    kaio(AIOWAIT, 0xFFFFFFFF7FFF8EE0)        = -2247675152
 /1:     0.0001    kaio(AIOWAIT, 0xFFFFFFFF7FFF8EE0)        = -2247675776

I wish I had truss information of the sessions performing fast direct path reads to see why they weren’t affected, because I expect them to use KAIO too. The scattered and sequential reads wouldn’t have used KAIO, but were slower!

I managed to find a match: MOS Bug 13356146 “PERIODIC SLOW DOWN IN SYSTEM”, but this was closed due to insufficient data.

So… unless I see it again, it will remain a mystery.

Advertisements

2 thoughts on “Kaio oww

    1. Hi Thomas,

      There was no RAID-5 involved, just ASM on an Oracle (Sun) SAN. This client didn’t use RAID-5 for any of their Oracle database environments.
      Only one instance/server was affected, so I don’t think it was the SAN, and only some types of I/O.
      I don’t think I’ll ever know what the issue was because it didn’t happen again.

      Regards
      Ari

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s