During a build of a new Oracle 11g RAC environment I found that ASM performance during tablespace creation was terrible. I used iostat to learn that it took a number of minutes before I/O started. Once the I/O started, the transfer rate was satisfactory. The delay before starting to write was often almost as long as than the I/O time. (Averaged about 40%, even when only one session in the cluster was creating datafiles).
If I were writing to a conventional volume manager / filesystem, I’d expect the I/O to start immediately.
RMAN backups to ASM were affected by the same issue. The archive log writer performance was slower than expected when writing to ASM diskgroups.
The environment was:
- Solaris 10
- Oracle 126.96.36.199.3 RAC – two nodes
- ASM (compatible.asm=11.2, 1MB AUs)
I had the assistance of Oracle Support and a globally known Oracle expert during this investigation. (He may well publish a more detailed and technical account of what was learned during this investigation, so I’ll try to summarise here).
The database session stalled on “KSV master wait” before starting the disk I/O, which I believe means the database session is waiting for work to complete in the ASM instance.
In the ASM instance, a slave process was switching between “ges inquiry response” and “ASM file metadata operation” during the “KSV master wait” delay. Most of the time was spent waiting for “ges inquiry response“.
Oracle Support gave me this description:
Wait Event: “ges inquiry response”
Definition: The problem is related to fast rcfg where resource was only replayed on remastered resources.
The resource cleanup will only clean up the state (including inquiry state) on remastered resources and caused discrepancies in inquires state for non-remastered resources.
I didn’t know what to make of that; it didn’t seem to apply to our situation. I still wanted to know why there were so many waits, and so I was keen to find out what the id1 and id2 of the wait event meant. I guessed it was requesting permission to change some ASM metadata from the master of that resource on the other ASM instance, and using truss on the ASM LMD process confirmed this.
I used oradebug to trace the ASM slave process during datafile to gather more information about the “ges inquiry response” event. It showed that the pairs of id1 and id2 were repeated several times each.
My RAC expert colleague discovered that id1 = disk number and id2 = allocation unit (AU) number. He used X$KFFXP (extent mapping for files) to prove this.
select DISK_KFFXP disk_num , DISK_KFFXP+131072 id1 , AU_KFFXP id2 from x$kffxp where group_kffxp = 2 -- disk group number and number_kffxp = 613 -- file number of datafile /
From the trace file, we learned that not all the AUs that were allocated to the datafile had a corresponding “ges inquiry response” wait. I believe that is because the local instance was the master for some of them. I don’t know why each AU had to be locked several times – it seems very inefficient.
Each “ges inquiry response” wait was around 0.6ms and the truss on the ASM LMD process indicated to me that the packets were small. The network was performant, it was just the high number of round trips that caused the delay. Eg, for a 1024M datafile in a normal redundancy, (2050 x 1MB AUs for the file), there were about 7300 ges inquiries for about 1500 AUs. Imagine what happens when creating a 200G tablespace.
Concurrent allocation of ASM space is even worse. An eight-channel database image backup via RMAN resulted in one ASM session waiting for “ges inquiry response“, and the other seven waiting on “buffer busy”. Once again, no disk I/O happened until one of the sessions had acquired locks for each AU needed for the copy of the datafile.
Creating tablespaces/datafiles concurrently had the same symptoms. If concurrent ASM space allocations were requested from both nodes, then GCS wait events started showing up too, eg “GCS lock cvt X”.
Tuning / Workarounds
- There is an obvious RAC element to this issue, so as expected, only having one ASM instance up when allocating large amounts of space in ASM drastically reduced the delay time before disk I/O starts.
- Using the diskgroup attribute COMPATIBLE.ASM = 10.1 reduced the “KSV master wait” time by 66% in our tests. However, this robs us of some features, such as the fast mirror resync capability, so this was not a preferred option.
- Changing from normal to external redundancy reduces the number AUs by half and so reduces the delay.
- Increasing the AU_SIZE reduces the number of AUs (and therefore “ges inquiry response” waits) and so reduces the delay too. Note that despite the default AU being 1MB, at the time of writing, Oracle recommend 4MB for the AU_SIZE.
For us, option 4 was the most acceptable. We just had to size the redo logs so that archive logs do not waste too much space in ASM. (Sized just under a multiple of the AU_SIZE).
The choice of AU_SIZE for each disk group would depend on how much an organisation values performance over capacity waste, and what kind of files and activity are associated with that disk group.
The “knee” of the trendline in the chart below is between the 4MB and 8MB AU_SIZE. Apparently some systems support 4MB maximum I/Os now, so perhaps these are the reasons why Oracle now recommend AUs sized at 4MB.
The actual disk I/O time (“Data file init write”) didn’t vary. Our maximum I/O size was 1MB for all of these tests, so it makes sense that the I/O time wouldn’t change with AUs between 1MB and 32MB.
Creating a diskgroup with 64MB AUs hung, so we didn’t experiment further with that size.
I summarised our findings for Oracle Support and requested that the developers look into the reason multiple GES operations had to occur for each AU when using COMPATIBLE.ASM>10.1. It might be a bug that can be fixed. If not, perhaps Oracle could make GES operations that lock multiple AUs at a time to reduce the number of round trips, so that we don’t have to increase the AU_SIZE to reduce the GES overhead.
Oracle Support pointed me to the document below, saying it was “valid for non-exadata environments”.
Oracle Sun Database Machine Performance Best Practices [ID 1067520.1] says:
If creating multiple files in parallel in the same diskgroup is a requirement (eg: tablespaces, backups, restores), run the file creations on one node in the RAC cluster. Doing so optimizes lock acquisition and minimizes buffer pins associated with file allocation. File allocation intra and inter node scalability will be optimized in Oracle 188.8.131.52.
So the poor performance I’ve seen is after the file allocation optimisations?!