A Kiwi DBA in London

May 3, 2012

Kaio oww

The issue I’m writing about today remains a bit of a mystery.

My monitoring scripts had picked up that a LGWR process had started using 100% CPU and other jobs were running for longer than usual.

The 11gR2 RAC database was on Solaris and used ASM.  Only one instance of the two node cluster was affected. (Although the other did suffer from more GCS related waits).

The most prominent wait was ‘log file sync’, but a comparison of AWR data showed that at 10pm, many I/O operations suddenly slowed down.

Slower I/Os

Higher flashback log writes, log file parallel write, db file scattered read and db file sequential read.

Not Slower

RMAN backups, direct path reads.
Unfortunately I didn’t have sufficient privileges to diagnose the problem.  A friend UNIX admin ran truss on the LGWR process for me.

syscall               seconds   calls  errors
 times                    .000      10
 semctl                   .000      30
 semop                    .000       3
 semtimedop               .000       2
 pread                    .001      14
 kaio                    6.043      80
 kaio                    4.581      78
 pollsys                  .000       9
 recvmsg                  .000      18       8
 sendmsg                  .001      15
 --------               ------    ----
 sys totals:            10.628     259       8
 usr time:                .022
 elapsed:               11.030

So KAIO system calls were taking all the CPU time. Lets see the detail.
Before the instance was restarted:

 /1:     0.0897    kaio(AIOWRITE, 258, 0x380DEF600, 28160, 0x501DFE007A072930) = 0
 /1:     0.0893    kaio(AIOWRITE, 256, 0x380DEF600, 28160, 0x0D2DFE007A074DC0) = 0
 /1:     0.0893    kaio(AIOWRITE, 261, 0x380DEF600, 28160, 0x0A0DFE007A074B50) = 0
 /1:     0.0005    sendmsg(17, 0xFFFFFFFF7FFFB660, 0)        = 224
 /1:     0.0706    kaio(AIOWAIT, 0xFFFFFFFF7FFF8ED0)        = -2247677648
 /1:     0.0708    kaio(AIOWAIT, 0xFFFFFFFF7FFF8ED0)        = -2247668288
 /1:     0.0706    kaio(AIOWAIT, 0xFFFFFFFF7FFF8ED0)        = -2247668912

After the instance was restarted the KAIO calls were back to normal:

 /1:     0.0006    kaio(AIOWRITE, 262, 0x380A7A600, 20480, 0xE2C382007A072BA0) = 0
 /1:     0.0002    kaio(AIOWRITE, 258, 0x380A7A600, 20480, 0x5D2382007A0732F0) = 0
 /1:     0.0001    kaio(AIOWRITE, 256, 0x380A7A600, 20480, 0x13A382007A073080) = 0
 /1:     0.0003    sendmsg(17, 0xFFFFFFFF7FFFB670, 0)        = 224
 /1:     0.0005    kaio(AIOWAIT, 0xFFFFFFFF7FFF8EE0)        = -2247677024
 /1:     0.0001    kaio(AIOWAIT, 0xFFFFFFFF7FFF8EE0)        = -2247675152
 /1:     0.0001    kaio(AIOWAIT, 0xFFFFFFFF7FFF8EE0)        = -2247675776

I wish I had truss information of the sessions performing fast direct path reads to see why they weren’t affected, because I expect them to use KAIO too. The scattered and sequential reads wouldn’t have used KAIO, but were slower!

I managed to find a match: MOS Bug 13356146 “PERIODIC SLOW DOWN IN SYSTEM”, but this was closed due to insufficient data.

So… unless I see it again, it will remain a mystery.

April 22, 2012

Modern Age Database Reorganisation vs LOBs

Filed under: LOBs,Oracle Database — Ari @ 7:17 am
Tags: ,

Database re-organisations exist only as fond and distant memories of old Oracle DBAs.  Well… almost.  I was asked to purge, partition and reclaim space from audit tables that had grown to be 100s of gigabytes each.  The real problem was that they contained LOBs.  This post lists a few of the tweaks and tricks I used to bring the ancient ritual of the reorg into the modern age (11gR2).

In a nutshell, we wanted to move a schema from its current tablespace into a new one, leaving behind any data older than 12 months, and partitioning the largest objects at the same time.

  • The UFS filesystems, Solaris OS and Oracle database were configured to allow concurrent direct 1MB I/Os (most efficient for our environment).
  • The destination tablespaces were new and empty, so they could be configured without flashback or redo logging.  This dramatically reduced overhead when copying data into the tablespace, while allowing us to use a guaranteed restore point to flashback the rest of the database if the reorg failed.  (The new tablespaces would have to be dropped, but they were only populated during the reorg exercise, and the old tablespaces would be retained until after the reorg had completed, so it doesn’t matter that we couldn’t recover the new tablespaces).
  • The destination tablespaces would be created with evenly sized and pre-allocated datafiles.  The number of datafiles would be a multiple of the degree of parallelism used for the compressed backup off the tablespace.  (See next point).  The pre-allocation avoids auto-extension delays during the reorg, and the size and number of datafiles ensures the load will be shared evenly between the parallel backup workers.
  • After the reorg, redo logging would be re-enabled for the objects in the new tablespaces, then a parallel RMAN backup of the new tablespaces would be taken with low compression (fastest in this environment).  This required less capacity and time than logging full redo during the re-org.  It made a recovery possible if needed between the reorg and the next full database backup.
  • The tables to be partitioned would be pre-created with temporary names, parallel degrees, and nologging attributes for the LOBS.
  • Data would be transferred with parallel direct inserts or alter table move commands.
  • Tracking of unrecoverable actions in the datafiles would be disabled using event 10359.  (the DB_UNRECOVERABLE_SCN_TRACKING parameter isn’t dynamic until 11.2.0.3).  This prevents the bottleneck which would form around access to the controlfile when relocating data in nologging mode.
  • Direct writes were increased to 1MB with event 10351, eg:
    alter system set events ’10359 trace name context forever,level 1:10351 trace name context forever, level 128′;
  • Indexes would be created in parallel using large manual work areas, nologging and non-sparse temp files.
  • Constraints would be created with ENABLE NOVALIDATE then altered to ENABLE VALIDATE to complete in seconds.  (A big time saving).

LOBs are Anchors

The biggest obstacle to moving the data quickly was that the largest tables used LOBs.  How ironic that the datatype invented for holding large amounts of data is also the most awkward to manage.

The large amount of LOB data negated any benefit from using datapump in parallel.  (See bug 5943346: PRODUCT ENHANCEMENT: PARALLELISM OF DATAPUMP JOB ON TABLE WITH LOB COLUMN).

Luckily, most of the LOB data was small enough to be stored in-row, (so it would be handled like other datatypes), but some was still stored in LOB segments.  This meant a lot of single block I/Os would be required during the reorg and that parallel inserts were limited to one PQ process per destination partition.  (No intra-partition parallel support for LOBs).

It would have been much worse without disabling the unrecoverable scn tracking mentioned above.  Enabling caching of the LOBs would allow some multi-block I/O, but it would have increased redo and undo generation, so would not be a useful option.

March 7, 2012

Sparse UFS Tempfiles vs Direct IO

In a previous post I found that direct and concurrent I/O was disabled when restoring a certain type of backup with RMAN on UFS.

A similar issue disables direct and concurrent I/O when writing to sparse tempfiles.  Although there isn’t much advice about it, this issue must be common because tempfiles on Solaris UFS are created sparse, (not all space is allocated).  Solaris won’t let synchronous direct writes happen to sparse files (or when extending files).  Long average wait times for direct path write temp are an indication of this problem.  Another is a high number of 8KB writes shown by iostat, while truss shows larger (eg 128KB or 1MB) writes.

For this post, I shall refer to files that are not sparse as ‘dense’ files.

One way to compare the performance difference yourself is:

  1. Create a new temporary tablespace on a forcedirectio UFS filesystem.
  2. Time the parallel creation of a very large index, (so that the temporary tablespace is written to concurrently).
  3. After this, the tempfile/s might still be sparse.  Check with the du command or use ls -ls which shows the number of allocated 512 byte sectors for data+metadata.  Compare this to the file’s length.  If there is any doubt, copy the tempfile and copy/rename it back again to convert it to a normal file.  (Or you can do this online by dropping the tempfile and adding the dense one back with the reuse clause).
  4. Restart the instance or flush the cache.
  5. Time the parallel creation of the large index again.  This time it should be quicker, with less time spent waiting for direct writes.

A dense tempfile can become sparse again as soon as it auto-extends.  Even when the sort write size = the temporary tablespace extent size = the auto-extension next size = 1MB, some operations (eg hash joins) may not touch all blocks in the new sort segment extents, leaving the file sparse.  Over time, the extents will be reused and all the blocks touched.

The OS uses bmap_has_holes() to compare the data+metadata to the length of the file.  Just one unallocated block is enough for the entire file to be treated as a sparse file.

The tempfiles will be treated as dense when all the blocks have been touched, or when the tempfile is resized down far enough to remove all of the unallocated blocks.  (Shrinking a used tempfile causes the existing sort segment to be completely shrunk, one block at a time, which can take a while).

If possible, pre-allocate sufficient temporary tablespace capacity and use an OS tool like cp to get rid of the holes in the tempfiles.  This will allow concurrent direct I/O when sorting, hash joining, using global temporary tables, etc.

If using auto-extension, set the next increment to match the tablespace extent size, (eg 1MB).  However, I have seen examples of files that have auto-extended when sparse, and even hit the auto-extension limit (cannot extend error) and remained sparse!  So, I recommend pre-allocation of dense tempfiles and not allowing auto-extension for temporary tablespaces.

January 8, 2012

December 21, 2011

Phantom FRA vs RMAN Duplicate

I tried my first 11gR2 RMAN duplicate without connecting the target or catalogue databases yesterday.  (Although this can affect other types of backup based 11g duplicate too).  I ran into a gotcha that I couldn’t find an Oracle MOS note to explain, so I thought I’d briefly describe it here.

It all seemed straight forward at first.  I had to use a time instead of my preference of an SCN as the recovery point, but other than that it all went well until the recovery phase when the duplication failed with:

ORA-19870: error while restoring backup piece /ora/DEST_SID/fra01/SOURCE_SID/backupset/2011_12_11/o1_mf_annnn_SUN_FULL_7g8v67md_.bkp
ORA-19809: limit exceeded for recovery files ORA-19804: cannot reclaim 100627456 bytes disk space from 429496729600 limit
ORA-06512: at "SYS.X$DBMS_BACKUP_RESTORE", line 2338
RMAN-10031: RPC Error: ORA-19583  occurred during call to DBMS_BACKUP_RESTORE.RESTOREBACKUPPIECE
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 12/20/2011 13:48:39
RMAN-05501: aborting duplication of target database

RMAN was trying to restore archive logs into the FRA, but stopped because it thought the FRA was full.  In reality, only 320GB was used out of 400GB.  (The location of the backup files could have been in a different filesystem and I’d still have the same problem).

V$recovery_file_dest showed RMAN thought there was 756MB used in the FRA.  This would have been the amount used in the source system when the backup was taken.  (Two level 0 backups and a week’s worth of level 1 incremental and archive log backups).

The destination test environment only needed to hold one backup, so the FRA was intentionally smaller than the source production database to avoid wasting disk space.

My first thought was to run an RMAN crosscheck, but this didn’t help.  I decided to use one lie to defeat another:

alter system set db_recovery_file_dest_size=1000g;

Then I continued the duplicate by setting the DB_NAME to the same as the source and then:

run {
   set UNTIL time '11-DEC-2011 08:49:59';
   recover clone database;
   alter clone database open resetlogs;
}

Finally, I created a new controlfile and changed the DBID with NID.

December 18, 2011

Hollow Redo Logs

Redo logs are switched before being full, which complicates checkpoint tuning and choosing the size of redo logs.

One of the causes is when space reserved by redo strand buffers is not used.  When one strand is full and cannot reserve more space in the redo log, then a switch is triggered.  When the other strand buffers have little in them, the reserved redo log space is wasted.

When there aren’t many concurrent transactions, only one public strand may be active.  (Related parameters: _log_parallelism_dynamic, _log_parallelism_max).

If flashback database is enabled, then private redo strand buffers may be allocated, but not used.  (The SGA is allocated before the instance is open, so the memory structures must be available in case flashback is disabled).  Guaranteed restore points also disable private redo strands.  In these cases, each redo log reserves space that will never be used.
To see the size of each public strand buffer, look at the first and largest sizes in X$KCRFSTRAND, eg:

select strand_size_kcrfa from x$kcrfstrand ;

For the memory reserved for private strand buffers, check this:

select bytes from v$sgastat where name = 'private strands';

Note that if the total redo buffer size is larger than each redo log, then the redo log is divided by the number of public strand buffers.  The result is that the redo logs may switch even before a public redo buffer is full.  Eg, 8 public redo strand buffers of 16MB each with 50MB redo log files could result in archive logs being only 6.25MB.

Refer to MOS note 1356604.1 and bugs 9272059 & 10354739.  (The bug has a patch to improve diagnostic information, and the diagnostic information will be part of 12.1).

Real Example

I recently investigated a performance problem on an Enterprise Edition instance with default values for the checkpoint related parameters.  The redo logs switched when only an eighth full (12.5MB out of 100MB).  The low number of redo blocks used in each log bypassed the incremental checkpoint threshold based on redo log size (see this post).  Whenever all of the redo logs filled in less than five minutes (the self-tune checkpoint threshold), then no incremental checkpoints were triggered and all sessions had to wait for a thread checkpoint to complete.  The thread checkpoint wrote all dirty buffers to disk, eliminating the checkpoint lag, so no matter how long this high rate of redo generation continued, the time based and volume based thresholds were never reached.  Every cycle of the redo log estate resulted in delays waiting for the lazy DBWR to do all its work at once.

The contributing factors were CPU_COUNT defaulting to 128 and a large SGA which resulting in the total public redo strand buffer size (140MB) being larger than each redo log.  Since the total redo buffer size was bigger than the redo logs, the redo log size was divided by the eight public redo strand buffers, giving the approximate size expected for the archive logs.  (Assuming a low concurrency of sessions changing data).  This database was in flashback mode and most of the changes were generated by a single session.

Possible workarounds/tunables:

  1. Larger and/or more redo logs
  2. Set CPU_COUNT to a lower number, such as the number of cores (rather than virtual CPUs).
  3. Set the LOG_BUFFER parameter to force a smaller size.
  4. Set fast_start_mttr_target (or log_checkpoint_interval in Std Ed) to a low value to trigger incremental checkpoints.
  5. Set _log_parallelism_dynamic to false to spread activity across the public strands, filling redo logs more and triggering size based incremental checkpoints.

Checkpoint Tuning and the Evolution of the Lazy Database Writer

“cannot allocate new log”, “checkpoint not complete”

“log file switch (checkpoint incomplete)”

The event “log file switch (checkpoint incomplete)” is a foreground wait; it delays one or more sessions.

More and/or larger redo log groups may improve performance when you see these messages/waits.  If you can’t or you don’t want to grow your redo log estate, then learning more about checkpointing and redo population might provide an alternative solution.

In this post I’ll briefly cover the basics, then introduce some rare information about checkpoints.

Checkpoint Basics

Delaying writing dirty blocks to disk can be more efficient because

  1. the same blocks may be changed more than once before being written, and
  2. the DBWR will sort blocks so that it can write them in order.  (The more time that passes, the more chance of contiguous blocks).

However, redo for dirty blocks can’t be overwritten, so if the redo logs fill up, changes in the database stop until the dirty blocks have been written to disk.

Checkpoint tuning is about finding the best rate of activity for the DBWR by calibrating thresholds for CKPT triggers.  Oracle has evolved to make the DBWR lazier.

Threshold / Trigger
Type Threshold Metric
V$Instance_Recovery Counter
Default Std Edition
log_checkpoint_timeout Incremental Temporal WRITES_LOG_CHECKPOINT_SETTINGS 1800 Yes
log_checkpoint_interval Incremental Volume WRITES_LOG_CHECKPOINT_SETTINGS unset Yes
fast_start_mttr_target Incremental Volume WRITES_MTTR unset No
Self / auto tune
_selftune_checkpoint_write_pct
_selftune_checkpointing_lag
Incremental Temporal WRITES_AUTOTUNE 3%
300sec
No (?)
Redo logsize
_target_rba_max_lag_percentage
Incremental Volume WRITES_LOGFILE_SIZE 81 (11.2)
90 (10.2)
Yes
Redo thread full / active Thread Volume WRITES_FULL_THREAD_CKPT n/a Yes

Temporal thresholds smooth total the peaks of write activity (DBWR spikes lag behind the LGWR spikes) but can’t handle sudden bursts of redo generation that fill all of the logs in a short amount of time.

Volume thresholds can be quicker to scale up DBWR activity in reaction to bursts of redo generation, but may increase I/O peaks and total writes (where the same data blocks are modified repeatedly).

Rare Information or Rash Assertion

While there is a lot written about this topic, I found that even Oracle’s documentation didn’t match what I have observed.  Here are my claims that are contradicted in Oracle documentation, forums and blogs:

  1. Thread checkpoints are not triggered by log switches, despite what the Oracle 11.2 manual says.
  2. Self-tune checkpoints lag behind the redo log tail by five minutes and writes are restricted to 3% of physical I/O.
  3. Checkpoints are triggered when they lag behind the redo log tail by 81% of [ the sum of the redo logs except the largest one ].  (90% in 10.2).

Log Switch Checkpoints

Most of Oracle’s documentation, (at the time of writing this), is very outdated, or at best misleading.

However, John Watson’s book “OCA Oracle Database 11g: Administration I Exam Guide” says:

There is no checkpoint following a log switch. This has been the case since release 8i, though to this day many DBAs do not realize this.

If one enables checkpoint beginning/completed messages in the alert log, then it will look like log switch checkpoints are occurring:

ALTER SYSTEM SET LOG_CHECKPOINTS_TO_ALERT=TRUE SCOPE=MEMORY;
ALTER SYSTEM SWITCH LOGFILE;
....
Beginning log switch checkpoint up to RBA ...

However, when testing in an 11.2 database with no other users, although I saw the message stating that a log switch checkpoint had started, nothing further happened.  The DBWR was not progressing the checkpoint by writing dirty blocks to disk.  Checkpoint activity only occurred when one of the other thresholds was reached.  (See table above).

My theory is that the “log switch checkpoints” are no longer real thread checkpoints, but just RBA markers.  They do not complete without something else triggering a real checkpoint.  When an incremental checkpoint takes the checkpoint position in the control file up to the log switch marker, then the checkpoint position is written to the datafile headers and a message is posted to the alert log:

Completed checkpoint up to RBA ....

This is part of the evolution of the DBWR, becoming lazier (more efficient), but it relies on incremental checkpoints.  If incremental checkpoint thresholds aren’t being reached for some reason, then the only thread checkpoints will happen when the redo logs are all full or active, resulting in delays and the checkpoint position in the datafiles lagging more than expected.

Self / Auto Tune Checkpoints

These incremental checkpoints are enabled when fast_start_mttr_target is unset or set to 1 or more.  If the checkpoint position lags behind the redo log tail by five minutes, then incremental checkpoints will start, subject to some vague load criteria.  All I can add is that the parameters below seem to be related:

Parameter                            Default Description
_disable_selftune_checkpointing      FALSE    Disable self-tune checkpointing
_selftune_checkpoint_write_pct       3        Percentage of total physical i/os for self-tune ckpt
_selftune_checkpointing_lag          300      Self-tune checkpointing lag the tail of the redo log

I’ve found that self-tune checkpoints don’t happen in Oracle 11.2 Express Edition.  I haven’t tested in Standard Edition yet, but I guess that they won’t happen either.

Log Size Checkpoints

One of the challenges with having worked with Oracle databases for so many years is the amount of “unlearning” that is necessary with each new version.  It doesn’t help that Oracle’s staff face this challenge too, when maintaining documentation.

Oracle’s documentation (eg 11.2 Reference manual and MOS Note 147468.1 Checkpoint Tuning and Troubleshooting Guide), claim that checkpoints are influenced by the size of the smallest redo log:

A checkpoint occurs at every log switch.  If a previous checkpoint is already in progress, the checkpoint forced by the log switch will override the current checkpoint. This necessitates well-sized redo logs to avoid unnecessary checkpoints as a result of frequent log switches.
The lag between the incremental checkpoint target and the log tail is also limited by 90% of the smallest online log file size. This makes sure that in most cases log switch would not need to wait for checkpoint.

Recovery I/O can also be limited by setting the LOG_CHECKPOINT_TIMEOUT parameter or by the size specified for the smallest redo log.

LOG_FILE_SIZE_REDO_BLKS    Maximum number of redo blocks required to guarantee that a log switch does not occur before the checkpoint completes.

I believe these are outdated statements.

An incremental checkpoint is triggered when the checkpoint position lags behind the redo log tail by V$INSTANCE_RECOVERY.LOG_FILE_SIZE_REDO_BLKS.  This mechanism has the objective to avoid waiting for thread checkpoints when all the redo logs are full or active, but it is no longer based on the size of the smallest redo log.  In the databases I’ve examined, the threshold is calculated by adding the size of all the redo logs except the largest one, then multiplying by 0.81 (in 11.2, or 0.9 in 10.2).  The parameter controlling the multiplier is _target_rba_max_lag_percentage.

This change in behaviour is consistent with the evolution of the lazy DBWR.

Unfortunately, some environments may fill as little as an eighth of each redo log, which circumvents this safety mechanism.  (See this post).

November 28, 2011

RMAN’s Section Size and UFS Don’t Mix

When you backup your Oracle database using RMAN to create compressed backup sets, it could take a long time if you don’t use multiple channels to take advantage of all of the server’s CPU power.

If the datafiles are vastly different in size, then the backup completion could be delayed while one channel finishes compressing one of the largest files.

A solution in 11g is to use RMAN’s section size option so that each large datafile could be compressed by multiple cores.  This feature was probably created to allow bigfile tablespaces in ASM to be backed up quickly, but it is also useful for speeding up backups on UFS.

Or is it?

When restoring from an RMAN backup that was created with the section size option, the Oracle server process opens each datafile with the O_DSYNC flag!  (Data writes must complete before signalling the server process).

This disables direct I/O (because RMAN is filling holes in a sparse file when restoring a datafile) and so a single writer lock is enforced.  Blocks are processed one at a time through the OS cache (longer code path), there are more and smaller writes, and channels working on each section of the same file have to wait for each other to obtain a file lock.

For the system I was working on, this increased the restore time by a factor of eight or nine.

I have created a service request and a bug (13566617) with Oracle, who have confirmed this is an issue on Solaris.
The SR notes say it can’t be replicated on Linux because the datafile is opened with the O_SYNC flag instead.  (I’ve asked them to review that statement because only the flag name is different.  I am not in a position to test performance on Linux, but it looks like the same situation).

November 27, 2011

UFS Fragmentation vs Oracle Datafiles

This post might interest you if your organisation runs Oracle databases on UFS.

What Does UFS File Fragmentation Mean?

The blocks allocated to a datafile are not contiguous in the filesystem.

Why Should You Care?

You may have configured your system to allow 1MB I/Os from database through to the OS and devices (direct I/O, maxphys, md_maxphys, maxcontig, _db_file_exec_read_count/db_file_multiblock_read_count, extent sizes, etc), so you want to make the most of this tuning.

If your files are fragmented, then the number of I/O requests sent to the storage devices will be higher than necessary for large sequential operations, such as full scans, RMAN backup/restore/duplications, tempfile I/O, direct inserts, etc.  When an Oracle database process executes a 1MB read/write, this can be broken into many smaller requests to the device if the file’s blocks aren’t contiguous in UFS.

Even if the device is a LUN presented by an array with cache and striping, the number of I/O operations per second (IOPS) could limit the throughput achieved.

On one system I tested, the average throughput for datafile reads during an RMAN backup was 270MB/s when fragmented, but about 780MB/s when defragmented.

Why Is There so Little Information About it?

My guess is that the lack of tools to easily detect and correct the problem of UFS fragmentation is the reason it isn’t a well known/documented Oracle database performance issue.

I started researching UFS fragmentation after noticing a dramatic reduction in the performance of a backup after a database was restored.  I used iostat to see thousands of small reads were happening each second, when truss showed 1MB preads.  I found a tool called filestat on the Solaris Internals site which gave me visibility of a file’s block allocation / fragmentation.

How Does it Happen?

Whenever multiple requests to allocate space occur concurrently for the same filesystem there is a chance that free blocks will be assigned to different files in an alternating fashion.  With asynchronous I/O, (simulated via LWP), sections of the same file may be allocated blocks out of order because the requests are received out of order.

The UFS allocation policies result in datafiles (or backup pieces) competing for space in the same cylinder group.

Scenarios to consider are:

  1. RMAN restore and duplication (production recovery, standby database creation, test database refreshes).
    For speed, backups are often done asynchronously, multiplexed and multi-channelled, resulting in many competing block allocation requests.
    The resulting database files will be heavily fragmented, affecting performance.
  2. RMAN backup pieces on UFS.
    Similar to scenario #1, except that there is less scope for a performance problem while reading from fragmented backup files.  (Write speeds or decompression during restores are more likely to be bottlenecks, and if backup files are transferred to other servers, or media then they may be defragmented in the process).
  3. Tablespace creation.
    If you are tempted to create two or more tablespaces concurrently to minimise creation time, you may want to reconsider.
  4. Datafile auto-extension.
    Some DBAs manage growth by leaving datafiles at 99.99% full and rely on frequent auto-extension.  Not only does this introduce risk and cause users to wait for capacity, if the allocation sizes are too small, then it will result in fragmented datafiles.
  5. Mixed filesystem contents.
    If many small files (eg trace and log files) are on the same filesystem as datafiles, then after time it would be reasonable to assume there will small fragments of free space and more concurrent allocation requests.  (Untested, because I don’t mix datafiles with other types of files).
  6. Datafile relocation.
    Not as common, but if you are going to transfer files from one filesystem/server to another, do so one at a time.

How Can You Prevent It?

Basically:  allocate space in large chunks and in multiples of your maximum I/O size (commonly 1MB), and don’t make concurrent requests for space in the same filesystem.

For the corresponding scenarios above:

  1. Don’t use multiplexing for disk backups.  (Filesperset=1 allows faster single file restoration anyway).
    Using one channel per filesystem is most effective, but this may not be acceptable for capacity management or restore performance reasons.  It is worth consideration when building standby databases or performance test environments.  For emergency production restorations, getting the database operational ASAP is usually the highest priority.  Analyzing fragmentation can be done later, and if performance is suffering, an arranged outage could be scheduled later to defragment the files.
    The parameters disk_asynch_io_backup_file_bufcnt (number of buffers for asynchronous writes) and _backup_file_bufsz (write buffer size) can be used to influence the degree of fragmentation during restores, but on the system I used for testing, any significant reduction in fragmentation was matched by a reduction in restore performance.  Note that the backup and restore should use the same buffer sizes or else the restore may fail.
  2. Similar to scenario #1, but _db_file_direct_io_count is used for write sizes.
  3. I found that setting disk_asynch_io to false didn’t reduce tablespace creation speed, but did reduce fragmentation.  Worthwhile to try when creating new databases on empty UFS filesystems.
    When using asynchronous I/O, it may help to have  _db_file_direct_io_count set to a multiple of your maximum I/O size.
  4. Set auto-extending datafiles’ initial and next sizes to a multiple of 16MB (or the value of maxbpg) which is the largest any section of a file can be.  (That is, the most blocks in a cylinder group that can be allocated to each file).  The file sections won’t be uniformly 16MB due to indirect blocks, an initial section size of 48kb, and previously used space, but we are aiming to eliminate needless fragmentation from many small allocation requests.
  5. Use separate filesystems.
  6. Recent versions of cp and mv use 8MB reads / writes, so they can be used efficiently on filesystems mounted with the forcedirectio option.  If all the files are moved from one filesystem to another (empty filesystem), then they are defragmented in the process.

Quick Specs

Quick specs of one the systems I used for testing:

  • OS:   Solaris 10
  • DB:  Oracle 11.2
  • CPU: 2x SPARC T2 (128 virtual CPUs)
  • RAM: 32GB
  • Storage: 2x FC SANs, 16 disks on each forming RAID 1+0 LUNs, which are then mirrored via SVM
  • UFS with forcedirectio, noatime, maxcontig<=>1MB and maxphys<=>1MB

October 28, 2011

Perpetual Update

A colleague was asked to run a data fix script in a production Oracle database:

update services s
set s.service_ref = (select si.service_reference
                    from SERVICES_INFO si
                    where si.SERVICE_ID = S.SERVICE_ID)
where s.service_ref is null;

After a few minutes it was still running and sessions blocked by the transaction were piling up.

This was the explain plan:

| Operation                                        |  Name                    |  Rows | Bytes|  Cost  | I/O|PQ Dist| Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------------
|  0 UPDATE STATEMENT                              |                          |       |      |  39885 |    |       |       |       |
|  1  UPDATE                                       |SERVICES                  |       |      |        |    |       |       |       |
|  2   TABLE ACCESS FULL                           |SERVICES                  |     6M|  128M|  39885 |    |       |       |       |
|  3   TABLE ACCESS BY INDEX ROWID                 |SERVICES_INFO             |     1 |   18 |      3 |    |       |       |       |
|  4    INDEX UNIQUE SCAN                          |SERVICES_INFO_PK          |     1 |      |      2 |    |       |       |       |
------------------------------------------------------------------------------------------------------------------------------------

From v$session_longops I could see that a full tablescan of SERVICES was taking about two minutes. However, the update kept running after it was estimated to complete.
V$session_longops showed that the tablescan of SERVICES had started again!
This pattern repeated, so we stopped the update to allow the blocked sessions to continue.

To get around the problem, we listed generated a list of rows where service_ref was null using a select statement, and then performed a quick update on just those rows, which completed immediately.

I believe that this update was never completing because it was restarting due to the inability to perform a consistent write. (Tom Kyte has written about this, so I’ll be brief).
I guessed that after the data fix update started, other sessions had updated at least one row where service_ref was null, and then committed. The data fix update session could not complete what it had started, because some of the target rows had changed already.  It had to rollback and try the update again in order to update a consistent set of rows.

A quick check of V$SQL provided more supporting evidence. There were twenty thousand recent update statements that set the value of every NOT NULL column in SERVICES for a row identified by SERVICE_ID.  It appears that the row is inserted with nulls and then the values updated later.  This explains why the data fix update session couldn’t scan through SERVICES before one of it’s targeted rows had been been changed and committed.

Next Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.