This post is about two RMAN issues on 11.2.0.2 – RMAN duplicate & control file backup timing and a bug that causes problems when RMAN scans directories to catalogue backup pieces.  (RMAN-06023 or “bad file number” during autobackups).

RMAN Duplicate & Control File Backup Timing

After upgrading to Oracle 11.2, RMAN duplication from a set of level 0 backup files became awkward.
In Oracle 10.2 we could simply copy the files from a level 0 backup to a test server and duplicate until the highest NEXT SCN in the backed up archive logs.
In 11.2, RMAN starts the duplicate by restoring a controlfile, but it wants one backed up with a checkpoint SCN equal to or lower than the “UNTIL SCN“, so the autobackup after the run {} block is of no use.  (If a suitable controlfile backup can’t be found, then these errors will be displayed: RMAN-06026, RMAN-06024).

Using “backup database including current controlfile plus archivelog…” with multiple channels may produce a controlfile that finishes before some of the datafile backups.  The duplicate will fail, even if the RMAN repository is used!  (The repository knows about the datafile backups, so RMAN should be able to complete the duplicate).

To work around this issue, I altered the script to have two backup commands, eg:

run
{
backup incremental level 0 database ;
backup current controlfile plus archivelog force delete all input;
}

The idea was to create a backup set that had a controlfile backup after the datafile backups complete, and before the last backed up archive log.  This set of backup files would then be sufficient for an 11.2 duplicate until scn.

Bad File Number RMAN Bug (11.2.0.2.3)

Unfortunately, there was a side effect.  The message below kept appearing during the autobackup:

Non critical error ORA-00001 caught while writing to trace file "/ora/diag/rdbms/dbuname/SID/trace/SID_ora_1.trc"
Error message: SVR4 Error: 9: Bad file number
Additional information: 1
Writing to the above trace file is disabled for now on...

The ORA-00001 means ‘unique constraint violated’, so the wrong code is shown; one I don’t want to filter out from the alert log monitoring, so it is really annoying.  (The patch for bug 8367518 will remove the ORA-00001 in 11.2.0.3, but the OS error is still something that I want to be notified about, so the problem remains).

I used truss to find the cause.  The Oracle code opens the trace file and writes the first lines into it.  After the autobackup, the FRA is scanned (to catalogue the files or manage capacity?).  During this scan, there are repeated erroneous close calls which result in the trace file being closed before the final lines should be written to the trace file.  Instead, the error is returned (Err#9 EBADF).

I tried to work around this bug by adjusting the script to have a single backup command inside the run {} block and one afterwards:

run
{
backup incremental level 0 database plus archivelog force delete all input;
}
backup archivelog all force delete all input;
Update:

This work around seems to work most of the time, but the error is still reported sometimes.  So… I’ve changed the scripts to disable autobackups temporarily during the backup and to include the current controlfile in the explicit backup commands.  (The spfile still gets backed up with datafile 1).

I logged a service request with Oracle Support to see if we can turn off the tracing for autobackups and to register the bug.  They have referred me to bug 9315802 which says that there is no way to disable these unnecessary trace files.

Oracle have created bug 13609024 for the issue I discovered.

Worse File Number RMAN Bug

The problem above affects more than just trace files.

I found this bug is affecting our ability to duplicate databases too.  An RMAN duplicate performed without a connection to target or repository by specifying a backup location restores the controlfile and then clears it before the backup location is scanned to catalogue the backup pieces.

This time it isn’t the trace file that is being closed prematurely, instead it is the directory containing backup pieces. This means that only the first 145 or so backup pieces are found per directory, because the directory file is closed before all the directory entries are read, and so the duplicate fails with “RMAN-06023 no backup or copy of datafile 1 found to restore”.
If I move the backup pieces into directories with only a few backup pieces in each, then the catalog & duplicate will succeed.Truss output showed these relevant lines:
Open directory as file handle 0

open("/dummypath/backupset/2012_03_04", O_RDONLY) = 0

Get the first directory entries

getdents(0, 0xFFFFFFFF7A304000, 8192) = 8168
access("/dummypath/backupset/2012_03_04/o1_mf_nnnd0_SUN_FULL_7o5qmoyh_.bkp", R_OK) = 0....

Erroneously closes the directory after first backup piece is processed. (144 more are cached).

close(0) = 0
access("/dummypath/backupset/2012_03_04/o1_mf_nnnd0_SUN_FULL_7o66ft2o_.bkp", R_OK) = 0
...

Erroneously closes file handle 0, but it is already closed.

close(0) Err#9 EBADF
.....

Try and fail to read more directory entries:

getdents(0, 0xFFFFFFFF7A304000, 8192) Err#9 EBADF

Then the next directory is processed, leaving 200 backup pieces un-catalogued!

This is also attributed to bug 13609024, which with Oracle development.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s