NZ DBA: Oracle Database Tuning & Tips

September 9, 2012

Diagnostic Directory Disaster

Have you thought about where your database takes a dump?

It is common to install the Oracle software on internal disks and place the database files on external storage.  The DIAGNOSTIC_DEST (and Automatic Diagnostic Repository) is stored under $ORACLE_BASE or $ORACLE_HOME by default, so without further consideration this ends up on local disk too.

Often there are only a couple of internal disks, which are mirrored.  This should be adequate for the OS and Oracle binaries, but not for operations with sustained high IOPS.

In some situations, many Oracle processes may write a lot of data to different files in the ADR concurrently, which may overwhelm a single internal disk.

I’ve seen an extreme example on a Solaris server with multiple test databases sharing a single (mirrored) internal disk and UFS filesystems for their ADR, Oracle binaries, swap space and OS  files.  When one highly parallel statement started hitting ORA-0600 errors, each parallel slave process started writing trace and core files to the ADR.  The server became very sluggish; many OS commands were affected.  A shutdown abort of the affected instance left behind the parallel slave processes, and even kill -9 wouldn’t terminate them.  The internal disks were swamped; the service times for I/Os were reaching 40 seconds while the disk heads thrashed back and forward!  This situation continued for 30 minutes before all the processes finished writing and finally terminated.

Enabling SQL tracing for an instance could also put pressure on the internal disk/s to cope.

If the ADR / DIAGNOSTIC_DEST was on SAN storage, (multiple spindles, large cache), then the spike in I/O activity would be better accommodated.

For Oracle clusters there is another reason to avoid internal/local disk for the ADR: it easier to manage RAC instances with logs (alert, listener, etc) on shared (and highly available) storage, to avoid having to log into every server.

The issue could be mitigated by setting a low value for MAX_DUMP_FILE_SIZE, but in production environments, you may not want to truncate trace files so that the entire trace file can be sent to Oracle Support without having to replicate the issue, (eg internal errors causing an instance crash or data corruption).

Pointing DIAGNOSTIC_DEST at a ZFS volume might mitigate the use of internal disks because of the sequential nature of the ZFS writes.

August 2, 2012

Solaris Cluster vs Network Speed

Filed under: Solaris Cluster,UNIX / Linux — Ari @ 1:56 pm
Tags:

While investigating rsync/ssh performance, I discovered that our Solaris Clusters had just over half the expected throughput on their gigabit NICs.

This was caused by the cluster/streams related module clhbsndr being automatically pushed to each public NIC via /etc/iu.ap .

Symptoms:

  • Round trip time via ping was higher than non-cluster servers on the same network
  • The throughput measured with iperf (and other tools) was around 400-600MBit/s instead of 940Mbit/s
  • The clhbsndr module is loaded for the public interfaces, eg
ifconfig e1000g1 modlist

0 arp
1 ip
2 clhbsndr
3 e1000g

Status

We have confirmed that bug 6625886 is the cause and are now waiting for Oracle to say if the module can safely be removed from the public interfaces of a Solaris 10 cluster, eg:

ifconfig e1000g1 modremove clhbsndr@2

Encryption vs Transfer Speed

Filed under: ssh — Ari @ 1:46 pm
Tags:

Do you want to make scp and rsync go faster?

LAN

One easy method to squeeze a little more speed out of file transfer tools using ssh by trying a different encryption method.

For example, we saved 15 seconds per gigabyte sent across our LAN on our SPARC Solaris servers, by using:

rsync --rsh="ssh -c arcfour" ...

or

scp -c arcfour source dest

WAN

If your problem is latency across a WAN, then HPN-SSH might be the solution.

Also, if you want secure authentication but don’t need the data encrypted, then you could use SSH with no encryption.  HPN-SSH allows the use of a ‘none” cipher.

For example, on a network with 32.5 ms RTT (ping) and Gigabit NICs, scp transfer rates increased from 1.35MB/s to around 100MB/s with HPN-SSH, 4MB buffers (TCP and SSH), and the “NONE CIPHER” enabled.

HPN_SSH Settings
  • HPNDisabled no
  • NoneEnabled=yes (optional)
Kernel Settings

All these were set to 4MB from their defaults:

  • tcp_cwnd_max
  • tcp_max_buf
  • tcp_recv_hiwat
  • tcp_xmit_hiwat

(The send and receive buffers could be set per route instead to differentiate between LAN and WAN connections).

Example of syntax (remove TcpRcvBufPoll=no on Linux and Windows):

/opt/openssh-5.9p1-hpn/bin/scp -oTcpRcvBufPoll=no -oNoneSwitch=yes -oNoneEnabled=yes sourcefile remoteserver:/destpath/destfile

If the NONE cipher is not appropriate, then HPN-SSH’s multi-threaded AES cipher might be the fastest solution for your server if it has multiple cores.  Alternatively, I’ve found arcfour used less CPU so it was better for several concurrent rsync and scp transfers, because it offered greater throughput on busy processors.

July 24, 2012

Datapump vs Application Upgrade

Filed under: Datapump — Ari @ 6:51 am
Tags: , , ,

If you have used Datapump to migrate, duplicate or re-organise a schema, then you may have inadvertently introduced an application upgrade time bomb.

A primary key index (or unique key index) that was created by an “add constraint” statement, will be dropped by an “alter table xxx drop primary key” statement.  If the developers rely on this “lazy” method, then the SQL in their application upgrades may fail if the index was created in another way.  (Eg Datapump import, or manual DBA maintenance).

For example, the table, index and primary key constraint (with user specified names rather than system generated names) are created by the application scripts like this:

create table TEST as select * from dual;
alter table TEST add constraint TEST_PK primary key (dummy);

Later, Datapump (11.2.0.2.3) is used to move the schema to a new database.  Datapump creates the tables, then the indexes, before adding the constraints separately.  (Except when reference partitioning is used).

The Datapump SQL is equivalent to:

create table TEST as select * from dual;
create index TEST_PK on TEST(dummy);
alter table TEST add constraint TEST_PK primary key (dummy);

This works fine until the next application upgrade, when a column is to be added to the primary key:

SQL> alter table TEST add (dummy2 varchar2(100) default 'a' not null);

Table altered.

SQL> alter table TEST drop primary key;

Table altered.

SQL> alter table TEST add constraint TEST_PK primary key (dummy,dummy2);
alter table TEST add constraint TEST_PK primary key (dummy,dummy2)
*
ERROR at line 1:
ORA-00955: name is already used by an existing object 

SQL> -- The application upgrade script failed
SQL> select index_name from user_indexes where table_name='TEST';

INDEX_NAME                                                                                                                                                              
----------                                                                                                                                          
TEST_PK

The index wasn’t dropped, but it can no longer support the primary key, so the “add constraint” statement can’t create an appropriately named index to enforce the new primary key.

Prevention

  1. Developers could create objects using commands similar to Datapump, ie run “create index” explicitly
  2. Use the full syntax when dropping constraints and intending the index to be dropped too, eg
    alter table TEST drop primary key drop index;
  3. Manually recreate affected indexes and constraints as required after a Datapump import.  (Not always a quick task).
  4. After a Datapump import, manually set the index properties to make it seem that the index was created by an “add constraint” statement.  (This involves updating ind$, so this may not be considered appropriate for production environments).

Ind$.Property

These are the index property flags for a Oracle 11.2:

 /* unique : 0x01 */
 /* partitioned : 0x02 */
 /* reverse : 0x04 */
 /* compressed : 0x08 */
 /* functional : 0x10 */
 /* temporary table index: 0x20 */
 /* session-specific temporary table index: 0x40 */
 /* index on embedded adt: 0x80 */
 /* user said to check max length at runtime: 0x0100 */
 /* domain index on IOT: 0x0200 */
 /* join index : 0x0400 */
 /* system managed domain index : 0x0800 */
 /* The index was created by a constraint : 0x1000 */
 /* The index was created by create MV : 0x2000 */
 /* composite domain index : 0x8000 */

An “alter table xxx drop primary key” statement will drop the index too if the bits for 0×1001 are set (4097 in decimal; meaning ‘unique and created by a constraint’).  A Datapump import will lose the 0×1000 flag, so a DBA could generate statements to reinstate it by running this in the source database:

-- For Oracle 11.2
set linesize 300
spool ind_property_hack.sql
select 'update sys.ind$ set property=property+4096 
 where bitand(property,4096)=0
 and obj#= (
  select object_id
  from dba_objects do2
  where owner='''||do.owner||'''
  and object_name='''||do.object_name||''');'
from sys.ind$ i, dba_objects do
where bitand(property,4096)=4096
and i.obj# = do.object_id
and i.dataobj# = do.data_object_id
and do.owner not in (select distinct user_name from SYS.default_pwd$)
order by 1
/
spool off

June 6, 2012

Danger: Sparse Datafiles Ahead

You may have read my advice not to use sparse tempfiles on Solaris UFS, but did you know that datafiles can be sparse too?

Direct and concurrent I/O is very important for Oracle database performance, so there can be a dramatic impact when database operations switch to buffered and blocking I/O.  (Eg RMAN section size bug).

On Solaris, concurrent direct I/O is disabled to UFS files that are sparse (not all blocks have been allocated in the filesystem).
All direct I/O is disabled when the files are opened with the O_DSYNC flag and the files are sparse.

When datafiles are resized, (either manually or by auto-extension), Oracle grows the file using fcntl to seek past the last allocated block of the file.  At this point, the datafile is considered sparse; it has holes because not all the blocks have been allocated.  The process resizing the datafile will now start filling that hole, and can do so with direct and asynchronous I/O.  Until it has completed the task, other processes’ I/O from/to that datafile is limited by blocking file locks (aka POSIX/inode/single writer locks).  Processes opening the datafile with the O_DSYNC flag are also forced to use OS buffering (large I/Os are broken into single block I/Os).

This situation may not cause significant problems for quiet databases, but for a database that has many concurrent sessions that access the most recent data, (eg right hand indexes and time-based partitions), it can cause grid lock.  For small auto-extension increases, the symptoms may just show as frequent blips.  At the other end of the scale, resizing a 1GB datafile to 20GB would block all sessions wanting to read a single block from that datafile and bring multiple database writers to a halt until the resizing process finishes writing 19GB to the datafile (so that it has 20GB allocated in the filesystem).

The symptoms are varied, but you might see one session waiting on Data file init write while others get stuck on a single db file sequential read, DBWRs on db file parallel write, etc.

Consequent peaks of concurrent activity (dam bursting effect) can have lasting effects, such as index bloating from ITL growth and an increase in 50/50 block splits in what would otherwise be right hand indexes.  In extreme examples the application may experience errors when trying to connect due to hitting the maximum number of processes or concurrent listener/TCP connections.

To avoid this issue, simply pre-allocate space; create datafiles at the size at which you would expect them to eventually grow if left to auto-extend.

Don’t rely on auto-extension to drip feed capacity.  (An auto-extend next size of 8KB or 16KB would also avoid this problem, but could increase overhead and UFS fragmentation).

Be cautious when resizing a used datafile by a large amount in a busy database.

May 31, 2012

Read Only Users Can Lock Tables

Filed under: Oracle Database,Security — Ari @ 5:20 am
Tags: , , ,

Can a user with only select privileges block transactions?  Yes!

Although this topic has been covered on the interweb already, I thought it was worth a mention because this fact had evaded me (and a team of experienced DBAs) until recently.

A session using an application support account, with resource limits and a “READ_ONLY” role, was found to be blocking another session’s transaction.  The user had selected a row with a popular GUI tool then accidentally clicked on a field in that row, which resulted in a select for update.

I was shocked, so I tried a simple experiment with the lock table command:

17:09:17 SYS@MYDB SQL> create user RO_USER identified by RO_USER;
User created.
17:09:56 SYS@MYDB SQL> grant create session to RO_USER;
Grant succeeded.
17:10:03 SYS@MYDB SQL> grant select on APP.key_table to RO_USER;
Grant succeeded.
17:10:12 SYS@MYDB SQL> conn RO_USER/RO_USER
Connected.
17:10:24 RO_USER@MYDB SQL> lock table APP.key_table in exclusive mode;
Table(s) Locked.

Oracle has a few bugs logged for this, including the enhancement request Bug 6823286.
Hopefully they will fix this one day soon, so that privileges are checked before allowing lock table and select for update commands.

May 3, 2012

Kaio oww

The issue I’m writing about today remains a bit of a mystery.

My monitoring scripts had picked up that a LGWR process had started using 100% CPU and other jobs were running for longer than usual.

The 11gR2 RAC database was on Solaris and used ASM.  Only one instance of the two node cluster was affected. (Although the other did suffer from more GCS related waits).

The most prominent wait was ‘log file sync’, but a comparison of AWR data showed that at 10pm, many I/O operations suddenly slowed down.

Slower I/Os

Higher flashback log writes, log file parallel write, db file scattered read and db file sequential read.

Not Slower

RMAN backups, direct path reads.
Unfortunately I didn’t have sufficient privileges to diagnose the problem.  A kind UNIX admin ran truss on the LGWR process for me.

syscall               seconds   calls  errors
 times                    .000      10
 semctl                   .000      30
 semop                    .000       3
 semtimedop               .000       2
 pread                    .001      14
 kaio                    6.043      80
 kaio                    4.581      78
 pollsys                  .000       9
 recvmsg                  .000      18       8
 sendmsg                  .001      15
 --------               ------    ----
 sys totals:            10.628     259       8
 usr time:                .022
 elapsed:               11.030

So KAIO system calls were taking all the CPU time. Lets see the detail.
Before the instance was restarted:

 /1:     0.0897    kaio(AIOWRITE, 258, 0x380DEF600, 28160, 0x501DFE007A072930) = 0
 /1:     0.0893    kaio(AIOWRITE, 256, 0x380DEF600, 28160, 0x0D2DFE007A074DC0) = 0
 /1:     0.0893    kaio(AIOWRITE, 261, 0x380DEF600, 28160, 0x0A0DFE007A074B50) = 0
 /1:     0.0005    sendmsg(17, 0xFFFFFFFF7FFFB660, 0)        = 224
 /1:     0.0706    kaio(AIOWAIT, 0xFFFFFFFF7FFF8ED0)        = -2247677648
 /1:     0.0708    kaio(AIOWAIT, 0xFFFFFFFF7FFF8ED0)        = -2247668288
 /1:     0.0706    kaio(AIOWAIT, 0xFFFFFFFF7FFF8ED0)        = -2247668912

After the instance was restarted the KAIO calls were back to normal:

 /1:     0.0006    kaio(AIOWRITE, 262, 0x380A7A600, 20480, 0xE2C382007A072BA0) = 0
 /1:     0.0002    kaio(AIOWRITE, 258, 0x380A7A600, 20480, 0x5D2382007A0732F0) = 0
 /1:     0.0001    kaio(AIOWRITE, 256, 0x380A7A600, 20480, 0x13A382007A073080) = 0
 /1:     0.0003    sendmsg(17, 0xFFFFFFFF7FFFB670, 0)        = 224
 /1:     0.0005    kaio(AIOWAIT, 0xFFFFFFFF7FFF8EE0)        = -2247677024
 /1:     0.0001    kaio(AIOWAIT, 0xFFFFFFFF7FFF8EE0)        = -2247675152
 /1:     0.0001    kaio(AIOWAIT, 0xFFFFFFFF7FFF8EE0)        = -2247675776

I wish I had truss information of the sessions performing fast direct path reads to see why they weren’t affected, because I expect them to use KAIO too. The scattered and sequential reads wouldn’t have used KAIO, but were slower!

I managed to find a match: MOS Bug 13356146 “PERIODIC SLOW DOWN IN SYSTEM”, but this was closed due to insufficient data.

So… unless I see it again, it will remain a mystery.

April 22, 2012

Modern Age Database Reorganisation vs LOBs

Filed under: LOBs,Oracle Database — Ari @ 7:17 am
Tags: ,

Database re-organisations exist only as fond and distant memories of old Oracle DBAs.  Well… almost.  I was asked to purge, partition and reclaim space from audit tables that had grown to be 100s of gigabytes each.  The real problem was that they contained LOBs (BasicFiles, not SecureFiles).  This post lists a few of the tweaks and tricks I used to bring the ancient ritual of the reorg into the modern age (11gR2).

In a nutshell, we wanted to move a schema from its current tablespace into a new one, leaving behind any data older than 12 months, and partitioning the largest objects at the same time.

  • The UFS filesystems, Solaris OS and Oracle database were configured to allow concurrent direct 1MB I/Os (most efficient for our environment).
  • The destination tablespaces were new and empty, so they could be configured without flashback or redo logging.  This dramatically reduced overhead when copying data into the tablespace, while allowing us to use a guaranteed restore point to flashback the rest of the database if the reorg failed.  (The new tablespaces would have to be dropped before flashing back the database, but they were only populated during the reorg exercise, and the old tablespaces would be retained until after the reorg had completed, so it doesn’t matter that we couldn’t recover the new tablespaces).
  • The destination tablespaces was created with evenly sized and pre-allocated datafiles.  The number of datafiles was a multiple of the degree of parallelism used for the compressed backup off the tablespace.  (See next point).  The pre-allocation avoids auto-extension delays during the reorg, and the size and number of datafiles ensures the load will be shared evenly between the parallel backup workers.
  • After the reorg, redo logging was re-enabled for the objects in the new tablespaces, then a parallel RMAN backup of the new tablespaces was taken with low compression (fastest in this environment).  This required less capacity and time than logging full redo during the re-org.  It made a recovery possible if needed between the reorg and the next full database backup.
  • The tables to be partitioned were pre-created with temporary names, parallel degrees, and nologging attributes for the LOBS.
  • Data was transferred with parallel direct inserts or alter table move commands.
  • Tracking of unrecoverable actions in the datafiles was disabled using event 10359.  (the DB_UNRECOVERABLE_SCN_TRACKING parameter isn’t dynamic until 11.2.0.3).  This prevents the bottleneck which would form around access to the controlfile when relocating data in nologging mode.
  • Direct writes were increased to 1MB with event 10351, eg:
    alter system set events ’10359 trace name context forever,level 1:10351 trace name context forever, level 128′;
  • Indexes were created in parallel using large manual work areas, nologging and non-sparse temp files.
  • Constraints were created with ENABLE NOVALIDATE then altered to ENABLE VALIDATE to complete in seconds.  (A big time saving).

LOBs are Anchors

The biggest obstacle to moving the data quickly was that the largest tables used LOBs (basicfiles).  How ironic that the datatype invented for holding large amounts of data is also the most awkward to manage.

The large amount of LOB data negated any benefit from using datapump in parallel.  (See bug 5943346: PRODUCT ENHANCEMENT: PARALLELISM OF DATAPUMP JOB ON TABLE WITH LOB COLUMN).

Luckily, most of the LOB data was small enough to be stored in-row, (so it would be handled like other datatypes), but some was still stored in LOB segments.  Having data in LOB segments meant a lot of single block I/Os would be required during the reorg and that parallel inserts would be limited to one PQ process per destination partition.  (No intra-partition parallel support for LOBs).

It would have been much worse without disabling the unrecoverable scn tracking mentioned above.  Enabling caching of the LOBs would allow some multi-block I/O, but it would have increased redo and undo generation, so would not be a useful option.

March 7, 2012

Sparse UFS Tempfiles vs Direct IO

In a previous post I found that direct and concurrent I/O was disabled when restoring a certain type of backup with RMAN on UFS.

A similar issue disables direct and concurrent I/O when writing to sparse tempfiles.  I haven’t seen any advice on this specific issue, but it must be common because tempfiles on Solaris UFS are created sparse, (not all space is allocated; the file has “holes” in it).  Solaris won’t allow direct I/O for synchronous writes to sparse UFS files.  Long average wait times for direct path write temp are an indication of this problem.  Another is a high number of 8KB writes shown by iostat, while truss shows larger (eg 128KB or 1MB) writes.

For this post, I shall refer to files that are not sparse as ‘dense’ files.

One way to compare the performance difference yourself is:

  1. Create a new temporary tablespace on a forcedirectio UFS filesystem.
  2. Time the parallel creation of a very large index, (so that the temporary tablespace is written to concurrently).
  3. After this, the tempfile/s might still be sparse.  Check with the du command or use ls -ls which shows the number of allocated 512 byte sectors for data+metadata.  Compare this to the file’s length.  If there is any doubt, copy the tempfile and copy/rename it back again to convert it to a normal file.  (Or you can do this online by dropping the tempfile and adding the dense one back with the reuse clause).
  4. Restart the instance or flush the cache.
  5. Time the parallel creation of the large index again.  This time it should be quicker, with less time spent waiting for direct writes.

A dense tempfile can become sparse again as soon as it auto-extends.  Even when the sort write size = the temporary tablespace extent size = the auto-extension next size = 1MB, some operations (eg hash joins) may not touch all blocks in the new sort segment extents, leaving the file sparse.  Over time, the extents will be reused and all the blocks touched.

The OS uses bmap_has_holes() to compare the data+metadata to the length of the file.  Just one unallocated block is enough for the entire file to be treated as a sparse file.

The tempfiles will be treated as dense when all the blocks have been touched, or when the tempfile is resized down far enough to remove all of the unallocated blocks.  (Shrinking a used tempfile causes the existing sort segment to be completely shrunk, one block at a time, which can take a while).

If possible, pre-allocate sufficient temporary tablespace capacity and use an OS tool like cp to get rid of the holes in the tempfiles.  This will allow concurrent direct I/O when sorting, hash joining, using global temporary tables, etc.

If using auto-extension, set the next increment to match the tablespace extent size, (eg 1MB).  However, I have seen examples of files that were sparse when they auto-extended, and even hit the auto-extension limit (cannot extend error) and remained sparse!  So, I recommend pre-allocation of dense tempfiles and not allowing auto-extension for temporary tablespaces.

Why use synchronous writes to tempfiles?

I assume that Oracle has chosen to open tempfiles with the O_DSYNC flag because many processes could be using the tempfiles, and the alternative could result in lots of fdsync calls.  But do we need to know if any writes to tempfiles have made it to disk?  If the  server lost power with dirty tempfile blocks in the cache, it wouldn’t affect the integrity of the database.  I can’t explain why Oracle have chosen to use O_DSYNC for tempfiles.

Even without the O_DSYNC flag, a single write lock would be enforced while the tempfiles are sparse.

January 8, 2012

Next Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.