This post might interest you if your organisation has Oracle databases running on UFS.
What Does UFS File Fragmentation Mean?
The blocks allocated to a datafile are not contiguous in the filesystem.
Why Should You Care?
You may have configured your system to allow 1MB I/Os from database through to the OS and devices (direct I/O, maxphys, md_maxphys, maxcontig, _db_file_exec_read_count/db_file_multiblock_read_count, extent sizes, etc), so you want to make the most of this tuning.
If your files are fragmented, then the number of I/O requests sent to the storage devices will be higher than necessary for large sequential operations, such as full scans, RMAN backup/restore/duplications, tempfile I/O, direct inserts, etc. When an Oracle database process executes a 1MB read/write, this can be broken into many smaller requests to the device if the file’s blocks aren’t contiguous in UFS.
Even if the device is a LUN presented by an array with cache and striping, the number of I/O operations per second (IOPS) could limit the throughput achieved.
On one system I tested, the average throughput for datafile reads during an RMAN backup was 270MB/s when fragmented, but about 780MB/s when defragmented.
Why Is There so Little Information About it?
My guess is that the lack of tools to easily detect and correct the problem of UFS fragmentation is the reason it isn’t a well known/documented Oracle database performance issue.
I started researching UFS fragmentation after noticing a dramatic reduction in the performance of a backup after a database was restored. I used iostat to see thousands of small reads were happening each second, when truss showed 1MB preads. I found a tool called filestat on the Solaris Internals site which gave me visibility of a file’s block allocation / fragmentation. (Also available on the Performance Tools CD).
How Does it Happen?
Whenever multiple requests to allocate space occur concurrently for the same filesystem there is a chance that free blocks will be assigned to different files in an alternating fashion. With asynchronous I/O, (simulated via LWP), sections of the same file may be allocated blocks out of order because the requests are received out of order.
The UFS allocation policies result in datafiles (or backup pieces) competing for space in the same cylinder group.
Scenarios to consider are:
- RMAN restore and duplication (production recovery, standby database creation, test database refreshes).
For speed, backups are often done asynchronously, multiplexed and multi-channelled, resulting in many competing block allocation requests.
The resulting database files will be heavily fragmented, affecting performance.
- RMAN backup pieces on UFS.
Similar to scenario #1, except that there is less scope for a performance problem while reading from fragmented backup files. (Write speeds or decompression during restores are more likely to be bottlenecks, and if backup files are transferred to other servers, or media then they may be defragmented in the process).
- Tablespace creation.
If you are tempted to create two or more tablespaces concurrently to minimise creation time, you may want to reconsider.
- Datafile auto-extension.
Some DBAs manage growth by leaving datafiles at 99.99% full and rely on frequent auto-extension. Not only does this introduce risk, reduce concurrency and cause users to wait for capacity, but if the allocation sizes are too small, then auto-extension will result in fragmented datafiles.
- Mixed filesystem contents.
If many small files (eg trace and log files) are on the same filesystem as datafiles, then after time it would be reasonable to assume there will be small fragments of free space and more concurrent allocation requests. (Untested, because I don’t mix datafiles with other types of files).
- Datafile relocation.
Not as common, but if you are going to transfer files from one filesystem/server to another, do so one at a time.
- Sparse tempfiles.
Space will be allocated as the blocks are used.
How Can You Prevent It?
Basically: allocate space in large chunks and in multiples of your maximum I/O size (commonly 1MB), and don’t make concurrent requests for space in the same filesystem.
For the corresponding scenarios above:
- Don’t use multiplexing for disk backups. (Filesperset=1 allows faster single file restoration anyway).
Using one channel per filesystem is most effective, but this may not be acceptable for capacity management or restore performance reasons. It is worth consideration when building standby databases or performance test environments. For emergency production restorations, getting the database operational ASAP is usually the highest priority. Analyzing fragmentation can be done later, and if performance is suffering, an arranged outage could be scheduled later to defragment the files.
The parameters disk_asynch_io, _backup_file_bufcnt (number of buffers for asynchronous writes) and _backup_file_bufsz (write buffer size) can be used to influence the degree of fragmentation during restores, but on the system I used for testing, any significant reduction in fragmentation was matched by a reduction in restore performance. Note that the backup and restore should use the same buffer sizes or else the restore may fail.
- Similar to scenario #1, but _db_file_direct_io_count is used for write sizes.
- I found that setting disk_asynch_io to false didn’t reduce tablespace creation speed, but did reduce fragmentation. Worthwhile to try when creating new databases on empty UFS filesystems.
When using asynchronous I/O, it may help to have _db_file_direct_io_count set to a multiple of your maximum I/O size.
- Set auto-extending datafiles’ initial and next sizes to a multiple of 16MB (or the value of maxbpg) which is the largest any section of a file can be. (That is, the most blocks in a cylinder group that can be allocated to each file). The file sections won’t be uniformly 16MB due to indirect blocks, an initial section size of 48kb, and previously used space, but we are aiming to eliminate needless fragmentation from many small allocation requests.
- Use separate filesystems.
- Recent versions of cp and mv use 8MB reads / writes, so they can be used efficiently on filesystems mounted with the forcedirectio option. If all the files are moved from one filesystem to another (empty filesystem), then they are defragmented in the process.
- Don’t use sparse tempfiles.
Quick specs of one the systems I used for testing:
- OS: Solaris 10
- DB: Oracle 11.2
- CPU: 2x SPARC T2 (128 virtual CPUs)
- RAM: 32GB
- Storage: 2x FC SANs, 16 disks on each forming RAID 1+0 LUNs, which are then mirrored via SVM
- UFS with forcedirectio, noatime, maxcontig<=>1MB and maxphys<=>1MB