In the old days, (ie the early 90’s when I was still a student), the most popular tuning tool was the database buffer cache hit ratio.  All tuning advice seemed to drag up the same old story.  It took a long time for the masses to move away from this metric and to use response time breakdowns (wait events and CPU time) as their main source of performance information.  Oracle’s recent tuning tools force it upon any stragglers.

However, such is the frustration of the early adopters, that the backlash against the buffer cache hit rate has gone too far, with some DBAs decrying it as useless.  Now it seems popular to dismiss and scoff at the mention of a buffer cache hit rate.  This is not a position I agree with.  Sure, some SQL runs more efficiently with full scans, or direct I/O, so hit rates can be low for good plans.  Sure, with mixed workloads, there is no rule of thumb for the optimal hit rate for the whole database.

But…. the total system buffer cache hit rate percentage can be used to detect changes in database work load, and can be drilled down to the session, SQL, segment and period levels.  It’s not the only tool in the box, and not the first one to use, but it certainly helps to build a model of a database’s workload.

There are a couple of gotchas:

  1. Calculate it correctly.  Not too difficult in recent versions, but still some people forget about direct I/O, (eg parallel processing).
  2. Some of the database metrics used to calculate the hit rate wrap around, leading to disturbing fluctuations in the resulting percentage score.

In this post I’m going to give a couple of recent examples of when total hit rates didn’t matter and did matter.  Both times, drilling down made all the difference.

Case 1

First up, I was asked whether a DB buffer cache should be increased by 50% because the monitoring software was alerting due to an 86% hit rate.
The SGA advisory was supplied:

SGA Target SGA Size   Est DB      Est DB   Est Physical
  Size (M)   Factor Time (s) Time Factor          Reads
---------- -------- -------- ----------- --------------
       500       .3   32,140         2.3     63,228,627
     1,000       .5   14,091         1.0      3,327,207
     1,500       .8   13,940         1.0      2,822,972
     2,000      1.0   13,890         1.0      2,656,665  (CURRENT)
     2,500      1.3   13,858         1.0      2,551,195
     3,000      1.5   13,851         1.0      2,528,082  (PROPOSED)
     3,500      1.8   13,851         1.0      2,528,082
     4,000      2.0   13,863         1.0      2,528,082

This indicated to me that the DB buffer cache did not need to be increased, and that doing so would yield no significant benefit.

I thought spending a few more minutes to gather more evidence would make my assertion more convincing.

First I checked the overall hit rate myself using v$buffer_pool_statistics.  I calculated a 95% hit rate.  Perhaps the monitoring tool was using v$sysstat and including direct I/O.  (Outdated SQL).

Next I looked through V$SQL for statements with a lot of physical I/O and low hit rates.  The top ones were related to statistics gathering and the monitoring tool itself.  To get a broader view, I generated a STATSPACK report for 24 hourly snapshots and saw that 85% of disk reads during that period were from statistics gathering, with the second biggest contributer at 2% being the monitoring software itself.

We don’t care about the hit rates of statistics gathering jobs that use parallel processing and direct I/O  (negative hit rates), or overnight batch jobs (if any) that do full scans.  The online users’ hit rates are more relevant.  Using V$SESS_IO I could see hit rates from 95-100%:

---------- ------------- ----------
        84             6        100
       267         29511      99.48
       189         29558      99.56
        93        215517      98.91
        92        270283      94.77
       296        418464      99.53
        95        424808      94.99
        80        437150      99.63

So, the only actions I would advise in this case are to upgrade/correct/calibrate the monitoring software, and reconsider how the statistics are being gathered (overkill!).

Case 2

A total hit rate percentage for a 9i database that we support suddenly dropped from mid-90’s (for the last year) to mid-30’s.  The database hadn’t been restarted recently, and a quick check of the metrics in v$buffer_pool_statistics showed that the numbers hadn’t wrapped around the clock.  This was something real to investigate.

We were lucky enough to find a statement still in V$SQL (not flushed yet) that had a low hit rate and a high number of physical reads.  If we hadn’t have caught the statement in V$SQL we still could have got clues from v$segstat and v$session_longops, which would have identified the table being scanned repeatedly.

When we notified the client of what we’d found, they mentioned that they’d attempted to archive a lot of data recently, but that the process had taken far too long and had been abandoned.

The explain plan of the statement showed a full tablescan, and a CBO trace showed that an execution plan to use an appropriate index was not being chosen by a small difference in cost.  Changing an Oracle parameter or two temporarily was an easy way to tip the balance and allow the client to complete the archiving procedure, (which required an outage anyway):


The problem and solution took minutes to find, and we were alerted to it by the change in hit rate.  Admittedly, an unusual rise in scattered read waits might also have got our attention, but the hit rate is a simpler metric to monitor, graph, compare and explain to the client in a report.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s