This post follows on from a previous one in which I used a Statspack report to show the value of the time model statistics. In this post I use some guess work to look outside the DB Time metric, to get a feel for what else might affect the response time for the application, and to gauge how important tuning the database (reducing DB Time) is from a holistic point of view.

I should point out that I’m not as familiar with AIX (the OS in this case) as I am with Solaris, Linux, HP-UX or even Windows.  I had no access to the server in question during the period for the Statspack report.  These are theories.  Make up your own mind and leave a comment if you’d like to share.

The system was described to me as “CPU constrained”.  The Statspack report shows that the database was directly responsible for only 24% of the 78% CPU utilisation (in other words increased server CPU utilisation by 19%).

Host CPU  (CPUs: 2  Cores: 1  Sockets: 0)
~~~~~~~~              Load Average
                      Begin     End      User  System    Idle     WIO     WCPU
                    ------- -------   ------- ------- ------- ------- --------
                       5.90    5.92     44.49   34.03   21.47    1.47  309.01

Instance CPU
~~~~~~~~~~~~                                       % Time (seconds)
                                            -------- --------------
                     Host: Total time (s):                  7,199.4
                  Host: Busy CPU time (s):                  5,653.5
                   % of time Host is Busy:      78.5
             Instance: Total CPU time (s):                  1,364.2
          % of Busy CPU used for Instance:      24.1
        Instance: Total Database time (s):                  4,059.1
  %DB time waiting for CPU (Resource Mgr):       0.0

OS Statistics
-> ordered by statistic type (CPU use, Virtual Memory, Hardware Config), Name

Statistic                                  Total
------------------------- ----------------------
BUSY_TIME                                565,349
IDLE_TIME                                154,590
IOWAIT_TIME                               10,579
SYS_TIME                                 245,021
USER_TIME                                320,328
OS_CPU_WAIT_TIME                       2,224,700
RSRC_MGR_CPU_WAIT_TIME                         0
PHYSICAL_MEMORY_BYTES              8,589,934,592
NUM_CPUS                                       2
NUM_CPU_CORES                                  1
GLOBAL_RECEIVE_SIZE_MAX                1,048,576
GLOBAL_SEND_SIZE_MAX                   1,048,576
TCP_RECEIVE_SIZE_DEFAULT                  16,384
TCP_RECEIVE_SIZE_MAX      ######################
TCP_RECEIVE_SIZE_MIN                       4,096
TCP_SEND_SIZE_DEFAULT                     16,384
TCP_SEND_SIZE_MAX         ######################
TCP_SEND_SIZE_MIN                          4,096

OS Statistics - detail

  Snap Snapshot
    Id Day Time          Load  %Busy  %User %System   %WIO  %WCPU
------ --------------- ------ ------ ------ ------- ------ ------
  1654 Tue 18 11:00:04    5.9
  1655 Tue 18 11:31:02    7.5   82.5   46.5    36.0    1.9
  1656 Tue 18 12:00:04    5.9   74.3   42.4    31.9    1.0

This server is dedicated to the one database, and has no application component running on it.  So what was using the rest of the CPU?  Perhaps the database CPU measurements were not quite correct, (eg due to LOB OCI activity?), or perhaps the listener was using some CPU (tracing enabled?), but it seems to me there is more to it than that.  I see that 44% of the CPU busy time happened in system mode.  This could be related to I/O or network traffic (among other things).  It was subsequently discovered that the I/O was being cached by the OS, so perhaps that contributed to some degree.

Finding the cause might offer some opportunity for tuning, greater than just trying to reduce background and DB CPU time.

How about network traffic?

I already know that this application uses LOBS, (direct I/O to LOBs showed up in the Statspack report), and  LOB access can result in a high number of round trips.

                                                              Avg          %Total
                                          %Tim Total Wait   wait    Waits   Call
Event                               Waits  out   Time (s)   (ms)     /txn   Time
---------------------------- ------------ ---- ---------- ------ -------- ------
SQL*Net message from client    16,327,445    0    471,448     29     69.9

Statistic                                      Total     per Second    per Trans
--------------------------------- ------------------ -------------- ------------
bytes received via SQL*Net from c      5,432,266,525    1,508,962.9     23,262.5
bytes sent via SQL*Net to client       9,743,180,327    2,706,439.0     41,723.1
SQL*Net roundtrips to/from client         16,329,496        4,536.0         69.9

There were 16 million round trips and about 15GB of data transferred across the network during the hour.

4,536 round trips per second means 9,072 SQL*Net packets (up to 8000bytes) arriving or departing per second, which may then be broken into multiple network packets (up to 1500bytes), each potentially causing CPU interrupts which have to be handled by the OS. (“Potentially” because of interrupt coalescing).  Surely this would contribute to the CPU time system in system mode.

Further, it might show that the active database time is less significant to overall application performance than we might otherwise assume.

The total wait time was 471,448 seconds, most of which will be idle application time, but not all of it.  Some of this time is network overhead, and some will be application think time.  It is not possible to calculate the network + application time from the data in the Statspack report, but we can make a safe guess at a lower limit using the wait event histogram information.

Total ----------------- % of Waits ------------------
Event                      Waits  <1ms  <2ms  <4ms  <8ms <16ms <32ms  <=1s   >1s
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
SQL*Net more data to clien  203K  99.1    .2    .3    .3    .0    .0    .0
SQL*Net message from clien   16M  16.1  29.5  25.4  19.3   6.5   1.6   1.4    .2
SQL*Net message to client    16M 100.0    .0    .0    .0    .0    .0    .0
SQL*Net more data from cli   60K  58.5    .2    .2    .2   2.3  38.1    .4

If we decide to treat any SQL*Net wait event longer than one second as idle (application and DB doing nothing), and treat the rest as time during which the user would be waiting, then we can get a lower limit from the highlighted row above.  (I’ll ignore the less significant more data waits for this example).

0ms <= 16.1% of waits < 1ms     0.161 * 0ms * 16,327,445 = 0s
1ms <= 29.5% of waits < 2ms     0.295 * 1ms * 16,327,445 = 4,816s

Taking the lower bound of each I calculate at that 45,701 sec is the lower limit.  Compare this to the total DB Time of 3,393 sec, and we can see that the time spent in the database is no more than 6.9% of the total response time of the app server + database server components.

Note that this doesn’t include any time used for communication between the application server and the user, browser processing, etc.

This extra information puts into perspective the scope of any benefit from reducing DB Time.  This information could be useful to the overall tuning effort by moving the focus to options that might help to reduce the number of round trips or make them more efficient.  (Eg bulk processing, pre-fetching, using varchar2 instead of LOB, increasing SDU, removing firewalls between app and DB, tuning interrupt coalescing parameters, application server tuning, etc).


One thought on “Beyond DB Time

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s