I have often been asked to investigate a performance problem, only to find to my disappointment that the main component of response time is outside of the Oracle database. (This robs me of a challenging tuning exercise, and may limit how useful I can be to the customer). The problem still needs to be identified and fixed, but this will only happen if the appropriate area is examined, so the DBA may have to convince the relevant technical staff to have closer look.
The “idle” wait event SQL*Net message from client is often ignored, but it is relevant during the period when the user is waiting for the application to respond. It can be used to show that the issue is not one of database or SQL tuning, but elsewhere, eg networking devices or application server performance. (Admittedly, the DBA may still be able to help in some cases, eg tuning the connection method, SQL*Net configuration and pre-fetch settings).
It doesn’t take long to trace a user’s session (with wait events), isolate the period during which the user is waiting for the application to respond, and then use tkprof to generate a report showing how much time was due to the database working and how much could be attributed to network/application server (SQL*Net message from client).
Sometimes this is enough for the DBA to stop investigating, but other times the people responsible for the network or application server need more evidence.
Monitoring the load (eg CPU) on the application server isn’t always going to be conclusive, especially when there are lots of round trips between the application server and the database.
In some cases the application server has been found to be the main cause of the delay, eg JVM garbage collections halting processing. In the example I use below, it was proved to be a network problem, despite initial claims that it was not.
A useful next step in the situation where the Oracle database has been ruled out as a cause of the slow performance, is to narrow it down further by measuring the network packet arrivals and departures from the perspectives of the application server and the database server. For one particular case, I used snoop and then fed the data into a free program called WireShark to analyse the results.
The snoop measurements from the Oracle database server side showed fast responses to incoming packets, which matched the SQL trace data.
The snoop measurements from the application server side also showed fast responses to incoming packets.
These results indicated that the ‘network’, (TCP stack, NIC, cable, switch, port, etc), was the cause of most of the delay. I just needed a little more lower level evidence, so I tried pings from both servers, which showed RTTs of >10ms. The servers were on different subnets, so there had to be at least one switch or router in between. We needed more granularity.
A traceroute from each server showed this:
[dbserver:root]# traceroute appserver traceroute to appserver (nn.nn.40.49), 30 hops max, 40 byte packets 1 nn.nn.38.2 (nn.nn.38.2) 0.521 ms 0.396 ms 0.322 ms 2 appserver (nn.nn.40.49) 11.968 ms 8.708 ms 13.582 ms[appserver:root]# traceroute dbserver traceroute to dbserver (nn.nn.38.166), 30 hops max, 40 byte packets 1 nn.nn.40.2 (nn.nn.40.2) 1.149 ms 0.997 ms 0.740 ms 2 dbserver (nn.nn.38.166) 11.134 ms 12.107 ms 13.027 ms
Hop 1 is to a switch/router and was consistently fast, proving that the cause wasn’t local to one of the servers.
Hop 2 was always slow, between the switch/router and the remote server, in both directions. It must be the switch causing the problem!
These results were repeatable, so we had enough reason to ask the network engineer to take another look.
The network engineer found that the switch/router suffered from a “known problem”. Something about other ports on the same line card running at slower speeds but using the same buffer…. I think that means that the output queue might have been slower when sending large amounts of data to a slower network.