Putting firewalls between application servers and database servers seems to be becoming more popular.

One negative side effect of this is when the firewall decides to end a connection due to inactivity.  Some networking devices have a timeout period so idle connections are cut off.  This can have unpleasant results for the users of an application.

I first came across this problem while working for a new customer a long time ago.  I was frequently asked to kill database sessions that held row level locks and were blocking other transactions.  After a while, I noticed a pattern.  The locking problems always started after the blocking session had held the lock and been idle for at least 900 seconds.  When this was described to the users, they added that they have to log in again after lunch, or coffee breaks, etc.  I referred this issue to the network administrator, (who sat next to me and so couldn’t ignore me), who soon realised that the cause was a NAT’ing router.  The router cut off the (fat client) application’s session after the user had been idle for 15min.  The timeout was increased and users were asked to log out of the application when finished for the day.  Problem solved.

The SQLNET.LOG on a database server can show a DBA that idle connections are being cut off, eg a recent example I saw on Solaris:

***********************************************************************
Fatal NI connect error 12170.

 VERSION INFORMATION:
 TNS for Solaris: Version 10.2.0.3.0 - Production
 Oracle Bequeath NT Protocol Adapter for Solaris: Version 10.2.0.3.0 - Production
 TCP/IP NT Protocol Adapter for Solaris: Version 10.2.0.3.0 - Production
 Time: 01-JUL-2009 03:31:04
 Tracing not turned on.
 Tns error struct:
 ns main err code: 12535
 TNS-12535: TNS:operation timed out
 ns secondary err code: 12560
 nt main err code: 505
 TNS-00505: Operation timed out
 nt secondary err code: 145
 nt OS err code: 0
 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=nn.nn.nn.nn)(PORT=57922))

In this example, the Solaris error 145 shows that the disconnection was not due to an Oracle setting or database problem, (as some parties were insisting), but was actually from a connection timeout at the OS level.  (Look up error number 145 in /usr/include/sys/errno.h ).
In this case I needed more evidence to prove my explanation to others, so I recorded the connected database sessions and their idle times, then matched these errors to idle sessions that had disappeared at the same time.  Recording this data for a while showed that only sessions that were idle for a certain amount of time were cut off.  Later, the network support staff were able to confirm the idle connection timeout setting.

Over the years, my attempts to persuade network administrators to increase the timeout period have often met resistance, because the timeout exists to free up resources.  An alternative is to use some sort of keep alive packets.  There are various places keep alives can be enabled, eg at the Windows OS level or in PuTTy.  For Oracle databases, SQLNET.EXPIRE_TIME can be used.  Although SQLNET.EXPIRE_TIME was intended as a dead connection detection (DCD) method, a side effect of sending probe packets between the database server and client/application server is that the connection remains active.  Set the SQLNET.EXPIRE_TIME to less than the firewall timeout and the problem is solved.

Another work around for applications that use connection pools is to keep each connection active by checking running a simple SQL statement at regular intervals.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s