Connections to Oracle RAC timing out

A few week back, I was alerted by a remote team, that they are getting frequent timeout connection errors when connecting to theĀ  Oracle RAC database that I had installed at their location. The errors seem to occur for a period of a few minutes and then things looked fine until the next period of timeouts of a few minutes… They experienced these periods 1-3 times per day.

The alert file had the following errors:

WARNING: inbound connection timed out (ORA-3136)
WARNING: inbound connection timed out (ORA-3136)
WARNING: inbound connection timed out (ORA-3136)
WARNING: inbound connection timed out (ORA-3136)
WARNING: inbound connection timed out (ORA-3136)
... (repeated many times)
WARNING: inbound connection timed out (ORA-3136)

The Oracle RAC database is version 10.2.0.4, and uses Oracle Clusterware 11.2.0.1. The OS is Linux (CentOS 5.4). The cluster is configured to use a dedicated 1 GbE network interface for the interconnect. There are 2 nodes in the cluster. The Oracle Clusterware and database have different user owners in Linux.

After digging some more in the alert files, I noticed in the instance start-up entries a notification that a dedicated interconnect network interface could not be found:


WARNING: No cluster interconnect has been specified. Depending on the communication driver configured Oracle cluster traffic may be directed to the public interface of this machine. Oracle recommends that RAC clustered databases be configured with a private interconnect for enhanced security and performance.

Also after the connection timeout errors, there was the following error:

Global Enqueue Services Deadlock detected.
More info in file /u01/app/oracle/admin/xxxxxx/bdump/xxxxxx_lmd0_11155.trc.

The latter contained entries similar to the ones below:


Single node DRM(1) - transfer pkey 4294950913 to 0 done
Single node DRM(1) - transfer pkey 4294950914 to 0 done
Single node DRM(1) - transfer pkey 4294950915 to 0 done
...

I found a great post about Oracle’s dynamic object remastering feature in Riyaj’s blog. I figured that the reason for the timeouts is that Oracle is using the public network for inter-node communication and during normal dynamic object remastering and high volume on the public network, the RAC got in a state that caused time-outs in connections for short periods of time.

The way I resolved this is by setting the instance parameter cluster_interconnects in the spfile in each instance, respectively and restarting them:

alter system set cluster_interconnects = 10.10.10.1 scope=spfile sid=node1;
alter system set cluster_interconnects = 10.10.10.2 scope=spfile sid=node2;

The IP address is the one used by the dedicated interconnect interface on each node. You can see what it is by logging in to the OS as the Oracle Clusterware owner and executing:


[grid@node1 ~]oifcfg getif

Summery

Using a dedicated interconnect in Oracle RAC is very important for the operation of the database. When installing different versions of the database and clusterware software there are issues regarding dependencies of the different components. In my case, one of the ways this made itself evident is that the Oracle instance was not able to retrieve the interconnect information from the OCR. Check the alert log for any warnings about the interconnect and do not ignore them.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>