Saturday, January 4, 2014

ActiveMQ Connection Factory Timeout for Failover

I recently encountered an issue within our Apache ServiceMix environment such that if all the nodes are down that are participating in an ActiveMQ failover connection, without a timeout setting in place on the connection factory, ActiveMQ will wait forever for a node to become available.

This sounds reasonable, but within our usage, other processes were beings blocked and rolled back which were unrelated to the ActiveMQ integration messages. At a high level, the sending system would batch send messages to integration points A and B which are of different types. Type A might be a file system write, while B is an ActiveMQ message send.

If the message ordering was as mentioned above, all would have been fine if ActiveMQ failover nodes were all down. The file for message integration A would send, B would wait forever to send, and nothing is after it, so no harm done. Problem is, the process for the batch send of the mixed messages types will run again, and thread for sending integration B, or ActiveMQ is still waiting. The backend of the messaging system is a simple database table, which stores the messages for sends, and clears them after. So now a second process thread attempts to send message B integration again, but rolls back given that its locked in the initial attempt, and we begin to see a symptom of the original problem, the blocking.

The original failover string in the ActiveMQ connection factory looked like the following:
Notice no timeout, resulting in the blocking forever. In our case mentioned above, we don't want the connection factory to wait forever, we want it to timeout, so the next batch message sending process can attempt the send. Referencing the following ActiveMQ documentation:

After adding the additional parameter to the failover string, that being "&timeout=<some number>". The result would cause the connection attempt to stop looking for an ActiveMQ node to take the messages. Once added, the sending thread fails on the ActiveMQ message rather than blocking.

Also note, the inclusion of the JMS header "JMSExpiration" had no affect on within the message since the issue was within the connection establishment, not the actual sending of the message.

Obviously you might think, "why would all the nodes in the ActiveMQ failover be down?". The answer is, they were down.

No comments:

Share on Twitter