Funny Concurrency Story

I've recently have been working with a client using a very well-known Web-tier library (which will go anonymous for certain reasons). Under heavy load, their system started responding terribly. They asked me to take a look at things.

To make a long story short, the library has a race condition in the assignment of a Web session's state to the activated code-handling block. Under heavy concurrent load, a handler may be activated with a null session (and, delightfully, a few microseconds later another handler might find its underlying session state assigned to). Not surprisingly, my client's code can't do anything logical with a null session and generates an exception. Worse, though, the library thread, which is pooled, becomes permanently corrupted.

So once the first failure happens and the thread corrupted, it increased the load on the remaining threads, making their failure more likely. However, as the thread pool corruption grows, the time that the over-all application spends in states away from the race condition increases, the requests start to queue, and the system limps along (kind of).

One of the reasons why the company couldn't figure out what was going on was that when they turned Debug logging on, the serialization of debugging statements to the logfile inadvertently throttled the system so that the deadly race condition didn't occur.

So, after figuring out al this, my first recommendation was to change the use of the Web-tier library. In the meantime, though, I added a very few logging calls as an "Info" log-level, gave them a graphical "dashboard" showing the system at that level, and tested the system to data-tier saturation without triggering the race condition! (Oh, and I gave them a very detailed explanation of why they shouldn't view the "synchronization by logging statement" trick as a fix!)