Recently I was surprised to find out that a process that I expected to to complete in about 8 hours was still running after 20. Everything appeared to be operating normally. The load on the server was what we expected, IO was minimal, and the external services it was using were responding with latencies that were normal.
After tracing one of the sub-processes we noticed that reads from
/dev/urandom were not what we expected. It was taking 80-200ms to read 4k from
this device. At first I thought that this was an issue with entropy but
/dev/urandom is non-blocking so that probably wasn't the issue.
I didn't think that 80-200ms was typical so I tried a
dd on the system
in question and another similar system in production. The system in question
took about 3 minutes to write 10Mb while the reference system took about 3s.
The only difference between the systems with respect to
/dev/urandom was the
rate and number of processes reading from the device. The reads were on
the order of hundreds per second.
The number of processes reading from
/dev/urandom made me wonder if maybe
there was a spinlock in the kernel in the read code. After looking at the code
I found one. You can see the spinlock here
in the Linux kernel source code. The author mentions the potential for
contention in a thread
on the mailing list from December 2004.
Fast forward 10 years and contention on this device is a real issue. Our
curl from within PHP to fetch data from a
cache. The application has to process 10s of
millions of text objects and we don't want to wait days for that processing
to complete so we split the work of processing each object over N threads.
The read from
/dev/urandom appears to be coming from
which is being called from
curl_easy_init in our version of PHP+curl. Removing
AsynchDNS feature from curl causes the problem to go away
(tracing confirms that the read from
/dev/urandom is no longer there). You can
remove this feature by compiling with
So why is this an issue? I wrote a python script
to measure the read times from
/dev/urandom as you increase the contention by
adding more threads. Here is a plot with the results.
Running the same script with a user-land file is more or less linear out to 16 threads. A simple spinlock can have a big impact in the multi-core world of 2014!