Lock-free

Scalability isn’t everything

In the be­gin­ning, you write threaded apps with great ig­no­rance to scal­a­bil­ity. That’s usu­ally okay — most apps don’t need it, but sooner or later you will come across a prob­lem that de­mands it. With enough search­ing, you will come across lock–free al­go­rithms. Tricky to get right, but promis­ing fan­tas­tic scal­a­bil­ity if you do.

Even trick­ier, though, is know­ing when to not use them. Lock–free al­go­rithms come with a price: al­though they are in­deed very scal­able, their per­for­mance can be much worse than a well de­signed al­go­rithm for sin­gle–threaded ap­pli­ca­tions. Do a lit­tle bench­mark­ing and you might find some­thing sur­pris­ing: the per­for­mance hit can ac­tu­ally be so large that a sim­ple locked sin­gle–threaded al­go­rithm with no scal­a­bil­ity will give bet­ter over­all per­for­mance than a 100% scal­able lock–free ver­sion.

This is more com­mon than you might think. Take a queue. A sin­gle–threaded ver­sion will typ­i­cally have very min­i­mal mem­ory over­head: maybe a pointer for every n ob­jects. A lock–free ver­sion will need two point­ers for every ob­ject (or one, if you use a GC). Now the amount of over­head greatly de­pends on what your ob­ject is. If your ob­ject is large, a lock–free queue will prob­a­bly be a bet­ter choice. But if your ob­ject is small—say one or two point­ers—the over­head can be enough that cache misses will sig­nif­i­cantly af­fect your ap­pli­ca­tion.

I re­cently had to tackle this prob­lem. My ap­pli­ca­tion needed a queue of small ob­jects, and on a mod­ern quad–core CPU the cache misses were hurt­ing per­for­mance so much that al­though a lock–free queue did have near 100% scal­a­bil­ity, the over­all op­er­a­tion was com­plet­ing 165% faster with a locked queue with zero scal­a­bil­ity.

The next best thing is to com­bines the best of both worlds: de­sign a queue with low over­head and medium scal­a­bil­ity. Using a reader–writer lock with a com­bi­na­tion of lock–free op­er­a­tions, I came up with a queue that only needs to do a full lock once every 32 or 64 op­er­a­tions. The re­sult? Scal­a­bil­ity 5% lower than a lock–free queue, with over­all per­for­mance 210% bet­ter.

OK, I’ll admit it: I cheated, some­what. Lock–free al­go­rithms are good for more than just scal­a­bil­ity. They also offer im­mu­nity to nasty ef­fects like dead­lock, live­lock, and pri­or­ity in­ver­sion. In my case I wasn’t in a sit­u­a­tion to worry about these, but you might be. The les­son here is to know your sit­u­a­tion and de­cide care­fully, and don’t trust what oth­ers tell you: al­ways try things your­self and pro­file.