Optimizing

Optimizing IP range searching in PeerGuardian

I was work­ing on some­thing com­pletely dif­fer­ent last night, when an el­e­gant idea came to mind on how to sig­nif­i­cantly speed up Peer­Guardian’s IP search­ing. It’s funny how an idea can just pop into the mind about a prob­lem that hasn’t been thought of in a long time.

Right now Peer­Guardian uses a bi­nary search to match IPs. This is al­ready pretty ef­fi­cient, run­ning in ⌈log 2 N⌉—so for 100,000 IP ranges, about 16 tests need to be done. This has the ad­di­tional ad­van­tage of hav­ing no mem­ory over­head.

My idea is to use a struc­ture sim­i­lar to a B+tree, pack­ing as many IPs and branch point­ers into a cache line as pos­si­ble. On today’s ar­chi­tec­tures, a cache line is typ­i­cally 64 bytes, so 8 IPs and 9 branch point­ers would fit on each node, mak­ing it only need to read about ⌈log 9 N⌉ nodes to find a match. So in order to find a match in 100,000 IP ranges, only about 5 nodes would need to be read.

CPUs al­ways read and cache data in blocks (a cache line), so an al­go­rithm that keeps this in mind to min­i­mize mem­ory reads and max­i­mize cache usage should be in­cred­i­bly fast. Even though this in­tro­duces sig­nif­i­cant over­head for branch point­ers (about 2x the stor­age would be re­quired), it should be far more ef­fi­cient over­all.

But this al­go­rithm im­proves in an­other way too: branch­ing. I’m talk­ing in terms of branch in­struc­tions, not the branch point­ers men­tioned above. The fewer branches code takes, the faster a su­per­scalar or pipelined CPU will be able to run your code. For this al­go­rithm, an en­tire node could be processed (that is, com­par­ing the IPs and de­ter­min­ing which branch node to go into) with zero branches using in­te­ger SSE2 (PCMPGTD, PMOVM­SKB), and bit-scan for­ward (BSF).

I can’t be sure how much of a speed dif­fer­ence this would make until I code it up, but I bet it would be at least 200% faster. I’ve been too busy to work on Peer­Guardian for quite a while, so I don’t know if this will ever make it into PG. We’re look­ing for a new coder with more time on their hands.

Optimizing exceptions

You might often hear about ex­cep­tions being slow. For this rea­son they are usu­ally shunned in the em­bed­ded space, and some­times even for reg­u­lar desk­top/server pro­gram­ming. What makes them slow? When one is thrown it needs to search through the call stack for ex­cep­tion han­dlers.

I guess I don’t un­der­stand this line of think­ing. For one, ex­cep­tions are meant for ex­cep­tional sit­u­a­tions: things you don’t ex­pect to hap­pen under nor­mal op­er­a­tion. Code that uses ex­cep­tions will run just as fast (or maybe even faster) as code with­out, until you throw one. These ex­cep­tional sit­u­a­tions are tru­ely rare, so I usu­ally don’t care if they do hap­pen to run slower.

A com­piler can ac­tu­ally use ex­cep­tions to op­ti­mize your code. Con­sider this in­ef­fi­cient (but typ­i­cal) pseudo-C:

int dosomething(void) {
   /* do something A */
   if(err) return -1;

   /* do something B */
   if(err) {
      /* cleanup previous work A */
      return -1;
   }

   /* do something C */
   if(err) {
      /* cleanup previous work B */
      /* cleanup previous work A */
      return -1;
   }

   return 0;
}

Or even this more ef­fi­cient (yes boys and girls, goto ac­tu­ally has a good use case in C, get over it) pseudo-C:

int dosomething(void) {
   /* do something A */
   if(err) return -1;

   /* do something B */
   if(err) goto err1;

   /* do something C */
   if(err) goto err2;

   return 0;

   err2:
   /* cleanup previous work B */

   err1:
   /* cleanup previous work A */

   return -1;
}

Why are these bad? Cache lo­cal­ity. In the first ex­am­ple, you have error han­dling code in­line with your reg­u­lar code. In the sec­ond you have it slightly bet­ter and off to the end of the func­tion. Ide­ally the code you run will all be com­pacted in as few cache lines as pos­si­ble, and er­ror­ing han­dling this way will waste sig­nif­i­cant space on cleanup code that in the large ma­jor­ity of cases won’t be run.

But with ex­cep­tions, the com­piler is free to take all the cleanup code in your en­tire app, and shove it into a sin­gle sep­a­rate area of code. All your nor­mal code that you ex­pect to run can be com­pact and closer to­gether. Of course, this will make ex­cep­tions run slower. If your code is heavy on throw­ing ex­cep­tions (which would prob­a­bly be an abuse) it will prob­a­bly cause a sig­nif­i­cant over­all slow­down. But if they are used cor­rectly–for ex­cep­tional sit­u­a­tions–then the com­mon case will be im­proved cache usage and there­for faster code.

Scalability isn’t everything

In the be­gin­ning, you write threaded apps with great ig­no­rance to scal­a­bil­ity. That’s usu­ally okay — most apps don’t need it, but sooner or later you will come across a prob­lem that de­mands it. With enough search­ing, you will come across lock–free al­go­rithms. Tricky to get right, but promis­ing fan­tas­tic scal­a­bil­ity if you do.

Even trick­ier, though, is know­ing when to not use them. Lock–free al­go­rithms come with a price: al­though they are in­deed very scal­able, their per­for­mance can be much worse than a well de­signed al­go­rithm for sin­gle–threaded ap­pli­ca­tions. Do a lit­tle bench­mark­ing and you might find some­thing sur­pris­ing: the per­for­mance hit can ac­tu­ally be so large that a sim­ple locked sin­gle–threaded al­go­rithm with no scal­a­bil­ity will give bet­ter over­all per­for­mance than a 100% scal­able lock–free ver­sion.

This is more com­mon than you might think. Take a queue. A sin­gle–threaded ver­sion will typ­i­cally have very min­i­mal mem­ory over­head: maybe a pointer for every n ob­jects. A lock–free ver­sion will need two point­ers for every ob­ject (or one, if you use a GC). Now the amount of over­head greatly de­pends on what your ob­ject is. If your ob­ject is large, a lock–free queue will prob­a­bly be a bet­ter choice. But if your ob­ject is small—say one or two point­ers—the over­head can be enough that cache misses will sig­nif­i­cantly af­fect your ap­pli­ca­tion.

I re­cently had to tackle this prob­lem. My ap­pli­ca­tion needed a queue of small ob­jects, and on a mod­ern quad–core CPU the cache misses were hurt­ing per­for­mance so much that al­though a lock–free queue did have near 100% scal­a­bil­ity, the over­all op­er­a­tion was com­plet­ing 165% faster with a locked queue with zero scal­a­bil­ity.

The next best thing is to com­bines the best of both worlds: de­sign a queue with low over­head and medium scal­a­bil­ity. Using a reader–writer lock with a com­bi­na­tion of lock–free op­er­a­tions, I came up with a queue that only needs to do a full lock once every 32 or 64 op­er­a­tions. The re­sult? Scal­a­bil­ity 5% lower than a lock–free queue, with over­all per­for­mance 210% bet­ter.

OK, I’ll admit it: I cheated, some­what. Lock–free al­go­rithms are good for more than just scal­a­bil­ity. They also offer im­mu­nity to nasty ef­fects like dead­lock, live­lock, and pri­or­ity in­ver­sion. In my case I wasn’t in a sit­u­a­tion to worry about these, but you might be. The les­son here is to know your sit­u­a­tion and de­cide care­fully, and don’t trust what oth­ers tell you: al­ways try things your­self and pro­file.