Optimizing

Optimizing IP range searching in PeerGuardian

I was working on something completely different last night, when an elegant idea came to mind on how to significantly speed up PeerGuardian’s IP searching. It’s funny how an idea can just pop into the mind about a problem that hasn’t been thought of in a long time.

Right now PeerGuardian uses a binary search to match IPs. This is already pretty efficient, running in ⌈log 2 N⌉—so for 100,000 IP ranges, about 16 tests need to be done. This has the additional advantage of having no memory overhead.

My idea is to use a structure similar to a B+tree, packing as many IPs and branch pointers into a cache line as possible. On today’s architectures, a cache line is typically 64 bytes, so 8 IPs and 9 branch pointers would fit on each node, making it only need to read about ⌈log 9 N⌉ nodes to find a match. So in order to find a match in 100,000 IP ranges, only about 5 nodes would need to be read.

CPUs always read and cache data in blocks (a cache line), so an algorithm that keeps this in mind to minimize memory reads and maximize cache usage should be incredibly fast. Even though this introduces significant overhead for branch pointers (about 2x the storage would be required), it should be far more efficient overall.

But this algorithm improves in another way too: branching. I’m talking in terms of branch instructions, not the branch pointers mentioned above. The fewer branches code takes, the faster a superscalar or pipelined CPU will be able to run your code. For this algorithm, an entire node could be processed (that is, comparing the IPs and determining which branch node to go into) with zero branches using integer SSE2 (PCMPGTD, PMOVMSKB), and bit-scan forward (BSF).

I can’t be sure how much of a speed difference this would make until I code it up, but I bet it would be at least 200% faster. I’ve been too busy to work on PeerGuardian for quite a while, so I don’t know if this will ever make it into PG. We’re looking for a new coder with more time on their hands.

Optimizing exceptions

You might often hear about exceptions being slow. For this reason they are usually shunned in the embedded space, and sometimes even for regular desktop/server programming. What makes them slow? When one is thrown it needs to search through the call stack for exception handlers.

I guess I don’t understand this line of thinking. For one, exceptions are meant for exceptional situations: things you don’t expect to happen under normal operation. Code that uses exceptions will run just as fast (or maybe even faster) as code without, until you throw one. These exceptional situations are truely rare, so I usually don’t care if they do happen to run slower.

A compiler can actually use exceptions to optimize your code. Consider this inefficient (but typical) pseudo-C:

int dosomething(void) {
   /* do something A */
   if(err) return -1;

   /* do something B */
   if(err) {
      /* cleanup previous work A */
      return -1;
   }

   /* do something C */
   if(err) {
      /* cleanup previous work B */
      /* cleanup previous work A */
      return -1;
   }

   return 0;
}

Or even this more efficient (yes boys and girls, goto actually has a good use case in C, get over it) pseudo-C:

int dosomething(void) {
   /* do something A */
   if(err) return -1;

   /* do something B */
   if(err) goto err1;

   /* do something C */
   if(err) goto err2;

   return 0;

   err2:
   /* cleanup previous work B */

   err1:
   /* cleanup previous work A */

   return -1;
}

Why are these bad? Cache locality. In the first example, you have error handling code inline with your regular code. In the second you have it slightly better and off to the end of the function. Ideally the code you run will all be compacted in as few cache lines as possible, and erroring handling this way will waste significant space on cleanup code that in the large majority of cases won’t be run.

But with exceptions, the compiler is free to take all the cleanup code in your entire app, and shove it into a single separate area of code. All your normal code that you expect to run can be compact and closer together. Of course, this will make exceptions run slower. If your code is heavy on throwing exceptions (which would probably be an abuse) it will probably cause a significant overall slowdown. But if they are used correctly–for exceptional situations–then the common case will be improved cache usage and therefor faster code.

Scalability isn’t everything

In the beginning, you write threaded apps with great ignorance to scalability. That’s usually okay — most apps don’t need it, but sooner or later you will come across a problem that demands it. With enough searching, you will come across lock–free algorithms. Tricky to get right, but promising fantastic scalability if you do.

Even trickier, though, is knowing when to not use them. Lock–free algorithms come with a price: although they are indeed very scalable, their performance can be much worse than a well designed algorithm for single–threaded applications. Do a little benchmarking and you might find something surprising: the performance hit can actually be so large that a simple locked single–threaded algorithm with no scalability will give better overall performance than a 100% scalable lock–free version.

This is more common than you might think. Take a queue. A single–threaded version will typically have very minimal memory overhead: maybe a pointer for every n objects. A lock–free version will need two pointers for every object (or one, if you use a GC). Now the amount of overhead greatly depends on what your object is. If your object is large, a lock–free queue will probably be a better choice. But if your object is small—say one or two pointers—the overhead can be enough that cache misses will significantly affect your application.

I recently had to tackle this problem. My application needed a queue of small objects, and on a modern quad–core CPU the cache misses were hurting performance so much that although a lock–free queue did have near 100% scalability, the overall operation was completing 165% faster with a locked queue with zero scalability.

The next best thing is to combines the best of both worlds: design a queue with low overhead and medium scalability. Using a reader–writer lock with a combination of lock–free operations, I came up with a queue that only needs to do a full lock once every 32 or 64 operations. The result? Scalability 5% lower than a lock–free queue, with overall performance 210% better.

OK, I’ll admit it: I cheated, somewhat. Lock–free algorithms are good for more than just scalability. They also offer immunity to nasty effects like deadlock, livelock, and priority inversion. In my case I wasn’t in a situation to worry about these, but you might be. The lesson here is to know your situation and decide carefully, and don’t trust what others tell you: always try things yourself and profile.