Coding, Page 2

I/O completion ports made easy

I de­scribed the ba­sics of I/O com­ple­tion ports in my last post, but there is still the ques­tion of what the eas­i­est way to use them. Here I’ll show a call­back-based ap­pli­ca­tion de­sign that I’ve found makes a fully asyn­chro­nous pro­gram re­ally sim­ple to do.

I touched briefly on at­tach­ing our own con­text data to the OVERLAPPED struc­ture we pass along with I/O op­er­a­tions. It’s this same idea that I’ll ex­pand on here. This time, we de­fine a generic struc­ture to use with all our op­er­a­tions, and how our threads will han­dle them while de­queu­ing pack­ets:

struct io_context
{
  OVERLAPPED ovl;
  void (*on_completion)(DWORD error, DWORD transferred,
                        struct io_context *ctx);
};

OVERLAPPED *ovl;
ULONG_PTR completionkey;
DWORD transferred;

BOOL ret = GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

if(ret)
{
  struct io_context *ctx = (struct io_context*)ovl;
  ctx->on_completion(ERROR_SUCCESS, transferred, ctx);
}
else if(ovl)
{
  DWORD err = GetLastError();

  struct io_context *ctx = (struct io_context*)ovl;
  ctx->on_completion(err, transferred, ctx);
}
else
{
  // error out
}

With this, all our I/O op­er­a­tions will have a call­back as­so­ci­ated with them. When a com­ple­tion packet is de­queued it gets the error in­for­ma­tion, if any, and runs the call­back. Hav­ing every I/O op­er­a­tion use a sin­gle call­back mech­a­nism greatly sim­pli­fies the de­sign of the en­tire pro­gram.

Lets say our app was read­ing a file and send­ing out it’s con­tents. We also want it to prefetch the next buffer so we can start send­ing right away. Here’s our con­nec­tion con­text:

struct connection_context
{
  HANDLE file;
  SOCKET sock;

  WSABUF readbuf;
  WSABUF sendbuf;

  struct io_context readctx;
  struct io_context sendctx;
};

A struc­ture like this is nice be­cause ini­ti­at­ing an I/O op­er­a­tion will need no al­lo­ca­tions. Note that we need two io_­con­text mem­bers be­cause we’re doing a read and send con­cur­rently.

Now the code to use it:

#define BUFFER_SIZE 4096

void begin_read(struct connection_context *ctx)
{
  if(ctx->readbuf.buf)
  {
    // only begin a read if one isn't already running.
    return;
  }

  ctx->readbuf.buf = malloc(BUFFER_SIZE);
  ctx->readbuf.len = 0;

  // zero out io_context structure.
  memset(&ctx->readctx, 0, sizeof(ctx->readctx));

  // set completion callback.
  ctx->readctx.on_completion = read_finished;

  ReadFile(ctx->file, ctx->readbuf.buf, BUFFER_SIZE,
           NULL, &ctx->readctx.ovl);
}

void read_finished(DWORD error, DWORD transferred,
                   struct io_context *ioctx)
{
  // get our connection context.
  struct connection_context *ctx =
     (struct connection_context*)((char*)ioctx -
        offsetof(struct connection_context, readctx));

  if(error != ERROR_SUCCESS)
  {
    // handle error.
    return;
  }

  if(!transferred)
  {
    // reached end of file, close out connection.
    free(ctx->readbuf.buf);
    ctx->readbuf.buf = 0;
    return;
  }

  // send out however much we read from the file.

  ctx->readbuf.len = transferred;

  begin_send(ctx);
}

This gives us a very ob­vi­ous chain of events: read_finished is called when a read com­pletes. Since we only get an io_context struc­ture in our call­back, we need to ad­just the pointer to get our full connection_context.

Send­ing is easy too:

void begin_send(struct connection_context *ctx)
{
  if(ctx->sendbuf.buf)
  {
    // only begin a send if one
    // isn't already running.
    return;
  }

  if(!ctx->recvbuf.len)
  {
    // only begin a send if the
    // read buffer has something.
    return;
  }

  // switch buffers.
  ctx->sendbuf = ctx->readbuf;

  // clear read buffer.
  ctx->readbuf.buf = NULL;
  ctx->readbuf.len = 0;

  // zero out io_context structure.
  memset(&ctx->sendctx, 0, sizeof(ctx->sendctx));

  // set completion callback.
  ctx->sendctx.on_completion = send_finished;

  WSASend(ctx->sock, &ctx->sendbuf, 1, NULL, 0,
          &ctx->sendctx.ovl, NULL);

  // start reading next buffer.
  begin_read(ctx);
}

void send_finished(DWORD error, DWORD transferred,
                   struct io_context *ioctx)
{
  // get our connection context.
  struct connection_context *ctx =
     (struct connection_context*)((char*)ioctx -
        offsetof(struct connection_context, sendctx));

  if(error != ERROR_SUCCESS)
  {
    // handle error.
    return;
  }

  // success, clear send buffer and start next send.

  free(ctx->sendbuf.buf);
  ctx->sendbuf.buf = NULL;

  begin_send(ctx);
}

Pretty much more of the same. Again for brevity I’m leav­ing out some error check­ing code and as­sum­ing the buffer gets sent out in full. I’m also as­sum­ing a sin­gle-threaded de­sign—socket and file func­tions them­selves are thread-safe and have noth­ing to worry about, but the buffer man­age­ment code here would need some extra lock­ing since it could be run con­cur­rently. But the idea should be clear.

Up­date: this sub­ject con­tin­ued in Tips for ef­fi­cient I/O.

Visual Studio 2010 Beta 1 in four days

Vi­sual Stu­dio 2010 Beta 1 is being launched for MSDN sub­scribers in four days on the 18th, and for the gen­eral pub­lic on the 20th!

High Performance I/O on Windows

I/O com­ple­tion ports were first in­tro­duced in Win­dows NT 4.0 as a highly scal­able, multi-proces­sor ca­pa­ble al­ter­na­tive to the then-typ­i­cal prac­tices of using se­lect, WSAWait­For­Mul­ti­pleEvents, WSAA­sync­S­e­lect, or even run­ning a sin­gle thread per client. The most op­ti­mal way to per­form I/O on Win­dows—short of writ­ing a ker­nel-mode dri­ver—is to use I/O com­ple­tion ports.

A re­cent post on Slash­dot claimed sock­ets have run their course, which I think is ab­solutely not true! The au­thor seems to be­lieve the Berke­ley sock­ets API is the only way to per­form socket I/O, which is non­sense. Much more mod­ern, scal­able and high per­for­mance APIs exist today such as I/O com­ple­tion ports on Win­dows, epoll on Linux, and kqueue on FreeBSD. In light of this I thought I’d write a lit­tle about com­ple­tion ports here.

The old ways of mul­ti­plex­ing I/O still work pretty well for sce­nar­ios with a low num­ber of con­cur­rent con­nec­tions, but when writ­ing a server dae­mon to han­dle a thou­sand or even tens of thou­sands of con­cur­rent clients at once, we need to use some­thing dif­fer­ent. In this sense the old de facto stan­dard Berke­ley sock­ets API has run its course, be­cause the over­head of man­ag­ing so many con­nec­tions is sim­ply too great and makes using mul­ti­ple proces­sors hard.

An I/O com­ple­tion port is a multi-proces­sor aware queue. You cre­ate a com­ple­tion port, bind file or socket han­dles to it, and start asyn­chro­nous I/O op­er­a­tions. When they com­plete—ei­ther suc­cess­fully or with an error—a com­ple­tion packet is queued up on the com­ple­tion port, which the ap­pli­ca­tion can de­queue from mul­ti­ple threads. The com­ple­tion port uses some spe­cial voodoo to make sure only a spe­cific num­ber of threads can run at once—if one thread blocks in ker­nel-mode, it will au­to­mat­i­cally start up an­other one.

First you need to cre­ate a com­ple­tion port with Cre­ateIo­Com­ple­tion­Port:

HANDLE iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE,
   NULL, 0, 0);

It’s im­por­tant to note that Num­berOf­Con­cur­rent­Threads is not the total num­ber of threads that can ac­cess the com­ple­tion port—you can have as many as you want. This in­stead con­trols the num­ber of threads it will allow to run con­cur­rently. Once you’ve reached this num­ber, it will block all other threads. If one of the run­ning threads blocks for any rea­son in ker­nel-mode, the com­ple­tion port will au­to­mat­i­cally start up one of the wait­ing threads. Spec­i­fy­ing 0 for this is equiv­a­lent to the num­ber of log­i­cal proces­sors in the sys­tem.

Next is cre­at­ing and as­so­ci­at­ing a file or socket han­dle, using Cre­ate­File, WSASocket, and Cre­ateIo­Com­ple­tion­Port:

#define OPERATION_KEY 1

HANDLE file = CreateFile(L"file.txt", GENERIC_READ,
   FILE_SHARE_READ, NULL, OPEN_EXISTING,
   FILE_FLAG_OVERLAPPED, NULL);

SOCKET sock = WSASocket(AF_INET, SOCK_STREAM, IPPROTO_TCP,
   NULL, 0, WSA_FLAG_OVERLAPPED);

CreateIoCompletionPort(file, iocp, OPERATION_KEY, 0);
CreateIoCompletionPort((HANDLE)sock, iocp, OPERATION_KEY, 0);

Files and sock­ets must be opened with the FILE_FLAG_OVERLAPPED and WSA_FLAG_OVERLAPPED flags be­fore they are used asyn­chro­nously.

Reusing CreateIoCompletionPort for as­so­ci­at­ing file/socket han­dles is weird de­sign choice from Mi­crosoft but that’s how it’s done. The CompletionKey pa­ra­me­ter can be any­thing you want, it is a value pro­vided when pack­ets are de­queued. I de­fine a OPERATION_KEY here to use as the CompletionKey, the sig­nif­i­cance of which I’ll get to later.

Next we have to start up some I/O op­er­a­tions. I’ll skip set­ting up the socket and go right to send­ing data. We start these op­er­a­tions using Read­File and WSASend:

OVERLAPPED readop, sendop;
WSABUF sendwbuf;
char readbuf[256], sendbuf[256];

memset(&readop, 0, sizeof(readop));
memset(&sendop, 0, sizeof(sendop));

sendwbuf.len = sizeof(sendbuf);
sendwbuf.buf = sendbuf;

BOOL readstatus = ReadFile(file, readbuf,
   sizeof(readbuf), NULL, &readop);

if(!readstatus)
{
  DWORD readerr = GetLastError();

  if(readerr != ERROR_IO_PENDING)
  {
    // error reading.
  }
}

int writestatus = WSASend(sock, &sendwbuf, 1, NULL,
   0, &sendop, NULL);

if(writestatus)
{
  int writeerr = WSAGetLastError();

  if(writeerr != WSA_IO_PENDING)
  {
    // error sending.
  }
}

Every I/O op­er­a­tion using a com­ple­tion port has an OVERLAPPED struc­ture as­so­ci­ated with it. Win­dows uses this in­ter­nally in un­spec­i­fied ways, only say­ing we need to zero them out be­fore start­ing an op­er­a­tion. The OVERLAPPED struc­ture will be given back to us when we de­queue the com­ple­tion pack­ets, and can be used to pass along our own con­text data.

I have left out the stan­dard error check­ing up until now for brevity’s sake, but this one doesn’t work quite like one might ex­pect so it is im­por­tant here. When start­ing an I/O op­er­a­tion, an error might not re­ally be an error. If the func­tion suc­ceeds all is well, but if the func­tion fails, it is im­por­tant to check the error code with Get­LastEr­ror or WSAGet­LastEr­ror.

If these func­tions re­turn ERROR_IO_PENDING or WSA_IO_PENDING, the func­tion ac­tu­ally still com­pleted suc­cess­fully. All these error codes mean is an asyn­chro­nous op­er­a­tion has been started and com­ple­tion is pend­ing, as op­posed to com­plet­ing im­me­di­ately. A com­ple­tion packet will be queued up re­gard­less of the op­er­a­tion com­plet­ing asyn­chro­nously or not.

De­queu­ing pack­ets from a com­ple­tion port is han­dled by the GetQueuedCompletionStatus func­tion:

This func­tion de­queues com­ple­tion pack­ets, con­sist­ing of the com­ple­tion key we spec­i­fied in CreateIoCompletionPort and the OVERLAPPED struc­ture we gave while start­ing the I/O. If the I/O trans­ferred any data, it will re­trieve how many bytes were trans­ferred too. Again the error han­dling is a bit weird on this one, hav­ing three error states.

This can be run from as many threads as you like, even a sin­gle one. It is com­mon prac­tice to run a pool of twice the num­ber of threads as there are log­i­cal proces­sors avail­able, to keep the CPU ac­tive with the afore­men­tioned func­tion­al­ity of start­ing a new thread if a run­ning one blocks.

Un­less you are going for a sin­gle-threaded de­sign, I rec­om­mend start­ing two threads per log­i­cal CPU. Even if your app is de­signed to be 100% asyn­chro­nous, you will still run into block­ing when lock­ing shared data and even get un­avoid­able hid­den block­ing I/Os like read­ing in paged out mem­ory. Keep­ing two threads per log­i­cal CPU will keep the proces­sor busy with­out over­load­ing the OS with too much con­text switch­ing.

This is all well and good, but two I/O op­er­a­tions were ini­ti­ated—a file read and a socket send. We need a way to tell their com­ple­tion pack­ets apart. This is why we need to at­tach some state to the OVERLAPPED struc­ture when we call those func­tions:

struct my_context
{
  OVERLAPPED ovl;
  int isread;
};

struct my_context readop, sendop;

memset(&readop.ovl, 0, sizeof(readop.ovl));
memset(&sendop.ovl, 0, sizeof(sendop.ovl));

readop.isread = 1;
sendop.isread = 0;

ReadFile(file, readbuf, sizeof(readbuf), NULL, &readop.ovl);
WSASend(sock, &sendwbuf, 1, NULL, 0, &sendop.ovl, NULL);

Now we can tell the op­er­a­tions apart when we de­queue them:

OVERLAPPED *ovl;
ULONG_PTR completionkey;
DWORD transferred;

GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

struct my_context *ctx = (struct my_context*)ovl;

if(ctx->isread)
{
  // read completed.
}
else
{
  // send completed.
}

The last im­por­tant thing to know is how to queue up your own com­ple­tion pack­ets. This is use­ful if you want to split an op­er­a­tion up to be run on the thread pool, or if you want to exit a thread wait­ing on a call to GetQueuedCompletionStatus. To do this, we use the PostQueuedCompletionStatus func­tion:

#define EXIT_KEY 0

struct my_context ctx;

PostQueuedCompletionStatus(iocp, 0, OPERATION_KEY, &ctx.ovl);
PostQueuedCompletionStatus(iocp, 0, EXIT_KEY, NULL);

Here we post two things onto the queue. The first, we post our own struc­ture onto the queue, to be processed by our thread pool. The sec­ond, we give a new com­ple­tion key: EXIT_KEY. The thread which processes this packet can test if the com­ple­tion key is EXIT_KEY to know when it needs to stop de­queu­ing pack­ets and shut down.

Other than the com­ple­tion port han­dle, Win­dows does not use any of the pa­ra­me­ters given to PostQueuedCompletionStatus. They are en­tirely for our use, to be de­queued with GetQueuedCompletionStatus.

That’s all I have to write for now, and should be every­thing one would need to get started learn­ing these high per­for­mance APIs! I will make an­other post shortly de­tail­ing some good pat­terns for com­ple­tion port usage, and some op­ti­miza­tion tips to en­sure ef­fi­cient usage of these I/O APIs.

Up­date: this sub­ject con­tin­ued in I/O com­ple­tion ports made easy.

User Mode Scheduling in Windows 7

Don’t use threads. Or more pre­cisely, don’t over-use them. It’s one of the first thing fledg­ling pro­gram­mers learn after they start using threads. This is be­cause thread­ing in­volves a lot of over­head. In short, using more threads may im­prove con­cur­rency, but it will give you less over­all through­put as more pro­cess­ing is put into sim­ply man­ag­ing the threads in­stead of let­ting them run. So pro­gram­mers learn to use threads spar­ingly.

When nor­mal threads run out of time, or block on some­thing like a mutex or I/O, they hand off con­trol to the op­er­at­ing sys­tem ker­nel. The ker­nel then finds a new thread to run, and switches back to user-mode to run the thread. This con­text switch­ing is what User Mode Sched­ul­ing looks to al­le­vi­ate.

User Mode Sched­ul­ing can be thought of as a cross be­tween threads and thread pools. An ap­pli­ca­tion cre­ates one or more UMS sched­uler threads—typ­i­cally one for each proces­sor. It then cre­ates sev­eral UMS worker threads for each sched­uler thread. The worker threads are the ones that run your ac­tual code. When­ever a worker thread runs out of time, it is put on the end of its sched­uler thread’s queue. If a worker thread blocks, it is put on a wait­ing list to be re-queued by the ker­nel when what­ever it was wait­ing on fin­ishes. The sched­uler thread then takes the worker thread from the top of the queue and starts run­ning it. Like the name sug­gests, this hap­pens en­tirely in user-mode, avoid­ing the ex­pen­sive user->ker­nel->user-mode tran­si­tions. Let­ting each thread run for ex­actly as long as it needs helps to solve the through­put prob­lem. Work is only put into man­ag­ing threads when ab­solutely nec­es­sary in­stead of in ever smaller time slices, leav­ing more time to run your ac­tual code.

A good side ef­fect of this is UMS threads also help to al­le­vi­ate the cache thrash­ing prob­lems typ­i­cal in heav­ily-threaded ap­pli­ca­tions. For­get­ting your data shar­ing pat­terns, each thread still needs its own stor­age for stack space, proces­sor con­text, and thread-lo­cal stor­age. Every time a con­text switch hap­pens, some data may need to be pushed out of caches in order to load some ker­nel-mode code and the next thread’s data. By switch­ing be­tween threads less often, cache can be put to bet­ter use for the task at hand.

If you have ever had a chance to use some of the more es­o­teric APIs in­cluded with Win­dows, you might be won­der­ing why we need UMS threads when we have fibers which offer sim­i­lar co-op­er­a­tive mul­ti­task­ing. Fibers have a lot of spe­cial ex­cep­tions. There are things that aren’t safe to do with them. Li­braries that rely on thread-lo­cal stor­age, for in­stance, will likely walk all over them­selves if used from within fibers. A UMS thread on the other hand is a full fledged thread—they sup­port TLS and no have no real spe­cial things to keep in mind while using them.

I still wouldn’t count out thread pools just yet. UMS threads are still more ex­pen­sive than a thread pool and the large mem­ory re­quire­ments of a thread still apply here, so things like per-client threads in in­ter­net dae­mons are still out of the ques­tion if you want to be mas­sively scal­able. More likely, UMS threads will be most use­ful for build­ing thread pools. Most thread pools launch two or three threads per CPU to help stay busy when given block­ing tasks, and UMS threads will at least help keep their time slice usage op­ti­mal.

From what I un­der­stand the team be­hind Mi­crosoft’s Con­cur­rency Run­time, to be in­cluded with Vi­sual C++ 2010, was one of the pri­mary forces be­hind UMS threads. They worked very closely with the ker­nel folks to find the most scal­able way to en­able the su­per-par­al­lel code that will be pos­si­ble with the CR.

Optimizing IP range searching in PeerGuardian

I was work­ing on some­thing com­pletely dif­fer­ent last night, when an el­e­gant idea came to mind on how to sig­nif­i­cantly speed up Peer­Guardian’s IP search­ing. It’s funny how an idea can just pop into the mind about a prob­lem that hasn’t been thought of in a long time.

Right now Peer­Guardian uses a bi­nary search to match IPs. This is al­ready pretty ef­fi­cient, run­ning in ⌈log 2 N⌉—so for 100,000 IP ranges, about 16 tests need to be done. This has the ad­di­tional ad­van­tage of hav­ing no mem­ory over­head.

My idea is to use a struc­ture sim­i­lar to a B+tree, pack­ing as many IPs and branch point­ers into a cache line as pos­si­ble. On today’s ar­chi­tec­tures, a cache line is typ­i­cally 64 bytes, so 8 IPs and 9 branch point­ers would fit on each node, mak­ing it only need to read about ⌈log 9 N⌉ nodes to find a match. So in order to find a match in 100,000 IP ranges, only about 5 nodes would need to be read.

CPUs al­ways read and cache data in blocks (a cache line), so an al­go­rithm that keeps this in mind to min­i­mize mem­ory reads and max­i­mize cache usage should be in­cred­i­bly fast. Even though this in­tro­duces sig­nif­i­cant over­head for branch point­ers (about 2x the stor­age would be re­quired), it should be far more ef­fi­cient over­all.

But this al­go­rithm im­proves in an­other way too: branch­ing. I’m talk­ing in terms of branch in­struc­tions, not the branch point­ers men­tioned above. The fewer branches code takes, the faster a su­per­scalar or pipelined CPU will be able to run your code. For this al­go­rithm, an en­tire node could be processed (that is, com­par­ing the IPs and de­ter­min­ing which branch node to go into) with zero branches using in­te­ger SSE2 (PCMPGTD, PMOVM­SKB), and bit-scan for­ward (BSF).

I can’t be sure how much of a speed dif­fer­ence this would make until I code it up, but I bet it would be at least 200% faster. I’ve been too busy to work on Peer­Guardian for quite a while, so I don’t know if this will ever make it into PG. We’re look­ing for a new coder with more time on their hands.

Alex Filo has the right idea, UniformWrapPanel

WPF’s Wrap­Panel is miss­ing a key fea­ture: the abil­ity to cre­ate ver­ti­cal columns, but still fill the max­i­mum amount of width avail­able. One less an­noy­ance in WPF, thanks to Alex Filo’s Uni­formWrap­Panel.

Qt 4.5 released, still using three year old GCC

Qt 4.5 is out, along with Qt Cre­ator. It’s still using GCC 3.4.5, from a Jan­u­ary 2006 code­base. Sigh.

Databinding TextBlocks with XAML

I’ve be­come frus­trated lately with try­ing to data­bind a list of strings to a textblock. Ie, if I have:

string[] mytext = new string[] { "a", "b", "c" };

And I want the ef­fec­tive end re­sult to be:

<TextBlock>a, b, c</TextBlock>

Or maybe I want some­thing more com­plex like hy­per­links:

<TextBlock>
   <Hyperlink>a</Hyperlink>,
   <Hyperlink>b</Hyperlink>,
   <Hyperlink>c</Hyperlink>
</TextBlock>

Ba­si­cally what I’m look­ing for is a Item­sCon­trol that works on TextBlocks (or Spans, etc.), with a Sep­a­ra­tor tem­plate to in­sert those “, ” in be­tween the items.

Your first in­stinct might be to use a Stack­Panel in­stead, but Stack­Panel wasn’t made for dis­play­ing text. It might fool you ini­tially, but you quickly no­tice it doesn’t be­have any­thing like text: there’s no proper kern­ing, RTL, wrap­ping, or any­thing else the text classes were made to sup­port.

I’m pretty sur­prised that WPF doesn’t have any­thing like this, as it seems like dis­play­ing lists as text would be a com­mon enough thing for any app. Un­for­tu­nately I’m still not well versed enough in WPF to cre­ate such a cus­tom con­trol for my­self, and haven’t had a whole lot of time to in­ves­ti­gate it.

Using HSL colors in WPF

One thing that has al­ways ir­ri­tated me about WPF is you’re still stuck spec­i­fy­ing col­ors in RGB. HSL just feels so much more nat­ural from a de­sign stand­point. Well, we’re not com­pletely out of luck—we’ve got markup ex­ten­sions.

So I cre­ated my own Hsl­Color and HslBrush ex­ten­sions which are fairly sim­ple to use:

<Window xmlns:e="clr-namespace:WpfExtensions"
        Background="{e:HslBrush H=300,S=50,L=75,A=80}"/>

Hue is spec­i­fied in de­grees from 0 to 360, while Sat­u­ra­tion, Light­ness, and Alpha are from 0 to 100. The pa­ra­me­ters are all dou­bles and it con­verts to scRGB be­hind the scenes, which means you ac­tu­ally get a much higher color pre­ci­sion than if you had just used the equiv­a­lent RGB hex. With Win­dows 7 hav­ing na­tive sup­port for scRGB, this will fu­ture-proof your ap­pli­ca­tion to make good use of up­com­ing mon­i­tors with Deep Color sup­port.

My Windows Vista/7/8 Wishlist

These are some changes I’ve been try­ing to get made since Vista en­tered beta. Now 7’s beta has begun and still chances look bleak. Maybe I’ll have more luck in 8?