High Performance I/O on Windows

I/O com­ple­tion ports were first in­tro­duced in Win­dows NT 4.0 as a highly scal­able, multi-proces­sor ca­pa­ble al­ter­na­tive to the then-typ­i­cal prac­tices of using se­lect, WSAWait­For­Mul­ti­pleEvents, WSAA­sync­S­e­lect, or even run­ning a sin­gle thread per client. The most op­ti­mal way to per­form I/O on Win­dows—short of writ­ing a ker­nel-mode dri­ver—is to use I/O com­ple­tion ports.

A re­cent post on Slash­dot claimed sock­ets have run their course, which I think is ab­solutely not true! The au­thor seems to be­lieve the Berke­ley sock­ets API is the only way to per­form socket I/O, which is non­sense. Much more mod­ern, scal­able and high per­for­mance APIs exist today such as I/O com­ple­tion ports on Win­dows, epoll on Linux, and kqueue on FreeBSD. In light of this I thought I’d write a lit­tle about com­ple­tion ports here.

The old ways of mul­ti­plex­ing I/O still work pretty well for sce­nar­ios with a low num­ber of con­cur­rent con­nec­tions, but when writ­ing a server dae­mon to han­dle a thou­sand or even tens of thou­sands of con­cur­rent clients at once, we need to use some­thing dif­fer­ent. In this sense the old de facto stan­dard Berke­ley sock­ets API has run its course, be­cause the over­head of man­ag­ing so many con­nec­tions is sim­ply too great and makes using mul­ti­ple proces­sors hard.

An I/O com­ple­tion port is a multi-proces­sor aware queue. You cre­ate a com­ple­tion port, bind file or socket han­dles to it, and start asyn­chro­nous I/O op­er­a­tions. When they com­plete—ei­ther suc­cess­fully or with an error—a com­ple­tion packet is queued up on the com­ple­tion port, which the ap­pli­ca­tion can de­queue from mul­ti­ple threads. The com­ple­tion port uses some spe­cial voodoo to make sure only a spe­cific num­ber of threads can run at once—if one thread blocks in ker­nel-mode, it will au­to­mat­i­cally start up an­other one.

First you need to cre­ate a com­ple­tion port with Cre­ateIo­Com­ple­tion­Port:

HANDLE iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE,
   NULL, 0, 0);

It’s im­por­tant to note that Num­berOf­Con­cur­rent­Threads is not the total num­ber of threads that can ac­cess the com­ple­tion port—you can have as many as you want. This in­stead con­trols the num­ber of threads it will allow to run con­cur­rently. Once you’ve reached this num­ber, it will block all other threads. If one of the run­ning threads blocks for any rea­son in ker­nel-mode, the com­ple­tion port will au­to­mat­i­cally start up one of the wait­ing threads. Spec­i­fy­ing 0 for this is equiv­a­lent to the num­ber of log­i­cal proces­sors in the sys­tem.

Next is cre­at­ing and as­so­ci­at­ing a file or socket han­dle, using Cre­ate­File, WSASocket, and Cre­ateIo­Com­ple­tion­Port:

#define OPERATION_KEY 1

HANDLE file = CreateFile(L"file.txt", GENERIC_READ,
   FILE_SHARE_READ, NULL, OPEN_EXISTING,
   FILE_FLAG_OVERLAPPED, NULL);

SOCKET sock = WSASocket(AF_INET, SOCK_STREAM, IPPROTO_TCP,
   NULL, 0, WSA_FLAG_OVERLAPPED);

CreateIoCompletionPort(file, iocp, OPERATION_KEY, 0);
CreateIoCompletionPort((HANDLE)sock, iocp, OPERATION_KEY, 0);

Files and sock­ets must be opened with the FILE_FLAG_OVERLAPPED and WSA_FLAG_OVERLAPPED flags be­fore they are used asyn­chro­nously.

Reusing CreateIoCompletionPort for as­so­ci­at­ing file/socket han­dles is weird de­sign choice from Mi­crosoft but that’s how it’s done. The CompletionKey pa­ra­me­ter can be any­thing you want, it is a value pro­vided when pack­ets are de­queued. I de­fine a OPERATION_KEY here to use as the CompletionKey, the sig­nif­i­cance of which I’ll get to later.

Next we have to start up some I/O op­er­a­tions. I’ll skip set­ting up the socket and go right to send­ing data. We start these op­er­a­tions using Read­File and WSASend:

OVERLAPPED readop, sendop;
WSABUF sendwbuf;
char readbuf[256], sendbuf[256];

memset(&readop, 0, sizeof(readop));
memset(&sendop, 0, sizeof(sendop));

sendwbuf.len = sizeof(sendbuf);
sendwbuf.buf = sendbuf;

BOOL readstatus = ReadFile(file, readbuf,
   sizeof(readbuf), NULL, &readop);

if(!readstatus)
{
  DWORD readerr = GetLastError();

  if(readerr != ERROR_IO_PENDING)
  {
    // error reading.
  }
}

int writestatus = WSASend(sock, &sendwbuf, 1, NULL,
   0, &sendop, NULL);

if(writestatus)
{
  int writeerr = WSAGetLastError();

  if(writeerr != WSA_IO_PENDING)
  {
    // error sending.
  }
}

Every I/O op­er­a­tion using a com­ple­tion port has an OVERLAPPED struc­ture as­so­ci­ated with it. Win­dows uses this in­ter­nally in un­spec­i­fied ways, only say­ing we need to zero them out be­fore start­ing an op­er­a­tion. The OVERLAPPED struc­ture will be given back to us when we de­queue the com­ple­tion pack­ets, and can be used to pass along our own con­text data.

I have left out the stan­dard error check­ing up until now for brevity’s sake, but this one doesn’t work quite like one might ex­pect so it is im­por­tant here. When start­ing an I/O op­er­a­tion, an error might not re­ally be an error. If the func­tion suc­ceeds all is well, but if the func­tion fails, it is im­por­tant to check the error code with Get­LastEr­ror or WSAGet­LastEr­ror.

If these func­tions re­turn ERROR_IO_PENDING or WSA_IO_PENDING, the func­tion ac­tu­ally still com­pleted suc­cess­fully. All these error codes mean is an asyn­chro­nous op­er­a­tion has been started and com­ple­tion is pend­ing, as op­posed to com­plet­ing im­me­di­ately. A com­ple­tion packet will be queued up re­gard­less of the op­er­a­tion com­plet­ing asyn­chro­nously or not.

De­queu­ing pack­ets from a com­ple­tion port is han­dled by the GetQueuedCompletionStatus func­tion:

This func­tion de­queues com­ple­tion pack­ets, con­sist­ing of the com­ple­tion key we spec­i­fied in CreateIoCompletionPort and the OVERLAPPED struc­ture we gave while start­ing the I/O. If the I/O trans­ferred any data, it will re­trieve how many bytes were trans­ferred too. Again the error han­dling is a bit weird on this one, hav­ing three error states.

This can be run from as many threads as you like, even a sin­gle one. It is com­mon prac­tice to run a pool of twice the num­ber of threads as there are log­i­cal proces­sors avail­able, to keep the CPU ac­tive with the afore­men­tioned func­tion­al­ity of start­ing a new thread if a run­ning one blocks.

Un­less you are going for a sin­gle-threaded de­sign, I rec­om­mend start­ing two threads per log­i­cal CPU. Even if your app is de­signed to be 100% asyn­chro­nous, you will still run into block­ing when lock­ing shared data and even get un­avoid­able hid­den block­ing I/Os like read­ing in paged out mem­ory. Keep­ing two threads per log­i­cal CPU will keep the proces­sor busy with­out over­load­ing the OS with too much con­text switch­ing.

This is all well and good, but two I/O op­er­a­tions were ini­ti­ated—a file read and a socket send. We need a way to tell their com­ple­tion pack­ets apart. This is why we need to at­tach some state to the OVERLAPPED struc­ture when we call those func­tions:

struct my_context
{
  OVERLAPPED ovl;
  int isread;
};

struct my_context readop, sendop;

memset(&readop.ovl, 0, sizeof(readop.ovl));
memset(&sendop.ovl, 0, sizeof(sendop.ovl));

readop.isread = 1;
sendop.isread = 0;

ReadFile(file, readbuf, sizeof(readbuf), NULL, &readop.ovl);
WSASend(sock, &sendwbuf, 1, NULL, 0, &sendop.ovl, NULL);

Now we can tell the op­er­a­tions apart when we de­queue them:

OVERLAPPED *ovl;
ULONG_PTR completionkey;
DWORD transferred;

GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

struct my_context *ctx = (struct my_context*)ovl;

if(ctx->isread)
{
  // read completed.
}
else
{
  // send completed.
}

The last im­por­tant thing to know is how to queue up your own com­ple­tion pack­ets. This is use­ful if you want to split an op­er­a­tion up to be run on the thread pool, or if you want to exit a thread wait­ing on a call to GetQueuedCompletionStatus. To do this, we use the PostQueuedCompletionStatus func­tion:

#define EXIT_KEY 0

struct my_context ctx;

PostQueuedCompletionStatus(iocp, 0, OPERATION_KEY, &ctx.ovl);
PostQueuedCompletionStatus(iocp, 0, EXIT_KEY, NULL);

Here we post two things onto the queue. The first, we post our own struc­ture onto the queue, to be processed by our thread pool. The sec­ond, we give a new com­ple­tion key: EXIT_KEY. The thread which processes this packet can test if the com­ple­tion key is EXIT_KEY to know when it needs to stop de­queu­ing pack­ets and shut down.

Other than the com­ple­tion port han­dle, Win­dows does not use any of the pa­ra­me­ters given to PostQueuedCompletionStatus. They are en­tirely for our use, to be de­queued with GetQueuedCompletionStatus.

That’s all I have to write for now, and should be every­thing one would need to get started learn­ing these high per­for­mance APIs! I will make an­other post shortly de­tail­ing some good pat­terns for com­ple­tion port usage, and some op­ti­miza­tion tips to en­sure ef­fi­cient usage of these I/O APIs.

Up­date: this sub­ject con­tin­ued in I/O com­ple­tion ports made easy.

Related Posts