Feature Article

The art of resampling

Years ago when I was a fledg­ling still learn­ing to code, one of the first things I tried cre­at­ing was an image re­sizer. I pre­ferred (and given time, still do) to just have a go at things with­out re­search so while I suc­ceeded in mak­ing a re­sizer, the re­sults were (pre­dictably) poor. I soon got side­tracked and left the code to rot, but the idea re­mained.

It’s taken me over 10 years to find a rea­son to re­visit the sub­ject, but I fi­nally have: gamma com­pres­sion.

In HTML, the color #777 will have roughly half the in­ten­sity of #FFF. This matches our per­cep­tion and makes work­ing with color fairly easy, but the way light re­ally works is much dif­fer­ent. Our eyes per­ceive light on a log­a­rith­mic scale—twice the pho­tons won’t ap­pear twice as bright to us. Trans­form­ing the ac­tual lin­ear in­ten­sity into our fa­mil­iar rep­re­sen­ta­tion is called gamma com­pres­sion.

When we blend two gamma-​com­pressed col­ors, the re­sult is not the same as if we blended two lin­ear col­ors. To cor­rectly blend col­ors, we must first un­com­press them into their lin­ear val­ues. Take the gamma-​com­pressed val­ues 0.1 and 0.9. If we just add 0.1 to 0.9, we’ll of course get 1.0: an 11% change of value. Doing it the cor­rect way, we first de­com­press them into the lin­ear val­ues 0.​01 and 0.​79. Add 0.​01 to 0.​79, re-​com­press, and the re­sult will be 0.​905:​ an 0.5% change of value. Gamma-​ig­no­rant pro­cess­ing gave us a way off re­sult!

For an ex­am­ple, lets take a look at NASA’s “Earth’s City Lights”:

NASA’s “Earth’s City Lights”

Down­siz­ing this 4800×2400 image pre­sents a worst-case sce­nario for a gamma-ig­no­rant re­sizer. The sharp con­trast be­tween the lights and sur­round­ing dark­ness makes the blend­ing error very promi­nent, and the mas­sive down­siz­ing gives it a chance to do a lot of blend­ing.

Gamma-corrected “Earth’s City Lights”

At the top we see the re­sult of gamma-cor­rect re­siz­ing. This is how it's sup­posed to look—you can still see the lights along the African and Aus­tralian coasts. West­ern USA is clearly still very awake, as well as Eu­rope and parts of Asia.

Gamma-ignorant “Earth’s City Lights”

On the bot­tom we see the re­sult of gamma-ig­no­rant re­siz­ing. The fainter lights have been com­pletely drowned out. Aus­tralia and Africa now barely reg­is­ter, and all the other con­ti­nents look far darker over­all. Big dif­fer­ence! The un­for­tu­nate thing is that a ma­jor­ity of re­siz­ers will look like the image on the bot­tom. The in­cor­rect re­sult is often good enough that they ei­ther don’t no­tice or don’t care.

One of these in­cor­rect re­siz­ers is in Avisynth, a scripted video proces­sor used for every­thing from sim­ple DVD dein­ter­lac­ing all the way to heavy restora­tion of old 8mm film. A while ago I was using it to re­size a Quake video and dis­cov­ered that sim­i­lar to the lights of the image above, all the starry tele­porters lost their stars: a clear sign of gamma-​ig­no­rant pro­cess­ing.

I de­cided to make a high-​qual­ity re­siz­ing plu­gin for Avisynth that would use a fully lin­ear, high bit depth pipeline. No short­cuts. Work­ing on it has been a lot of fun and chal­lenge, so I’ll be writ­ing about it here over the next few days.

Justifying the Nook: A case for PDF

I’ve had my Nook for about two months now, and have read about 8 books on it so far. It’s a great lit­tle de­vice that I’ve been un­able to put down in my spare time. My only real gripe is that, like al­most every other e-​reader out there, it has such poor type­set­ting that you have to won­der if it was de­signed by soft­ware en­gi­neers who aren’t big read­ers—it cer­tainly pro­vides the paper look, but it’s still got a long way to go to pro­vide a read­ing ex­pe­ri­ence equal to a dead tree book.

To give an ex­am­ple, there are cur­rently two types of books you can buy on the nook: those that are left-​aligned, and those that are jus­ti­fied.

Left aligned

Here’s the left align­ment. It’s read­able, but it wastes a lot of space on the right edge and isn’t very aes­thetic. Next we have the jus­ti­fied ver­sion:

Bad justify

This kind of very basic jus­ti­fi­ca­tion is pretty poor look­ing—in at­tempt­ing to cre­ate a nice straight mar­gin, it adds a bunch of ugly space be­tween words. This is the same kind of jus­ti­fi­ca­tion web browsers do, and al­most no­body uses it be­cause it looks so bad. I knew about this going in. Sci­ence fic­tion au­thor Robert J. Sawyer wrote a very in­for­ma­tive post about it, and I set about find­ing a way to solve this prob­lem be­fore my nook even ar­rived in the mail.

One great fea­ture about the nook is that it sup­ports view­ing PDFs, with and with­out re-​flow­ing text. With re-​flow­ing turned off, you’re free to man­u­ally type­set the doc­u­ment any way you want, and it will ap­pear ex­actly how you want on the screen. This is what I was look­ing for. With this you can use Mi­crosoft Word to cre­ate a great look­ing PDF. If you’re feel­ing brave and want even bet­ter look­ing text, pro­fes­sional type­set­ting soft­ware like Adobe In­De­sign pro­vides more ad­vanced fea­tures that will give fan­tas­tic re­sults.

Good justify

Here we can see proper hy­phen­ation and jus­ti­fi­ca­tion. A good jus­ti­fi­ca­tion al­go­rithm won’t just add more space be­tween words, but will also be will­ing to shrink it just slightly if it will make things look bet­ter. You can see this in the first line of text, where it fits “adip­isc­ing” in to re­move the white­space that plagued the text in the pre­vi­ous image. It also evenly spaces the en­tire para­graph at once in­stead of just a sin­gle line, and fits more text into each line by hy­phen­at­ing words on line breaks.

It’s look­ing pretty good now, and is al­most like a real book. But there’s a lit­tle more that can be done:

Best justify

Can you spot the dif­fer­ence? I admit, with­out a keen eye and a lit­tle knowl­edge of ty­pog­ra­phy, I won’t ex­pect most peo­ple to. Here’s an an­i­ma­tion to help show the dif­fer­ences bet­ter:

Best justify + good justify comparison animation

There are two things hap­pen­ing here, one more sub­tle than the other. The first is op­ti­cal mar­gin ad­just­ment. This im­proves aes­thet­ics by ad­just­ing each line’s mar­gins ever so slightly to re­duce empty space, giv­ing a more flat look to edges. You can see it on the fifth line, where it com­pen­sates for the empty space under the V’s left edge by mov­ing it out a lit­tle bit. It’s even more no­tice­able with punc­tu­a­tion on the right edge, where it pushes the pe­ri­ods and hy­phens out into the mar­gin.

The sec­ond thing hap­pen­ing is lig­a­ture sub­sti­tu­tion. Cer­tain com­bi­na­tions of char­ac­ters have sim­i­lar fine de­tails in the same spots and can look a lit­tle awk­ward to­gether, and lig­a­tures can make them look bet­ter by com­bin­ing them into a sin­gle spe­cial­ized glyph. You can see this in the mid­dle left of the text, where “of­fi­cae” is slightly al­tered—look closely at the “ffi” and you will see the three let­ters merged to­gether with the first F be­com­ing a lit­tle smaller and the dot over the I merg­ing with the sec­ond F to cre­ate a slightly larger over­hang. Look in your fa­vorite dead tree book, and you’ll prob­a­bly find it in “ff” and “fi” com­bi­na­tions—it’s pretty hard to no­tice with­out look­ing for it, but it is used to sub­tly im­prove leg­i­bil­ity.

There is noth­ing about EPUBs that pre­vent e-​read­ers from per­form­ing this kind of type­set­ting au­to­mat­i­cally. With any luck, one day we’ll get this nice look in all the e-​books we can down­load. The fault is solely within the e-​reader’s soft­ware. Until they start to take type­set­ting se­ri­ously, the only way we’ll get this is with PDFs. Un­for­tu­nately, most e-​books aren’t legally avail­able with­out DRM, mak­ing this kind of dra­matic al­ter­ation im­pos­si­ble with most of the stuff you can buy.

It’s easy to say this isn’t very im­por­tant. After all, it doesn’t af­fect func­tion­al­ity, right? You can still read the first pic­ture! When you’re doing a lot of read­ing, it is im­por­tant. Proper text flow re­duces eye move­ment and takes less work for your brain to process, let­ting you read longer and faster. It also hap­pens to get sig­nif­i­cantly more text onto each page, which means fewer de­lays from page turns—in my ex­pe­ri­ence, it re­duces a book’s page count by around 15%.

There are a lot of com­pelling rea­sons to get an e-​reader. They can store thou­sands of books at once and re­mem­ber your place in each one of them. You can look up un­known words in a dic­tio­nary, and book­mark stuff for later ref­er­ence with­out draw­ing all over a page with a high­lighter. You can browse through mas­sive book stores and read bought books in­stantly with­out ever leav­ing your home. It baf­fles me that they don’t bother to im­prove the sin­gle most im­por­tant part of the de­vice—read­ing!

Tips for efficient I/O

There are a few things to keep in mind for I/O that can have pretty in­cred­i­ble ef­fects on per­for­mance and scal­a­bil­ity. It’s not re­ally any new API, but how you use it.

Reduce blocking

The whole point of I/O com­ple­tion ports is to do op­er­a­tions asyn­chro­nously, to keep the CPU busy doing work while wait­ing for com­ple­tion. Block­ing de­feats this by stalling the thread, so it should be avoided when­ever pos­si­ble. Com­ple­tion ports have a mech­a­nism built in to make block­ing less hurt­ful by start­ing up a wait­ing thread when an­other one blocks, but it is still bet­ter to avoid it all to­gether.

This in­cludes mem­ory al­lo­ca­tion. Stan­dard sys­tem al­lo­ca­tors usu­ally have sev­eral points where it needs to lock to allow con­cur­rent use, so ap­pli­ca­tions should make use of cus­tom al­lo­ca­tors like are­nas and pools where pos­si­ble.

This means I/O should al­ways be per­formed asyn­chro­nously, lock-free al­go­rithms used when avail­able, and any re­main­ing locks should be as op­ti­mally placed as pos­si­ble. Care­ful ap­pli­ca­tion de­sign plan­ning goes a long way here. The tough­est area I’ve dis­cov­ered is data­base ac­cess—as far as I have seen, there have been zero well de­signed data­base client li­braries. Every one that I’ve seen has forced you to block at some point.

Ide­ally, the only place the ap­pli­ca­tion would block is when re­triev­ing com­ple­tion pack­ets.

Buffer size and alignment

I/O rou­tines like to lock the pages of the buffers you pass in. That is, it will pin them in phys­i­cal mem­ory so that they can’t be paged out to a swap file.

The re­sult is if you pass in a 20 byte buffer, on most sys­tems it will lock a full 4096 byte page in mem­ory. Even worse, if the 20 byte buffer has 10 bytes in one page and 10 bytes in an­other, it will lock both pages—8192 bytes of mem­ory for a 20 byte I/O. If you have a large num­ber of con­cur­rent op­er­a­tions this can waste a lot of mem­ory!

Be­cause of this, I/O buffers should get spe­cial treat­ment. Func­tions like malloc() and operator new() should not be used be­cause they have no oblig­a­tion to pro­vide such op­ti­mal align­ment for I/O. I like to use VirtualAlloc to al­lo­cate large blocks of 1MiB, and di­vide this up into smaller fixed-sized (usu­ally page-sized, or 4KiB) blocks to be put into a free list.

If buffers are not a mul­ti­ple of the sys­tem page size, extra care should be taken to al­lo­cate buffers in a way that keeps them in sep­a­rate pages from non-buffer data. This will make sure page lock­ing will incur the least amount of over­head while per­form­ing I/O.

Limit the number of I/Os

Sys­tem calls and com­ple­tion ports have some ex­pense, so lim­it­ing the num­ber of I/O calls you do can help to in­crease through­put. We can use scat­ter/gather op­er­a­tions to chain sev­eral of those page-sized blocks into one sin­gle I/O.

A scat­ter op­er­a­tion is tak­ing data from one source, like a socket, and scat­ter­ing it into mul­ti­ple buffers. In­versely a gather op­er­a­tion takes data from mul­ti­ple buffers and gath­ers it into one des­ti­na­tion.

For sock­ets, this is easy—we just use the nor­mal WSASend and WSARecv func­tions, pass­ing in mul­ti­ple WSABUF struc­tures.

For files it is a lit­tle more com­plex. Here the WriteFileGather and ReadFileScatter func­tions are used, but some spe­cial care must be taken. These func­tions re­quire page-aligned and -sized buffers to be used, and the num­ber of bytes read/writ­ten must be a mul­ti­ple of the disk’s sec­tor size. The sec­tor size can be ob­tained with Get­Disk­Free­Space.

Non-blocking I/O

Asyn­chro­nous op­er­a­tions are key here, but non-block­ing I/O still has a place in the world of high per­for­mance.

Once an asyn­chro­nous op­er­a­tion com­pletes, we can issue non-block­ing re­quests to process any re­main­ing data. Fol­low­ing this pat­tern re­duces the amount of strain on the com­ple­tion port and helps to keep your op­er­a­tion con­text hot in the cache for as long as pos­si­ble.

Care should be taken to not let non-block­ing op­er­a­tions con­tinue on for too long, though. If data is re­ceived on a socket fast enough and we take so long to process it that there is al­ways more, it could starve other com­ple­tion no­ti­fi­ca­tions wait­ing to be de­queued.

Throughput or concurrency

A ker­nel has a lim­ited amount of non-paged mem­ory avail­able to it, and lock­ing one or more pages for each I/O call is a real easy way use it all up. Some­times if we ex­pect an ex­treme num­ber of con­cur­rent con­nec­tions or if we want to limit mem­ory usage, it can be ben­e­fi­cial to delay lock­ing these pages until ab­solutely re­quired.

An un­doc­u­mented fea­ture of WSARecv is that you can re­quest a 0-byte re­ceive, which will com­plete when data has ar­rived. Then you can issue an­other WSARecv or use non-block­ing I/O to pull out what­ever is avail­able. This lets us get no­ti­fied when data can be re­ceived with­out ac­tu­ally lock­ing mem­ory.

Doing this is very much a choice of through­put or con­cur­rency—it will use more CPU, re­duc­ing through­put. But it will allow your ap­pli­ca­tion to han­dle a larger num­ber of con­cur­rent op­er­a­tions. It is most ben­e­fi­cial in a low mem­ory sys­tem, or on 32-bit Win­dows when an ex­treme num­ber of con­cur­rent op­er­a­tions will be used. 64-bit Win­dows has a much larger non-paged pool, so it shouldn’t pre­sent a prob­lem if you feed it enough phys­i­cal mem­ory.

Unbuffered I/O

If you are using the file sys­tem a lot, your ap­pli­ca­tion might be wait­ing on the syn­chro­nous op­er­at­ing sys­tem cache. In this case, en­abling un­buffered I/O will let file op­er­a­tions by­pass the cache and com­plete more asyn­chro­nously.

This is done by call­ing CreateFile with the FILE_FLAG_NO_BUFFERING flag. Note that with this flag, all file ac­cess must be sec­tor aligned—read/write off­sets and sizes. Buffers must also be sec­tor aligned.

Op­er­at­ing sys­tem caching can have good ef­fects, so this isn’t al­ways ad­van­ta­geous. But if you’ve got a spe­cific I/O pat­tern or if you do your own caching, it can give a sig­nif­i­cant per­for­mance boost. This is an easy thing to turn on and off, so take some bench­marks.

Up­date: this sub­ject con­tin­ued in I/O Im­prove­ments in Win­dows Vista.

I/O completion ports made easy

I de­scribed the ba­sics of I/O com­ple­tion ports in my last post, but there is still the ques­tion of what the eas­i­est way to use them. Here I’ll show a call­back-based ap­pli­ca­tion de­sign that I’ve found makes a fully asyn­chro­nous pro­gram re­ally sim­ple to do.

I touched briefly on at­tach­ing our own con­text data to the OVERLAPPED struc­ture we pass along with I/O op­er­a­tions. It’s this same idea that I’ll ex­pand on here. This time, we de­fine a generic struc­ture to use with all our op­er­a­tions, and how our threads will han­dle them while de­queu­ing pack­ets:

struct io_context
{
  OVERLAPPED ovl;
  void (*on_completion)(DWORD error, DWORD transferred,
                        struct io_context *ctx);
};

OVERLAPPED *ovl;
ULONG_PTR completionkey;
DWORD transferred;

BOOL ret = GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

if(ret)
{
  struct io_context *ctx = (struct io_context*)ovl;
  ctx->on_completion(ERROR_SUCCESS, transferred, ctx);
}
else if(ovl)
{
  DWORD err = GetLastError();

  struct io_context *ctx = (struct io_context*)ovl;
  ctx->on_completion(err, transferred, ctx);
}
else
{
  // error out
}

With this, all our I/O op­er­a­tions will have a call­back as­so­ci­ated with them. When a com­ple­tion packet is de­queued it gets the error in­for­ma­tion, if any, and runs the call­back. Hav­ing every I/O op­er­a­tion use a sin­gle call­back mech­a­nism greatly sim­pli­fies the de­sign of the en­tire pro­gram.

Lets say our app was read­ing a file and send­ing out it’s con­tents. We also want it to prefetch the next buffer so we can start send­ing right away. Here’s our con­nec­tion con­text:

struct connection_context
{
  HANDLE file;
  SOCKET sock;

  WSABUF readbuf;
  WSABUF sendbuf;

  struct io_context readctx;
  struct io_context sendctx;
};

A struc­ture like this is nice be­cause ini­ti­at­ing an I/O op­er­a­tion will need no al­lo­ca­tions. Note that we need two io_­con­text mem­bers be­cause we’re doing a read and send con­cur­rently.

Now the code to use it:

#define BUFFER_SIZE 4096

void begin_read(struct connection_context *ctx)
{
  if(ctx->readbuf.buf)
  {
    // only begin a read if one isn't already running.
    return;
  }

  ctx->readbuf.buf = malloc(BUFFER_SIZE);
  ctx->readbuf.len = 0;

  // zero out io_context structure.
  memset(&ctx->readctx, 0, sizeof(ctx->readctx));

  // set completion callback.
  ctx->readctx.on_completion = read_finished;

  ReadFile(ctx->file, ctx->readbuf.buf, BUFFER_SIZE,
           NULL, &ctx->readctx.ovl);
}

void read_finished(DWORD error, DWORD transferred,
                   struct io_context *ioctx)
{
  // get our connection context.
  struct connection_context *ctx =
     (struct connection_context*)((char*)ioctx -
        offsetof(struct connection_context, readctx));

  if(error != ERROR_SUCCESS)
  {
    // handle error.
    return;
  }

  if(!transferred)
  {
    // reached end of file, close out connection.
    free(ctx->readbuf.buf);
    ctx->readbuf.buf = 0;
    return;
  }

  // send out however much we read from the file.

  ctx->readbuf.len = transferred;

  begin_send(ctx);
}

This gives us a very ob­vi­ous chain of events: read_finished is called when a read com­pletes. Since we only get an io_context struc­ture in our call­back, we need to ad­just the pointer to get our full connection_context.

Send­ing is easy too:

void begin_send(struct connection_context *ctx)
{
  if(ctx->sendbuf.buf)
  {
    // only begin a send if one
    // isn't already running.
    return;
  }

  if(!ctx->recvbuf.len)
  {
    // only begin a send if the
    // read buffer has something.
    return;
  }

  // switch buffers.
  ctx->sendbuf = ctx->readbuf;

  // clear read buffer.
  ctx->readbuf.buf = NULL;
  ctx->readbuf.len = 0;

  // zero out io_context structure.
  memset(&ctx->sendctx, 0, sizeof(ctx->sendctx));

  // set completion callback.
  ctx->sendctx.on_completion = send_finished;

  WSASend(ctx->sock, &ctx->sendbuf, 1, NULL, 0,
          &ctx->sendctx.ovl, NULL);

  // start reading next buffer.
  begin_read(ctx);
}

void send_finished(DWORD error, DWORD transferred,
                   struct io_context *ioctx)
{
  // get our connection context.
  struct connection_context *ctx =
     (struct connection_context*)((char*)ioctx -
        offsetof(struct connection_context, sendctx));

  if(error != ERROR_SUCCESS)
  {
    // handle error.
    return;
  }

  // success, clear send buffer and start next send.

  free(ctx->sendbuf.buf);
  ctx->sendbuf.buf = NULL;

  begin_send(ctx);
}

Pretty much more of the same. Again for brevity I’m leav­ing out some error check­ing code and as­sum­ing the buffer gets sent out in full. I’m also as­sum­ing a sin­gle-threaded de­sign—socket and file func­tions them­selves are thread-safe and have noth­ing to worry about, but the buffer man­age­ment code here would need some extra lock­ing since it could be run con­cur­rently. But the idea should be clear.

Up­date: this sub­ject con­tin­ued in Tips for ef­fi­cient I/O.

High Performance I/O on Windows

I/O com­ple­tion ports were first in­tro­duced in Win­dows NT 4.0 as a highly scal­able, multi-proces­sor ca­pa­ble al­ter­na­tive to the then-typ­i­cal prac­tices of using se­lect, WSAWait­For­Mul­ti­pleEvents, WSAA­sync­S­e­lect, or even run­ning a sin­gle thread per client. The most op­ti­mal way to per­form I/O on Win­dows—short of writ­ing a ker­nel-mode dri­ver—is to use I/O com­ple­tion ports.

A re­cent post on Slash­dot claimed sock­ets have run their course, which I think is ab­solutely not true! The au­thor seems to be­lieve the Berke­ley sock­ets API is the only way to per­form socket I/O, which is non­sense. Much more mod­ern, scal­able and high per­for­mance APIs exist today such as I/O com­ple­tion ports on Win­dows, epoll on Linux, and kqueue on FreeBSD. In light of this I thought I’d write a lit­tle about com­ple­tion ports here.

The old ways of mul­ti­plex­ing I/O still work pretty well for sce­nar­ios with a low num­ber of con­cur­rent con­nec­tions, but when writ­ing a server dae­mon to han­dle a thou­sand or even tens of thou­sands of con­cur­rent clients at once, we need to use some­thing dif­fer­ent. In this sense the old de facto stan­dard Berke­ley sock­ets API has run its course, be­cause the over­head of man­ag­ing so many con­nec­tions is sim­ply too great and makes using mul­ti­ple proces­sors hard.

An I/O com­ple­tion port is a multi-proces­sor aware queue. You cre­ate a com­ple­tion port, bind file or socket han­dles to it, and start asyn­chro­nous I/O op­er­a­tions. When they com­plete—ei­ther suc­cess­fully or with an error—a com­ple­tion packet is queued up on the com­ple­tion port, which the ap­pli­ca­tion can de­queue from mul­ti­ple threads. The com­ple­tion port uses some spe­cial voodoo to make sure only a spe­cific num­ber of threads can run at once—if one thread blocks in ker­nel-mode, it will au­to­mat­i­cally start up an­other one.

First you need to cre­ate a com­ple­tion port with Cre­ateIo­Com­ple­tion­Port:

HANDLE iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE,
   NULL, 0, 0);

It’s im­por­tant to note that Num­berOf­Con­cur­rent­Threads is not the total num­ber of threads that can ac­cess the com­ple­tion port—you can have as many as you want. This in­stead con­trols the num­ber of threads it will allow to run con­cur­rently. Once you’ve reached this num­ber, it will block all other threads. If one of the run­ning threads blocks for any rea­son in ker­nel-mode, the com­ple­tion port will au­to­mat­i­cally start up one of the wait­ing threads. Spec­i­fy­ing 0 for this is equiv­a­lent to the num­ber of log­i­cal proces­sors in the sys­tem.

Next is cre­at­ing and as­so­ci­at­ing a file or socket han­dle, using Cre­ate­File, WSASocket, and Cre­ateIo­Com­ple­tion­Port:

#define OPERATION_KEY 1

HANDLE file = CreateFile(L"file.txt", GENERIC_READ,
   FILE_SHARE_READ, NULL, OPEN_EXISTING,
   FILE_FLAG_OVERLAPPED, NULL);

SOCKET sock = WSASocket(AF_INET, SOCK_STREAM, IPPROTO_TCP,
   NULL, 0, WSA_FLAG_OVERLAPPED);

CreateIoCompletionPort(file, iocp, OPERATION_KEY, 0);
CreateIoCompletionPort((HANDLE)sock, iocp, OPERATION_KEY, 0);

Files and sock­ets must be opened with the FILE_FLAG_OVERLAPPED and WSA_FLAG_OVERLAPPED flags be­fore they are used asyn­chro­nously.

Reusing CreateIoCompletionPort for as­so­ci­at­ing file/socket han­dles is weird de­sign choice from Mi­crosoft but that’s how it’s done. The CompletionKey pa­ra­me­ter can be any­thing you want, it is a value pro­vided when pack­ets are de­queued. I de­fine a OPERATION_KEY here to use as the CompletionKey, the sig­nif­i­cance of which I’ll get to later.

Next we have to start up some I/O op­er­a­tions. I’ll skip set­ting up the socket and go right to send­ing data. We start these op­er­a­tions using Read­File and WSASend:

OVERLAPPED readop, sendop;
WSABUF sendwbuf;
char readbuf[256], sendbuf[256];

memset(&readop, 0, sizeof(readop));
memset(&sendop, 0, sizeof(sendop));

sendwbuf.len = sizeof(sendbuf);
sendwbuf.buf = sendbuf;

BOOL readstatus = ReadFile(file, readbuf,
   sizeof(readbuf), NULL, &readop);

if(!readstatus)
{
  DWORD readerr = GetLastError();

  if(readerr != ERROR_IO_PENDING)
  {
    // error reading.
  }
}

int writestatus = WSASend(sock, &sendwbuf, 1, NULL,
   0, &sendop, NULL);

if(writestatus)
{
  int writeerr = WSAGetLastError();

  if(writeerr != WSA_IO_PENDING)
  {
    // error sending.
  }
}

Every I/O op­er­a­tion using a com­ple­tion port has an OVERLAPPED struc­ture as­so­ci­ated with it. Win­dows uses this in­ter­nally in un­spec­i­fied ways, only say­ing we need to zero them out be­fore start­ing an op­er­a­tion. The OVERLAPPED struc­ture will be given back to us when we de­queue the com­ple­tion pack­ets, and can be used to pass along our own con­text data.

I have left out the stan­dard error check­ing up until now for brevity’s sake, but this one doesn’t work quite like one might ex­pect so it is im­por­tant here. When start­ing an I/O op­er­a­tion, an error might not re­ally be an error. If the func­tion suc­ceeds all is well, but if the func­tion fails, it is im­por­tant to check the error code with Get­LastEr­ror or WSAGet­LastEr­ror.

If these func­tions re­turn ERROR_IO_PENDING or WSA_IO_PENDING, the func­tion ac­tu­ally still com­pleted suc­cess­fully. All these error codes mean is an asyn­chro­nous op­er­a­tion has been started and com­ple­tion is pend­ing, as op­posed to com­plet­ing im­me­di­ately. A com­ple­tion packet will be queued up re­gard­less of the op­er­a­tion com­plet­ing asyn­chro­nously or not.

De­queu­ing pack­ets from a com­ple­tion port is han­dled by the GetQueuedCompletionStatus func­tion:

This func­tion de­queues com­ple­tion pack­ets, con­sist­ing of the com­ple­tion key we spec­i­fied in CreateIoCompletionPort and the OVERLAPPED struc­ture we gave while start­ing the I/O. If the I/O trans­ferred any data, it will re­trieve how many bytes were trans­ferred too. Again the error han­dling is a bit weird on this one, hav­ing three error states.

This can be run from as many threads as you like, even a sin­gle one. It is com­mon prac­tice to run a pool of twice the num­ber of threads as there are log­i­cal proces­sors avail­able, to keep the CPU ac­tive with the afore­men­tioned func­tion­al­ity of start­ing a new thread if a run­ning one blocks.

Un­less you are going for a sin­gle-threaded de­sign, I rec­om­mend start­ing two threads per log­i­cal CPU. Even if your app is de­signed to be 100% asyn­chro­nous, you will still run into block­ing when lock­ing shared data and even get un­avoid­able hid­den block­ing I/Os like read­ing in paged out mem­ory. Keep­ing two threads per log­i­cal CPU will keep the proces­sor busy with­out over­load­ing the OS with too much con­text switch­ing.

This is all well and good, but two I/O op­er­a­tions were ini­ti­ated—a file read and a socket send. We need a way to tell their com­ple­tion pack­ets apart. This is why we need to at­tach some state to the OVERLAPPED struc­ture when we call those func­tions:

struct my_context
{
  OVERLAPPED ovl;
  int isread;
};

struct my_context readop, sendop;

memset(&readop.ovl, 0, sizeof(readop.ovl));
memset(&sendop.ovl, 0, sizeof(sendop.ovl));

readop.isread = 1;
sendop.isread = 0;

ReadFile(file, readbuf, sizeof(readbuf), NULL, &readop.ovl);
WSASend(sock, &sendwbuf, 1, NULL, 0, &sendop.ovl, NULL);

Now we can tell the op­er­a­tions apart when we de­queue them:

OVERLAPPED *ovl;
ULONG_PTR completionkey;
DWORD transferred;

GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

struct my_context *ctx = (struct my_context*)ovl;

if(ctx->isread)
{
  // read completed.
}
else
{
  // send completed.
}

The last im­por­tant thing to know is how to queue up your own com­ple­tion pack­ets. This is use­ful if you want to split an op­er­a­tion up to be run on the thread pool, or if you want to exit a thread wait­ing on a call to GetQueuedCompletionStatus. To do this, we use the PostQueuedCompletionStatus func­tion:

#define EXIT_KEY 0

struct my_context ctx;

PostQueuedCompletionStatus(iocp, 0, OPERATION_KEY, &ctx.ovl);
PostQueuedCompletionStatus(iocp, 0, EXIT_KEY, NULL);

Here we post two things onto the queue. The first, we post our own struc­ture onto the queue, to be processed by our thread pool. The sec­ond, we give a new com­ple­tion key: EXIT_KEY. The thread which processes this packet can test if the com­ple­tion key is EXIT_KEY to know when it needs to stop de­queu­ing pack­ets and shut down.

Other than the com­ple­tion port han­dle, Win­dows does not use any of the pa­ra­me­ters given to PostQueuedCompletionStatus. They are en­tirely for our use, to be de­queued with GetQueuedCompletionStatus.

That’s all I have to write for now, and should be every­thing one would need to get started learn­ing these high per­for­mance APIs! I will make an­other post shortly de­tail­ing some good pat­terns for com­ple­tion port usage, and some op­ti­miza­tion tips to en­sure ef­fi­cient usage of these I/O APIs.

Up­date: this sub­ject con­tin­ued in I/O com­ple­tion ports made easy.