Feature Article

The art of resampling

Years ago when I was a fledgling still learning to code, one of the first things I tried creating was an image resizer. I preferred (and given time, still do) to just have a go at things without research so while I succeeded in making a resizer, the results were (predictably) poor. I soon got sidetracked and left the code to rot, but the idea remained.

It’s taken me over 10 years to find a reason to revisit the subject, but I finally have: gamma compression.

In HTML, the color #777 will have roughly half the intensity of #FFF. This matches our perception and makes working with color fairly easy, but the way light really works is much different. Our eyes perceive light on a logarithmic scale—twice the photons won’t appear twice as bright to us. Transforming the actual linear intensity into our familiar representation is called gamma compression.

When we blend two gamma-​compressed colors, the result is not the same as if we blended two linear colors. To correctly blend colors, we must first uncompress them into their linear values. Take the gamma-​compressed values 0.1 and 0.9. If we just add 0.1 to 0.9, we’ll of course get 1.0: an 11% change of value. Doing it the correct way, we first decompress them into the linear values 0.​01 and 0.​79. Add 0.​01 to 0.​79, re-​compress, and the result will be 0.​905:​ an 0.5% change of value. Gamma-​ignorant processing gave us a way off result!

For an example, lets take a look at NASA’s “Earth’s City Lights”:

NASA’s “Earth’s City Lights”

Downsizing this 4800×2400 image presents a worst-case scenario for a gamma-ignorant resizer. The sharp contrast between the lights and surrounding darkness makes the blending error very prominent, and the massive downsizing gives it a chance to do a lot of blending.

Gamma-corrected “Earth’s City Lights”

At the top we see the result of gamma-correct resizing. This is how it's supposed to look—you can still see the lights along the African and Australian coasts. Western USA is clearly still very awake, as well as Europe and parts of Asia.

Gamma-ignorant “Earth’s City Lights”

On the bottom we see the result of gamma-ignorant resizing. The fainter lights have been completely drowned out. Australia and Africa now barely register, and all the other continents look far darker overall. Big difference! The unfortunate thing is that a majority of resizers will look like the image on the bottom. The incorrect result is often good enough that they either don’t notice or don’t care.

One of these incorrect resizers is in Avisynth, a scripted video processor used for everything from simple DVD deinterlacing all the way to heavy restoration of old 8mm film. A while ago I was using it to resize a Quake video and discovered that similar to the lights of the image above, all the starry teleporters lost their stars: a clear sign of gamma-​ignorant processing.

I decided to make a high-​quality resizing plugin for Avisynth that would use a fully linear, high bit depth pipeline. No shortcuts. Working on it has been a lot of fun and challenge, so I’ll be writing about it here over the next few days.

Justifying the Nook: A case for PDF

I’ve had my Nook for about two months now, and have read about 8 books on it so far. It’s a great little device that I’ve been unable to put down in my spare time. My only real gripe is that, like almost every other e-​reader out there, it has such poor typesetting that you have to wonder if it was designed by software engineers who aren’t big readers—it certainly provides the paper look, but it’s still got a long way to go to provide a reading experience equal to a dead tree book.

To give an example, there are currently two types of books you can buy on the nook: those that are left-​aligned, and those that are justified.

Left aligned

Here’s the left alignment. It’s readable, but it wastes a lot of space on the right edge and isn’t very aesthetic. Next we have the justified version:

Bad justify

This kind of very basic justification is pretty poor looking—in attempting to create a nice straight margin, it adds a bunch of ugly space between words. This is the same kind of justification web browsers do, and almost nobody uses it because it looks so bad. I knew about this going in. Science fiction author Robert J. Sawyer wrote a very informative post about it, and I set about finding a way to solve this problem before my nook even arrived in the mail.

One great feature about the nook is that it supports viewing PDFs, with and without re-​flowing text. With re-​flowing turned off, you’re free to manually typeset the document any way you want, and it will appear exactly how you want on the screen. This is what I was looking for. With this you can use Microsoft Word to create a great looking PDF. If you’re feeling brave and want even better looking text, professional typesetting software like Adobe InDesign provides more advanced features that will give fantastic results.

Good justify

Here we can see proper hyphenation and justification. A good justification algorithm won’t just add more space between words, but will also be willing to shrink it just slightly if it will make things look better. You can see this in the first line of text, where it fits “adipiscing” in to remove the whitespace that plagued the text in the previous image. It also evenly spaces the entire paragraph at once instead of just a single line, and fits more text into each line by hyphenating words on line breaks.

It’s looking pretty good now, and is almost like a real book. But there’s a little more that can be done:

Best justify

Can you spot the difference? I admit, without a keen eye and a little knowledge of typography, I won’t expect most people to. Here’s an animation to help show the differences better:

Best justify + good justify comparison animation

There are two things happening here, one more subtle than the other. The first is optical margin adjustment. This improves aesthetics by adjusting each line’s margins ever so slightly to reduce empty space, giving a more flat look to edges. You can see it on the fifth line, where it compensates for the empty space under the V’s left edge by moving it out a little bit. It’s even more noticeable with punctuation on the right edge, where it pushes the periods and hyphens out into the margin.

The second thing happening is ligature substitution. Certain combinations of characters have similar fine details in the same spots and can look a little awkward together, and ligatures can make them look better by combining them into a single specialized glyph. You can see this in the middle left of the text, where “officae” is slightly altered—look closely at the “ffi” and you will see the three letters merged together with the first F becoming a little smaller and the dot over the I merging with the second F to create a slightly larger overhang. Look in your favorite dead tree book, and you’ll probably find it in “ff” and “fi” combinations—it’s pretty hard to notice without looking for it, but it is used to subtly improve legibility.

There is nothing about EPUBs that prevent e-​readers from performing this kind of typesetting automatically. With any luck, one day we’ll get this nice look in all the e-​books we can download. The fault is solely within the e-​reader’s software. Until they start to take typesetting seriously, the only way we’ll get this is with PDFs. Unfortunately, most e-​books aren’t legally available without DRM, making this kind of dramatic alteration impossible with most of the stuff you can buy.

It’s easy to say this isn’t very important. After all, it doesn’t affect functionality, right? You can still read the first picture! When you’re doing a lot of reading, it is important. Proper text flow reduces eye movement and takes less work for your brain to process, letting you read longer and faster. It also happens to get significantly more text onto each page, which means fewer delays from page turns—in my experience, it reduces a book’s page count by around 15%.

There are a lot of compelling reasons to get an e-​reader. They can store thousands of books at once and remember your place in each one of them. You can look up unknown words in a dictionary, and bookmark stuff for later reference without drawing all over a page with a highlighter. You can browse through massive book stores and read bought books instantly without ever leaving your home. It baffles me that they don’t bother to improve the single most important part of the device—reading!

Tips for efficient I/O

There are a few things to keep in mind for I/O that can have pretty incredible effects on performance and scalability. It’s not really any new API, but how you use it.

Reduce blocking

The whole point of I/O completion ports is to do operations asynchronously, to keep the CPU busy doing work while waiting for completion. Blocking defeats this by stalling the thread, so it should be avoided whenever possible. Completion ports have a mechanism built in to make blocking less hurtful by starting up a waiting thread when another one blocks, but it is still better to avoid it all together.

This includes memory allocation. Standard system allocators usually have several points where it needs to lock to allow concurrent use, so applications should make use of custom allocators like arenas and pools where possible.

This means I/O should always be performed asynchronously, lock-free algorithms used when available, and any remaining locks should be as optimally placed as possible. Careful application design planning goes a long way here. The toughest area I’ve discovered is database access—as far as I have seen, there have been zero well designed database client libraries. Every one that I’ve seen has forced you to block at some point.

Ideally, the only place the application would block is when retrieving completion packets.

Buffer size and alignment

I/O routines like to lock the pages of the buffers you pass in. That is, it will pin them in physical memory so that they can’t be paged out to a swap file.

The result is if you pass in a 20 byte buffer, on most systems it will lock a full 4096 byte page in memory. Even worse, if the 20 byte buffer has 10 bytes in one page and 10 bytes in another, it will lock both pages—8192 bytes of memory for a 20 byte I/O. If you have a large number of concurrent operations this can waste a lot of memory!

Because of this, I/O buffers should get special treatment. Functions like malloc() and operator new() should not be used because they have no obligation to provide such optimal alignment for I/O. I like to use VirtualAlloc to allocate large blocks of 1MiB, and divide this up into smaller fixed-sized (usually page-sized, or 4KiB) blocks to be put into a free list.

If buffers are not a multiple of the system page size, extra care should be taken to allocate buffers in a way that keeps them in separate pages from non-buffer data. This will make sure page locking will incur the least amount of overhead while performing I/O.

Limit the number of I/Os

System calls and completion ports have some expense, so limiting the number of I/O calls you do can help to increase throughput. We can use scatter/gather operations to chain several of those page-sized blocks into one single I/O.

A scatter operation is taking data from one source, like a socket, and scattering it into multiple buffers. Inversely a gather operation takes data from multiple buffers and gathers it into one destination.

For sockets, this is easy—we just use the normal WSASend and WSARecv functions, passing in multiple WSABUF structures.

For files it is a little more complex. Here the WriteFileGather and ReadFileScatter functions are used, but some special care must be taken. These functions require page-aligned and -sized buffers to be used, and the number of bytes read/written must be a multiple of the disk’s sector size. The sector size can be obtained with GetDiskFreeSpace.

Non-blocking I/O

Asynchronous operations are key here, but non-blocking I/O still has a place in the world of high performance.

Once an asynchronous operation completes, we can issue non-blocking requests to process any remaining data. Following this pattern reduces the amount of strain on the completion port and helps to keep your operation context hot in the cache for as long as possible.

Care should be taken to not let non-blocking operations continue on for too long, though. If data is received on a socket fast enough and we take so long to process it that there is always more, it could starve other completion notifications waiting to be dequeued.

Throughput or concurrency

A kernel has a limited amount of non-paged memory available to it, and locking one or more pages for each I/O call is a real easy way use it all up. Sometimes if we expect an extreme number of concurrent connections or if we want to limit memory usage, it can be beneficial to delay locking these pages until absolutely required.

An undocumented feature of WSARecv is that you can request a 0-byte receive, which will complete when data has arrived. Then you can issue another WSARecv or use non-blocking I/O to pull out whatever is available. This lets us get notified when data can be received without actually locking memory.

Doing this is very much a choice of throughput or concurrency—it will use more CPU, reducing throughput. But it will allow your application to handle a larger number of concurrent operations. It is most beneficial in a low memory system, or on 32-bit Windows when an extreme number of concurrent operations will be used. 64-bit Windows has a much larger non-paged pool, so it shouldn’t present a problem if you feed it enough physical memory.

Unbuffered I/O

If you are using the file system a lot, your application might be waiting on the synchronous operating system cache. In this case, enabling unbuffered I/O will let file operations bypass the cache and complete more asynchronously.

This is done by calling CreateFile with the FILE_FLAG_NO_BUFFERING flag. Note that with this flag, all file access must be sector aligned—read/write offsets and sizes. Buffers must also be sector aligned.

Operating system caching can have good effects, so this isn’t always advantageous. But if you’ve got a specific I/O pattern or if you do your own caching, it can give a significant performance boost. This is an easy thing to turn on and off, so take some benchmarks.

Update: this subject continued in I/O Improvements in Windows Vista.

I/O completion ports made easy

I described the basics of I/O completion ports in my last post, but there is still the question of what the easiest way to use them. Here I’ll show a callback-based application design that I’ve found makes a fully asynchronous program really simple to do.

I touched briefly on attaching our own context data to the OVERLAPPED structure we pass along with I/O operations. It’s this same idea that I’ll expand on here. This time, we define a generic structure to use with all our operations, and how our threads will handle them while dequeuing packets:

struct io_context
  void (*on_completion)(DWORD error, DWORD transferred,
                        struct io_context *ctx);

ULONG_PTR completionkey;
DWORD transferred;

BOOL ret = GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

  struct io_context *ctx = (struct io_context*)ovl;
  ctx->on_completion(ERROR_SUCCESS, transferred, ctx);
else if(ovl)
  DWORD err = GetLastError();

  struct io_context *ctx = (struct io_context*)ovl;
  ctx->on_completion(err, transferred, ctx);
  // error out

With this, all our I/O operations will have a callback associated with them. When a completion packet is dequeued it gets the error information, if any, and runs the callback. Having every I/O operation use a single callback mechanism greatly simplifies the design of the entire program.

Lets say our app was reading a file and sending out it’s contents. We also want it to prefetch the next buffer so we can start sending right away. Here’s our connection context:

struct connection_context
  HANDLE file;
  SOCKET sock;

  WSABUF readbuf;
  WSABUF sendbuf;

  struct io_context readctx;
  struct io_context sendctx;

A structure like this is nice because initiating an I/O operation will need no allocations. Note that we need two io_context members because we’re doing a read and send concurrently.

Now the code to use it:

#define BUFFER_SIZE 4096

void begin_read(struct connection_context *ctx)
    // only begin a read if one isn't already running.

  ctx->readbuf.buf = malloc(BUFFER_SIZE);
  ctx->readbuf.len = 0;

  // zero out io_context structure.
  memset(&ctx->readctx, 0, sizeof(ctx->readctx));

  // set completion callback.
  ctx->readctx.on_completion = read_finished;

  ReadFile(ctx->file, ctx->readbuf.buf, BUFFER_SIZE,
           NULL, &ctx->readctx.ovl);

void read_finished(DWORD error, DWORD transferred,
                   struct io_context *ioctx)
  // get our connection context.
  struct connection_context *ctx =
     (struct connection_context*)((char*)ioctx -
        offsetof(struct connection_context, readctx));

  if(error != ERROR_SUCCESS)
    // handle error.

    // reached end of file, close out connection.
    ctx->readbuf.buf = 0;

  // send out however much we read from the file.

  ctx->readbuf.len = transferred;


This gives us a very obvious chain of events: read_finished is called when a read completes. Since we only get an io_context structure in our callback, we need to adjust the pointer to get our full connection_context.

Sending is easy too:

void begin_send(struct connection_context *ctx)
    // only begin a send if one
    // isn't already running.

    // only begin a send if the
    // read buffer has something.

  // switch buffers.
  ctx->sendbuf = ctx->readbuf;

  // clear read buffer.
  ctx->readbuf.buf = NULL;
  ctx->readbuf.len = 0;

  // zero out io_context structure.
  memset(&ctx->sendctx, 0, sizeof(ctx->sendctx));

  // set completion callback.
  ctx->sendctx.on_completion = send_finished;

  WSASend(ctx->sock, &ctx->sendbuf, 1, NULL, 0,
          &ctx->sendctx.ovl, NULL);

  // start reading next buffer.

void send_finished(DWORD error, DWORD transferred,
                   struct io_context *ioctx)
  // get our connection context.
  struct connection_context *ctx =
     (struct connection_context*)((char*)ioctx -
        offsetof(struct connection_context, sendctx));

  if(error != ERROR_SUCCESS)
    // handle error.

  // success, clear send buffer and start next send.

  ctx->sendbuf.buf = NULL;


Pretty much more of the same. Again for brevity I’m leaving out some error checking code and assuming the buffer gets sent out in full. I’m also assuming a single-threaded design—socket and file functions themselves are thread-safe and have nothing to worry about, but the buffer management code here would need some extra locking since it could be run concurrently. But the idea should be clear.

Update: this subject continued in Tips for efficient I/O.

High Performance I/O on Windows

I/O completion ports were first introduced in Windows NT 4.0 as a highly scalable, multi-processor capable alternative to the then-typical practices of using select, WSAWaitForMultipleEvents, WSAAsyncSelect, or even running a single thread per client. The most optimal way to perform I/O on Windows—short of writing a kernel-mode driver—is to use I/O completion ports.

A recent post on Slashdot claimed sockets have run their course, which I think is absolutely not true! The author seems to believe the Berkeley sockets API is the only way to perform socket I/O, which is nonsense. Much more modern, scalable and high performance APIs exist today such as I/O completion ports on Windows, epoll on Linux, and kqueue on FreeBSD. In light of this I thought I’d write a little about completion ports here.

The old ways of multiplexing I/O still work pretty well for scenarios with a low number of concurrent connections, but when writing a server daemon to handle a thousand or even tens of thousands of concurrent clients at once, we need to use something different. In this sense the old de facto standard Berkeley sockets API has run its course, because the overhead of managing so many connections is simply too great and makes using multiple processors hard.

An I/O completion port is a multi-processor aware queue. You create a completion port, bind file or socket handles to it, and start asynchronous I/O operations. When they complete—either successfully or with an error—a completion packet is queued up on the completion port, which the application can dequeue from multiple threads. The completion port uses some special voodoo to make sure only a specific number of threads can run at once—if one thread blocks in kernel-mode, it will automatically start up another one.

First you need to create a completion port with CreateIoCompletionPort:

HANDLE iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE,
   NULL, 0, 0);

It’s important to note that NumberOfConcurrentThreads is not the total number of threads that can access the completion port—you can have as many as you want. This instead controls the number of threads it will allow to run concurrently. Once you’ve reached this number, it will block all other threads. If one of the running threads blocks for any reason in kernel-mode, the completion port will automatically start up one of the waiting threads. Specifying 0 for this is equivalent to the number of logical processors in the system.

Next is creating and associating a file or socket handle, using CreateFile, WSASocket, and CreateIoCompletionPort:


HANDLE file = CreateFile(L"file.txt", GENERIC_READ,


CreateIoCompletionPort(file, iocp, OPERATION_KEY, 0);
CreateIoCompletionPort((HANDLE)sock, iocp, OPERATION_KEY, 0);

Files and sockets must be opened with the FILE_FLAG_OVERLAPPED and WSA_FLAG_OVERLAPPED flags before they are used asynchronously.

Reusing CreateIoCompletionPort for associating file/socket handles is weird design choice from Microsoft but that’s how it’s done. The CompletionKey parameter can be anything you want, it is a value provided when packets are dequeued. I define a OPERATION_KEY here to use as the CompletionKey, the significance of which I’ll get to later.

Next we have to start up some I/O operations. I’ll skip setting up the socket and go right to sending data. We start these operations using ReadFile and WSASend:

OVERLAPPED readop, sendop;
WSABUF sendwbuf;
char readbuf[256], sendbuf[256];

memset(&readop, 0, sizeof(readop));
memset(&sendop, 0, sizeof(sendop));

sendwbuf.len = sizeof(sendbuf);
sendwbuf.buf = sendbuf;

BOOL readstatus = ReadFile(file, readbuf,
   sizeof(readbuf), NULL, &readop);

  DWORD readerr = GetLastError();

  if(readerr != ERROR_IO_PENDING)
    // error reading.

int writestatus = WSASend(sock, &sendwbuf, 1, NULL,
   0, &sendop, NULL);

  int writeerr = WSAGetLastError();

  if(writeerr != WSA_IO_PENDING)
    // error sending.

Every I/O operation using a completion port has an OVERLAPPED structure associated with it. Windows uses this internally in unspecified ways, only saying we need to zero them out before starting an operation. The OVERLAPPED structure will be given back to us when we dequeue the completion packets, and can be used to pass along our own context data.

I have left out the standard error checking up until now for brevity’s sake, but this one doesn’t work quite like one might expect so it is important here. When starting an I/O operation, an error might not really be an error. If the function succeeds all is well, but if the function fails, it is important to check the error code with GetLastError or WSAGetLastError.

If these functions return ERROR_IO_PENDING or WSA_IO_PENDING, the function actually still completed successfully. All these error codes mean is an asynchronous operation has been started and completion is pending, as opposed to completing immediately. A completion packet will be queued up regardless of the operation completing asynchronously or not.

Dequeuing packets from a completion port is handled by the GetQueuedCompletionStatus function:

This function dequeues completion packets, consisting of the completion key we specified in CreateIoCompletionPort and the OVERLAPPED structure we gave while starting the I/O. If the I/O transferred any data, it will retrieve how many bytes were transferred too. Again the error handling is a bit weird on this one, having three error states.

This can be run from as many threads as you like, even a single one. It is common practice to run a pool of twice the number of threads as there are logical processors available, to keep the CPU active with the aforementioned functionality of starting a new thread if a running one blocks.

Unless you are going for a single-threaded design, I recommend starting two threads per logical CPU. Even if your app is designed to be 100% asynchronous, you will still run into blocking when locking shared data and even get unavoidable hidden blocking I/Os like reading in paged out memory. Keeping two threads per logical CPU will keep the processor busy without overloading the OS with too much context switching.

This is all well and good, but two I/O operations were initiated—a file read and a socket send. We need a way to tell their completion packets apart. This is why we need to attach some state to the OVERLAPPED structure when we call those functions:

struct my_context
  int isread;

struct my_context readop, sendop;

memset(&readop.ovl, 0, sizeof(readop.ovl));
memset(&sendop.ovl, 0, sizeof(sendop.ovl));

readop.isread = 1;
sendop.isread = 0;

ReadFile(file, readbuf, sizeof(readbuf), NULL, &readop.ovl);
WSASend(sock, &sendwbuf, 1, NULL, 0, &sendop.ovl, NULL);

Now we can tell the operations apart when we dequeue them:

ULONG_PTR completionkey;
DWORD transferred;

GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

struct my_context *ctx = (struct my_context*)ovl;

  // read completed.
  // send completed.

The last important thing to know is how to queue up your own completion packets. This is useful if you want to split an operation up to be run on the thread pool, or if you want to exit a thread waiting on a call to GetQueuedCompletionStatus. To do this, we use the PostQueuedCompletionStatus function:

#define EXIT_KEY 0

struct my_context ctx;

PostQueuedCompletionStatus(iocp, 0, OPERATION_KEY, &ctx.ovl);
PostQueuedCompletionStatus(iocp, 0, EXIT_KEY, NULL);

Here we post two things onto the queue. The first, we post our own structure onto the queue, to be processed by our thread pool. The second, we give a new completion key: EXIT_KEY. The thread which processes this packet can test if the completion key is EXIT_KEY to know when it needs to stop dequeuing packets and shut down.

Other than the completion port handle, Windows does not use any of the parameters given to PostQueuedCompletionStatus. They are entirely for our use, to be dequeued with GetQueuedCompletionStatus.

That’s all I have to write for now, and should be everything one would need to get started learning these high performance APIs! I will make another post shortly detailing some good patterns for completion port usage, and some optimization tips to ensure efficient usage of these I/O APIs.

Update: this subject continued in I/O completion ports made easy.