Scalability

Limiting asynchrony with task parallelism

When I started re-writ­ing this web­site, I wanted to make good use of my multi-core CPU. Gen­er­at­ing hun­dreds of pages using XSL trans­forms and plenty of pre-pro­cess­ing in C#, there's a lot of par­al­lelism to be had.

I began by using the TPL's data par­al­lelism fea­tures: mainly Parallel.​ForEach and Parallel.​Invoke. These are super easy to use, and made an im­me­di­ate huge dif­fer­ence.

Then the Vi­sual Stu­dio 11 de­vel­oper pre­view came out, and I felt com­pelled to make use of its new async fea­tures. This meant ditch­ing the Par­al­lel meth­ods all to­gether and writ­ing for task par­al­lelism.

There are still parts of the .NET Frame­work which don't sup­port async, and XML is one of them. Be­cause I'm read­ing rel­a­tively small doc­u­ments, I was able to work around these lim­i­ta­tions by asyn­chro­nously fill­ing a Mem­o­ryS­tream from a file and feed­ing the Mem­o­ryS­tream to the XML classes:

Task<FileStream> OpenReadAsync(string fileName)
{
    return Task.Factory.StartNew(state =>
        new FileStream((string)state, FileMode.Open, FileAccess.Read,
                       FileShare.Read, 4096, true), fileName);
}

async Task<XmlReader> CreateXmlReader(string fileName,
                                      XmlReaderSettings settings = null)
{
    MemoryStream ms = new MemoryStream();
    
    using (FileStream fs = await OpenReadAsync(fileName))
    {
        await fs.CopyToAsync(ms);
    }

    ms.Position = 0;
    return XmlReader.Create(ms, settings, fileName);
}

But I had one more prob­lem to solve. For ef­fi­ciency, Parallel.​ForEach par­ti­tions its items into ranges which will be op­er­ated on con­cur­rently. A side ef­fect of this that I was re­ly­ing on was that only so many I/O op­er­a­tions would be able to hap­pen at once. In my new code I'm sim­ply launch­ing all these tasks at once rather than par­ti­tion­ing—this ab­solutely killed per­for­mance as po­ten­tially hun­dreds of con­cur­rent I/Os caused my disk to seek like crazy.

What I ended up doing here was cre­at­ing a ticket sys­tem which can be used to allow only a lim­ited num­ber of I/Os to hap­pen con­cur­rently: es­sen­tially a safe task-based sem­a­phore.

sealed class AsyncLimiter
{
    public AsyncLimiter(int max);
    public Task<IDisposable> Lock();
}

The full im­ple­men­ta­tion is avail­able in Sub­ver­sion and under a 2-clause BSD li­cense. Using it is very sim­ple:

AsyncLimiter limiter = new AsyncLimiter(4);

async Task<FileStream> OpenReadAsync(string fileName)
{
    using (IDisposable limiterlock = await limiter.Lock())
    {
        return await Task.Factory.StartNew(state =>
            new FileStream((string)state, FileMode.Open, FileAccess.Read,
                           FileShare.Read, 4096, true), fileName);
    }
}

async Task<XmlReader> CreateXmlReader(string fileName,
                                      XmlReaderSettings settings = null)
{
    MemoryStream ms = new MemoryStream();

    using (FileStream fs = await OpenReadAsync(fileName))
    using (IDisposable limiterlock = await limiter.Lock())
    {
        await fs.CopyToAsync(ms);
    }

    ms.Position = 0;
    return XmlReader.Create(ms, settings, fileName);
}

When the lock gets dis­posed, it'll let the next op­er­a­tion in line progress. This was sim­ple to im­ple­ment ef­fi­ciently using In­ter­locked meth­ods and a Con­cur­ren­tQueue.

Some op­er­a­tions—file open­ing and ex­is­tence test­ing, di­rec­tory cre­ation, etc.—have no asyn­chro­nous ana­log. For these there is no good so­lu­tion, so I sim­ply wrapped them in a task as in the OpenReadAsync ex­am­ple above. They're rare enough that it hasn't been a prob­lem.

The end re­sult? Ac­tu­ally about 50% bet­ter per­for­mance than using the Par­al­lel meth­ods. When all the files are in cache, I'm able to gen­er­ate this en­tire web­site from scratch in about 0.7 sec­onds.

Asynchronous page faults

With I/O, we’ve got some choices:

  1. Syn­chro­nous, copy­ing from OS cache ( fread). This is the sim­plest form of I/O, but isn’t very scal­able.
  2. Syn­chro­nous, read­ing di­rectly from OS cache (mem­ory map­ping). This is wicked fast and ef­fi­cient once mem­ory is filled, but aside from some cases with read-​ahead, your threads will still block with page faults.
  3. Asyn­chro­nous, copy­ing from OS cache ( ReadFile). Much more scal­able than fread, but each read still in­volves du­pli­cat­ing data from the OS cache into your buffer. Fine if you’re read­ing some data only to mod­ify the buffer in place, but still not very great when you’re treat­ing it as read only (such as to send over a socket).
  4. Asyn­chro­nous, main­tain­ing your own cache ( FILE_FLAG_NO_BUFFERING). More scal­able still than Read­File, but you need to do your own caching and it’s not shared with other processes.

Note that there’s one im­por­tant choice miss­ing: mem­ory map­ping with asyn­chro­nous page faults. As far as I know there are no op­er­at­ing sys­tems that ac­tu­ally offer this—it’s kind of a dream fea­ture of mine. There are two APIs that will help sup­port this:

HANDLE CreateMemoryManager();
BOOL MakeResident(HANDLE, LPVOID, SIZE_T, LPOVERLAPPED);

CreateMemoryManager opens a han­dle to the Win­dows mem­ory man­ager, and MakeResident will fill the pages you spec­ify (re­turn­ing true for syn­chro­nous com­ple­tion, false for error/async like every­thing else). The best of both worlds: fast, easy ac­cess through mem­ory, a full asyn­chro­nous work­flow, and shared cache usage. This would be es­pe­cially use­ful on mod­ern CPUs that offer gi­gan­tic ad­dress spaces.

The mem­ory man­ager al­ready has sim­i­lar func­tion­al­ity in there some­where, so it might not be dif­fi­cult to pull into user-​mode. Just an ed­u­cated guess. Maybe it’d be ter­ri­bly dif­fi­cult. Dream fea­ture!

Efficient stream parsing in C++

A while ago I wrote about cre­at­ing a good parser and while the non-block­ing idea was spot-on, the rest of it re­ally isn’t very good in C++ where we have the power of tem­plates around to help us.

I’m cur­rently fin­ish­ing up a HTTP li­brary and have been re­vis­ing my views on stream pars­ing be­cause of it. I’m still not en­tirely set on my over­all im­ple­men­ta­tion, but I’m near­ing com­ple­tion and am ready to share my ideas. First, I’ll list my re­quire­ments:

To ac­com­plish this I broke this out into three lay­ers: a core parser, a buffer, and a buffer parser.

The core parser

De­sign­ing the core parser was sim­ple. I be­lieve I al­ready have a solid C++ parser de­sign in my XML li­brary, so I’m reusing that con­cept. This is fully in-situ pull parser that op­er­ates on a range of bidi­rec­tional it­er­a­tors and re­turns back a sub-range of those it­er­a­tors. The pull func­tion re­turns ok when it parses a new el­e­ment, done when it has reached a point that could be con­sid­ered an end of the stream, and need_more when an el­e­ment can’t be ex­tracted from the passed in it­er­a­tor range. Using this parser is pretty sim­ple:

typedef std::deque<char> buffer_type;
typedef http::parser<buffer_type::iterator> parser_type;

buffer_type buffer;

parser_type p;
parser_type::node_type n;
parser_type::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  std::deque<char>::iterator first = buffer.begin();

  while((r = p.parse(first, buffer.end(), n)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  buffer.erase(buffer.begin(), first); // remove all the used
                                       // data from the buffer.
} while(r == http::result_types::need_more);

By let­ting the user pass in a new range of it­er­a­tors to parse each time, we have the op­tion of up­dat­ing the stream with more data when need_more is re­turned. The parse() func­tion also up­dates the first it­er­a­tor so that we can pop any data prior to it from the data stream.

By de­fault the parser will throw an ex­cep­tion when it en­coun­ters an error. This can be changed by call­ing an over­load and han­dling the error re­sult type:

typedef std::deque<char> buffer_type;
typedef http::parser<buffer_type::iterator> parser_type;

buffer_type buffer;

parser_type p;
parser_type::node_type n;
parser_type::error_type err;
parser_type::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  std::deque<char>::iterator first = buffer.begin();

  while((r = p.parse(first, buffer.end(), n, err)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  buffer.erase(buffer.begin(), first); // remove all the used
                                       // data from the buffer.
} while(r == http::result_types::need_more);

if(r == http::result_types::error)
{
  std::cerr
    << "an error occured at "
    << std::distance(buffer.begin(), err.position())
    << ": "
    << err.what()
    << std::endl;
}

The buffer

Ini­tially I was test­ing my parser with a deque<char> like above. This let me test the it­er­a­tor-based parser very eas­ily by in­cre­men­tally push­ing data on, pars­ing some of it, and pop­ping off what was used. Un­for­tu­nately, using a deque means we al­ways have an extra copy, from an I/O buffer into the deque. It­er­at­ing a deque is also a lot slower than it­er­at­ing a range of point­ers be­cause of the way deque is usu­ally im­ple­mented. This in­ef­fi­ciency is ac­cept­able for test­ing, but just won't work in a live app.

My buffer class is I/O- and pars­ing-op­ti­mized, op­er­at­ing on pages of data. It al­lows pages to be in­serted di­rectly from I/O with­out copy­ing. Ones that weren't filled en­tirely can still be filled later, al­low­ing the user to com­mit more bytes of a page as read­able. One can use scat­ter/gather I/O to make op­er­a­tions span mul­ti­ple pages con­tained in a buffer.

The buffer ex­poses two types of it­er­a­tors. The first type is what we are used to in deque: just a gen­eral byte stream it­er­a­tor. But this in­curs the same cost as deque: each in­cre­ment to the it­er­a­tor must check if it's at the end of the cur­rent page and move to the next. A pro­to­col like HTTP can fit a lot of el­e­ments into a sin­gle 4KiB page, so it doesn't make sense to have this cost. This is where the sec­ond it­er­a­tor comes in: the page it­er­a­tor. A page can be thought of as a Range rep­re­sent­ing a sub­set of the data in the full buffer. Over­all the buffer class looks some­thing like this:

struct page
{
  const char *first;    // the first byte of the page.
  const char *last;     // one past the last byte of the page.
  const char *readpos;  // the first readable byte of the page.
  const char *writepos; // the first writable byte of the page,
                        // one past the last readable byte.
};

class buffer
{
public:
  typedef ... size_type;
  typedef ... iterator;
  typedef ... page_iterator;

  void push(page *p); // pushes a page into the buffer.  might
                      // be empty, semi-full, or full.

  page* pop(); // pops the first fully read page from from the buffer.

  void commit_write(size_type numbytes); // merely moves writepos
                                         // by some number of bytes.

  void commit_read(size_type numbytes); // moves readpos by
                                        // some number of bytes.

  iterator begin() const;
  iterator end() const;

  page_iterator pages_begin() const;
  page_iterator pages_end() const;
};

One thing you may no­tice is it ex­pects you to push() and pop() pages di­rectly onto it, in­stead of al­lo­cat­ing its own. I re­ally hate classes that al­lo­cate mem­ory – in terms of scal­a­bil­ity the fewer places that al­lo­cate mem­ory, the eas­ier it will be to op­ti­mize. Be­cause of this I al­ways try to de­sign my code to – if it makes sense – have the next layer up do al­lo­ca­tions. When it doesn't make sense, I doc­u­ment it. Hid­den al­lo­ca­tions are the root of evil.

The buffer parser

Un­like the core parser, the buffer parser isn't a tem­plate class. The buffer parser ex­poses the same func­tion­al­ity as a core parser, but using a buffer in­stead of it­er­a­tor ranges.

This is where C++ gives me a big ad­van­tage. The buffer parser is ac­tu­ally im­ple­mented with two core parsers. The first is a very fast http::parser<const char*>. It uses this to parse as much of a sin­gle page as pos­si­ble, stop­ping when it en­coun­ters need_more and no more data can be added to the page. The sec­ond is a http::parser<buffer::iterator>. This gets used when the first parser stops, which will hap­pen very in­fre­quently – only when a HTTP el­e­ment spans mul­ti­ple pages.

This is fairly easy to im­ple­ment, but re­quired a small change to my core parser con­cept. Be­cause each has sep­a­rate in­ter­nal state, I needed to make it so I could move the state be­tween two parsers that use dif­fer­ent it­er­a­tors. The amount of state is ac­tu­ally very small, mak­ing this a fast op­er­a­tion.

The buffer parser works with two dif­fer­ent it­er­a­tor types in­ter­nally, so I chose to al­ways re­turn a buffer::iterator range. The choice was ei­ther that or silently copy el­e­ments span­ning mul­ti­ple pages, and this way lets the user of the code de­cide how they want to han­dle it.

Using the buffer parser is just as easy as be­fore:

http::buffer buffer;
http::buffer_parser p;
http::buffer_parser::node_type n;
http::buffer_parser::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  while((r = p.parse(buffer, n)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  pop_used(buffer); // remove all the used
                    // data from the buffer.
} while(r == http::result_types::need_more);

The I/O layer

I'm leav­ing out an I/O layer for now. I will prob­a­bly write sev­eral small I/O sys­tems for it once I'm sat­is­fied with the parser. Per­haps one using asio, one using I/O com­ple­tion ports, and one using epoll. I've de­signed this from the start to be I/O ag­nos­tic but with op­ti­miza­tions that fa­cil­i­tate ef­fi­cient forms of all I/O, so I think it could be an good bench­mark of the var­i­ous I/O sub­sys­tems that dif­fer­ent plat­forms pro­vide.

One idea I've got is to use Winsock Ker­nel to im­ple­ment a ker­nel-mode HTTPd. Not a very good idea from a se­cu­rity stand­point, but would still be in­ter­est­ing to see the ef­fects on per­for­mance. Be­cause the parser per­forms no al­lo­ca­tion, no I/O calls, and doesn't force the use of ex­cep­tions, it should ac­tu­ally be very sim­ple to use in ker­nel-mode.

I/O Improvements in Windows Vista

My tips for ef­fi­cient I/O are rel­e­vant all the way back to cod­ing for Win­dows 2000. A lot of time has passed since then though, and for all the crit­i­cism it got, Win­dows Vista ac­tu­ally brought in a few new ways to make I/O even more per­for­mant than be­fore.

This will prob­a­bly be my last post on user-mode I/O until some­thing new and in­ter­est­ing comes along, com­plet­ing what started a cou­ple weeks ago with High Per­for­mance I/O on Win­dows.

Synchronous completion

In the past, non-block­ing I/O was a great way to re­duce the stress on a com­ple­tion port. An un­for­tu­nate side-ef­fect of this was an in­creased amount of syscalls -- the last non-block­ing call you make will do noth­ing, only re­turn­ing WSAE­WOULD­BLOCK. You would still need to call an asyn­chro­nous ver­sion to wait for more data.

Win­dows Vista solved this el­e­gantly with Set­File­Com­ple­tion­No­ti­fi­ca­tion­Modes. This func­tion lets you tell a file or socket that you don't want a com­ple­tion packet queued up when an op­er­a­tion com­pletes syn­chro­nously (that is, a func­tion re­turned suc­cess im­me­di­ately and not ER­ROR_IO_PEND­ING). Using this, the last I/O call will al­ways be of some use -- ei­ther it com­pletes im­me­di­ately and you can con­tinue pro­cess­ing, or it starts an asyn­chro­nous op­er­a­tion and a com­ple­tion packet will be queued up when it fin­ishes.

Like the non-block­ing I/O trick, con­tin­u­ally call­ing this can starve other op­er­a­tions in a com­ple­tion port if a socket or file feeds data faster than you can process it. Care should be taken to limit the num­ber of times you con­tinue pro­cess­ing syn­chro­nous com­ple­tions.

Reuse memory with file handles

If you want to op­ti­mize even more for through­put, you can as­so­ci­ate a range of mem­ory with an un­buffered file han­dle using Set­FileIoOver­lappe­dRange. This tells the OS that a block of mem­ory will be re-used, and should be kept locked in mem­ory until the file han­dle is closed. Of course if you won't be per­form­ing I/O with a han­dle very often, it might just waste mem­ory.

Dequeue multiple completion packets at once

A new fea­ture to fur­ther re­duce the stress on a com­ple­tion port is GetQueued­Com­ple­tion­Sta­tu­sEx, which lets you de­queue mul­ti­ple com­ple­tion pack­ets in one call.

If you read the docs for it, you'll even­tu­ally re­al­ize it only re­turns error in­for­ma­tion if the func­tion it­self fails—not if an async op­er­a­tion fails. Un­for­tu­nately this im­por­tant in­for­ma­tion is miss­ing from all the of­fi­cial docs I could find, and search­ing gives noth­ing. So how do you get error in­for­ma­tion out of GetQueued­Com­ple­tion­Sta­tu­sEx? Well, after play­ing around a bit I dis­cov­ered that you can call GetOver­lappe­dResult or WSAGe­tOver­lappe­dResult to get it, so not a prob­lem.

This func­tion should only be used if your ap­pli­ca­tion has a sin­gle thread or han­dles a high amount of con­cur­rent I/O op­er­a­tions, or you might end up de­feat­ing the mul­ti­thread­ing baked in to com­ple­tion ports by not let­ting it spread com­ple­tion no­ti­fi­ca­tions around. If you can use it, it's a nice and easy way to boost the per­for­mance of your code by low­er­ing con­tention on a com­ple­tion port.

Bandwidth reservation

One large change in Win­dows Vista was I/O sched­ul­ing and pri­or­i­ti­za­tion. If you have I/O that is de­pen­dant on steady stream­ing like audio or video, you can now use Set­File­Band­widthReser­va­tion to help en­sure it will never be in­ter­rupted by some­thing else hog­ging a disk.

Cancel specific I/O requests

A big pain pre-Vista was the in­abil­ity to can­cel in­di­vid­ual I/O op­er­a­tions. The only op­tion was to can­cel all op­er­a­tions for a han­dle, and only from the thread which ini­ti­ated them.

If it turns out some I/O op­er­a­tion is no longer re­quired, it is now pos­si­ble to can­cel in­di­vid­ual I/Os using Can­ce­lIoEx. This much needed func­tion re­places the al­most use­less Can­ce­lIo, and opens the doors to shar­ing file han­dles be­tween sep­a­rate op­er­a­tions.

Tips for efficient I/O

There are a few things to keep in mind for I/O that can have pretty in­cred­i­ble ef­fects on per­for­mance and scal­a­bil­ity. It’s not re­ally any new API, but how you use it.

Reduce blocking

The whole point of I/O com­ple­tion ports is to do op­er­a­tions asyn­chro­nously, to keep the CPU busy doing work while wait­ing for com­ple­tion. Block­ing de­feats this by stalling the thread, so it should be avoided when­ever pos­si­ble. Com­ple­tion ports have a mech­a­nism built in to make block­ing less hurt­ful by start­ing up a wait­ing thread when an­other one blocks, but it is still bet­ter to avoid it all to­gether.

This in­cludes mem­ory al­lo­ca­tion. Stan­dard sys­tem al­lo­ca­tors usu­ally have sev­eral points where it needs to lock to allow con­cur­rent use, so ap­pli­ca­tions should make use of cus­tom al­lo­ca­tors like are­nas and pools where pos­si­ble.

This means I/O should al­ways be per­formed asyn­chro­nously, lock-free al­go­rithms used when avail­able, and any re­main­ing locks should be as op­ti­mally placed as pos­si­ble. Care­ful ap­pli­ca­tion de­sign plan­ning goes a long way here. The tough­est area I’ve dis­cov­ered is data­base ac­cess—as far as I have seen, there have been zero well de­signed data­base client li­braries. Every one that I’ve seen has forced you to block at some point.

Ide­ally, the only place the ap­pli­ca­tion would block is when re­triev­ing com­ple­tion pack­ets.

Buffer size and alignment

I/O rou­tines like to lock the pages of the buffers you pass in. That is, it will pin them in phys­i­cal mem­ory so that they can’t be paged out to a swap file.

The re­sult is if you pass in a 20 byte buffer, on most sys­tems it will lock a full 4096 byte page in mem­ory. Even worse, if the 20 byte buffer has 10 bytes in one page and 10 bytes in an­other, it will lock both pages—8192 bytes of mem­ory for a 20 byte I/O. If you have a large num­ber of con­cur­rent op­er­a­tions this can waste a lot of mem­ory!

Be­cause of this, I/O buffers should get spe­cial treat­ment. Func­tions like malloc() and operator new() should not be used be­cause they have no oblig­a­tion to pro­vide such op­ti­mal align­ment for I/O. I like to use VirtualAlloc to al­lo­cate large blocks of 1MiB, and di­vide this up into smaller fixed-sized (usu­ally page-sized, or 4KiB) blocks to be put into a free list.

If buffers are not a mul­ti­ple of the sys­tem page size, extra care should be taken to al­lo­cate buffers in a way that keeps them in sep­a­rate pages from non-buffer data. This will make sure page lock­ing will incur the least amount of over­head while per­form­ing I/O.

Limit the number of I/Os

Sys­tem calls and com­ple­tion ports have some ex­pense, so lim­it­ing the num­ber of I/O calls you do can help to in­crease through­put. We can use scat­ter/gather op­er­a­tions to chain sev­eral of those page-sized blocks into one sin­gle I/O.

A scat­ter op­er­a­tion is tak­ing data from one source, like a socket, and scat­ter­ing it into mul­ti­ple buffers. In­versely a gather op­er­a­tion takes data from mul­ti­ple buffers and gath­ers it into one des­ti­na­tion.

For sock­ets, this is easy—we just use the nor­mal WSASend and WSARecv func­tions, pass­ing in mul­ti­ple WSABUF struc­tures.

For files it is a lit­tle more com­plex. Here the WriteFileGather and ReadFileScatter func­tions are used, but some spe­cial care must be taken. These func­tions re­quire page-aligned and -sized buffers to be used, and the num­ber of bytes read/writ­ten must be a mul­ti­ple of the disk’s sec­tor size. The sec­tor size can be ob­tained with Get­Disk­Free­Space.

Non-blocking I/O

Asyn­chro­nous op­er­a­tions are key here, but non-block­ing I/O still has a place in the world of high per­for­mance.

Once an asyn­chro­nous op­er­a­tion com­pletes, we can issue non-block­ing re­quests to process any re­main­ing data. Fol­low­ing this pat­tern re­duces the amount of strain on the com­ple­tion port and helps to keep your op­er­a­tion con­text hot in the cache for as long as pos­si­ble.

Care should be taken to not let non-block­ing op­er­a­tions con­tinue on for too long, though. If data is re­ceived on a socket fast enough and we take so long to process it that there is al­ways more, it could starve other com­ple­tion no­ti­fi­ca­tions wait­ing to be de­queued.

Throughput or concurrency

A ker­nel has a lim­ited amount of non-paged mem­ory avail­able to it, and lock­ing one or more pages for each I/O call is a real easy way use it all up. Some­times if we ex­pect an ex­treme num­ber of con­cur­rent con­nec­tions or if we want to limit mem­ory usage, it can be ben­e­fi­cial to delay lock­ing these pages until ab­solutely re­quired.

An un­doc­u­mented fea­ture of WSARecv is that you can re­quest a 0-byte re­ceive, which will com­plete when data has ar­rived. Then you can issue an­other WSARecv or use non-block­ing I/O to pull out what­ever is avail­able. This lets us get no­ti­fied when data can be re­ceived with­out ac­tu­ally lock­ing mem­ory.

Doing this is very much a choice of through­put or con­cur­rency—it will use more CPU, re­duc­ing through­put. But it will allow your ap­pli­ca­tion to han­dle a larger num­ber of con­cur­rent op­er­a­tions. It is most ben­e­fi­cial in a low mem­ory sys­tem, or on 32-bit Win­dows when an ex­treme num­ber of con­cur­rent op­er­a­tions will be used. 64-bit Win­dows has a much larger non-paged pool, so it shouldn’t pre­sent a prob­lem if you feed it enough phys­i­cal mem­ory.

Unbuffered I/O

If you are using the file sys­tem a lot, your ap­pli­ca­tion might be wait­ing on the syn­chro­nous op­er­at­ing sys­tem cache. In this case, en­abling un­buffered I/O will let file op­er­a­tions by­pass the cache and com­plete more asyn­chro­nously.

This is done by call­ing CreateFile with the FILE_FLAG_NO_BUFFERING flag. Note that with this flag, all file ac­cess must be sec­tor aligned—read/write off­sets and sizes. Buffers must also be sec­tor aligned.

Op­er­at­ing sys­tem caching can have good ef­fects, so this isn’t al­ways ad­van­ta­geous. But if you’ve got a spe­cific I/O pat­tern or if you do your own caching, it can give a sig­nif­i­cant per­for­mance boost. This is an easy thing to turn on and off, so take some bench­marks.

Up­date: this sub­ject con­tin­ued in I/O Im­prove­ments in Win­dows Vista.

I/O completion ports made easy

I de­scribed the ba­sics of I/O com­ple­tion ports in my last post, but there is still the ques­tion of what the eas­i­est way to use them. Here I’ll show a call­back-based ap­pli­ca­tion de­sign that I’ve found makes a fully asyn­chro­nous pro­gram re­ally sim­ple to do.

I touched briefly on at­tach­ing our own con­text data to the OVERLAPPED struc­ture we pass along with I/O op­er­a­tions. It’s this same idea that I’ll ex­pand on here. This time, we de­fine a generic struc­ture to use with all our op­er­a­tions, and how our threads will han­dle them while de­queu­ing pack­ets:

struct io_context
{
  OVERLAPPED ovl;
  void (*on_completion)(DWORD error, DWORD transferred,
                        struct io_context *ctx);
};

OVERLAPPED *ovl;
ULONG_PTR completionkey;
DWORD transferred;

BOOL ret = GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

if(ret)
{
  struct io_context *ctx = (struct io_context*)ovl;
  ctx->on_completion(ERROR_SUCCESS, transferred, ctx);
}
else if(ovl)
{
  DWORD err = GetLastError();

  struct io_context *ctx = (struct io_context*)ovl;
  ctx->on_completion(err, transferred, ctx);
}
else
{
  // error out
}

With this, all our I/O op­er­a­tions will have a call­back as­so­ci­ated with them. When a com­ple­tion packet is de­queued it gets the error in­for­ma­tion, if any, and runs the call­back. Hav­ing every I/O op­er­a­tion use a sin­gle call­back mech­a­nism greatly sim­pli­fies the de­sign of the en­tire pro­gram.

Lets say our app was read­ing a file and send­ing out it’s con­tents. We also want it to prefetch the next buffer so we can start send­ing right away. Here’s our con­nec­tion con­text:

struct connection_context
{
  HANDLE file;
  SOCKET sock;

  WSABUF readbuf;
  WSABUF sendbuf;

  struct io_context readctx;
  struct io_context sendctx;
};

A struc­ture like this is nice be­cause ini­ti­at­ing an I/O op­er­a­tion will need no al­lo­ca­tions. Note that we need two io_­con­text mem­bers be­cause we’re doing a read and send con­cur­rently.

Now the code to use it:

#define BUFFER_SIZE 4096

void begin_read(struct connection_context *ctx)
{
  if(ctx->readbuf.buf)
  {
    // only begin a read if one isn't already running.
    return;
  }

  ctx->readbuf.buf = malloc(BUFFER_SIZE);
  ctx->readbuf.len = 0;

  // zero out io_context structure.
  memset(&ctx->readctx, 0, sizeof(ctx->readctx));

  // set completion callback.
  ctx->readctx.on_completion = read_finished;

  ReadFile(ctx->file, ctx->readbuf.buf, BUFFER_SIZE,
           NULL, &ctx->readctx.ovl);
}

void read_finished(DWORD error, DWORD transferred,
                   struct io_context *ioctx)
{
  // get our connection context.
  struct connection_context *ctx =
     (struct connection_context*)((char*)ioctx -
        offsetof(struct connection_context, readctx));

  if(error != ERROR_SUCCESS)
  {
    // handle error.
    return;
  }

  if(!transferred)
  {
    // reached end of file, close out connection.
    free(ctx->readbuf.buf);
    ctx->readbuf.buf = 0;
    return;
  }

  // send out however much we read from the file.

  ctx->readbuf.len = transferred;

  begin_send(ctx);
}

This gives us a very ob­vi­ous chain of events: read_finished is called when a read com­pletes. Since we only get an io_context struc­ture in our call­back, we need to ad­just the pointer to get our full connection_context.

Send­ing is easy too:

void begin_send(struct connection_context *ctx)
{
  if(ctx->sendbuf.buf)
  {
    // only begin a send if one
    // isn't already running.
    return;
  }

  if(!ctx->recvbuf.len)
  {
    // only begin a send if the
    // read buffer has something.
    return;
  }

  // switch buffers.
  ctx->sendbuf = ctx->readbuf;

  // clear read buffer.
  ctx->readbuf.buf = NULL;
  ctx->readbuf.len = 0;

  // zero out io_context structure.
  memset(&ctx->sendctx, 0, sizeof(ctx->sendctx));

  // set completion callback.
  ctx->sendctx.on_completion = send_finished;

  WSASend(ctx->sock, &ctx->sendbuf, 1, NULL, 0,
          &ctx->sendctx.ovl, NULL);

  // start reading next buffer.
  begin_read(ctx);
}

void send_finished(DWORD error, DWORD transferred,
                   struct io_context *ioctx)
{
  // get our connection context.
  struct connection_context *ctx =
     (struct connection_context*)((char*)ioctx -
        offsetof(struct connection_context, sendctx));

  if(error != ERROR_SUCCESS)
  {
    // handle error.
    return;
  }

  // success, clear send buffer and start next send.

  free(ctx->sendbuf.buf);
  ctx->sendbuf.buf = NULL;

  begin_send(ctx);
}

Pretty much more of the same. Again for brevity I’m leav­ing out some error check­ing code and as­sum­ing the buffer gets sent out in full. I’m also as­sum­ing a sin­gle-threaded de­sign—socket and file func­tions them­selves are thread-safe and have noth­ing to worry about, but the buffer man­age­ment code here would need some extra lock­ing since it could be run con­cur­rently. But the idea should be clear.

Up­date: this sub­ject con­tin­ued in Tips for ef­fi­cient I/O.

High Performance I/O on Windows

I/O com­ple­tion ports were first in­tro­duced in Win­dows NT 4.0 as a highly scal­able, multi-proces­sor ca­pa­ble al­ter­na­tive to the then-typ­i­cal prac­tices of using se­lect, WSAWait­For­Mul­ti­pleEvents, WSAA­sync­S­e­lect, or even run­ning a sin­gle thread per client. The most op­ti­mal way to per­form I/O on Win­dows—short of writ­ing a ker­nel-mode dri­ver—is to use I/O com­ple­tion ports.

A re­cent post on Slash­dot claimed sock­ets have run their course, which I think is ab­solutely not true! The au­thor seems to be­lieve the Berke­ley sock­ets API is the only way to per­form socket I/O, which is non­sense. Much more mod­ern, scal­able and high per­for­mance APIs exist today such as I/O com­ple­tion ports on Win­dows, epoll on Linux, and kqueue on FreeBSD. In light of this I thought I’d write a lit­tle about com­ple­tion ports here.

The old ways of mul­ti­plex­ing I/O still work pretty well for sce­nar­ios with a low num­ber of con­cur­rent con­nec­tions, but when writ­ing a server dae­mon to han­dle a thou­sand or even tens of thou­sands of con­cur­rent clients at once, we need to use some­thing dif­fer­ent. In this sense the old de facto stan­dard Berke­ley sock­ets API has run its course, be­cause the over­head of man­ag­ing so many con­nec­tions is sim­ply too great and makes using mul­ti­ple proces­sors hard.

An I/O com­ple­tion port is a multi-proces­sor aware queue. You cre­ate a com­ple­tion port, bind file or socket han­dles to it, and start asyn­chro­nous I/O op­er­a­tions. When they com­plete—ei­ther suc­cess­fully or with an error—a com­ple­tion packet is queued up on the com­ple­tion port, which the ap­pli­ca­tion can de­queue from mul­ti­ple threads. The com­ple­tion port uses some spe­cial voodoo to make sure only a spe­cific num­ber of threads can run at once—if one thread blocks in ker­nel-mode, it will au­to­mat­i­cally start up an­other one.

First you need to cre­ate a com­ple­tion port with Cre­ateIo­Com­ple­tion­Port:

HANDLE iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE,
   NULL, 0, 0);

It’s im­por­tant to note that Num­berOf­Con­cur­rent­Threads is not the total num­ber of threads that can ac­cess the com­ple­tion port—you can have as many as you want. This in­stead con­trols the num­ber of threads it will allow to run con­cur­rently. Once you’ve reached this num­ber, it will block all other threads. If one of the run­ning threads blocks for any rea­son in ker­nel-mode, the com­ple­tion port will au­to­mat­i­cally start up one of the wait­ing threads. Spec­i­fy­ing 0 for this is equiv­a­lent to the num­ber of log­i­cal proces­sors in the sys­tem.

Next is cre­at­ing and as­so­ci­at­ing a file or socket han­dle, using Cre­ate­File, WSASocket, and Cre­ateIo­Com­ple­tion­Port:

#define OPERATION_KEY 1

HANDLE file = CreateFile(L"file.txt", GENERIC_READ,
   FILE_SHARE_READ, NULL, OPEN_EXISTING,
   FILE_FLAG_OVERLAPPED, NULL);

SOCKET sock = WSASocket(AF_INET, SOCK_STREAM, IPPROTO_TCP,
   NULL, 0, WSA_FLAG_OVERLAPPED);

CreateIoCompletionPort(file, iocp, OPERATION_KEY, 0);
CreateIoCompletionPort((HANDLE)sock, iocp, OPERATION_KEY, 0);

Files and sock­ets must be opened with the FILE_FLAG_OVERLAPPED and WSA_FLAG_OVERLAPPED flags be­fore they are used asyn­chro­nously.

Reusing CreateIoCompletionPort for as­so­ci­at­ing file/socket han­dles is weird de­sign choice from Mi­crosoft but that’s how it’s done. The CompletionKey pa­ra­me­ter can be any­thing you want, it is a value pro­vided when pack­ets are de­queued. I de­fine a OPERATION_KEY here to use as the CompletionKey, the sig­nif­i­cance of which I’ll get to later.

Next we have to start up some I/O op­er­a­tions. I’ll skip set­ting up the socket and go right to send­ing data. We start these op­er­a­tions using Read­File and WSASend:

OVERLAPPED readop, sendop;
WSABUF sendwbuf;
char readbuf[256], sendbuf[256];

memset(&readop, 0, sizeof(readop));
memset(&sendop, 0, sizeof(sendop));

sendwbuf.len = sizeof(sendbuf);
sendwbuf.buf = sendbuf;

BOOL readstatus = ReadFile(file, readbuf,
   sizeof(readbuf), NULL, &readop);

if(!readstatus)
{
  DWORD readerr = GetLastError();

  if(readerr != ERROR_IO_PENDING)
  {
    // error reading.
  }
}

int writestatus = WSASend(sock, &sendwbuf, 1, NULL,
   0, &sendop, NULL);

if(writestatus)
{
  int writeerr = WSAGetLastError();

  if(writeerr != WSA_IO_PENDING)
  {
    // error sending.
  }
}

Every I/O op­er­a­tion using a com­ple­tion port has an OVERLAPPED struc­ture as­so­ci­ated with it. Win­dows uses this in­ter­nally in un­spec­i­fied ways, only say­ing we need to zero them out be­fore start­ing an op­er­a­tion. The OVERLAPPED struc­ture will be given back to us when we de­queue the com­ple­tion pack­ets, and can be used to pass along our own con­text data.

I have left out the stan­dard error check­ing up until now for brevity’s sake, but this one doesn’t work quite like one might ex­pect so it is im­por­tant here. When start­ing an I/O op­er­a­tion, an error might not re­ally be an error. If the func­tion suc­ceeds all is well, but if the func­tion fails, it is im­por­tant to check the error code with Get­LastEr­ror or WSAGet­LastEr­ror.

If these func­tions re­turn ERROR_IO_PENDING or WSA_IO_PENDING, the func­tion ac­tu­ally still com­pleted suc­cess­fully. All these error codes mean is an asyn­chro­nous op­er­a­tion has been started and com­ple­tion is pend­ing, as op­posed to com­plet­ing im­me­di­ately. A com­ple­tion packet will be queued up re­gard­less of the op­er­a­tion com­plet­ing asyn­chro­nously or not.

De­queu­ing pack­ets from a com­ple­tion port is han­dled by the GetQueuedCompletionStatus func­tion:

This func­tion de­queues com­ple­tion pack­ets, con­sist­ing of the com­ple­tion key we spec­i­fied in CreateIoCompletionPort and the OVERLAPPED struc­ture we gave while start­ing the I/O. If the I/O trans­ferred any data, it will re­trieve how many bytes were trans­ferred too. Again the error han­dling is a bit weird on this one, hav­ing three error states.

This can be run from as many threads as you like, even a sin­gle one. It is com­mon prac­tice to run a pool of twice the num­ber of threads as there are log­i­cal proces­sors avail­able, to keep the CPU ac­tive with the afore­men­tioned func­tion­al­ity of start­ing a new thread if a run­ning one blocks.

Un­less you are going for a sin­gle-threaded de­sign, I rec­om­mend start­ing two threads per log­i­cal CPU. Even if your app is de­signed to be 100% asyn­chro­nous, you will still run into block­ing when lock­ing shared data and even get un­avoid­able hid­den block­ing I/Os like read­ing in paged out mem­ory. Keep­ing two threads per log­i­cal CPU will keep the proces­sor busy with­out over­load­ing the OS with too much con­text switch­ing.

This is all well and good, but two I/O op­er­a­tions were ini­ti­ated—a file read and a socket send. We need a way to tell their com­ple­tion pack­ets apart. This is why we need to at­tach some state to the OVERLAPPED struc­ture when we call those func­tions:

struct my_context
{
  OVERLAPPED ovl;
  int isread;
};

struct my_context readop, sendop;

memset(&readop.ovl, 0, sizeof(readop.ovl));
memset(&sendop.ovl, 0, sizeof(sendop.ovl));

readop.isread = 1;
sendop.isread = 0;

ReadFile(file, readbuf, sizeof(readbuf), NULL, &readop.ovl);
WSASend(sock, &sendwbuf, 1, NULL, 0, &sendop.ovl, NULL);

Now we can tell the op­er­a­tions apart when we de­queue them:

OVERLAPPED *ovl;
ULONG_PTR completionkey;
DWORD transferred;

GetQueuedCompletionStatus(iocp, &transferred,
   &completionkey, &ovl, INFINITE);

struct my_context *ctx = (struct my_context*)ovl;

if(ctx->isread)
{
  // read completed.
}
else
{
  // send completed.
}

The last im­por­tant thing to know is how to queue up your own com­ple­tion pack­ets. This is use­ful if you want to split an op­er­a­tion up to be run on the thread pool, or if you want to exit a thread wait­ing on a call to GetQueuedCompletionStatus. To do this, we use the PostQueuedCompletionStatus func­tion:

#define EXIT_KEY 0

struct my_context ctx;

PostQueuedCompletionStatus(iocp, 0, OPERATION_KEY, &ctx.ovl);
PostQueuedCompletionStatus(iocp, 0, EXIT_KEY, NULL);

Here we post two things onto the queue. The first, we post our own struc­ture onto the queue, to be processed by our thread pool. The sec­ond, we give a new com­ple­tion key: EXIT_KEY. The thread which processes this packet can test if the com­ple­tion key is EXIT_KEY to know when it needs to stop de­queu­ing pack­ets and shut down.

Other than the com­ple­tion port han­dle, Win­dows does not use any of the pa­ra­me­ters given to PostQueuedCompletionStatus. They are en­tirely for our use, to be de­queued with GetQueuedCompletionStatus.

That’s all I have to write for now, and should be every­thing one would need to get started learn­ing these high per­for­mance APIs! I will make an­other post shortly de­tail­ing some good pat­terns for com­ple­tion port usage, and some op­ti­miza­tion tips to en­sure ef­fi­cient usage of these I/O APIs.

Up­date: this sub­ject con­tin­ued in I/O com­ple­tion ports made easy.

User Mode Scheduling in Windows 7

Don’t use threads. Or more pre­cisely, don’t over-use them. It’s one of the first thing fledg­ling pro­gram­mers learn after they start using threads. This is be­cause thread­ing in­volves a lot of over­head. In short, using more threads may im­prove con­cur­rency, but it will give you less over­all through­put as more pro­cess­ing is put into sim­ply man­ag­ing the threads in­stead of let­ting them run. So pro­gram­mers learn to use threads spar­ingly.

When nor­mal threads run out of time, or block on some­thing like a mutex or I/O, they hand off con­trol to the op­er­at­ing sys­tem ker­nel. The ker­nel then finds a new thread to run, and switches back to user-mode to run the thread. This con­text switch­ing is what User Mode Sched­ul­ing looks to al­le­vi­ate.

User Mode Sched­ul­ing can be thought of as a cross be­tween threads and thread pools. An ap­pli­ca­tion cre­ates one or more UMS sched­uler threads—typ­i­cally one for each proces­sor. It then cre­ates sev­eral UMS worker threads for each sched­uler thread. The worker threads are the ones that run your ac­tual code. When­ever a worker thread runs out of time, it is put on the end of its sched­uler thread’s queue. If a worker thread blocks, it is put on a wait­ing list to be re-queued by the ker­nel when what­ever it was wait­ing on fin­ishes. The sched­uler thread then takes the worker thread from the top of the queue and starts run­ning it. Like the name sug­gests, this hap­pens en­tirely in user-mode, avoid­ing the ex­pen­sive user->ker­nel->user-mode tran­si­tions. Let­ting each thread run for ex­actly as long as it needs helps to solve the through­put prob­lem. Work is only put into man­ag­ing threads when ab­solutely nec­es­sary in­stead of in ever smaller time slices, leav­ing more time to run your ac­tual code.

A good side ef­fect of this is UMS threads also help to al­le­vi­ate the cache thrash­ing prob­lems typ­i­cal in heav­ily-threaded ap­pli­ca­tions. For­get­ting your data shar­ing pat­terns, each thread still needs its own stor­age for stack space, proces­sor con­text, and thread-lo­cal stor­age. Every time a con­text switch hap­pens, some data may need to be pushed out of caches in order to load some ker­nel-mode code and the next thread’s data. By switch­ing be­tween threads less often, cache can be put to bet­ter use for the task at hand.

If you have ever had a chance to use some of the more es­o­teric APIs in­cluded with Win­dows, you might be won­der­ing why we need UMS threads when we have fibers which offer sim­i­lar co-op­er­a­tive mul­ti­task­ing. Fibers have a lot of spe­cial ex­cep­tions. There are things that aren’t safe to do with them. Li­braries that rely on thread-lo­cal stor­age, for in­stance, will likely walk all over them­selves if used from within fibers. A UMS thread on the other hand is a full fledged thread—they sup­port TLS and no have no real spe­cial things to keep in mind while using them.

I still wouldn’t count out thread pools just yet. UMS threads are still more ex­pen­sive than a thread pool and the large mem­ory re­quire­ments of a thread still apply here, so things like per-client threads in in­ter­net dae­mons are still out of the ques­tion if you want to be mas­sively scal­able. More likely, UMS threads will be most use­ful for build­ing thread pools. Most thread pools launch two or three threads per CPU to help stay busy when given block­ing tasks, and UMS threads will at least help keep their time slice usage op­ti­mal.

From what I un­der­stand the team be­hind Mi­crosoft’s Con­cur­rency Run­time, to be in­cluded with Vi­sual C++ 2010, was one of the pri­mary forces be­hind UMS threads. They worked very closely with the ker­nel folks to find the most scal­able way to en­able the su­per-par­al­lel code that will be pos­si­ble with the CR.

WCF is pretty neat

I haven’t worked with .NET ex­ten­sively since a lit­tle bit after 2.0 was re­leased, so when I took on a new job de­vel­op­ing with it, I had some catch­ing up to do. WCF was the easy part. In fact, I’m re­ally en­joy­ing using it. I can tell they put a lot of thought into mak­ing it scal­able.

For those that don’t know, WCF is Mi­crosoft’s new web ser­vices frame­work, meant to re­place the old Re­mot­ing stuff in .NET 2.0. It lets you worry about writ­ing code—classes and meth­ods etc., and it man­ages trans­form­ing it into SOAP and WSDL in the back­ground.

The coolest thing about WCF is the sup­port for com­pletely async de­sign. You start a data­base query, put the method call into the back­ground, and re­sume it when the data­base query is done. This al­lows the server to run thou­sands of clients in only a cou­ple threads, im­prov­ing cache and mem­ory usage greatly.

One funny thing I learned from this is that ASP.​NET has full async sup­port too, it just doesn’t get a lot of ad­ver­tis­ing for some rea­son. The one thing that an­noys me about all mod­ern web de­vel­op­ment frame­works is the lack of async sup­port mak­ing you pay for 20 servers when you should only need one, and here it was under my nose all the time. Imag­ine that!

Scalability isn’t everything

In the be­gin­ning, you write threaded apps with great ig­no­rance to scal­a­bil­ity. That’s usu­ally okay — most apps don’t need it, but sooner or later you will come across a prob­lem that de­mands it. With enough search­ing, you will come across lock–free al­go­rithms. Tricky to get right, but promis­ing fan­tas­tic scal­a­bil­ity if you do.

Even trick­ier, though, is know­ing when to not use them. Lock–free al­go­rithms come with a price: al­though they are in­deed very scal­able, their per­for­mance can be much worse than a well de­signed al­go­rithm for sin­gle–threaded ap­pli­ca­tions. Do a lit­tle bench­mark­ing and you might find some­thing sur­pris­ing: the per­for­mance hit can ac­tu­ally be so large that a sim­ple locked sin­gle–threaded al­go­rithm with no scal­a­bil­ity will give bet­ter over­all per­for­mance than a 100% scal­able lock–free ver­sion.

This is more com­mon than you might think. Take a queue. A sin­gle–threaded ver­sion will typ­i­cally have very min­i­mal mem­ory over­head: maybe a pointer for every n ob­jects. A lock–free ver­sion will need two point­ers for every ob­ject (or one, if you use a GC). Now the amount of over­head greatly de­pends on what your ob­ject is. If your ob­ject is large, a lock–free queue will prob­a­bly be a bet­ter choice. But if your ob­ject is small—say one or two point­ers—the over­head can be enough that cache misses will sig­nif­i­cantly af­fect your ap­pli­ca­tion.

I re­cently had to tackle this prob­lem. My ap­pli­ca­tion needed a queue of small ob­jects, and on a mod­ern quad–core CPU the cache misses were hurt­ing per­for­mance so much that al­though a lock–free queue did have near 100% scal­a­bil­ity, the over­all op­er­a­tion was com­plet­ing 165% faster with a locked queue with zero scal­a­bil­ity.

The next best thing is to com­bines the best of both worlds: de­sign a queue with low over­head and medium scal­a­bil­ity. Using a reader–writer lock with a com­bi­na­tion of lock–free op­er­a­tions, I came up with a queue that only needs to do a full lock once every 32 or 64 op­er­a­tions. The re­sult? Scal­a­bil­ity 5% lower than a lock–free queue, with over­all per­for­mance 210% bet­ter.

OK, I’ll admit it: I cheated, some­what. Lock–free al­go­rithms are good for more than just scal­a­bil­ity. They also offer im­mu­nity to nasty ef­fects like dead­lock, live­lock, and pri­or­ity in­ver­sion. In my case I wasn’t in a sit­u­a­tion to worry about these, but you might be. The les­son here is to know your sit­u­a­tion and de­cide care­fully, and don’t trust what oth­ers tell you: al­ways try things your­self and pro­file.