Tips for efficient I/O

There are a few things to keep in mind for I/O that can have pretty in­cred­i­ble ef­fects on per­for­mance and scal­a­bil­ity. It’s not re­ally any new API, but how you use it.

Reduce blocking

The whole point of I/O com­ple­tion ports is to do op­er­a­tions asyn­chro­nously, to keep the CPU busy doing work while wait­ing for com­ple­tion. Block­ing de­feats this by stalling the thread, so it should be avoided when­ever pos­si­ble. Com­ple­tion ports have a mech­a­nism built in to make block­ing less hurt­ful by start­ing up a wait­ing thread when an­other one blocks, but it is still bet­ter to avoid it all to­gether.

This in­cludes mem­ory al­lo­ca­tion. Stan­dard sys­tem al­lo­ca­tors usu­ally have sev­eral points where it needs to lock to allow con­cur­rent use, so ap­pli­ca­tions should make use of cus­tom al­lo­ca­tors like are­nas and pools where pos­si­ble.

This means I/O should al­ways be per­formed asyn­chro­nously, lock-free al­go­rithms used when avail­able, and any re­main­ing locks should be as op­ti­mally placed as pos­si­ble. Care­ful ap­pli­ca­tion de­sign plan­ning goes a long way here. The tough­est area I’ve dis­cov­ered is data­base ac­cess—as far as I have seen, there have been zero well de­signed data­base client li­braries. Every one that I’ve seen has forced you to block at some point.

Ide­ally, the only place the ap­pli­ca­tion would block is when re­triev­ing com­ple­tion pack­ets.

Buffer size and alignment

I/O rou­tines like to lock the pages of the buffers you pass in. That is, it will pin them in phys­i­cal mem­ory so that they can’t be paged out to a swap file.

The re­sult is if you pass in a 20 byte buffer, on most sys­tems it will lock a full 4096 byte page in mem­ory. Even worse, if the 20 byte buffer has 10 bytes in one page and 10 bytes in an­other, it will lock both pages—8192 bytes of mem­ory for a 20 byte I/O. If you have a large num­ber of con­cur­rent op­er­a­tions this can waste a lot of mem­ory!

Be­cause of this, I/O buffers should get spe­cial treat­ment. Func­tions like malloc() and operator new() should not be used be­cause they have no oblig­a­tion to pro­vide such op­ti­mal align­ment for I/O. I like to use VirtualAlloc to al­lo­cate large blocks of 1MiB, and di­vide this up into smaller fixed-sized (usu­ally page-sized, or 4KiB) blocks to be put into a free list.

If buffers are not a mul­ti­ple of the sys­tem page size, extra care should be taken to al­lo­cate buffers in a way that keeps them in sep­a­rate pages from non-buffer data. This will make sure page lock­ing will incur the least amount of over­head while per­form­ing I/O.

Limit the number of I/Os

Sys­tem calls and com­ple­tion ports have some ex­pense, so lim­it­ing the num­ber of I/O calls you do can help to in­crease through­put. We can use scat­ter/gather op­er­a­tions to chain sev­eral of those page-sized blocks into one sin­gle I/O.

A scat­ter op­er­a­tion is tak­ing data from one source, like a socket, and scat­ter­ing it into mul­ti­ple buffers. In­versely a gather op­er­a­tion takes data from mul­ti­ple buffers and gath­ers it into one des­ti­na­tion.

For sock­ets, this is easy—we just use the nor­mal WSASend and WSARecv func­tions, pass­ing in mul­ti­ple WSABUF struc­tures.

For files it is a lit­tle more com­plex. Here the WriteFileGather and ReadFileScatter func­tions are used, but some spe­cial care must be taken. These func­tions re­quire page-aligned and -sized buffers to be used, and the num­ber of bytes read/writ­ten must be a mul­ti­ple of the disk’s sec­tor size. The sec­tor size can be ob­tained with Get­Disk­Free­Space.

Non-blocking I/O

Asyn­chro­nous op­er­a­tions are key here, but non-block­ing I/O still has a place in the world of high per­for­mance.

Once an asyn­chro­nous op­er­a­tion com­pletes, we can issue non-block­ing re­quests to process any re­main­ing data. Fol­low­ing this pat­tern re­duces the amount of strain on the com­ple­tion port and helps to keep your op­er­a­tion con­text hot in the cache for as long as pos­si­ble.

Care should be taken to not let non-block­ing op­er­a­tions con­tinue on for too long, though. If data is re­ceived on a socket fast enough and we take so long to process it that there is al­ways more, it could starve other com­ple­tion no­ti­fi­ca­tions wait­ing to be de­queued.

Throughput or concurrency

A ker­nel has a lim­ited amount of non-paged mem­ory avail­able to it, and lock­ing one or more pages for each I/O call is a real easy way use it all up. Some­times if we ex­pect an ex­treme num­ber of con­cur­rent con­nec­tions or if we want to limit mem­ory usage, it can be ben­e­fi­cial to delay lock­ing these pages until ab­solutely re­quired.

An un­doc­u­mented fea­ture of WSARecv is that you can re­quest a 0-byte re­ceive, which will com­plete when data has ar­rived. Then you can issue an­other WSARecv or use non-block­ing I/O to pull out what­ever is avail­able. This lets us get no­ti­fied when data can be re­ceived with­out ac­tu­ally lock­ing mem­ory.

Doing this is very much a choice of through­put or con­cur­rency—it will use more CPU, re­duc­ing through­put. But it will allow your ap­pli­ca­tion to han­dle a larger num­ber of con­cur­rent op­er­a­tions. It is most ben­e­fi­cial in a low mem­ory sys­tem, or on 32-bit Win­dows when an ex­treme num­ber of con­cur­rent op­er­a­tions will be used. 64-bit Win­dows has a much larger non-paged pool, so it shouldn’t pre­sent a prob­lem if you feed it enough phys­i­cal mem­ory.

Unbuffered I/O

If you are using the file sys­tem a lot, your ap­pli­ca­tion might be wait­ing on the syn­chro­nous op­er­at­ing sys­tem cache. In this case, en­abling un­buffered I/O will let file op­er­a­tions by­pass the cache and com­plete more asyn­chro­nously.

This is done by call­ing CreateFile with the FILE_FLAG_NO_BUFFERING flag. Note that with this flag, all file ac­cess must be sec­tor aligned—read/write off­sets and sizes. Buffers must also be sec­tor aligned.

Op­er­at­ing sys­tem caching can have good ef­fects, so this isn’t al­ways ad­van­ta­geous. But if you’ve got a spe­cific I/O pat­tern or if you do your own caching, it can give a sig­nif­i­cant per­for­mance boost. This is an easy thing to turn on and off, so take some bench­marks.

Up­date: this sub­ject con­tin­ued in I/O Im­prove­ments in Win­dows Vista.

Related Posts