Efficient stream parsing in C++

A while ago I wrote about cre­at­ing a good parser and while the non-block­ing idea was spot-on, the rest of it re­ally isn’t very good in C++ where we have the power of tem­plates around to help us.

I’m cur­rently fin­ish­ing up a HTTP li­brary and have been re­vis­ing my views on stream pars­ing be­cause of it. I’m still not en­tirely set on my over­all im­ple­men­ta­tion, but I’m near­ing com­ple­tion and am ready to share my ideas. First, I’ll list my re­quire­ments:

To ac­com­plish this I broke this out into three lay­ers: a core parser, a buffer, and a buffer parser.

The core parser

De­sign­ing the core parser was sim­ple. I be­lieve I al­ready have a solid C++ parser de­sign in my XML li­brary, so I’m reusing that con­cept. This is fully in-situ pull parser that op­er­ates on a range of bidi­rec­tional it­er­a­tors and re­turns back a sub-range of those it­er­a­tors. The pull func­tion re­turns ok when it parses a new el­e­ment, done when it has reached a point that could be con­sid­ered an end of the stream, and need_more when an el­e­ment can’t be ex­tracted from the passed in it­er­a­tor range. Using this parser is pretty sim­ple:

typedef std::deque<char> buffer_type;
typedef http::parser<buffer_type::iterator> parser_type;

buffer_type buffer;

parser_type p;
parser_type::node_type n;
parser_type::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  std::deque<char>::iterator first = buffer.begin();

  while((r = p.parse(first, buffer.end(), n)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  buffer.erase(buffer.begin(), first); // remove all the used
                                       // data from the buffer.
} while(r == http::result_types::need_more);

By let­ting the user pass in a new range of it­er­a­tors to parse each time, we have the op­tion of up­dat­ing the stream with more data when need_more is re­turned. The parse() func­tion also up­dates the first it­er­a­tor so that we can pop any data prior to it from the data stream.

By de­fault the parser will throw an ex­cep­tion when it en­coun­ters an error. This can be changed by call­ing an over­load and han­dling the error re­sult type:

typedef std::deque<char> buffer_type;
typedef http::parser<buffer_type::iterator> parser_type;

buffer_type buffer;

parser_type p;
parser_type::node_type n;
parser_type::error_type err;
parser_type::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  std::deque<char>::iterator first = buffer.begin();

  while((r = p.parse(first, buffer.end(), n, err)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  buffer.erase(buffer.begin(), first); // remove all the used
                                       // data from the buffer.
} while(r == http::result_types::need_more);

if(r == http::result_types::error)
{
  std::cerr
    << "an error occured at "
    << std::distance(buffer.begin(), err.position())
    << ": "
    << err.what()
    << std::endl;
}

The buffer

Ini­tially I was test­ing my parser with a deque<char> like above. This let me test the it­er­a­tor-based parser very eas­ily by in­cre­men­tally push­ing data on, pars­ing some of it, and pop­ping off what was used. Un­for­tu­nately, using a deque means we al­ways have an extra copy, from an I/O buffer into the deque. It­er­at­ing a deque is also a lot slower than it­er­at­ing a range of point­ers be­cause of the way deque is usu­ally im­ple­mented. This in­ef­fi­ciency is ac­cept­able for test­ing, but just won't work in a live app.

My buffer class is I/O- and pars­ing-op­ti­mized, op­er­at­ing on pages of data. It al­lows pages to be in­serted di­rectly from I/O with­out copy­ing. Ones that weren't filled en­tirely can still be filled later, al­low­ing the user to com­mit more bytes of a page as read­able. One can use scat­ter/gather I/O to make op­er­a­tions span mul­ti­ple pages con­tained in a buffer.

The buffer ex­poses two types of it­er­a­tors. The first type is what we are used to in deque: just a gen­eral byte stream it­er­a­tor. But this in­curs the same cost as deque: each in­cre­ment to the it­er­a­tor must check if it's at the end of the cur­rent page and move to the next. A pro­to­col like HTTP can fit a lot of el­e­ments into a sin­gle 4KiB page, so it doesn't make sense to have this cost. This is where the sec­ond it­er­a­tor comes in: the page it­er­a­tor. A page can be thought of as a Range rep­re­sent­ing a sub­set of the data in the full buffer. Over­all the buffer class looks some­thing like this:

struct page
{
  const char *first;    // the first byte of the page.
  const char *last;     // one past the last byte of the page.
  const char *readpos;  // the first readable byte of the page.
  const char *writepos; // the first writable byte of the page,
                        // one past the last readable byte.
};

class buffer
{
public:
  typedef ... size_type;
  typedef ... iterator;
  typedef ... page_iterator;

  void push(page *p); // pushes a page into the buffer.  might
                      // be empty, semi-full, or full.

  page* pop(); // pops the first fully read page from from the buffer.

  void commit_write(size_type numbytes); // merely moves writepos
                                         // by some number of bytes.

  void commit_read(size_type numbytes); // moves readpos by
                                        // some number of bytes.

  iterator begin() const;
  iterator end() const;

  page_iterator pages_begin() const;
  page_iterator pages_end() const;
};

One thing you may no­tice is it ex­pects you to push() and pop() pages di­rectly onto it, in­stead of al­lo­cat­ing its own. I re­ally hate classes that al­lo­cate mem­ory – in terms of scal­a­bil­ity the fewer places that al­lo­cate mem­ory, the eas­ier it will be to op­ti­mize. Be­cause of this I al­ways try to de­sign my code to – if it makes sense – have the next layer up do al­lo­ca­tions. When it doesn't make sense, I doc­u­ment it. Hid­den al­lo­ca­tions are the root of evil.

The buffer parser

Un­like the core parser, the buffer parser isn't a tem­plate class. The buffer parser ex­poses the same func­tion­al­ity as a core parser, but using a buffer in­stead of it­er­a­tor ranges.

This is where C++ gives me a big ad­van­tage. The buffer parser is ac­tu­ally im­ple­mented with two core parsers. The first is a very fast http::parser<const char*>. It uses this to parse as much of a sin­gle page as pos­si­ble, stop­ping when it en­coun­ters need_more and no more data can be added to the page. The sec­ond is a http::parser<buffer::iterator>. This gets used when the first parser stops, which will hap­pen very in­fre­quently – only when a HTTP el­e­ment spans mul­ti­ple pages.

This is fairly easy to im­ple­ment, but re­quired a small change to my core parser con­cept. Be­cause each has sep­a­rate in­ter­nal state, I needed to make it so I could move the state be­tween two parsers that use dif­fer­ent it­er­a­tors. The amount of state is ac­tu­ally very small, mak­ing this a fast op­er­a­tion.

The buffer parser works with two dif­fer­ent it­er­a­tor types in­ter­nally, so I chose to al­ways re­turn a buffer::iterator range. The choice was ei­ther that or silently copy el­e­ments span­ning mul­ti­ple pages, and this way lets the user of the code de­cide how they want to han­dle it.

Using the buffer parser is just as easy as be­fore:

http::buffer buffer;
http::buffer_parser p;
http::buffer_parser::node_type n;
http::buffer_parser::result_type r;

do
{
  push_data(buffer); // add data to buffer from whatever I/O source.

  while((r = p.parse(buffer, n)) == http::result_types::ok)
  {
    switch(n.type)
    {
      case http::node_types::method:
      case http::node_types::uri:
      case http::node_types::version:
    }
  }

  pop_used(buffer); // remove all the used
                    // data from the buffer.
} while(r == http::result_types::need_more);

The I/O layer

I'm leav­ing out an I/O layer for now. I will prob­a­bly write sev­eral small I/O sys­tems for it once I'm sat­is­fied with the parser. Per­haps one using asio, one using I/O com­ple­tion ports, and one using epoll. I've de­signed this from the start to be I/O ag­nos­tic but with op­ti­miza­tions that fa­cil­i­tate ef­fi­cient forms of all I/O, so I think it could be an good bench­mark of the var­i­ous I/O sub­sys­tems that dif­fer­ent plat­forms pro­vide.

One idea I've got is to use Winsock Ker­nel to im­ple­ment a ker­nel-mode HTTPd. Not a very good idea from a se­cu­rity stand­point, but would still be in­ter­est­ing to see the ef­fects on per­for­mance. Be­cause the parser per­forms no al­lo­ca­tion, no I/O calls, and doesn't force the use of ex­cep­tions, it should ac­tu­ally be very sim­ple to use in ker­nel-mode.

Posted on September 01, 2009 in C++, Coding, Parsing, Scalability

Related Posts