Writing a good parser

From a sim­ple bi­nary pro­to­col used over sock­ets to a com­plex XML doc­u­ment, many ap­pli­ca­tions de­pend on pars­ing. Why, then, do the great ma­jor­ity of parsers out there just plain suck?

I’ve come to be­lieve a good parser should have two prop­er­ties:

Sev­eral parsers meet the first re­quire­ment, but al­most none meet the sec­ond: they ex­pect their input to come in with­out a break in ex­e­cu­tion. So how do you ac­com­plish this? Lately I’ve been em­ploy­ing this fairly sim­ple de­sign:

struct buffer {
  struct buffer *next;
  char *buf;
  size_t len;
};

struct parser {
  struct buffer *input;
  struct buffer *lastinput;
  struct buffer *output;
  int (*func)(struct parser*, struct *state);
};

enum {
  CONTINUE,
  NEEDMORE,
  GOTFOO,
  GOTBAR
};

int parse(struct parser *p) {
  int ret;
  while((ret = p->func(p)) == CONTINUE);

  return ret;
}

The idea should be easy to un­der­stand:

  1. Add buffer(s) to input queue.
  2. If parse() re­turns NEEDMORE, add more input to the queue and call it again.
  3. If parse() re­turns GOTFOO or GOTBAR, state is filled with data.
  4. The func­tion pointer is con­tin­u­ally up­dated with a parser spe­cial­ized for the cur­rent data: in this case, a FOO or a BAR, or even just bits and pieces of a FOO or a BAR. It re­turns CONTINUE if parse() should just call a new func­tion pointer.
  5. As the parser func­tion point­ers eat at the input queue, put used buffers into the out­put stack.

Other than meet­ing my two re­quire­ments above, the best thing about this de­sign? It doesn’t sac­ri­fice clean­li­ness, and it won’t cause your code size to in­crease.

Posted on January 02, 2008 in C, Coding, Parsing

Related Posts