Monday, October 29, 2012

Don't Parse with RegEx

At some point, I think all developers have tried to use regular expressions to parse input. It may work at first, but it quickly becomes unwieldy for all but the most trivial of inputs.

Engineers learned early on that it simplifies things drastically to separate the compiler into separate modules – first find the individual tokens, then use a grammar to see if the tokens are arranged correctly, then assign semantic meaning to the statements, and so forth.

Regular expressions are great for getting the individual tokens. This is the scanning, or lexing phase. With the tokens in hand, we're ready to do the actual parsing. To do this properly, we need to specify a formal grammar. This is typically done in BNF. Grammars can deal with nesting and other constructs that regular expressions don't handle well. For example, imagine trying to match up opening and closing sets of parentheses using regular expressions. This is trivial to specify in a grammar.

Separation of concerns is another advantage of doing it this way. If we truly follow the single responsibility principle, we have no justification for trying to lex and parse at the same time.

Regular expressions are powerful tools by themselves, but don't abuse them – especially if you're dealing with a full-fledged domain-specific language.