A lexer (or scanner) is a tool in programming language design that is responsible for breaking an input string into individual tokens. These tokens can then be handed to a parser for further processing. For example, consider the following statement in a hypothetical language:
double := 2 * x //double the number
Here, our lexer might give back the following tokens:
ID (double), ASSIGN, NUMBER, TIMES, ID (x)
Note that the lexer discarded the whitespace and the comment. The rest of the input was matched to a predefined token.
Most lexers use regular expressions to specify the matching rules. Each of these rules, when matched, returns some specified token. Here is how it would look in Token.JS:
var lexer = new TokenJS.Lexer( 'double := 2 * x //double the number', { root: [ [/\/\/.*/, TokenJS.Ignore], //ignore comments [/\s+/, TokenJS.Ignore], //ignore whitespace [/[a-zA-Z]+/, 'ID'], [/[0-9]+/, 'NUMBER'], [/:=/, 'ASSIGN'], [/\*/, 'TIMES'] ] }); var tokens = lexer.tokenize();
Even if you're not writing a compiler, lexing has many other uses. For starters, imagine you wanted to write a syntax highlighter for a language. This would be trivial with a lexer. You would simply specify a regular expression for identifiers, literals, reserved words, etc. in the given language. The 'token' in each case could be a color to highlight the match with.
In general, string processing is a big part of many applications, so it's good to know what tools you have at your disposal.