Dmitry Soshnikov
1 min readApr 21, 2017

--

Good write-up! Although, manual tokenizer/parsers implementation is good mostly when you learn these topics, or have a really complex grammar.

For practical tokenizers/parsers generation, one could also automatically build them, instead of writing the code manually. Tokenizers (scanners/lexers) uses formalism of regular grammars which are completely covered by regular expressions.

For example, using Syntax tool (language-agnostic parser generator), it is possible to automatically build a tokenizer just providing a specification for it:

// file: ~/test.lex{
rules: [
[`\\s+`, `/* skip whitespace */`]
[`\\d+`, `return 'NUMBER'`], [`\\(`, `return 'L_PAREN'`],
[`\\)`, `return 'R_PAREN'`],
[`[*+\\-/]`, `return 'OPERATOR'`],
],
}

That’s the whole code you need to implement a tokenizer, since regular grammars can normally be generated by the machine, to avoid this manual boilerplate to be written by humans (similarly how we don’t write manually in binary code, but use compilers to generate that for us).

Then feeding this lexical grammar to Syntax, you can easily get a list of tokens:

syntax-cli --lex ~/test.lex --tokenize -p '2 + 5'

which produces:

List of tokens:[
{
"type": "NUMBER",
"value": "2",
"startOffset": 0,
"endOffset": 1,
"startLine": 1,
"endLine": 1,
"startColumn": 0,
"endColumn": 1
},
{
"type": "OPERATOR",
"value": "+",
"startOffset": 2,
"endOffset": 3,
"startLine": 1,
"endLine": 1,
"startColumn": 2,
"endColumn": 3
},
{
"type": "NUMBER",
"value": "5",
"startOffset": 4,
"endOffset": 5,
"startLine": 1,
"endLine": 1,
"startColumn": 4,
"endColumn": 5
}
]

You can take a look at the calculator grammar example to build an actual parser/interpreter.

--

--

Dmitry Soshnikov
Dmitry Soshnikov

Written by Dmitry Soshnikov

Software engineer interested in learning and education. Sometimes blog on topics of programming languages theory, compilers, and ECMAScript.

Responses (1)