Building a RegExp machine. Part 1: Regular grammars
In this series of articles we’re going to study a theory and implementation behind the formalism of regular expressions, — the technique used for text processing, implementation of tokenizers, etc.
Audience: experienced programmers, professionals.
You can also see more details, and enroll to full course.
Note: this is not a series on “how to use regexp”, but rather on “how to implement regexp”. You should already know what regexp is, and use it on practice.
The first part of the series is the most theoretical (so bear with me :)). There will be no any implementation of NFA/DFA in this part, and we’ll mostly be talking about formal grammars (and specifically regular grammars used for regexp). However, that’s the amount of theory we will need for full understanding of how regexp machines work under the hood.
Before reading this article, you can also watch an introduction video for this class:
With that’s being said, let’s get started with the basic definitions of “symbols” and “alphabets”.
Symbols, alphabets, and languages
From the Theory of computation (TOC), and linguistics languages consist of strings, which in turn consist of symbols, which in turn belong to some alphabet.
For example, we can have an alphabet ∑
containing only two symbols, a
and b
:
∑ = {a, b}
The set of strings which we can generate using this alphabet e.g. are:
strings(∑) = {a, aa, b, ab, ba, bba, …}
In this case we have an unbounded set of strings, which is infinite — based on its “length” property: we can have strings of any length, and this set in theory is infinite.
This “set of strings” is called a language.
Usually though languages apply some restrictions (or rules) over the strings they accept.
For example, using the same alphabet we can have a language L
which accepts only “strings of length 2". In this case our set contains only four strings:
L: "strings from ∑ of length 2"L(∑) = {aa, bb, ab, ba}
And the question “whether a string belongs to a language or not” is the subject of TOC, and formal grammars. The tools like regular expressions (which in fact also answer the same exact question — whether a string belongs to a language or not) are based on these grammars. So let’s take a closer look at them.
Formal grammars
Before going to regexp implementation, and regular grammars we shall discuss what a generic formal grammar is.
A formal grammar describes a language (“set of strings”) formally, defining its structure. Usually one may find discussions about formal grammars mostly related to “parsers” (referring to context-free grammars). However, as we will see shortly, in the core of regular expressions there are the same exact formal grammars, which specifically in this case are called regular grammars.
In general, a grammar G
is a tuple of four:
G = (N, T, P, S)
where N
— is a set of non-terminal symbols, T
— is a set of terminal symbols, P
— is a list of production rules, and S
— is the starting symbol.
For example:
S → aX
X → b
In this grammar, the set of non-terminals N
(sometimes called variables) is:
N = {S, X}
Non-terminals are variables, which means they hold some other values (terminal, non-terminal, or mixed), and can be replaced with the value they hold. In the example above we say that non-terminal S
“can be replaced with” aX
, and the non-terminal X
can be replaced with the terminal b
.
In contrast, terminals is something that cannot be further replaced, and appears “as-is” in our strings. The set T
of our terminals include:
T = {a, b}
The productions of a grammar defines exact replacement rules. We call them derivation rules. So the production:
S → aX
can be described as “S
can be replaced with aX
”, or, the same,“S
derives aX
”.
And the starting symbol (S
in our case) is the starting point where we starting our replacement (or derivation) process.
Note: the grammars such as in the example above are usually defined in the BNF notation. For regular grammars though the alternative notation, the regular expressions, is usually used.
That’s basically it: the purpose of a grammar is to define a (possibly recursive) structure of a language — to allow tools like parsers or regexp to recognize (or “accept”) the strings based on this grammar.
There are four types of grammars, which use different techniques at implementation. For example, context-free grammars (used for syntactic analysis, that is “parsers”) use push-down automata (PDA), while regular grammars use finite automata as the implementation machine.
Since in this series we’ll be focusing specifically on regular expressions, let’s closer discuss the regular grammars, and see the alternative syntactic notation for them.
At this point you can address the corresponding video lecture, and then proceed to the Regular grammars:
Regular grammars
In the Chomsky hierarchy of formal grammars, regular grammars stand right below the context-free grammars, and define the weakest possible formal languages, known as regular languages. These languages are denoted by the concept and notation of regular expressions.
A regular grammar might be right- or left-linear, meaning a non-terminal can appear only at the very right (usually) or left conner of the RHS of a production.
Example of a right-linear regular language in the:
S → aS
S → bA
A → ε
A → cA
Note: ε (“Epsilon”) is the special symbol meaning “the empty string”.
As we mentioned, the grammar in BNF-notation from above might not look very familiar when we talk about regular expressions, however in fact it directly corresponds to the following regexp-notation:
a*bc*
Note: syntactically regular grammars usually described by regular expressions notation, and not by BNF notation, although use the same exact formalism under the hood.
Relation to state machines
Running forward to the implementations of state machines (NFAs and DFAs), we see that regular grammars exactly define the state transitions of finite automata. For example, in the:
S → bA
A → ε
the first production says “if we are in the state S
, and receive symbol b
, we should transit to state A
”. The epsilon-rules in this case correspond to the accepting states.
We however defer discussion about NFAs and DFAs to the second chapter.
Ability to use only one non-terminal at the very right/left conner still allows building recursive rules, however, not nesting rules. Let’s see the difference.
Recursive vs. nesting grammars
Regarding regular languages sometimes there is a misconception of “recursive” vs. “nesting” properties of a grammar (resulting to incorrect understanding that regular expressions “cannot handle recursion”).
A classic example which a regular grammar cannot handle is a “check for balanced parenthesis”:
((()))
(try matching it with a regexp without tracking an extra state!)
A context-free grammar in contrast for this language would be pretty trivial:
S → (S)
S → ε
However, as we noted above, regular grammars can only have one non-terminal and only at the very right/left side of a production. And in the S -> (S)
we have the non-terminal S
at the middle of the RHS.
This property — ability to put a non-terminal anywhere, and not only at the very right/left end,— defines nesting of a language and should not be confused with a generic recursion, which is normally supported by regular grammars, as in the S -> aS
rule from the example above.
Non-regular regexp implementations
Notice though, that some modern regexp implementations normally support even nesting.
For example, the regexp for checking balanced parenthesis would be:
^(\((?1)*\))(?1)*$
However, implementation of such a regexp machine goes beyond the regular grammars. It should support the nesting state and therefore be already a context-free language with the (slower) push-down automation rather than a finite state machine (FSM).
We shall therefore distinguish regular grammars (regular languages) from the regular expressions implementations, which for complex cases like nesting cannot use FSM for implementation.
In the basic theory regular grammars are backed by the finite state machines (or finite automata), which we’re going to discuss and implement during the next part of this series.
If you have any questions, I’ll be glad to answer them in comments.