Building a RegExp machine. Part 2: Finite automata — NFA fragments
This is a second chapter of the “Building a RegExp machine” series, where we discuss a theory and implementation behind the regular languages and regular expressions.
Audience: experienced programmers, professionals.
You can also see more details, and enroll to full course.
In the first part we’ve been talking about generic concepts of formal grammars, languages, notations, etc. And in this chapter we are moving to one of the implementation techniques used to build a regular expressions engine— to finite automata.
From RegExp to finite automata
Regular languages are recognized by the formalism of finite state machines (FSM), also known as finite automata (FA). In particular for RegExp — by Nondeterministic finite automata (NFA), and Deterministic finite automata (DFA).
Note: from the general theory of FSM, FAs are divided into two categories: “FAs with output” (Moore machine, and Mealy machine) and “FAs without output”: NFA, ε-NFA, and DFA. In this series we talk about the later class of FAs, and which are used for RegExp implementation.
Unless one needs to support extended RegExp features, such as nesting, etc., FAs provide the fastest O(n) implementation of a regexp-recognizer, and that is why DFAs are still widely used on practice, e.g. in automatically generated lexers, such as Lex.
We shall now discuss in detail how we can convert a regular expression to an appropriate finite automata.
NFA definition
A Nondeterministic finite automata (NFA) is a 5-tuple:
NFA = (Q, Σ, Δ, q0, F)
Let’s take a look at a simple NFA which will help us understanding what each component from the definition means:
Note: this NFA accepts strings of
1
,10
,100
,1000
, etc., corresponding to the/10*/
regular expression.
The Q
component from the definition is a set of all possible states which this machine can be in. At each moment of time, a machine can be only in one of these states.
In our example the Q
set is:
Q = {A, B}
One of these states, specifically the state B
, is double-circled. With this notation we denote accepting states (or final states). This is the F
component from the definition:
F = {B}
Our goal in traversing this graph, is to reach an accepting state (at least one from the set of accepting states). But where do we start from? — from the state q0
, which is known as the starting state (or initial state). On FA diagrams a starting state is the one with an incoming arrow “from nowhere”. In our case it’s the state A
:
q0 = A
The transitions from state to state are managed by the ∆
transition-function, which for an NFA is defined as:
Δ : Q × Σ → 2^Q
This definition simply says that “if we are in one of the states from the set Q
, and consume a symbol from the alphabet ∑
, we transit to another state from the same set Q
”.
Since the NFA is nondeterministic, from each state on the same symbol we can go to all possible states from Q
, that’s why the transition maps to 2^Q
. Notice, that for DFA this would map just to Q
.
Examples of the transitions:
A × 1 → B
B x 0 → B
As we discussed previously in the first chapter, the ∑
defines the alphabet of a grammar (of the machine in this case). We can see the alphabet symbols labeling the transition edges, and in our case they are:
Σ = {0, 1}
That’s basically it. These are all components which allow us describing any NFA.
In the NFA abbreviation the first “N” stands for Nondeterministic. Let’s see what it means, and how it helps us easier translating regexp to NFAs.
Nondeterminism
Let’s take a look at the following NFA:
Note: this NFA accepts only two strings:
1
, and10
., which corresponds to the/10?/
regular expression.
The first nondeterministic aspect of an NFA comes from the ability to transit from a current state on the same symbol to different states. Take a look at the transitions from state A
:
A × 1 → B
A x 1 → C
How can we even decide where to go if we receive symbol 1
in this case? That is a good question! And while technically it is possible — we just need to try all paths one by one or in parallel, — usually it’s not the case, and is the reason why NFAs are not used directly, but converted further to appropriate DFAs (which do not allow such transitions).
Second aspect of the nondeterminism is that an NFA doesn’t require implementing all the transitions from the alphabet: e.g. from the state A
we have only A x 1
transition but not A x 0
. In a fully-specified DFA we would have to provide such A x 0
transition, which would go to a “dead state”. However, even in a DFA we can omit such transitions for brevity.
ε-NFA
Another property which is possible only in an NFA, is the ability to have epsilon-transitions (aka ε-transitions).
Note: Epsilon (ε) — the empty string.
We can see such transitions from states B
and C
in the example above, and this means, that we can transit from these states to the final state D
without consuming a character from the string.
Notice however, that ε-transitions do not extend the power of your NFA. They are usually used for convenience of combining different parts of NFA fragments.
Sometimes it is harder to translate a regular expression to pure NFA, and much easier to an ε-NFA, that’s why usually we use the later, and eventually convert ε-NFA to DFA.
And a DFA just defines these restrictions — do not allow ε-transitions, do not allow nondeterministic moves, etc. Other than that it doesn’t differ much from an NFA — except the transition function. We however defer a detailed discussion about DFAs to the next chapter.
Now let’s see how we can apply the NFA for regular expressions.
In fact, when we have a regular expression such as /xy*|z/
, it is not obvious how we can translate it into an appropriate machine. And usually we do not do this. Instead, we decompose a complex expression into the basic building blocks, and then connect these basic blocks together, forming the resulting machine.
5 basic NFA-fragments for RegExp
In the formal definition of the regular languages, there are only 5 cases, which actually define a regular language. And this is exactly what we need to cover in the basic building NFA-blocks (which we also call NFA-fragments). Let’s consider them.
Character fragment
The simplest possible NFA machine is the machine for a single character, say a
, with the corresponding regular expression /a/
.
An NFA-fragment for it looks as follows:
Note: we specifically omit state labels here, since this is not essential in this case. What essentials is the edge transition labeled with the character
a
.
This NFA-fragments says: “if we are in the initial state, we can transit to the next (final) state only if we consume character a
from the string”.
The fragment accepts only one string, the string "a"
: by consuming the character a
, we move the cursor to the end of the string, and transition to the next (final) state. Since we are in the final state, we accept the string.
In our NFA-fragments we’re going to maintain the “one input, one output” invariant like in the example above. This will make it easier connecting machines to more complex compound NFAs.
Let’s move forward, and take a look at a similar transition, but without consuming a character.
ε-transition fragment
As we mentioned above, the ε-moves allows us to transit to the next state, and keep the string cursor where it is, that is without consuming a character.
An ε-transition fragment looks the same as the character transition, but instead of an actual character, we transit on “the empty character”:
The two discussed above fragments — character and epsilon — are the very basic NFA-fragments. On top of these two, everything else is built in FA regexp implementation. Even the last three from the basic five fragments.
You can also watch the corresponding video lecture for the Single Character and Epsilon machines:
The last three fragments from the basic set are called basic regexp operations. Let’s consider them in detail.
Concatenation fragment
The concatenation NFA-fragment defines the “A followed by B” operation. It is denoted as a sequence AB
, and in the simplest example can be presented as the /ab/
regular expression.
How do we present this operation? Well, we know how to build a machine for A
(the single character in the simplest case), and we know how to build a machine for B
(also the single character machine in the simplest case).
All we need to do else is just to connect output of A
machine to the input of B
machine — on the ε-transition:
As we can see, the actual machines A
and B
are encapsulated. We just concatenate them on the ε-transition, without making any commitment of what is inside of these machines.
That is, the machines are black-box abstractions: they might be single character machines — like /ab/
, or they might be even some complex compound machines, e.g. a repeated character class followed by another character class: /[0-9]*[a-z]/
— in both cases they still obey the same concatenation fragment.
Notice also, that once we concatenate the machines, we reset the accepting state of the first machine. I.e. we still maintain the “one input, one output” invariant. This allows treating this combined machine as a single unit, and connect it further.
Union fragment
The union NFA fragment defines the “OR” operation. It is also known as “disjunction” operation.
Disjunction is denoted as A|B
operation, and in the simplest regexp can look like: /a|b/
.
Again, we’re going to maintain the invariants of the black-box machines, and keeping the “one input, one output” approach:
Taking the string "a"
, and the regexp /a|b/
, let’s see how do we accept it.
The nondeterministic ε-transition from the initial state allows us choosing any from the two paths. Let’s say we took the lower path, and transited to the state which expects us to consume b
character. This is not the case since we’re trying to match "a"
, and we have to backtrack now to the split point, and try another path.
This machine is interesting, and discovers several points. First, it shows that it’s possible to interpret an NFA directly (by traversing this graph). Second, it shows that it actually might be slow — like in this case with backtracking. We of course could do some optimizations, e.g. spawn several parallel threads, and abandon the paths which hit the dead end. However, this is not what usually happens on practice, and like we said previously, we’re going eventually to convert this NFA to an appropriate DFA which is much faster.
The last building block we’re going to consider is the Kleene-closure.
Kleene-closure fragment
The Kleene-closure (aka Kleene-star) operator defines the repetition pattern, and in a more user-friendly way of saying is repeating of a machine “zero or more times”.
The Kleene-star is denoted as A*
, and in the simplest regexp looks as /a*/
. This accepts strings such as: ""
(the empty string), "a"
, "aa"
, "aaa"
, etc.
Here’s its NFA-fragment:
The empty string is accepted by following the very bottom ε-transition, and the “more” case is handled by taking the ε-transition back from the accepting state, and consuming A
as many as needed.
These are the 5 basic fragments used to build everything else in NFA regexp implementation.
Note: the method of building NFAs from the basic blocks we used above is known as Thompson’s construction.
Let’s see an actual regexp example now.
RegExp NFA example
Now when we know that any complex compound regular expressions is built only from these 5 basic NFA fragments, let’s take an example which contains all sub-fragments, and see how we can translate this regexp into an appropriate NFA.
Our example is: /xy*|z/
.
Like we said, we have to decompose this complex expression on the primitive fragments, and then just connect them.
To do so, we have to consider precedence of the regular expressions. The highest precedence is of the symbol itself— A
. Then goes “star” operator — A*
. The next one is the concatenation — AB
. And finally (the lowest precedence) is the union operator — A|B
.
We have to build the machine part by part, starting from the lowest precedence, that is A|B
disjunction pattern. In our case it’s the whole expression, where A
is xy*
, and B
is z
.
Next we’ll have to decompose the xy*
complex pattern, which is the AB
concatenation pattern — where A
is x
, and B
is y*
. And finally we extract the y*
.
The resulting NFA for this expression looks as:
We leave it as an exercise for the reader to draw the actual boxes for the basic fragments on this picture, and also play building NFAs for other regular expressions.
In the next chapter we’re going to discuss conversion of an NFA to an appropriate DFA, will build NFA/DFA transition tables, and eventually implement RegExp match
function based on a DFA table.
As usually if you have any questions of feedback, I’ll be glad to discuss it in comments.