# Building a RegExp machine. Part 2: Finite automata — NFA fragments

--

This is a second chapter of the *“Building a RegExp machine”* series, where we discuss a theory and implementation behind the *regular languages* and *regular expressions*.

**Audience:** experienced programmers, professionals.

You can also **see more details**, and **enroll to full course**.

In the first part we’ve been talking about generic concepts of *formal grammars*, *languages*, *notations*, etc. And in this chapter we are moving to one of the *implementation techniques* used to build a regular expressions engine— to *finite automata*.

# From RegExp to finite automata

*Regular languages* are recognized by the formalism of *finite state machines** **(FSM)*, also known as *finite automata (FA).* In particular for RegExp — by *Nondeterministic finite automata (NFA)*, and *Deterministic finite automata (DFA)*.

Note:from the general theory of FSM, FAs are divided into two categories:“FAs with output”(Moore machine, andMealy machine) and“FAs without output”:NFA,ε-NFA, andDFA. In this series we talk about the later class of FAs, and which are used for RegExp implementation.

Unless one needs to support extended RegExp features, such as *nesting*, etc., FAs provide *the fastest* O(n) implementation of a regexp-recognizer, and that is why DFAs are still widely used on practice, e.g. in automatically generated lexers, such as Lex.

We shall now discuss in detail how we can *convert* a *regular expression* to an appropriate *finite automata*.

# NFA definition

A ** Nondeterministic finite automata (NFA)** is a 5-tuple:

`NFA = (`*Q*, Σ, Δ, *q0*, *F*)

Let’s take a look at a simple NFA which will help us understanding what each component from the definition means:

Note:this NFA accepts strings of`1`

,`10`

,`100`

,`1000`

, etc., corresponding to the

/10*/regular expression.

The `Q`

component from the definition is a *set* *of* *all possible**states** *which this machine can be in. At each moment of time, a machine can be only in *one* of these *states.*

In our example the `Q`

set is:

`Q = {A, B}`

One of these states, specifically the state `B`

, is *double-circled*. With this notation we denote *accepting states**(or final states)*. This is the `F`

component from the definition:

`F = {B}`

Our goal in traversing this graph, is to *reach an accepting state* (at least one from the set of accepting states). But where do we start from? — from the state `q0`

, which is known as the *starting state** (or initial state)*. On FA diagrams a starting state is the one with an *incoming arrow “from nowhere”.* In our case it’s the state `A`

:

`q0 = A`

The *transitions* from state to state are managed by the `∆`

** transition-function**, which for an NFA is defined as:

`Δ : Q × Σ → 2^Q`

This definition simply says that *“if we are in one of the states from the set **Q**, and consume a symbol from the alphabet **∑**, we transit to another state from the same set **Q**”*.

Since the NFA is *nondeterministic*, from each state on the *same symbol* we can go to *all possible states* from `Q`

, that’s why the transition maps to `2^Q`

. Notice, that for DFA this would map *just* to `Q`

.

Examples of the transitions:

`A × 1 → B`

B x 0 → B

As we discussed previously in the first chapter, the `∑`

defines the ** alphabet** of a grammar (of the machine in this case). We can see the alphabet symbols labeling the

*transition edges*, and in our case they are:

`Σ = {0, 1}`

That’s basically it. These are all components which allow us describing any NFA.

In the NFA abbreviation the first *“N”* stands for ** Nondeterministic**. Let’s see what it means, and how it helps us easier translating regexp to NFAs.

# Nondeterminism

Let’s take a look at the following NFA:

Note:this NFA accepts only two strings:

1, and

10., which corresponds to the

/10?/regular expression.

The first ** nondeterministic** aspect of an NFA comes from the ability to transit from a current state

*on the same symbol*to

*different states*. Take a look at the transitions from state

`A`

:`A × 1 → B`

A x 1 → C

How can we even decide where to go if we receive symbol `1`

in this case? That is a good question! And while technically it is possible — we just need to try *all paths* one by one or *in parallel*, — usually it’s not the case, and is the reason why NFAs are *not* used *directly*, but *converted* further to appropriate DFAs (which do not allow such transitions).

Second aspect of the nondeterminism is that an NFA doesn’t require implementing all the transitions from the alphabet: e.g. from the state `A`

we have only `A x 1`

transition but not `A x 0`

. In a *fully-specified DFA* we would have to provide such `A x 0`

transition, which would go to a *“dead state”*. However, even in a DFA we can omit such transitions for brevity.

# ε-NFA

Another property which is possible only in an NFA, is the ability to have *epsilon-transitions**(aka ε-transitions)*.

Note:Epsilon (ε)—the empty string.

We can see such transitions from states `B`

and `C`

in the example above, and this means, that we can transit from these states to the *final* state `D`

*without consuming a character from the string*.

Notice however, that ε-transitions *do not extend* the *power* of your NFA. They are usually used for *convenience* of combining different parts of NFA fragments.

Sometimes it is harder to translate a regular expression to *pure NFA*, and much easier to an *ε-NFA*, that’s why usually we use the later, and eventually convert ε-NFA to DFA.

And a DFA just defines these restrictions — do not allow ε-transitions, do not allow nondeterministic moves, etc. Other than that it doesn’t differ much from an NFA — except the transition function. We however defer a detailed discussion about DFAs to the next chapter.

Now let’s see how we can apply the NFA for regular expressions.

In fact, when we have a regular expression such as `/xy*|z/`

, it is not obvious how we can translate it into an appropriate machine. And usually we do not do this. Instead, we *decompose* a *complex expression* into the ** basic building blocks**, and then

*connect*these basic blocks together, forming the

*resulting machine*.

# 5 basic NFA-fragments for RegExp

In the formal definition of the regular languages, there are only 5 cases, which actually define a regular language. And this is exactly what we need to cover in the ** basic building NFA-blocks** (which we also call

**). Let’s consider them.**

*NFA-fragments*## Character fragment

The *simplest possible* NFA machine is the machine for a ** single character**, say

`a`

, with the corresponding regular expression `/a/`

.An NFA-fragment for it looks as follows:

Note:we specifically omit state labels here, since this is not essential in this case. What essentials is theedge transitionlabeled with the character`a`

.

This NFA-fragments says: *“if we are in the initial state, we can transit to the next (final) state **only if** we consume character **a** from the string”*.

The fragment accepts only one string, the string `"a"`

: by consuming the character `a`

, we move the cursor to the *end of the string*, *and**transition to the next (final) state*. Since we are in the final state, we *accept* the string.

In our NFA-fragments we’re going to maintain the *“one input, one output”**invariant* like in the example above. This will make it easier connecting machines to more complex compound NFAs.

Let’s move forward, and take a look at a similar transition, but *without consuming a character*.

## ε-transition fragment

As we mentioned above, the *ε-moves* allows us to transit to the next state, and *keep the string cursor* where it is, that is *without consuming a character*.

An *ε-transition* fragment looks the same as the character transition, but instead of an actual character, we transit on *“the empty character”*:

The two discussed above fragments — *character* and *epsilon — *are ** the very basic** NFA-fragments. On top of these two, everything else is built in FA regexp implementation. Even the last three from the basic five fragments.

You can also watch the corresponding video lecture for the Single Character and Epsilon machines:

The last three fragments from the basic set are called *basic regexp operations*. Let’s consider them in detail.

## Concatenation fragment

The ** concatenation** NFA-fragment defines the

**operation. It is denoted as a**

*“A followed by B”**sequence*

`AB`

, and in the simplest example can be presented as the `/ab/`

regular expression.How do we present this operation? Well, we know how to build a machine for `A`

(the single character in the simplest case), and we know how to build a machine for `B`

(also the single character machine in the simplest case).

All we need to do else is just to *connect* ** output** of

`A`

machine to the **of**

*input*`B`

machine — on the *ε-transition:*

As we can see, the actual *machines* `A`

and `B`

are *encapsulated.* We just *concatenate* them on the *ε-transition*, without making *any commitment* of *what is* *inside* of these machines.

That is, the machines are *black-box abstractions*: they might be single character machines — like `/ab/`

, ** or** they might be even some

*complex compound machines*, e.g. a repeated character class followed by another character class:

`/[0-9]*[a-z]/`

— in both cases they still obey *the same concatenation fragment*.

Notice also, that once we concatenate the machines, we *reset* the *accepting state* of the *first machine*. I.e. we still maintain the *“one input, one output”* invariant. This allows treating this *combined machine* as a ** single unit**, and connect it further.

## Union fragment

The ** union NFA fragment** defines the

**operation. It is also known as**

*“OR”***operation.**

*“disjunction”*Disjunction is denoted as `A|B`

operation, and in the simplest regexp can look like: `/a|b/`

.

Again, we’re going to maintain the invariants of the black-box machines, and keeping the “one input, one output” approach:

Taking the string `"a"`

, and the regexp `/a|b/`

, let’s see how do we accept it.

The nondeterministic ε-transition from the initial state allows us choosing *any* from the two paths. Let’s say we took the lower path, and transited to the state which expects us to consume `b`

character. This is not the case since we’re trying to match `"a"`

, and we have to ** backtrack** now to the

*split point*, and try

*another path*.

This machine is interesting, and discovers several points. First, it shows that it’s possible to *interpret* an NFA *directly* (by traversing this graph). Second, it shows that it actually might be *slow* — like in this case with backtracking. We of course could do some optimizations, e.g. spawn several *parallel threads*, and *abandon* the paths which hit the *dead end*. However, this is not what usually happens on practice, and like we said previously, we’re going eventually to convert this NFA to an appropriate DFA which is much faster.

The last building block we’re going to consider is the *Kleene-closure*.

## Kleene-closure fragment

The *Kleene-closure**(aka **Kleene-star**)* operator defines the *repetition*** pattern**, and in a more user-friendly way of saying is

*repeating*of a machine

**.**

*“zero or more times”*The Kleene-star is denoted as `A*`

, and in the simplest regexp looks as `/a*/`

. This accepts strings such as: `""`

(the empty string), `"a"`

, `"aa"`

, `"aaa"`

, etc.

Here’s its NFA-fragment:

The empty string is accepted by following the very bottom ε-transition, and the *“more”* case is handled by taking the ε-transition back from the accepting state, and consuming `A`

as many as needed.

These are the 5 basic fragments used to build everything else in NFA regexp implementation.

Note:the method of building NFAs from the basic blocks we used above is known asThompson’s construction.

Let’s see an actual regexp example now.

# RegExp NFA example

Now when we know that *any complex compound* regular expressions is built *only* from these *5 basic NFA fragments*, let’s take an example which contains all sub-fragments, and see how we can translate this regexp into an appropriate NFA.

Our example is: `/xy*|z/`

.

Like we said, we have to *decompose* this complex expression on the primitive fragments, and then just *connect* them.

To do so, we have to consider ** precedence** of the regular expressions. The highest precedence is of the symbol itself—

`A`

. Then goes “star” operator — `A*`

. The next one is the concatenation — `AB`

. And finally (the lowest precedence) is the union operator — `A|B`

.We have to build the machine part by part, *starting from the lowest precedence*, that is `A|B`

*disjunction pattern*. In our case it’s the whole expression, where `A`

is `xy*`

, and `B`

is `z`

.

Next we’ll have to decompose the `xy*`

complex pattern, which is the `AB`

*concatenation pattern — *where `A`

is `x`

, and `B`

is `y*`

. And finally we extract the `y*`

.

The resulting NFA for this expression looks as:

We leave it as an exercise for the reader to draw the actual boxes for the basic fragments on this picture, and also play building NFAs for other regular expressions.

In the next chapter we’re going to discuss *conversion* of an NFA to an appropriate DFA, will build NFA/DFA *transition tables*, and eventually implement RegExp `match`

function based on a DFA table.

As usually if you have any questions of feedback, I’ll be glad to discuss it in comments.