# Building a RegExp machine. Part 1: Regular grammars

--

In this series of articles we’re going to study a theory and implementation behind the formalism of *regular expressions*, — the technique used for text processing, implementation of tokenizers, etc.

**Audience:** experienced programmers, professionals.

You can also **see more details**, and **enroll to full course**.

Note:this isnota series on “how touseregexp”, but rather on “how toimplementregexp”. You should already know what regexp is, and use it on practice.

The first part of the series is the most theoretical (so bear with me :)). There will be no any implementation of NFA/DFA in this part, and we’ll mostly be talking about *formal grammars* (and specifically *regular grammars* used for regexp). However, that’s the amount of theory we will need for full understanding of how regexp machines work under the hood.

Before reading this article, you can also watch an introduction video for this class:

With that’s being said, let’s get started with the basic definitions of *“symbols”* and *“alphabets”*.

# Symbols, alphabets, and languages

From the Theory of computation (TOC), and linguistics *languages* consist of **strings**, which in turn consist of **symbols**, which in turn belong to some **alphabet**.

For example, we can have an *alphabet* `∑`

containing only two symbols, `a`

and `b`

:

`∑ = {a, b}`

The *set of strings* which we can *generate* using this alphabet e.g. are:

`strings(∑) = {a, aa, b, ab, ba, bba, …}`

In this case we have an *unbounded* set of strings, which is ** infinite** — based on its

*“length”*property: we can have strings of

*any length*, and this set in theory is infinite.

This *“set of strings”* is called a **language**.

Usually though languages apply some *restrictions* (or *rules*) over the strings they *accept*.

For example, using the same alphabet we can have a language `L`

which accepts *only**“strings of length 2".* In this case our set contains only four strings:

L: "strings from ∑ of length 2"L(∑) = {aa, bb, ab, ba}

And the question *“whether a string belongs to a language or not”* is the subject of TOC, and *formal grammars*. The tools like *regular expressions* (which in fact also answer the same exact question — whether a string belongs to a language or not) are based on these grammars. So let’s take a closer look at them.

# Formal grammars

Before going to regexp implementation, and regular grammars we shall discuss what a generic ** formal grammar** is.

A formal grammar describes a *language* (“set of strings”) *formally*, defining its *structure*. Usually one may find discussions about formal grammars mostly related to “parsers” (referring to *context-free* grammars). However, as we will see shortly, in the core of regular expressions there are the same exact formal grammars, which specifically in this case are called *regular grammars*.

In general, a **grammar** `G`

is a tuple of four:

`G = (N, T, P, S)`

where `N`

— is a set of *non-terminal* symbols, `T`

— is a set of *terminal symbols*, `P`

— is a list of *production rules*, and `S`

— is the *starting symbol*.

For example:

`S → aX`

X → b

In this grammar, the set of *non-terminals* `N`

(sometimes called *variables*) is:

`N = {S, X}`

**Non-terminals** are *variables*, which means they *hold some other values* (terminal, non-terminal, or mixed), and *can be replaced* with the value they hold. In the example above we say that non-terminal `S`

*“can be replaced with”* `aX`

, and the non-terminal `X`

can be replaced with the terminal `b`

.

In contrast, **terminals** is something that *cannot be further replaced*, and appears “*as-is”* in our strings. The set `T`

of our terminals include:

`T = {a, b}`

The **productions** of a grammar defines exact *replacement rules*. We call them **derivation rules**. So the production:

`S → aX`

can be described as *“**S** can be **replaced with** **aX**”*, or, the same,*“**S** **derives** **aX**”*.

And the **starting symbol** (`S`

in our case) is the starting point where we starting our *replacement* (or *derivation*) process.

Note:the grammars such as in the example above are usually defined in theBNF notation. Forregular grammarsthough the alternative notation,the regular expressions, is usually used.

That’s basically it: the purpose of a grammar is to define a (possibly recursive) structure of a language — to allow tools like parsers or regexp to *recognize* (or *“accept”*) the strings based on this grammar.

There are four types* *of grammars, which use *different techniques* at *implementation*. For example, *context-free grammars* (used for syntactic analysis, that is “parsers”) use *push-down automata (PDA)*, while *regular grammars* use *finite automata* as the implementation machine.

Since in this series we’ll be focusing specifically on regular expressions, let’s closer discuss the *regular grammars*, and see the alternative syntactic notation for them.

At this point you can address the corresponding video lecture, and then proceed to the Regular grammars:

# Regular grammars

In the Chomsky hierarchy of formal grammars, ** regular grammars** stand right below the

*context-free grammars*, and define

*the weakest possible*formal languages, known as

**. These languages are denoted by the concept and notation of**

*regular languages***.**

*regular expressions*A regular grammar might be *right-* or *left-linear*, meaning a *non-terminal* can appear ** only** at the

*very right*(usually) or

*left*conner of the RHS of a production.

Example of a *right-linear* regular language in the:

`S → aS`

S → bA

A → ε

A → cA

Note:ε (“Epsilon”) is the special symbol meaning “the empty string”.

As we mentioned, the grammar in **BNF-notation** from above might *not* look very familiar when we talk about regular expressions, however in fact it directly corresponds to the following **regexp-notation**:

`a*bc*`

Note:syntacticallyregular grammars usually described byregular expressionsnotation, and not by BNF notation, although use the same exact formalism under the hood.

# Relation to state machines

Running forward to the implementations of *state machines (NFAs and DFAs)*, we see that regular grammars exactly define the ** state transitions** of

*finite automata*. For example, in the:

`S → bA`

A → ε

the first production says *“if we are in the state **S**, and receive symbol **b**, we should transit to state **A**”*. The *epsilon-rules* in this case correspond to the *accepting states*.

We however defer discussion about NFAs and DFAs to the second chapter.

Ability to use ** only one non-terminal** at the very right/left conner still allows building

*recursive*rules, however, not

*nesting*rules. Let’s see the difference.

# Recursive vs. nesting grammars

Regarding regular languages sometimes there is a misconception of *“recursive”* vs. *“nesting”* properties of a grammar (resulting to *incorrect* understanding that regular expressions “cannot handle recursion”).

A classic example which a regular grammar *cannot* handle is a *“check for balanced parenthesis”*:

`((()))`

(try matching it with a regexp without tracking an extra state!)

A *context-free grammar* in contrast for this language would be pretty trivial:

`S → (S)`

S → ε

However, as we noted above, regular grammars can ** only** have

*one non-terminal*and

**at the**

*only**very right/left side*of a production. And in the

`S -> (S)`

we have the non-terminal `S`

*at the*

*middle*of the RHS.

This property — ability to put a non-terminal *anywhere*, and not only at the very right/left end,— defines ** nesting** of a language and

*should not be confused*with a

**, which is normally**

*generic recursion**supported*by regular grammars, as in the

`S -> aS`

rule from the example above.# Non-regular regexp implementations

Notice though, that some modern regexp implementations normally support *even nesting*.

For example, the regexp for checking balanced parenthesis would be:

`^(\((?1)*\))(?1)*$`

However, implementation of such a regexp machine goes beyond the regular grammars. It should support the nesting state and therefore be already a context-free language with the (slower) *push-down automation* rather than a *finite state machine (FSM)*.

We shall therefore *distinguish* ** regular grammars** (regular languages) from the regular expressions

**, which for complex cases like nesting cannot use FSM for implementation.**

*implementations*In the basic theory *regular grammars* are backed by the *finite state machines* (or *finite automata*), which we’re going to discuss and implement during the next part of this series.

If you have any questions, I’ll be glad to answer them in comments.