Writing a Parser — Part II: Algorithms and Implementation

10 min readSep 10, 2020

In my previous article (Writing a Parser — Getting Started), I discussed the basic architecture of a parser, some terminologies, and when to go for a handwritten parser and when to not. I also discussed some high-level details of different components of a parser and their requirements.

In this article, I will discuss in detail about each of these components, and the algorithms to be followed including some implementation details. This is more of a continuation of my previous article, so I would strongly recommend reading that before going through the rest of this article.

Character Reader / Input Reader

The main purpose of having an abstraction of a character reader is to have a character provider that is independent of the input type. That way, the lexer can simply rely on the peek()/consume() methods to provide the characters required to produce tokens.

However, along with the peek() method, there's another requirement that arises with respect to the implementation of these methods. That is, in order to facilitate peek()/peek(k), the character reader should be able to keep track of the characters that were looked-ahead without discarding them and should be able to provide them again upon an invocation of another peek() or consume() method. In other words, the character-reader should preserve the peeked characters in memory. There are two ways to achieve this:

In-memory reader
Stream reader

In-memory reader

Read the entire input at once and store it as a large character array/buffer. Maintain an index to point the last consumed character position (say i). When,

peek(k) is called, return the (i+k) th character from the buffer.
consume(k) is called, return the (i+k) th character from the buffer. And make i = i+k.

This is pretty simple to implement. But it consumes a lot of memory as the entire source-code is loaded into memory at once.

Stream reader

Uses a ring-buffer (of capacity N) to hold only the peeked characters. Does not read the entire source code into memory. Maintains the buffer size (say n) and the starting position of the buffer (say i),

When a peek(k) is called,

If k<n, that means k-th character is in the buffer. Then return the k-th character from the buffer.
If k≥n, that means k-th character is not in the buffer. Read up to the k-th character and add it to the buffer. Since already n characters are in the buffer, total k-n number of characters will be read. Then return the k-th character from the buffer.

When consume(k) is called, simply return the k-th element from i-th location of the buffer, and update i to be the index of the returned element.

This approach is very memory efficient as it only keeps k (≤n) characters at most in the memory at a given time. However, the capacity of the buffer N needs to be adjusted and fixed, depending on how many characters would be needed to peek by the lexer. e.g: If the lexer peeks 5 characters at most, then N can be set to 5 (or more). If you are unsure of how many characters the lexer would peek, it is possible to set some large value like 20, 100 or even 500, to be on the safe side, depending on the lexer implementation.

The Lexer

A lexer typically is a finite-state machine. It starts with an initial state. Then depending on the first character it encounters, the lexer changes the state and continues to operate/lex in that state until it reaches an end of the state. The way the tokens are produced is different from state to state.

For example, consider the set of characters “Hello, world!”. Lexer starts in a “default” mode where comma (,) is a single token - whitespace is an end of a token - and the exclamation mark is an operator-token. Once the lexer reaches the first double-quotes, it goes to a “string-literal” state where commas, whitespaces, and all other characters (except double-quotes) are also considered as part of the token. However, this string-literal-mode comes to an end once it reaches the second double quote, and completes processing that string literal. Then the lexer emits this string literal as a token and then goes back to the default mode.

State transition diagram of the lexer (with only a sub-set of states)

Corresponding code for the above lexer would look like:

Sample Java implementation of a lexer with the above state diagram

Emitting a Token

The lexer plays two sub-roles during the lexing/tokenizing. One is Scanning and the other one is Evaluating. Scanning is the part we discuss in the previous section where the lexer checks/scan the input characters and determine the matching grammar rule. Then depending on the content of the scanned characters, it emits a relevant token. A token not only consist of the character content but also it has a type/kind associated with it. Generally, tokens emitted by a lexer falls on to four major categories:

Keywords — Predefined set of names (identifiers) that are known by the lexer and the parser.
Identifiers — Names whose values are dynamic
Literals — String literals, numeric literals, etc. Values are dynamic.
Syntax tokens — Predefined set of tokens such as separators, arithmetic operators, etc. e.g: { , } , [ , + , — , ( , )

Lookahead

There are situations where the lexer cannot determine which lexer rule to process just by looking at the next character only. One example in java is the greater-than operator (>), the signed-right-shift operator (>>), and the unsigned-right-shift operator (>>>). When the lexer reaches the first ‘>’, it cannot simply determine which of the three tokens it would be processing. In order to determine that, the lexer has to check the 2nd character. If the second character is not ‘>’, then it can terminate assuming its the greater-than operator. If the second character is also ‘>’, then it has to check the 3rd character as well.

As you may have realized, this number of characters to lookahead can vary depending on the situation. Thus the lexer supports an arbitrary lookahead. There is no hard-bound rule to decide whats the amount of lookahead to use in each situation, but rather strictly depends on the lexical structure of the language. i.e: depends on the grammar.

Whitespace handling

As discussed in the previous article, there are many ways to treat whitespaces. However, if you are planning to go for an approach where whitespaces are attached to a token, then the next question would be: what's the best way to do that? Again, there are three approaches that can be followed when attaching whitespace to a token:

Attach whitespace to the immediate next token — Every token will have leading whitespaces. Any whitespace at the end of the file will be attached to the EOF token.
Attach whitespaces to last processed token — Every token will have only trailing white spaces. Needs a special start-of-file (SOF) token to capture the whitespaces at the start of a file.
Split the whitespace and attach them to both the last processed token and the next immediate token — A token may have both leading and trailing whitespaces. Typically, the first newline after a token will terminate the trailing whitespace (trailing whitespaces cannot have newlines). However, leading whitespaces can have newlines.

Three approaches to capture whitespaces: (from left to right) leading WS only, trailing WS only, both leading and trailing WS.

The Parser

There are two basic approaches for parsing:

Top-down parsing — Parses the input by searching the formal-grammar rules in a top-down manner. In other words, the high-level structure of input is decided first, before parsing its children. For example, when parsing a top-level element of a file, the parser first decides the kind of the element it is going to parse (e.g: whether it is a function-definition or a variable definition). Then only it starts parsing the children of that top-level element. In top-down parsing, the left-most derivation of a grammar production is matched first. Thus, the tokens are consumed and matched from left to right. LL (Left-to-right, Leftmost derivation) parsers are examples of top-down parsers.
Bottom-up parsing — The parser attempts to parse the leaf nodes first, and then depending on these parsed nodes, it decides the parent and parses it. In bottom-up parsing, the right-most derivation of a grammar production is matched first. LR (Left-to-right, Rightmost derivation) parsers are examples of bottom-up parsers.

Construction of the parse-tree can be done either top-down or bottom-up manner in both of these approaches.

Both LL and LR parsers are generally referred with a number associated with it like LL(1), LR(1), LL(k), etc. This number (generically ‘k’) means the number of lookaheads used by the parser (similar to the lexer). That is, taking top-down parsing for example, the parser needs to lookahead some tokens in order to determine which top-level node it should be parsing. In this article, I will focus on LL parsing as they tend to be much simpler compared to bottom-up parsers.

LL(k) Recursive descent parsers

LL parsers need to check some tokens into the future to decide the grammar rule to process when there are alternative rules. Sometimes this number of token to lookahead is a constant. In such cases, the number is written as a numeric value. However, for most parsers, this value depends on the grammar rule and the current input (i.e: depends on the context) and cannot be pre-determined. Such parsers use an arbitrary k, which is denoted as LL(k) or LR(k).

A parser is said to be recursive descent in the sense, the implementation is a set of small functions that parses a particular component of the grammar. These functions may depend on each other, thus creating a mutual recursion.

e.g: AparseExpression() method may call parserBinaryExpression() method, if the currently processing expression is a binary expression (e.g:a+b ). But since both the RHS and the LHS of the binary-expression is again some kind of expression, parserBinaryExpression() will call the parseExpression() within, creating a recursion. See below for a sample code:

Expression parseExpression() {
   ...
   if (binaryExpr) {
      parseBinaryExpression();
   }
   ...
}Expression parseBinaryExpression() {
   lhsExpr = parseExpression();
   binaryOp = parseBinaryOp();
   rhsExpr = parseExpression();
   ...
}

Predictive vs Backtracking

Predictive parser— Look into the tokens ahead in advanced to decide which grammar rule to process
Backtracking parser— Tries to parse an alternative rule. If the parser realizes that the selected grammar rule does not match the input, it goes back (backtracks) to the point where the parsing starts for that rule, starts trying the second alternative rule. Likewise, it tries all the alternatives and proceeds with the best one.

Predictive parsers do not try out all the alternative rules, hence they are faster. But on the other hand, backtracking parsers try out all the alternatives, hence tend to yield better results upon syntax errors.

Grammar requirements

When implementing the parsing logic for a grammar-rule in an LL(k) recursive-descent predictive-parser, the grammar must satisfy the following:

Must be non-left recursive

If the grammar rule is left-recursive, it must be re-written in a way such that left-recursion is eliminated. eg:

binary-expr := expression binary-operator expressionexpression := binary-expr | basic-literal | . . .

Here binary-expr rule is left recursive as the left-side of the production (lhs of the binary-operator) can again be another binary-expr. So if we try to implement this, our parse functions would ended-up getting stuck in an infinite recursion. To avoid that, left-recursion of the grammar rule must be eliminated. This can be done by rewriting the above rule as:

binary-expr := terminal-expression binary-operator expressionterminal-expression := basic-literal | var-ref | func-call | . . .expression := binary-expr | terminal-expression | . . .

Here a terminal-expr is any expression that has higher precedence compared to a binary expression. i.e: any expression that can be a leaf node in the binary-tree of a binary expression.

2. Must be linear

Consider the below grammar rule.

function := func-def | func-declfunc-def := functionSig ‘{‘ statements ‘}’func-decl:= functionSig ‘;’

Here both func-def and func-decl have a common component functionSig. The problem with these kinds of rules is that, during parsing, the parser will first try to match the source to the func-def and if that fails, it will try matching it to the func-decl. However, during this process, functionSig would be traversed twice. This can be avoided by rewriting the grammar rule in such a way that common productions are taken out of the alternative rules. eg:

function := functionSig (func-body | ‘;’)func-body := ‘{‘ statements ‘}’

Parser Table

In theory, you would have heard that LL(k) predictive parsers are based on a parse table, and we would need to construct a parse table. Parse table simply defines what character to expect and the next rule to follow, depending on the grammar. The parser follows/lookup this table to determine what token to parse.

However, in practice, we don’t need a parser table at all! That may sound a bit controversial, but there are few reasons for that:

Need to construct a parsing table before start parsing. This means the startup time of the parser increases.
For complex grammars, this parse table can be extremely complex and large. I mean, huge!
After every token, the parser has to lookup the parse-table to figure out what to parse next. This results in so many table lookups which may eventually impact the performance of the parser.

So, the alternative is to embed this parse table information to the implementation itself. If I take the same previous example of parsing a binary expression:

Expression parseBinaryExpression() {
   lhsExpr = parseExpression();
   binaryOp = parseBinaryOp();
   rhsExpr = parseExpression();
   ...
}

Here, the expected tokens and their order is embedded in the logic of the parseBinaryExpression() method itself. It no longer needs the help of a parsing table.

In this article, I discussed the algorithms and some implementation details of the lexer and the parser. In my next article (Writing a Parser — Syntax Error Handling), I will discuss the details of handling syntax errors and the error recovery mechanisms.