Writing a Parser — Part I: Getting Started
This article discusses a simple approach to implement a handwritten parser from scratch and some fundamentals associated with it. This focuses more on explaining the practical aspects of the implementation rather than the formal definitions of parsers.
Introduction
A parser is the very first thing that comes to our mind when we speak of compiler-development/compiler-construction. Rightly so, a parser plays a key role in a compiler architecture and can also be considered as the entry-point to a compiler. Before we get into the details of how to write a parser, let's see what parsing actually means.
What is parsing
Parsing essentially means converting a source-code into a tree-like object representation — which is called the ‘parse tree’ (also sometimes called the ‘syntax tree’). Often, an abstract syntax tree (AST) is confused with a parse/syntax tree. A parse tree is a concrete representation of the source code. It preserves all the information of the source code, including trivial information such as separators, whitespaces, comments, etc. Whereas, an AST is an abstract representation of the source code, and may not contain some of the information that is there in the source.
In a parse-tree, each element is called a ‘node’. Leaf-nodes or the terminal-nodes are treated as a special kind of nodes, which is called a ‘token’. The non-terminal nodes are generally referred to as simply ‘node’.
Why a handwritten parser?
If you look around enough, you will see that there are quite a few parser generators available like ANTLR, Bison, Yacc and etc. With these parser generators, we can simply define a grammar, and automatically generate a parser according to that grammar. That sounds pretty easy! If so, why bother writing a parser from scratch?
A common mistake in compiler construction is thinking that we need to write a parser from scratch — or thinking that we don’t need our own parser. Well, that sounds contradictory! The catch is, both approaches have on its own pros and cons. So it is important to know when to write a parser by hand or to use a parser generator:
A generated parser:
- Easy to implement — Define the grammar in a necessary format, and generate the parser. eg: For ANTLR, all we need is to define the grammar in a
.g4
format. Then, generating the parser is as simple as running a single command. - Easy to maintain — Updating the grammar rule and regenerating the parser is all you need to do.
- Can be compact in size.
- However, it doesn’t have the advantages of a handwritten parser have (see below).
A handwritten parser:
- Writing a parser by hand is a moderately difficult task. Complexity may increase if the language-grammar is complex. However, it has the following advantages.
- Can have better and meaningful error messages. Auto-generated parsers can sometimes result in totally unhelpful errors.
- Can support resilient parsing. In other words, it can produce a valid parse tree even upon syntax error. This also means a handwritten parser can detect and handle multiple syntax errors at the same time. In generated parsers, this can be achieved to a certain extend with extensive customizations, but might not be able to fully support resilient parsing.
- Can support incremental parsing — Parse only a portion of the code, upon an update to the source.
- Usually better in-terms of performance.
- Easy to customize. You own the code and has the full control over it — eg: In ANTLR4, if you want to customize a parsing logic, then you’ll either have to extend and do a bit of hacking into the generated parser or write some custom logic in the grammar file itself, in another language. This can be messy at times and the level of customization that can be done is very limited.
- Can easily handle context-aware grammars. Not all languages are 100% context-free. There can be situations where you want to tokenize the input or construct the parse-tree differently depending on the context. This is a very difficult, or near-impossible task when it comes to generated parsers.
So all in all, if you want to get a production-grade, highly optimized parser that is resilient, and if you are have an ample amount of time, then a handwritten parser is the way to go. On the other hand, what you need is a decent enough parser in a very quick time and the performance or resiliency is not one of your requirements, a generated parser would do the trick.
Grammar
Despite whether to implement a handwritten parser or to use a generated-parser, there is one thing that is always going to be needed: a well-defined grammar (a formal grammar) for the language we are going to implement. A grammar defines the lexical and syntactical structure of a program of that language. A very popular and simple format for defining context-free grammar is the Backus-Naur Form (BNF) or one of its variants such as the Extended Backus-Naur Form (EBNF).
Components of a Parser:
Even though a parser often refers to as a single component within a compiler architecture, it consists of several components, including but not limited to a lexer, a parser, and several other abstractions such as input/character reader(s) and an error handler. The below diagram shows the components and how they are connected to each other, in our parser implementation.
Character Reader / Input Reader
The character reader, also called the input reader, reads the source code and provides characters/code-points to the lexer upon request. Source-code can be many things: a file, an input stream, or even a string.
It is also possible to embed the input-reader capabilities to the lexer itself. However, the advantage of abstracting out the reader from the lexer is that, depending on the input, we can plug-in different readers, to the same lexer. And the lexer doesn’t have to worry about handling different types of inputs.
An input reader consists of three sets of important methods:
- peek()/peek(k) — Get the next character /next k-th character from the input. This is used to look ahead the characters without consuming/removing them from the input stream. Calling the
peek()
method more than once will return the same character. - consume()/consume(k) — Get the next character /next k-th token from the input, and remove it from the input. This means, calling the
consume()
method multiple times will return a new character at each invocation. Sometimes thisconsume()
method is also referred to asread()
ornext()
. - isEOF() — Checks whether the reader has reached the end of the input.
The Lexer
The lexer reads characters from the input/character reader and produces tokens. In other words, it converts a character stream into a token stream. Hence its sometimes also called the tokenizer. These tokens are produced according to the defined grammar. Usually, the implementation of the lexer is slightly complex than the character-reader, but much simpler than the parser.
An important aspect of the lexer is handling whitespaces and comments. In most languages, the language semantics are independent of the whitespaces. Whitespaces are only required to mark the end of a token and hence are also called ‘trivia’ or ‘minutiae’ as they have little value for the AST. However, this is not the case with every language, because whitespace can have semantic meanings in some languages such as python. Different lexers handle these whitespaces and comments differently:
- Discard them at the lexer — Disadvantage of this approach is, it will not be able to reproduce the source from the syntax/parse tree. This can become an issue if you are planning to use the parse-tree for purposes like code-formatting and etc.
- Emit whitespaces as separate tokens, but to a different stream/channel than the normal token. This is a good approach for languages where whitespaces have semantic meaning.
- Persist them in the parse tree, by attaching them to the nearest token. In our implementation, we will be using this approach.
Similar to the character reader, lexer consists of two methods:
- peek()/peek(k) — Get the next token /next k-th token. This is used to look ahead the tokens without consuming/removing them from the input stream. Calling the
peek()
method more than once will return the same token. - consume()/consume(k) — Get the next token /next k-th token, and remove it from the token stream. This means, calling the
consume()
method multiple times will return a new token at each invocation. Sometimes thisconsume()
method is also referred to asread()
ornext()
.
Once the lexer reaches the end of the input from the character reader, it emits a special token called the ‘EOFToken’ (end of file token). The parser uses this EOFToken to terminate the parsing.
The Parser
The parser is responsible for reading the tokens from the lexer and producing the parse-tree. It gets the next token from the lexer, analyzes it, and compare it against a defined grammar. Then decides which of the grammar rule should be considered, and continue to parse according to the grammar. However, this is not always very straightforward, as sometimes it is not possible to determine which path to take only by looking at the next token. Thus, the parser may have to check a few tokens into the future to as well, to decide the path or the grammar rule to be considered. We will discuss this in detail in the next article. However, due to the complication like these, the parser is also the most complex component to implement in a parser-architecture.
Generally, a parser only needs one method — A parse() method which will do all the parsing, and returns the parser-tree.
Given the fact that our objective is to implement a parser that is both resilient and gives proper error messages, it is quite an important aspect of the parser to properly handle syntax errors. A syntax error is a case where an unexpected token is reached, or in other words, the next token does not match the defined grammar, during the parsing. In such cases, parser asks the ‘error handler’ (see the next section) to recover from this syntax error, and once the error handler recovers, the parser continues to parse the rest of the input.
Error Handler
As discussed in the previous section, the objective of an error handler is to recover upon a syntax error. It plays a key role in a modern resilient parser, especially to be able to produce a valid parser tree even with syntax errors and to give proper and meaningful error messages.
Error handling capabilities can also be embedded into the parser itself. The advantage of doing so is, since the errors will be handled then and there at the parser, there is a lot of contextual information available at the point of recovery. However, there are more disadvantages than the advantages of embedding recovery capabilities into the parser itself:
- Trying to recover each and every place will result in a lot of repetitive-tasks and duplicate codes
- The parser logic will be cluttered with error handling logic, which will eventually make the codebase to be hard to read and understand.
- By having a separate error handler, it is also possible to plug-in different error-handlers for different use-cases. For example, one might want to use one error handling approach for the CLI tools, and a different error handling approach for the interactive IDEs. Because IDEs might want to facilitate code completion, etc, and hence the pattern of recovery would be more closer to the user’s writing pattern.
In this article, we discussed the basic architecture of a parser, some terminologies, and when to go for a handwritten parser and when to not. We also discussed some high-level details of different components of a parser and their requirements. In the next article, I will discuss in detail about each of these components, and the algorithms to be followed including some implementation details.