Writing a Parser — Part III: Syntax Error Handling
One of the nightmares of implementing a production-grade parser is having to deal with user errors and invalid codes. My previous article discusses the algorithms and some implementation details of a lexer and a parser. In this article, I will discuss how to handle syntax errors in a parser, and recover from them to produce a resilient parse tree.
Syntax Errors
A syntax error is a case where an unexpected token is reached, or in other words, the next token does not match the defined grammar. In different languages, there can be many different syntax errors. However, If you look closely, any syntax error in any language can occur due to only two reasons:
- An expected token is missing at some point.
- Some additional tokens (junk tokens) exist before the expected token.
This makes life a little bit easier for a parser, as it needs to worry about only two cases. But still, the parser needs to recover from these kinds of syntax errors in order to continue parsing the rest of the document/input. And this recovering is not as simple as it may sound.
Recovering
The error handler is responsible for recovering from such syntax errors. The parser requests the error handler to try and recover from the syntax error, and the error handler will recover from the error and give back the recovered node to the parser. In the above sample, the solution for recovering in (2) is to insert an identifier token (variable name), and the solution for (3) is to remove one of the identifier tokens (preferably ‘b’). This is easy to say for the human eye, but how does the error handler know what action to take? From the error handler’s perspective, upon a syntax error,
- Should the next token(s) be deleted?
- Should some token(s) be inserted? If so, what is/are the token(s) to be inserted?
To answer these questions, we could make use of an error recovery strategy.
Approach I:
One of the simplest approaches is to keep deleting tokens until a matching token is reached. However, this may end up removing many tokens and is not a good approach to solve errors caused my missing-tokens, like in the case of the above example (2).
Approach II:
When an unexpected token is reached, check whether this token is an ‘expected token’ within the next k rules of the grammar (e.g.: k=5).
- If yes, then that means the expected token at the current position is missing. Hence insert the missing token and continue.
- If not, that means this is an extraneous token. Hence remove the token.
This approach can yield better results than the previous approach. However, this is not a good approach when there is more than one error in a single location. Thus, going a bit more forward along the same path, the following approach is a more advanced method but that can yield far superior results for recovering.
Approach III (Recommended):
When an unexpected token is reached,
- Insert a token, and see how far the parser can successfully progress. The current context of the parser (i.e: which grammar rule is currently being parsed) and the language’s formal grammar could be used to decide the type of the token to be inserted.
- Delete a token and see how far the parser can successfully progress.
This can be done by maintaining two streams of tokens:
- The expected set of tokens — Can be formed by looking at the grammar.
- The actual set of token — This is the stream returned by the lexer.
And then, comparing the two tokens streams can be used to find the number of correct matches, and the tokens to be inserted upon an incorrect match. Consider the above example (2) where the variable name is missing in a variable declaration. If we compare the two streams, there is a mismatch after 2nd token (see below).
As you can see in the above diagram, inserting an identifier token correctly aligns the actual token stream with the expected token stream. However, this is not as simple as it sounds. For example, in place of ‘expr’, there could be different kinds of expressions that are allowed such as int-literals, string-literals, etc. So the error handler needs to check against all these alternative paths when comparing the tokens streams. Also, another complication to think of is that there could be more than one syntax error close to each other — then that will have an impact when checking the correct matches. Thus, we can extend the above solution as follows to overcome these issues.
After fixing the current head and trying to progress, if it encounters more errors, then it will try to fix those as well. All possible combinations of insertions and deletions will be tried out for such errors. Once all possible paths are discovered, it picks the optimal combination that leads to the best recovery. The optimal solution is selected using the below criteria:
- Pick the solution with the longest matching sequence.
- If there’s a tie, then check for the solution which requires the lowest number of ‘fixes’.
- If there’s a tie, then give priority for the ‘insertion’ as that doesn’t require removing an input a user has given.
- If two or more alternative paths give the same result, give priority to the order of occurrence.
Finally, the best solution is applied and the parsing will be continued. For example, if the best combination of fixes was [insert, insert, remove, remove], then apply only the first fix and continue.
- If the fix was a ‘remove’ — then consume the token stream once, and continue from the same rule again.
- If the fix was an ‘insert’ — then insert the missing node, and continue from the next rule, without consuming the token stream.
The reason for applying only the first fix is, once an error is fixed and the cursor is moved forward, the set of tokens that we will be considered for recovering the second error could be different. Thus there’s a possibility of resulting in a different solution than the one we got earlier (as the second fix).
Optimizations
The algorithm we discussed in the previous section is a brute-force algorithm. It tries out all possible combinations of tokens (according to the grammar) to find out the best possible solution. To make this more optimal in terms of runtime-performance, we can do the below optimizations.
Early exit
Suppose there's a syntax error at a position where the grammar supports branching/alternative-paths. An example would be a syntax error at a place where an expression is expected. See the below example of missing the value-expression in a variable definition.
int a = ; // syntax error
int b = 5;
The grammar at the point of syntax error is:
expr := basic-literal | variable-ref | func-call | binary-expr
Now the error handler will insert/delete a token and will try out all the alternatives (basic-literal, variable-ref, func-call, binary-expr) to figure out the best match. But the most optimal solution here is to insert a single token (a number or an identifier). Then it will match one of basic-literal or variable-ref. When the error handler first tries the basic-literal as the first option, it will get a result for that with all-matches, which is the best result the error-handler can ever get. Thus, there's no point in trying out the rest of the alternatives as we have already got the best possible solution. Therefore the error handler can terminate processing further, and return that result.
Memoizing
In similar scenarios like above where alternative paths are present, the error handler will face situations where the same set of tokens are matched against the same expected-token pattern (same grammar productions) multiple times. In such cases, if the error handler can store the result of the first visit, then it can return the same result on the second time around, without having to re-evaluate the same rule. This can vastly increase the performance of the error handler if the grammar is complex and have a large number of alternative paths.
A fail-safe
A common issue that recursive-descent parser with error recovery faces is, getting stuck on infinite recursions. This can happen when the error handler selects a token to insert as the best solutions, but the parser decides that token is not a good match (due to the difference in the lookaheads). This can be reduced by syncing the lookahead and matching logics in the parser and the error-handler.
However, in reality, it is pretty much impossible to eliminate it completely. So it is always good to have a fail-safe mechanism to terminate such possible infinite recursions. One approach is to keep track of the token index, and if the parser keeps trying to parse on the same index over and over again, it can be decided that the parser has stuck in an infinite recursion, and can be bailed out by explicitly consuming a token.