Here’s a simple grammar that describes a five-digit zip code:
1 2 3 | zip-code : digit digit digit digit digit digit : "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" |
A grammar consists of a series of production rules, written one per line. On the left of each rule is the name of a structural element of the language. The name is like a variable—it can be anything we want (though as usual, short & meaningful names are wise). A colon goes in the middle of the rule. + This notation style is known as Extended Backus–Naur form or EBNF, so this kind of grammar is also known as an EBNF grammar. Above, we have production rules for two elements: zip-code and digit.
On the right of each rule, we have a pattern for that element. This side of the rule is like a regular expression: it can include literal strings (like "<" or "def"), classes of strings (like UPPERCASE), or names of other production rules (denoting an appearance of that element). If a pattern has multiple possibilities for a certain position, they’re separated by a vertical bar |.
The first rule in a grammar defines the top-level element—in this case, a zip-code. A zip code has five digits, described by the pattern digit digit digit digit digit. Whitespace in each production rule is ignored during the matching, so this rule denotes five adjacent digits.
On the next line, we define a digit as any string from "0" to "9", using vertical bars to separate the choices. Again, because whitespace within the pattern is ignored, we can break the rule onto two lines.
How does a grammar-based parser work? The parser takes a string of source code as input (or an input port pointing at the code). Starting with the first production rule in the grammar, the parser tries to match the source code to the pattern on the right. If that pattern contains names of other rules in the grammar, the parser recursively tries to match those rules, again using the patterns on the right.
This process continues until one of two things happen:
The parser decomposes the source code into things that can’t be decomposed further, called terminals. In the zip-code grammar, the digit strings "0" through "9" would be terminals. The parser returns its result: a parse tree describing the structure of the program. + Corollary: the “leaves” of a parse tree are always terminals in the grammar.
The parser can’t find any way to decompose the source code into terminals according to the rules of the grammar. The parse fails.
Suppose we tried to parse the string "123ABC" with the grammar above:
Starting with the top rule, the parser would see that to make a zip-code, it would need to match five digits in a row.
It would match "1" "2" "3" as digits.
But when it reached "ABC", it would fail, because neither "A" nor "AB" nor "ABC" matches a possible pattern for digit.
Or suppose we tried to parse "1234":
This time, every character would be matched as a digit.
But when the parser went to match a fifth digit, it would get stuck again, because it would be out of characters.
Now suppose we tried to parse "01234":
This time, the parser would look for five digits, and would find them: each character of "01234" would be matched as a digit, with nothing left over.
Our reward would be this parse tree:
1 2 3 4 5 6 | '(zip-code (digit "0") (digit "1") (digit "2") (digit "3") (digit "4")) |
By the way, how does a parser avoid going down fruitless paths? It doesn’t. On the contrary, a parser might explore many cul-de-sacs on its way to a finished parse tree (just as a regular expression might end up trying lots of unsuccessful substring matches). It’s trial and error.
Grammars are not unique. We can always find another grammar to describe a language. For instance, our zip-code grammar could also be written like this:
1 2 3 4 5 | zip-code : digit-pair digit digit-pair digit-pair : digit digit digit : odd-digit | even-digit odd-digit : "1" | "3" | "5" | "7" | "9" even-digit : "0" | "2" | "4" | "6" | "8" |
It should be apparent that any string that matched (or did not match) the first zip-code grammar would also match (or not match) this one. For instance, our string 01234 would now be parsed like so:
1 2 3 4 5 6 | '(zip-code (digit-pair (digit (even-digit "0")) (digit (odd-digit "1"))) (digit (even-digit "2")) (digit-pair (digit (odd-digit "3")) (digit (even-digit "4")))) |
It would be perverse to define a zip-code grammar this way. Nevertheless, we have latitude to do so. As usual, however, simpler tends to be better.
It’s also possible to write grammars that are ambiguous, in the sense that the rules are sufficiently lenient that a string can produce more than one valid parse tree.
For instance, in this zip-code grammar, the pattern for the zip-code rule has two equivalent options for getting to five digits:
1 2 3 4 5 6 | zip-code : three-digits two-digits | two-digits three-digits two-digits : digit digit three-digits : digit digit digit digit : "1" | "3" | "5" | "7" | "9" | "0" | "2" | "4" | "6" | "8" |
With this grammar, our string 01234 could be parsed one of two ways:
1 2 | '(zip-code (three-digits (digit "0") (digit "1") (digit "2")) (two-digits (digit "3") (digit "4"))) |
1 2 | '(zip-code (two-digits (digit "0") (digit "1")) (three-digits (digit "2") (digit "3") (digit "4"))) |
A parser generator is unlikely to complain about this grammar—as long as it can complete a parse tree, it’s happy. But ambiguities like these often indicate lurking illogic, and are best avoided.
The production rules in our zip-code grammar only relied on simple string matching, and a vertical bar | to denote alternate possibilities. Let’s look at some of the other special notation we can use in our production-rule patterns.
Recall that stacker programs consisted of a list of instructions separated by line breaks, where each instruction was either an integer num or the functions + and *. A grammar for stacker might look like this:
1 2 3 4 5 6 | stacker-program : "\n"* instruction ("\n"+ instruction)* instruction : integer | func integer : ["-"] digit+ digit : "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" func : "+" | "*" |
Stepping through each rule:
As usual, the top-level element of our parse tree, stacker-program, is the name of the first production rule.
The pattern for this rule starts with "\n"*. The * quantifier should be familiar from regular expressions—it means “zero or more of the preceding item”. Taken together, "\n"* means “zero or more newlines”.
The next element in the pattern is instruction. There’s no quantifier on this element, which means every stacker-program needs to have at least one instruction.
Parentheses create subsequences. So the parenthesized expression ("\n"+ instruction) means “match the sequence "\n"+ followed by instruction“.
The + quantifier means “match one or more of the preceding item” (again, analogous to its meaning in regular expressions.) So "\n"+ means “one or more newlines”. This guarantees that multiple instructions are separated by one or more line breaks.
The * quantifier on the whole parenthesized subsequence once again means “zero or more of the preceding item”. So a stacker-program may or may not have multiple instruction elements separated by newlines.
Our instruction rule uses the | operator to indicate that an instruction can be either an integer or a func.
Square brackets mean “zero or one of the preceding item,” aka an optional match. In our integer rule, the ["-"] in front of digit+ means that an integer may or may not have a "-" prefix. digit+ itself means “match one or more digits.”
As before, the digit and func rules use the | operator to list out the possibilities.
So if we start with our good old stacker test program:
Our parse tree will look like this:
1 2 3 4 5 6 7 8 9 10 | '(stacker-program (instruction (integer (digit "4"))) "\n" (instruction (integer (digit "8"))) "\n" (instruction (func "+")) "\n" (instruction (integer (digit "3"))) "\n" (instruction (func "*"))) |
We should note how the parse tree lines up with the grammar:
Each parenthesized node in the parse tree corresponds to a production rule, starting with the name of the rule, and followed by the elements that matched the pattern for that rule.
Rules that rely on other rules lead to deeper nesting. For instance, an integer node will always contain a digit node.
Every character that appeared in the original source string also appears in the parse tree.
Of course, most programming languages are more structurally intricate than stacker or a zip code. Fortunately, because the rules in a grammar can be refer to each other recursively, a lot of complexity can be expressed in a small set of rules.
For instance, let’s invent M-expressions, which we’ll define as the subset of S-expressions that only contain addition and multiplication of integers. M-expressions can be nested to any depth. A short example:
1 | (+ 1 (* 2 (+ 3 4) 5) 6) |
The grammar for M-expressions might look like this:
1 2 3 4 5 6 | m-expr : m-list | integer m-list : "(" func (" "+ m-expr)* ")" integer : ["-"] digit+ digit : "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" func : "+" | "*" |
As before, the top-level element of our parse tree, m-expr, is the name of the first production rule. We can recycle the integer, digit, and func rules from the stacker grammar above.
The new possibility is an m-list, which is a series of items between parentheses, starting with a func, and followed by any number of m-exprs separated with word spaces. Notice that the pattern for an m-expr refers to an m-list, and vice versa. Mutually recursive definitions like these are not only legit, they’re often necessary to express recursive features of a language.
The result is that these five rules suffice to match any M-expression, no matter how deeply nested. The parse tree for our test expression will look like this (adjusting the white space for clarity):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | '(m-expr (m-list "(" (func "+") " " (m-expr (integer (digit "1"))) " " (m-expr (m-list "(" (func "*") " " (m-expr (integer (digit "2"))) " " (m-expr (m-list "(" (func "+") " " (m-expr (integer (digit "3"))) " " (m-expr (integer (digit "4"))) ")")) " " (m-expr (integer (digit "5"))) ")")) " " (m-expr (integer (digit "6"))) ")")) |
As before, we can see that all the characters in our source string appear in a node of the parse tree, and these nodes correspond to the names and patterns of the production rules.