Before we get to grammars themselves, some context.
When making a reader for a programming language, we often face the problem of converting source code from a format optimized for a human (= a string stored in a file) to one optimized for a computer (= a hierarchical data structure). This process is called parsing. We often delegate it to a special function called a parser. Because the data structure that emerges from the parser is hierarchical, it’s known as a parse tree.
There’s no magic to a parse tree. It simply describes, for the benefit of the computer, the structure that we humans easily see when we read code. For instance, if we look at this Python conditional expression:
1 2 3 4 | if y > 0: x / y else: print("nope") |
We understand that the pieces fit into a hierarchy like this:
This is the parse tree. In Racket, a parse tree can be notated as an S-expression, so we might write the above parse tree like so (where each node is grouped with its parent node using parentheses):
1 2 3 4 5 | '(if (> y 0) : (/ x y) else: (print "nope")) |
A parse tree describes the structure of source code. That’s why it’s useful in the reader for our language. But the parse tree doesn’t tell us how the code will actually run. For that, we’ll need further information about what the parsed pieces mean. We’ll add that meaning later, when we make our expander.
Suppose we translated the original Python conditional directly to Racket:
Look familiar? This code is almost identical to the parse tree we saw in the last section:
1 2 3 4 5 | '(if (> y 0) : (/ x y) else: (print "nope")) |
This similarity between parse tree and code is one reason Racket is suited to making programming languages. Recall that a Racket-implemented language acts as a source-to-source compiler that rewrites our language as Racket code. Generating a parse tree is often a giant step in the right direction.
This is why Racket and other Lisp languages are sometimes said to have “no syntax”. In any language, the parse tree is what remains after the surface notation of the language has been parsed. But in Racket, it’s almost like we’re starting with the parse tree.
When we implement a programming language, we need to teach the computer how to “see” the parse tree. Because computers are dumb, we do this by writing a parser, which consumes source code as input and returns a parse tree.
Every reader we make in Racket will include a parser. In fact, the parser often does the heaviest lifting in the reader. But it won’t always be fancy. For instance, in our read-syntax function for stacker, the parsing happened in one line of code:
1 | (define src-datums (format-datums '(handle ~a) src-lines)) |
Why does this count as a parser? Because it’s doing a parser’s job: reading the source code and converting it to a parse tree. For example, this stacker program:
Yielded a parse tree that looked like this (and which was stored in src-datums):
1 2 3 4 5 6 | '((handle) (handle 4) (handle 8) (handle +) (handle 3) (handle *)) |
Our parser for stacker could be written in one line because the structure of the source code was so simple. We deliberately defined stacker to have only one argument per line.
For more complex languages—especially the large subset that have recursive structures, like bf—that won’t work.
It’s possible to create parsers by hand. But getting them right, and making them fast, is trickier than it looks. We need a better tool.
For many languages, rather than write a parser by hand, it makes more sense to use a parser generator to make a parser from a specification of the structure of the language. This specification is called a grammar.
Recall that a parse tree is a way of notating the structure of a particular program written in a certain language. A grammar can be thought of as a way of notating the structure of every possible program written in that language. Therefore, a grammar can generate the parse tree for any program written in the language. For this reason, it makes sense to build the parser for the language around the grammar.
This isn’t a new idea. A grammar-based parser is a lot like a regular expression: they both match their input to a certain pattern that describes the structure of the input. But as output, the regular expression produces substrings; the grammar-based parser produces parse trees.
Making a grammar doesn’t mean we have to acquire new knowledge. Once we learn any programming language, we understand its general structure. A grammar is simply a way of notating this knowledge.
For bf, we could. It’s simple enough. But knowing how to write a grammar pays a recurring dividend. Once we have a grammar, we can use a parser generator to make our parser at no extra cost. Furthermore:
If we’re implementing an existing language, someone has probably worked out a grammar already. For instance, here’s the official grammar for Python. We could use this to make a parser for a Python interpreter, or adapt the grammar to make a Python-like language.
If we’re implementing a new language, a grammar is a reality check. If we can write a grammar, chances are good that we can make our language work.
If we can’t, it’s a warning—our language has wrinkles or ambiguities that might cause problems later. For instance, there is no grammar for Markdown. This, in turn, is why Markdown parsers often disagree, and why the concept of “standardizing” Markdown is difficult at best. + The nice thing about standards is that you have so many to choose from.
—Andrew S. Tanenbaum
As we grow our language and experiment with new features, we can amend the grammar more easily than we could a hand-crafted parser.
Not unless we want two problems. Regular expressions treat every string as flat, sequential data. They can’t treat strings as hierarchically structured data (which describes almost all the source code we’ll be parsing). While it’s possible to write rudimentary parsers with regular expressions, for bigger jobs they quickly lead to despair. That said, our familiarity with regular expressions will help, because grammar notation is similar.