Back to the task at hand—writing a grammar for bf so we can parse source code like this:
1 2 3 4 5 | ++++++[>++++++++++++<-]>. >++++++++++[>++++++++++<-]>+. +++++++..+++.>++++[>+++++++++++<-]>. <+++[>----<-]>.<<<<<+++[>+++++<-]>. >>.+++.------.--------.>>+. |
It turns out, however, that bf has a grammar even simpler than our fictitious M-expressions.
Let’s recall how bf works:
There are six operators that manipulate a byte array: > < + - . ,.
There’s also a looping construct [···] that can contain operators or other loops.
Whitespace and other characters have to be handled too—but we’ll leave those out of the grammar, because in the next section, we’ll learn a shortcut for dealing with them.
Let’s build up our grammar line by line. The first production rule represents the top element, which we’ll call bf-program.
1 | bf-program : |
What’s the pattern for a bf-program? We know that bf is just a list of operations and loops. Let’s call them bf-op and bf-loop respectively. Thus our pattern for bf-program is a list of any number of bf-ops or bf-loops:
1 | bf-program : (bf-op | bf-loop)* |
To recap, (bf-op | bf-loop) means “either a bf-op or a bf-loop” and * means “zero or more of the preceding item”.
If we stopped here, our grammar wouldn’t work, because we’ve referred to bf-op and bf-loop without defining them. We’ll do that next:
1 2 3 | bf-program : (bf-op | bf-loop)* bf-op : bf-loop : |
So what’s a bf-op? As we noted, it’s one of the characters > < + - . ,, which we notate like so:
1 2 3 | bf-program : (bf-op | bf-loop)* bf-op : ">" | "<" | "+" | "-" | "." | "," bf-loop : |
Finally, what’s a bf-loop? It’s a pair of square brackets with another list of bf-ops or bf-loops inside:
1 2 3 | bf-program : (bf-op | bf-loop)* bf-op : ">" | "<" | "+" | "-" | "." | "," bf-loop : "[" (bf-op | bf-loop)* "]" |
That’s it—we have a complete grammar for bf. Meaning, bf may be a crazy language, but no matter how big or complex a bf program someone handed us, we could parse it using just these three rules.
Is this the only valid grammar for bf? No. We could recursively reuse our bf-program element like so:
1 2 3 | bf-program : (bf-op | bf-loop)* bf-op : ">" | "<" | "+" | "-" | "." | "," bf-loop : "[" bf-program "]" |
We could also create a separate element, call it bf-exprs, to represent a list of bf-ops and bf-loops:
1 2 3 4 | bf-program : bf-exprs bf-exprs : (bf-op | bf-loop)* bf-op : ">" | "<" | "+" | "-" | "." | "," bf-loop : "[" bf-exprs "]" |
A case could be made for either of these alternatives. The tiebreaker is convenience. For reasons that will become clear when we write our expander, the easiest grammar to work with will be the first one, because bf-program only appears once, and it otherwise uses the fewest possible rules.
Our small investment in making a grammar will pay an immediate dividend. Because now that we’ve made our bf grammar:
1 2 3 | bf-program : (bf-op | bf-loop)* bf-op : ">" | "<" | "+" | "-" | "." | "," bf-loop : "[" (bf-op | bf-loop)* "]" |
We need to convert it to an actual bf parser. How?
Consistent with the idea of using a domain-specific language to solve specialized problems, we’ll use a parser-generating language called brag that takes a list of production rules as its source code, and turns those rules into a working parser.
Here’s how we use it. We make a new file called "parser.rkt" like so:
1 2 3 4 | #lang brag bf-program : (bf-op | bf-loop)* bf-op : ">" | "<" | "+" | "-" | "." | "," bf-loop : "[" (bf-op | bf-loop)* "]" |
All we’ve done is take our grammar and add a #lang brag line to the top. When we import this module into our bf reader, we’ll get a function called parse that implements this grammar, and another called parse-to-datum that will let us check the generated parse tree.
Let’s do that now, by making another file in the same directory, and pasting in the source code from our demo program as the input to parse-to-datum:
1 2 3 | #lang br (require "parser.rkt") (parse-to-datum "++++-+++-++-++[>++++-+++-++-++<-]>.") |
When we run "parser-tester.rkt", we’ll get this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | '(bf-program (bf-op "+") (bf-op "+") (bf-op "+") (bf-op "+") (bf-op "-") (bf-op "+") (bf-op "+") (bf-op "+") (bf-op "-") (bf-op "+") (bf-op "+") (bf-op "-") (bf-op "+") (bf-op "+") (bf-loop "[" (bf-op ">") (bf-op "+") (bf-op "+") (bf-op "+") (bf-op "+") (bf-op "-") (bf-op "+") (bf-op "+") (bf-op "+") (bf-op "-") (bf-op "+") (bf-op "+") (bf-op "-") (bf-op "+") (bf-op "+") (bf-op "<") (bf-op "-") "]") (bf-op ">") (bf-op ".")) |
Once again, all the characters in our source code appear in a node of the parse tree, and these nodes correspond to the names and patterns of the production rules. (The curious can change the source string in "parser-tester.rkt" and see how the parse tree changes, though take care not to use any characters other than the eight handled by our grammar.)
Not bad—with four lines of code, the heavy lifting for the parser is now done. To finish our parser, we need to build one helper function that will take care of whitespace and other characters.