Admittedly, we took a few shortcuts in stacker. To be fair, we had a lot of ground to cover. We had to learn about the two essential components of a Racket language—the reader and the expander. We had to learn some basic concepts of the Racket language itself like expressions and macros. In funstacker we filled in some details about functional programming.
But with that under our belt, we can migrate toward a more realistic technique that can be adapted to other language projects. In this tutorial, we’ll lay the foundation for this new technique. In the rest of the tutorials, we’ll refine and adapt it.
Here, we’ll implement a real (but still simple) programming language called bf. We’ll organize this tutorial around a new idea—a grammar, which lets us specify the structure of a language. We’ll also add two new pieces to our reader: a parser and a tokenizer, which will help us convert our bf source code into a structured representation of a program.
bf was created as an exercise in programming-language minimalism. + bf is a common abbreviation of its full name, which rhymes with drainsuck. The bf language uses only eight characters, which are nevertheless sufficient to make it Turing complete. For instance, here’s the traditional “Hello World” program in bf:
1 2 3 4 5 | ++++++[>++++++++++++<-]>. >++++++++++[>++++++++++<-]>+. +++++++..+++.>++++[>+++++++++++<-]>. <+++[>----<-]>.<<<<<+++[>+++++<-]>. >>.+++.------.--------.>>+. |
1 | Hello, World! |
And here’s a factorial generator:
1 2 3 4 5 6 | >++++++++++>>>+>+[>>>+[-[<<<<<[+<<<<<]>>[[-]>[<<+>+>-] <[>+<-]<[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<-[>+<- [>[-]>>>>+>+<<<<<<-[>+<-]]]]]]]]]]]>[<+>-]+>>>>>]<<<<< [<<<<<]>>>>>>>[>>>>>]++[-<<<<<]>>>>>>-]+>>>>>]<[>++<-] <<<<[<[>+<-]<<<<]>>[->[-]++++++[<++++++++>-]>>>>]<<<<< [<[>+>+<<-]>.<<<<<]>.>>>>] |
1 2 3 4 5 6 7 8 9 | 1 1 2 6 24 120 720 5040 ··· |
We can try these (and other) bf programs in DrRacket by using #lang bf-demo. This is the implementation that we’ll build from scratch in this tutorial:
1 2 3 4 5 6 | #lang bf-demo ++++++[>++++++++++++<-]>. >++++++++++[>++++++++++<-]>+. +++++++..+++.>++++[>+++++++++++<-]>. <+++[>----<-]>.<<<<<+++[>+++++<-]>. >>.+++.------.--------.>>+. |
1 | Hello, World! |
Is bf good for anything? In the sense of professional advancement, no. But even though bf code looks cryptic, its underlying structure is simple. Therefore, it’s a good way to learn how to translate the specification of a language into an implementation. (Writing programs in bf will remain hard, however.)
When a bf program starts, it creates an array of bytes in memory (by convention 30,000 bytes long, each byte initialized to 0) and a pointer into that array (initialized to the 0 position). The current byte is the byte in the array at the location indicated by the pointer.
Then it runs the code of the bf program, which consists of six operations:
> Increase the pointer position by one
< Decrease the pointer position by one
+ Increase the value of the current byte by one
- Decrease the value of the current byte by one
. Write the current byte to stdout
, Read a byte from stdin and store it in the current byte (overwriting the existing value)
bf also has a looping construct [···] that will repeat the code within the brackets until the current byte is zero. If the current byte is already zero, the loop will not run. Loops can be recursively nested within other loops.
Any other characters in the source code, including whitespace, are ignored.
Now intellectually armed, we can puzzle out what this bf program will do:
1 2 | Greatest language ever! ++++-+++-++-++[>++++-+++-++-++<-]>. |
We picture our array of bytes and our pointer at 0 (the first byte).
Greatest language ever! consists of characters that are ignored.
++++-+++-++-++ is 11 increments and 3 decrements of the current (first) byte, meaning we end up with a value of 8 in the current byte. (The + and - operations could be in any order, and the result would be the same.)
[ A loop starts. The current byte isn’t 0, so we go inside the loop.
> The pointer moves to the second byte and, with the same
++++-+++-++-++ pattern, leaves a value of 8 there as well.
< We move the pointer back to the first position and
- decrement the value of that byte from 8 to 7.
] At the end of the loop, the current byte is 7, not 0, so we do it again. The value in the first byte thus serves as a loop counter. The loop will therefore run a total of 8 times, leaving a value of 0 in the first byte and 64 in the second byte. After that:
> We move back to the second byte.
. We print the second byte. The character with value 64 is:
1 | @ |
Yes, that’s a lot of effort to print an @ sign. (We can run this program with #lang bf-demo to verify that it does.) Though bf may be inefficient, it’s not illogical. We might now recognize that this looping technique is the basis of the “Hello World” program we saw before:
1 2 3 4 5 | ++++++[>++++++++++++<-]>. >++++++++++[>++++++++++<-]>+. +++++++..+++.>++++[>+++++++++++<-]>. <+++[>----<-]>.<<<<<+++[>+++++<-]>. >>.+++.------.--------.>>+. |
Don’t spend too long penciling it out, however—that’s why we’re making a bf interpreter.
Recall that every language in Racket must provide two functions:
A reader, which converts the source file from a string of characters into parenthesized S-expressions. The reader starts with a function called read-syntax.
An expander, which determines how these S-expressions correspond to Racket code, which is then evaluated to produce a result. The expander starts with a macro called #%module-begin.
As it turns out, a lot is riding on the S-expressions produced by the reader. Why? Because they become the input to the expander, and thereby influence the rest of the implementation. If our reader produces clean, well-structured S-expressions, writing our expander will be easier. If not, then it won’t.
In stacker, we designed the language so that each line of the source became one S-expression. That was simple, but not realistic. Most programming languages have a more intricate structure. A bf program, for instance, might be a flat list of operations. But it also might have loops nested to any depth. We won’t be able to rely on the same shortcut.
“So how do we convert bf code into S-expressions? It doesn’t look anything like Racket.” There’s a secret weapon we can use to convert any well-specified programming language—no matter how weird—into S-expressions. That tool is a grammar.