Though it’s essentially true that the parser takes a string of source code as input, we glossed over a detail. It’s more precise to say that the parser takes as input a sequence of tokens. A token is the smallest meaningful chunk of a string of source code. A source string is converted to tokens with a function called a tokenizer that sits between the source string and the parser.
Strictly speaking, a tokenizer is optional. If we don’t use a tokenizer, then every character that appears in our source code counts as a token, and thus also has to appear in our grammar.
For that reason, a tokenizer is often convenient, because it reduces the number of distinct tokens we have to handle in our grammar. Because of this, the tokenizer and parser aren’t fully independent. Rather, the tokenizer gives us options for allocating the labor.
For instance, some tasks are more easily handled with a tokenizer:
Strings in the source code that are meaningless—e.g., those that represent comments—can be removed.
Strings that represent a type of value—e.g., those that represent numbers—can be labeled with a generic token type, like NUMBER. This simplifies the grammar, as we can just use NUMBER to mean any number string, rather than having to make grammar rules that cover every possible number pattern, e.g., 23.8, 3/4, 42+3i.
Strings that should be handled literally—e.g., a single character like < representing an operation—can just pass through.
Recall our zip-code grammar:
1 2 3 | zip-code : digit digit digit digit digit digit : "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" |
The tokenizer gets a string as input (or a port pointing at a string), so any function that we can use with a string can be used in the tokenizer. Here, we could use a regular expression to match the strings "0" through "9" and convert each of these to the token type DIGIT-TOKEN. We could then rewrite our grammar like this:
1 2 | zip-code : digit digit digit digit digit digit : DIGIT-TOKEN |
The parse tree for a zip code like 01234 will look the same:
1 2 3 4 5 6 | '(zip-code (digit "0") (digit "1") (digit "2") (digit "3") (digit "4")) |
In that way, the tokenizer can hide details about the source string that the grammar doesn’t need to care about.
What’s the downside of a tokenizer?
Substrings from the source code that are removed (like comments) are then completely invisible to the parser.
Tokens are indivisible. Once we fuse a substring into a token, it can’t be decomposed further by the parser.
For instance, we could put a regular expression inside the tokenizer to match five sequential digits, and then package them into the token type FIVE-DIGIT-TOKEN. Then we could update our grammar to look like this:
1 | zip-code : FIVE-DIGIT-TOKEN |
This isn’t wrong. But our parse tree would now look like this, which might be less detail than we need:
1 | '(zip-code "01234") |
Bigger tokens can be convenient, because they reduce the complexity of the grammar. But they also reduce its flexibility. Suppose we wanted to upgrade our zip-code grammar to cover nine-digit zip codes (= five digits, optionally followed by a hyphen and four more digits). We wouldn’t be able to do it with FIVE-DIGIT-TOKEN, because it’s not granular enough. But we could do it with DIGIT-TOKEN:
1 2 3 4 | zip-code : five-digits [("-" four-digits)] five-digits : digit digit digit digit digit four-digits : digit digit digit digit digit : DIGIT-TOKEN |
All that said, the division of labor between parser and tokenizer is more art than science. We can use the tokenizer in whatever way makes the most sense for a given language.
Let’s recall our bf grammar:
1 2 3 4 | #lang brag bf-program : (bf-op | bf-loop)* bf-op : ">" | "<" | "+" | "-" | "." | "," bf-loop : "[" (bf-op | bf-loop)* "]" |
This captures the structure of any bf program. But it omits one detail about bf source files: any characters aside from the eight used in the grammar, including whitespace, should be ignored.
So our tokenizer for bf will be simple: we’ll pass through the eight meaningful characters intact, and toss out everything else.
We can start assembling the reader for bf. Let’s create a file called "reader.rkt" in the same folder as the existing "parser.rkt". We need to require our "parser.rkt" module (to get access to parse) and add a read-syntax function:
1 2 3 4 5 6 7 8 9 | #lang br/quicklang (require "parser.rkt") (define (read-syntax path port) (define parse-tree (parse path (make-tokenizer port))) (define module-datum `(module bf-mod "expander.rkt" ,parse-tree)) (datum->syntax #f module-datum)) (provide read-syntax) |
Here’s the plan:
Just like the read-syntax in stacker—and every read-syntax we’ll ever write—this read-syntax will take as input a source path and input port.
But this time, instead of manually reading strings of code from port, we pass the port to make-tokenizer, which returns a function that reads characters from the port and generates tokens.
In turn, we make these tokens available to parse, which uses our grammar to produce our parse-tree.
As we did in stacker, we create a module-datum representing the code for a module, and put our parse tree inside it.
Finally, we use datum->syntax to package this code as a syntax object.
In stacker, we put the reader and expander in the same source file. This time we’re putting our expander in its own file. Within our module datum, we designate "expander.rkt" as the path to our expander (though we still need to create that file).
Next, we add our new make-tokenizer function. We’re passing the input port—which points at the source string—to make-tokenizer. Rather than return one big pile of tokens, make-tokenizer creates & returns a function called next-token that the parser will call repeatedly to retrieve new tokens: + This “incremental tokenizing” approach is consistent with the idea of using a port to read from a source file incrementally.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | #lang br/quicklang (require "parser.rkt") (define (read-syntax path port) (define parse-tree (parse path (make-tokenizer port))) (define module-datum `(module bf-mod "expander.rkt" ,parse-tree)) (datum->syntax #f module-datum)) (provide read-syntax) (define (make-tokenizer port) (define (next-token) ···) next-token) |
Finally, the tokenizing rules. This isn’t the moment for us to go deep on tokenizer-rule notation—we’ll save that for a later tutorial. In short, the tokenizer relies on a helper function called a lexer. Each branch of the lexer represents a rule. On the left side of the branch is a pattern that works like a regular expression. On the right side is a token-creating expression. Each time next-token is called, bf-lexer will read as many characters from the port as it can while still matching a rule pattern (aka “greedy” matching). The right side of the rule will convert the matched characters into a token, and this token will be returned as the result.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | #lang br/quicklang (require "parser.rkt") (define (read-syntax path port) (define parse-tree (parse path (make-tokenizer port))) (define module-datum `(module bf-mod "expander.rkt" ,parse-tree)) (datum->syntax #f module-datum)) (provide read-syntax) (require brag/support) (define (make-tokenizer port) (define (next-token) (define bf-lexer (lexer [(char-set "><-.,+[]") lexeme] [any-char (next-token)])) (bf-lexer port)) next-token) |
We require brag/support so we can get lexer, which manages the pattern matching.
The first rule uses the lexer helper char-set to match one of our eight special bf characters ><-.,+[]. We pass these through directly with the special lexer variable lexeme (= needlessly obscure word for “that thing we just matched”).
The other rule uses the lexer helper any-char, which matches any other character—we can think of it like an else branch. In bf, these characters should all be ignored. We can accomplish this by calling (next-token) again, which has the effect of skipping to the next available token.
When a port reaches the end of a file, it emits the special eof signal. Thus, a lexer always needs to handle eof. We could write an explicit rule. But we can also just rely on the lexer’s default eof behavior: the lexer will emit an eof token, which in turn will stop the parser. For bf, that suffices.
That’s it. Our tokenizer is done. Therefore, so is our reader. We’ll keep in mind that if we ever want to extend our bf interpreter—for instance, to support new characters with other meanings—we’ll first have to adjust our make-tokenizer function to let them through, and then update our grammar to handle them.
We already did a quick test of our parser to make sure a snippet of bf source code could be converted into a parse tree. Now that our reader is assembled, let’s make sure we get the same result from a full source file.
We’ll take the code for our demo program and make it into a source file that relies on our new bf reader. As before, we’ll use the #lang reader path bootstrapping command. Save this file in the same directory as "reader.rkt":
1 2 3 | #lang reader "reader.rkt" Greatest language ever! ++++-+++-++-++[>++++-+++-++-++<-]>. |
When we run this file, we’ll get an error like this:
1 2 3 4 | open-input-file: cannot open module file module path: #<path:/my/dir/expander.rkt> path: /my/dir/expander.rkt ··· |
That’s a good sign. It means we’re successfully invoking "reader.rkt", which in turn is looking for "expander.rkt", but can’t find it. Because we haven’t made it.
So let’s stub it out. As usual, our expander needs to start with a macro called #%module-begin. As we did in stacker, we’ll avoid a conflict with the #%module-begin already defined by br/quicklang by calling our new macro bf-module-begin, and then using rename-out to change the exported name.
1 2 3 4 5 6 | #lang br/quicklang (define-macro (bf-module-begin PARSE-TREE) #'(#%module-begin 'PARSE-TREE)) (provide (rename-out [bf-module-begin #%module-begin])) |
Look carefully—the only work that bf-module-begin is doing is adding a ' prefix to the PARSE-TREE that’s arriving from the reader as our input. We did this in stacker too—it’s a quick way to print and debug the result we’re getting from the reader.
Now when we run "atsign.rkt", we’ll reveal the parse tree:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | '(bf-program (bf-op "+") (bf-op "+") (bf-op "+") (bf-op "+") (bf-op "-") (bf-op "+") (bf-op "+") (bf-op "+") (bf-op "-") (bf-op "+") (bf-op "+") (bf-op "-") (bf-op "+") (bf-op "+") (bf-loop "[" (bf-op ">") (bf-op "+") (bf-op "+") (bf-op "+") (bf-op "+") (bf-op "-") (bf-op "+") (bf-op "+") (bf-op "+") (bf-op "-") (bf-op "+") (bf-op "+") (bf-op "-") (bf-op "+") (bf-op "+") (bf-op "<") (bf-op "-") "]") (bf-op ">") (bf-op ".")) |
What are we seeing here? Two things:
We’re seeing the effect of our tokenizer, which discarded the characters in the line Greatest language ever!. They never reached the parser, so they don’t appear in the parse tree.
We’re otherwise seeing the same parse tree that we got before—as we should. All the characters in our source code appear in a node of the parse tree, and these nodes correspond to the names and patterns of the production rules.
While we’re here, let’s see what happens when we run a poorly formed bf program. We know that brackets have to come in pairs. So let’s deliberately add an unmatched left bracket to the end of the program:
1 2 3 | #lang reader "reader.rkt" Greatest language ever! ++++-+++-++-++[>++++-+++-++-++<-]>.[ |
This time, when we run the program, we won’t see a parse tree—instead we’ll get an error like this:
1 | Encountered parsing error near "[" (token '|[|) while parsing #<path:atsign.rkt> [line=#f, column=#f, offset=#f] |
This is the right result. Our parser can’t find a way to match this defective code string to its grammar. Thus, when it reaches the unmatched bracket—"["—it gives up and raises an error.
Before we continue, let’s remove the spurious left bracket from the end of "atsign.rkt", and also the ' prefix from PARSE-TREE, so our files look like this:
1 2 3 | #lang reader "reader.rkt" Greatest language ever! ++++-+++-++-++[>++++-+++-++-++<-]>. |
1 2 3 4 5 6 | #lang br/quicklang (define-macro (bf-module-begin PARSE-TREE) #'(#%module-begin PARSE-TREE)) (provide (rename-out [bf-module-begin #%module-begin])) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | #lang br/quicklang (require "parser.rkt") (define (read-syntax path port) (define parse-tree (parse path (make-tokenizer port))) (define module-datum `(module bf-mod "expander.rkt" ,parse-tree)) (datum->syntax #f module-datum)) (provide read-syntax) (require brag/support) (define (make-tokenizer port) (define (next-token) (define bf-lexer (lexer [(char-set "><-.,+[]") lexeme] [any-char (next-token)])) (bf-lexer port)) next-token) |
Now we can finish the expander.