Recall that a token is the smallest meaningful chunk of a string of source code. Within the reader, a source string is converted to tokens with a helper function called a tokenizer.
We won’t be able to totally reuse the bf tokenizer. But the jsonic tokenizer will follow a similar pattern. Let’s create a new "tokenizer.rkt" module with a shell for the tokenize function, and we’ll step through the individual tokenizing rules:
As before, make-tokenizer takes a port as input and returns a function, next-token, that the parser will call to retrieve tokens.
Within next-token, we use a helper function called a lexer to break down the source code into tokens. We first used a lexer in bf. But let’s review how it works:
The lexer must be able to process every token that might appear in the source, including eof (which signals the end).
The lexer consists of a series of branches, each representing a lexing rule. On the left side of the branch is a pattern that works like a regular expression. On the right is a token-creating expression.
Each time next-token is called, jsonic-lexer will read as many characters from the input port as it can while still matching a rule pattern.
The lexer rule will convert the matched characters (known as the lexeme) into a token using the expression on the right.
This token will be returned as the result. The process repeats until the lexer gets the eof signal.
We import brag/support, which provides tools we need for the lexer: the basic lexer function; from/to, which is a matching helper; trim-ends, another helper function to process match results; and token, a data structure that will cooperate with the parser we’ll make later, using #lang brag.
We’ll need three rules in jsonic-lexer. Taking each in turn:
The first rule handles our line comments. Recall that any time we find the // character combination in the source, we want to ignore the rest of the line. So this rule matches everything from // to the next newline. We use the lexer rule from/to to make this pattern. Once we’ve gotten our match, we ignore it by calling (next-token) again, which has the effect of skipping to the next available token.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Next, we match the embedded Racket expressions that will appear between the @$ and $@ delimiters. This is an instance where it’s helpful to make a “big” token—ultimately, these Racket expressions will just pass through to the expander intact, so we don’t need to tokenize any of them into smaller pieces.
Again, we use a from/to rule to match everything between the @$ and $@ delimiters (including the delimiters themselves). Once we’ve matched the embedded expression—a value that will be held in the special lexer variable lexeme—we use trim-ends to remove the delimiters.
Finally, we package this trimmed lexeme into a token structure with the name SEXP-TOK. Named tokens can make a grammar simpler, because we can then refer to tokens within the grammar by name rather than by specific values. By convention, named tokens use CAPS names to distinguish them from names of production rules in the grammar. One notational wrinkle: though we write 'SEXP-TOK here (the ' prefix makes it a symbol rather than a variable), in the grammar we’ll write this token’s name simply as SEXP-TOK.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | #lang br/quicklang (require brag/support) (define (make-tokenizer port) (define (next-token) (define jsonic-lexer (lexer [(from/to "//" "\n") (next-token)] [(from/to "@$" "$@") (token 'SEXP-TOK (trim-ends "@$" lexeme "$@"))] [any-char (token 'CHAR-TOK lexeme)])) (jsonic-lexer port)) next-token) (provide make-tokenizer) |
For the last rule, we use an any-char rule to match everything else. In a lexer, this works like an else branch, handling all characters not processed by earlier rules. We create another named token, this time called CHAR-TOK, and include lexeme as the token value.
When port reaches the end of the file, it emits the special eof signal. By default, the lexer will automatically handle eof by emitting an eof token, which in turn will stop the parser. If we like, we can write our own rule that matches eof, that performs other actions before returning the eof token. Usually this isn’t necessary, and we can just rely on the default behavior. + For an example of an explicit eof rule, see the syntax-coloring tutorial.
It’s not a substitute for rigorous unit tests—we’ll get to those later—but let’s quickly check that our tokenizer gives us the results we expect.
We can use the REPL for "tokenizer.rkt" to test our tokenizer on sample strings. First, click Run to refresh make-tokenizer. Then we can enter some test expressions on the REPL using the helper function apply-tokenizer-maker.
If we pass a comment to our tokenizer, it should result in no tokens:
1 | (apply-tokenizer-maker make-tokenizer "// comment\n") |
1 | '() |
A Racket expression between delimiters should result in a corresponding SEXP-TOK token:
1 | (apply-tokenizer-maker make-tokenizer "@$ (+ 6 7) $@") |
1 | (list (token-struct 'SEXP-TOK " (+ 6 7) " #f #f #f #f #f)) |
And any other string should become a list of named CHAR-TOK tokens:
1 | (apply-tokenizer-maker make-tokenizer "hi") |
1 2 3 | (list (token-struct 'CHAR-TOK "h" #f #f #f #f #f) (token-struct 'CHAR-TOK "i" #f #f #f #f #f)) |
The trailing #f values are placeholders in the token data structure for source location fields that we’re not collecting now. But we’ll start using those fields in the next tutorial.
Now that we have our make-tokenizer function, we can make our parser. As we’ll find, the parser will be very simple thanks to the extra work we did here.
More broadly, when implementing a language, we’ll often have a choice about where to handle certain tasks: in the tokenizer, or in the parser, or in the expander. It’s more art than science. Though working with tokenizer rules is sometimes a chore, they can substantially simplify the rest of implementation.