Recall that a is the smallest meaningful chunk of a string of source code. Within the reader, a source string is converted to tokens with a helper function called a .
We won’t be able to totally reuse the the bf tokenizer. But the jsonic tokenizer will follow a similar pattern. Let’s create a new "tokenizer.rkt" module with a shell for the tokenize function, and we’ll step through the individual tokenizing rules:
1 2 3 4 5 6 7 8 9 10 11
As before, make-tokenizer takes a port as input and returns a function, next-token, that the parser will call to retrieve tokens.
Within next-token, we use a helper function called a to break down the source code into tokens. We first used a lexer in bf. But let’s review how it works:
The lexer must be able to process every token that might appear in the source, including eof (which signals the end).
The lexer consists of a series of branches, each representing a lexing rule. On the left side of the branch is a pattern that works like a regular expression. On the right is a token-creating expression.
Each time next-token is called, jsonic-lexer will read as many characters from the input port as it can while still matching a rule pattern.
The lexer rule will convert the matched characters (known as the lexeme) into a token using the expression on the right.
This token will be returned as the result. The process repeats until the lexer gets the eof signal.
We import brag/support, which provides tools we need for the lexer: the basic lexer function; from/to, which is a matching helper; trim-ends, another helper function to process match results; and token, a data structure that will cooperate with the parser we’ll make later, using #lang brag.
We’ll need four rules in jsonic-lexer. Taking each in turn:
1 2 3 4 5 6 7 8 9 10 11 12
When port reaches the end of the file, it emits the special eof signal. Thus, we always need a rule that handles eof. We put that in the first rule, returning an eof token, which stops the parser.
1 2 3 4 5 6 7 8 9 10 11 12 13
This next rule handles our line comments. Recall that any time we find the // character combination in the source, we want to ignore the rest of the line. So this rule matches everything from // to the next newline. We use the lexer rule from/to to make this pattern. Once we’ve gotten our match, we ignore it by calling (next-token) again, which has the effect of skipping to the next available token.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#lang br/quicklang (require brag/support) (define (make-tokenizer port) (define (next-token) (define jsonic-lexer (lexer [(eof) eof] [(from/to "//" "\n") (next-token)] [(from/to "@$" "$@") (token 'SEXP-TOK (trim-ends "@$" lexeme "$@"))] ···)) (jsonic-lexer port)) next-token) (provide make-tokenizer)
Next, we match the embedded Racket expressions that will appear between the @$ and $@ delimiters. This is an instance where it’s helpful to make a “big” token—ultimately, these Racket expressions will just pass through to the expander intact, so we don’t need to tokenize any of them into smaller pieces.
Again, we use a from/to rule to match everything between the @$ and $@ delimiters (including the delimiters themselves). Once we’ve matched the embedded expression—a value that will be held in the special lexer variable lexeme—we use trim-ends to remove the delimiters.
Finally, we package this trimmed lexeme into a token structure with the name SEXP-TOK. Named tokens can make a grammar simpler, because we can then refer to tokens within the grammar by name rather than by specific values. By convention, named tokens use CAPS names to distinguish them from names of production rules in the grammar. One notational wrinkle: though we write 'SEXP-TOK here (the ' prefix makes it a symbol rather than a variable), in the grammar we’ll write this token’s name simply as SEXP-TOK.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#lang br/quicklang (require brag/support) (define (make-tokenizer port) (define (next-token) (define jsonic-lexer (lexer [(eof) eof] [(from/to "//" "\n") (next-token)] [(from/to "@$" "$@") (token 'SEXP-TOK (trim-ends "@$" lexeme "$@"))] [any-char (token 'CHAR-TOK lexeme)])) (jsonic-lexer port)) next-token) (provide make-tokenizer)
Finally, we use an any-char rule to match everything else. In a lexer, this works like an else branch, handling all characters not processed by earlier rules. We create another named token, this time called CHAR-TOK, and include lexeme as the token value.
We can use the REPL for "tokenizer.rkt" to test our tokenizer on sample strings. First, click Run to refresh make-tokenizer. Then we can enter some test expressions on the REPL using the helper function apply-tokenizer-maker.
If we pass a comment to our tokenizer, it should result in no tokens:
(apply-tokenizer-maker make-tokenizer "// comment\n")
A Racket expression between delimiters should result in a corresponding SEXP-TOK token:
(apply-tokenizer-maker make-tokenizer "@$ (+ 6 7) $@")
And any other string should become list of named CHAR-TOK tokens:
(apply-tokenizer-maker make-tokenizer "hi")
The trailing #f values are placeholders in the token data structure for fields that we’re not collecting now. But we’ll start using those fields in the next tutorial.
Now that we have our make-tokenizer function, we can make our parser. As we’ll find, the parser will be very simple thanks to the extra work we did here.
More broadly, when implementing a language, we’ll often have a choice about where to handle certain tasks: in the tokenizer, or in the parser, or in the expander. It’s more art than science. Though working with tokenizer rules is sometimes a chore, they can substantially simplify the rest of implementation.