In previous tokenizer modules—for instance, those for bf and jsonic—we made one function, make-tokenizer. It returned a function called next-token that successively retrieved tokens from an input port using a character-matching function called a lexer.
Since then, we’ve learned that a lexer can also be useful for syntax coloring. So this time, we’ll separate our lexer from our tokenizer. That way, we can reuse it later for the syntax colorer.
Rather than start with "basic/tokenizer.rkt", we create a new "basic/lexer.rkt" module:
Previously we used lexer to set up our matching rules. That worked fine, but when we wanted to add source locations, we found ourselves doing a lot of housekeeping—we had to manually fill in the source-location fields for the tokens generated on the right-hand side.
This time, we switch to lexer-srcloc, a variant of lexer that automatically gathers source-location information and attaches it to our tokens. We get the same source locations as before, but with zero manual labor.
1 2 3 4 5 6 7 8 | #lang br (require brag/support) (define basic-lexer (lexer-srcloc ···)) (provide basic-lexer) |
To make this happen, lexer-srcloc returns instances of a structure called a srcloc-token. This structure holds two values: a token (that is, whatever we put on the right-hand side of the rule, which will usually be a string or an instance of the token structure) and a source location (which is itself an instance of the srcloc structure).
But we don’t need to get bogged down in the details. lexer-srcloc will automatically create the correct srcloc and srcloc-token structures. In turn, any parser made with #lang brag can take these srcloc-token structures as input. So the output from lexer-srcloc can be passed to our parser without any extra housekeeping.
By the way, since a lexer takes an input port as an argument, why aren’t we declaring this function as (basic-lexer port)? If we look at our previous examples, for instance the jsonic tokenizer, we also didn’t declare the lexer with an argument. The reason is that both lexer and lexer-srcloc return a lambda that takes one argument, so we omit the argument from the left-hand side of the declaration.
For the same reason, these two declarations are the same:
In the second case, because the right-hand side is a lambda expression, we don’t declare the x argument on the left. In fact, the lambda notation on the bottom is the “real” way functions are handled in Racket—the top notation is a convenience that’s transformed into the bottom notation. (For more about lambda notation, see functions.)
If we wanted, we could declare basic-lexer like so:
1 2 3 4 5 | (define (basic-lexer port) (define the-lexer (lexer-srcloc ···)) (the-lexer port)) |
But that’s just a wordier way of saying the same thing.
First, we handle whitespace. Most whitespace in BASIC is ignored. But newlines have special status, because each line of the program has to be on a separate line of the file. This program is valid:
1 2 | 10 print "hello world" 20 goto 10 |
But this version, with the newline removed, is not:
1 | 10 print "hello world" 20 goto 10 |
When we write the parser, we’ll need to keep these two cases separate. We can only do that if we preserve the newline as a token. So our first whitespace-lexing rule will match the "\n" character, and put it inside a token structure named NEWLINE:
But all the other whitespace, we can ignore. In previous lexers, we ignored whitespace or comments by recursively calling the lexer to fetch the next token. This shortcut doesn’t work with lexer-srcloc, because each recursive call also recursively wraps more layers of source-location information. That’s not what we want.
Let’s learn an alternate technique:
1 2 3 4 5 6 7 8 9 | #lang br (require brag/support) (define basic-lexer (lexer-srcloc ["\n" (token 'NEWLINE lexeme)] [whitespace (token lexeme #:skip? #t)])) (provide basic-lexer) |
Rather than ignoring certain source characters within the lexer, we can mark the tokens as “skippable” so that the parser ignores them. We do this by wrapping them in a token structure as usual, but we make the token skippable by passing the #:skip? keyword argument with a #t value. The parser will ignore this token, as if it had never been there in the first place. Problem solved.
A rem at the beginning of a line marks a line comment:
1 2 3 4 5 6 7 8 9 10 | #lang br (require brag/support) (define basic-lexer (lexer-srcloc ["\n" (token 'NEWLINE lexeme)] [whitespace (token lexeme #:skip? #t)] [(from/stop-before "rem" "\n") (token 'REM lexeme)])) (provide basic-lexer) |
Recall from specification and setup that a rem doesn’t remove the line entirely from the program—it can still be the target of a goto statement. So we only want to soak up the characters on the right side of the rem statement (not the line number).
Furthermore, though the "\n" marks the end of the rem, we don’t want to absorb the "\n" itself (because of its semantic significance in separating lines). So we use from/stop-before to capture these characters and put them in a REM token. Unlike from/to, which captures the left and right boundary characters, from/stop-before captures the left boundary character but not the right (hence “stop before”).
Then we add our other supported statements—print, goto, end—our only supported numerical operation +, and the separators for multiple print arguments ; and multiple statements ::
1 2 3 4 5 6 7 8 9 10 11 12 |
:or is a lexer function that matches one of a listed set of possibilities.
As the return value, we make a token structure, passing lexeme in both the type and value fields. Strictly speaking, we don’t need to return a token structure for this rule. We could return the raw lexeme value, because it’ll be matched literally in the parser. But when we get to the syntax colorer, we’ll find it more convenient to have the lexer results consistently packaged into token structures.
Finally we need to put in our data types: integers, decimals, and strings.
Both integers and decimals need to match a string of digits. To simplify our notation, we use define-lex-abbrev to make a little named subpattern that we can use in other rules. Our abbreviation digits will match any sequence of digits (where char-set means “match any character in this string” and :+ still means “one or more”):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | #lang br (require brag/support) (define-lex-abbrev digits (:+ (char-set "0123456789"))) (define basic-lexer (lexer-srcloc ["\n" (token 'NEWLINE lexeme)] [whitespace (token lexeme #:skip? #t)] [(from/stop-before "rem" "\n") (token 'REM lexeme)] [(:or "print" "goto" "end" "+" ":" ";") (token lexeme lexeme)])) (provide basic-lexer) |
Once we have digits, we can easily write rules to match integers and decimals:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | #lang br (require brag/support) (define-lex-abbrev digits (:+ (char-set "0123456789"))) (define basic-lexer (lexer-srcloc ["\n" (token 'LINE-SEP)] [whitespace (token lexeme #:skip? #t)] [(from/stop-before "rem" "\n") (token 'REM lexeme)] [(:or "print" "goto" "end" "+" ":" ";") (token lexeme lexeme)] [digits (token 'INTEGER (string->number lexeme))] [(:or (:seq (:? digits) "." digits) (:seq digits ".")) (token 'DECIMAL (string->number lexeme))])) (provide basic-lexer) |
An integer is just digits. We label the token with the name 'INTEGER, and use string->number to pass through the numeric value rather than its string representation.
The decimal rule is a little trickier, because we want to capture three valid possibilities:
1 2 3 | .456 123.456 123. |
We can use :seq, a lexer function that lets us match a series of patterns in a row, to express these possibilities as patterns, and then use :or to choose among them:
But we can combine the first and second cases into a single pattern by using :? to make the first digits subpattern optional (again, just like ? in a regular expression).
And that becomes the pattern for our rule. Though we shouldn’t get overzealous about streamlining the pattern, because this is wrong:
This rule is too broad, because it would also match a naked decimal point. A decimal number always needs digits on one side or the other. So we need two pieces to our pattern.
To finish, we label the token with the name 'DECIMAL, and again use string->number to pass through its numeric value.
Last, we have strings:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | #lang br (require brag/support) (define-lex-abbrev digits (:+ (char-set "0123456789"))) (define basic-lexer (lexer-srcloc ["\n" (token 'NEWLINE lexeme)] [whitespace (token lexeme #:skip? #t)] [(from/stop-before "rem" "\n") (token 'REM lexeme)] [(:or "print" "goto" "end" "+" ":" ";") (token lexeme lexeme)] [digits (token 'INTEGER (string->number lexeme))] [(:or (:seq (:? digits) "." digits) (:seq digits ".")) (token 'DECIMAL (string->number lexeme))] [(:or (from/to "\"" "\"") (from/to "'" "'")) (token 'STRING (substring lexeme 1 (sub1 (string-length lexeme))))])) (provide basic-lexer) |
For these, we can use from/to to match anything delimited by either single or double quotes. When we make the token, we don’t need the quote marks, so we use substring to drop the first and last characters of the matched lexeme. Then we package this into a token called 'STRING.
And that’s our finished lexer.
Well, almost finished. No function is really done until it’s thoroughly tested. We could put tests inside "basic/lexer.rkt" module using a test submodule, as we’ve done before. But here’s how we could make a set of tests for the lexer in a separate file:
1 2 3 4 5 6 7 | #lang br (require "lexer.rkt" brag/support rackunit) (define (lex str) (apply-port-proc basic-lexer str)) ··· |
First, we define lex, which is a helper function that will take a string and lex it into a list of tokens using basic-lexer and the helper function apply-port-proc, which comes from brag/support. apply-port-proc applies a function to an input port (or string) repeatedly until it reaches the end, and then returns the results in a list.
Then we use testing functions from rackunit like check-equal? and check-exn to compare the results of lex with lists of srcloc-token structures. Here are some simple sample tests that cover every lexing rule we wrote for basic-lexer (but feel free to add others):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | #lang br (require "lexer.rkt" brag/support rackunit) (define (lex str) (apply-port-proc basic-lexer str)) (check-equal? (lex "") empty) (check-equal? (lex " ") (list (srcloc-token (token " " #:skip? #t) (srcloc 'string 1 0 1 1)))) (check-equal? (lex "rem ignored\n") (list (srcloc-token (token 'REM "rem ignored") (srcloc 'string 1 0 1 11)) (srcloc-token (token 'NEWLINE "\n") (srcloc 'string 1 11 12 1)))) (check-equal? (lex "print") (list (srcloc-token (token "print" "print") (srcloc 'string 1 0 1 5)))) (check-equal? (lex "goto") (list (srcloc-token (token "goto" "goto") (srcloc 'string 1 0 1 4)))) (check-equal? (lex "end") (list (srcloc-token (token "end" "end") (srcloc 'string 1 0 1 3)))) (check-equal? (lex "+") (list (srcloc-token (token "+" "+") (srcloc 'string 1 0 1 1)))) (check-equal? (lex ";") (list (srcloc-token (token ";" ";") (srcloc 'string 1 0 1 1)))) (check-equal? (lex ":") (list (srcloc-token (token ":" ":") (srcloc 'string 1 0 1 1)))) (check-equal? (lex "12") (list (srcloc-token (token 'INTEGER 12) (srcloc 'string 1 0 1 2)))) (check-equal? (lex "1.2") (list (srcloc-token (token 'DECIMAL 1.2) (srcloc 'string 1 0 1 3)))) (check-equal? (lex "12.") (list (srcloc-token (token 'DECIMAL 12.) (srcloc 'string 1 0 1 3)))) (check-equal? (lex ".12") (list (srcloc-token (token 'DECIMAL .12) (srcloc 'string 1 0 1 3)))) (check-equal? (lex "\"foo\"") (list (srcloc-token (token 'STRING "foo") (srcloc 'string 1 0 1 5)))) (check-equal? (lex "'foo'") (list (srcloc-token (token 'STRING "foo") (srcloc 'string 1 0 1 5)))) (check-exn exn:fail:read? (lambda () (lex "x"))) |