As we remember from the second jsonic tutorial, a syntax colorer is a lexer called by DrRacket that reads in source code, matches the code to a list of rules, and then returns the matched strings with syntax-coloring annotations.
Each annotation contains five values: the string to be colored, a coloring category, a parenthesis shape, and the starting and ending position of the coloring. The valid coloring categories are 'error, 'comment, 'sexp-comment, 'white-space, 'constant, 'string, 'no-color, 'parenthesis, 'hash-colon-keyword, 'symbol, 'eof, or 'other. The valid parenthesis shapes are ()[]{} or #f.
Right now, our coloring for our sample program looks like this, using the default Racket-language colorer and DrRacket’s default color scheme:
Not everything is wrong—the strings, numbers, and identifiers look right, because they happen to be written the same way in Racket. But the rem comments in lines 30 and 70 aren’t colored as comments. The value 'three' on line 60 is not colored as a string. And on line 10, the part of the line after the ; is formatted as a comment—because it would be, if this were Racket code—but in BASIC, it just represents extra arguments to print, and these should be colored like any other values.
In jsonic, we wrote a separate lexer for the syntax colorer. This time, our colorer will rely on the lexer we already wrote for the main language interpreter:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | #lang br (require brag/support) (define-lex-abbrev digits (:+ (char-set "0123456789"))) (define basic-lexer (lexer-srcloc ["\n" (token 'NEWLINE lexeme)] [whitespace (token lexeme #:skip? #t)] [(from/stop-before "rem" "\n") (token 'REM lexeme)] [(:or "print" "goto" "end" "+" ":" ";") (token lexeme lexeme)] [digits (token 'INTEGER (string->number lexeme))] [(:or (:seq (:? digits) "." digits) (:seq digits ".")) (token 'DECIMAL (string->number lexeme))] [(:or (from/to "\"" "\"") (from/to "'" "'")) (token 'STRING (substring lexeme 1 (sub1 (string-length lexeme))))])) (provide basic-lexer) |
This is a convenient idea, because later, if we update this main lexer, our colorer will automatically pick up the changes.
Here’s the plan. We’ll read tokens from the lexer, each of which will be an instance of a srcloc-token structure. We’ll return one syntax-coloring annotation for each token. To make each annotation, we’ll read values from inside the token and use them to figure out what coloring category applies and what parenthesis shape we should use (if any). We’ll also read the necessary source-location values from inside the token, and put those into the coloring annotation.
There’s one additional wrinkle. When the lexer gets a character that it can’t match to any lexing rule, it raises an error. When we’re trying to run a program, that’s the right behavior, because nothing can happen until the error is corrected.
But in a syntax colorer, it’s the wrong behavior. Why? Because the syntax colorer is called after every keystroke, not just when the user clicks Run. It’s likely that after any given keystroke, the program can’t be lexed without errors. So we want to handle those errors when they arise, so that the user’s code editing can continue peacefully.
As we did in jsonic, we make our colorer available to DrRacket by adding a get-info function to our "main.rkt" module:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | #lang br/quicklang (require "parser.rkt" "tokenizer.rkt") (module+ reader (provide read-syntax get-info)) (define (read-syntax path port) (define parse-tree (parse path (make-tokenizer port path))) (strip-bindings #`(module basic-mod basic/expander #,parse-tree))) (define (get-info port src-mod src-line src-col src-pos) (define (handle-query key default) (case key [(color-lexer) (dynamic-require 'basic/colorer 'basic-colorer)] [else default])) handle-query) |
When get-info gets the color-lexer key, it uses dynamic-require to import the basic-colorer function from the basic/colorer module. We’ll set up this module next.
Let’s start our new "basic/colorer.rkt" module. We import "lexer.rkt" to get our main lexer, and set up a new function called basic-colorer:
DrRacket passes our colorer function a port argument. To start, we just pass this port through to basic-lexer, which will return an instance of srcloc-token that we’ll use to construct our coloring annotation.
We know, however, that basic-lexer might raise an error. We want to catch this error. As we’ve done before, we wrap our call to basic-lexer in a with-handlers expression:
1 2 3 4 5 6 7 8 9 10 11 12 | #lang br (require "lexer.rkt" brag/support) (provide basic-colorer) (define (basic-colorer port) (define (handle-lexer-error excn) (define excn-srclocs (exn:fail:read-srclocs excn)) (srcloc-token (token 'ERROR) (car excn-srclocs))) (define srcloc-tok (with-handlers ([exn:fail:read? handle-lexer-error]) (basic-lexer port))) ···) |
As we would find out from reading the docs for lexer-srcloc, when it can’t match any rule, it raises an exception of type exn:fail:read. Therefore, our with-handlers expression invokes the related predicate, exn:fail:read?, and passes any matching exception to our helper function handle-lexer-error.
To patch over the error, our helper function needs to return an instance of srcloc-token. The srcloc-token constructor function takes two arguments: a plain token value, and a srcloc structure.
An exception of type exn:fail:read has a srclocs field that contains a list of srcloc structures related to the error. + Different exception types have different fields that are useful in handling errors. The documentation has details for each type. We use exn:fail:read-srclocs to store these in excn-srclocs and then use car to get the topmost one.
As for the token value, we just make a token structure with type 'ERROR.
Taken together, our srcloc-tok variable will either get its value from basic-lexer, or if an error occurs, from handle-lexer-error.
The rest of our colorer will read fields out of the srcloc-tok value and use them to create coloring annotations. To make this easy, we’ll use two new functions: match and match-define.
match is one of Racket’s secret weapons—an extremely clever feature that doesn’t exist in other programming languages. Just as regular expressions let us deconstruct strings, or syntax patterns let us deconstruct syntax objects, match allows us to deconstruct any Racket value.
The basic match form works like cond: it takes a value as input, plus a series of branches. If the pattern on the left side matches the value, then the value on the right side of the branch is returned. An optional else branch handles everything else:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | (struct thing (x y)) (define (m in) (match in ["foo" 'got-foo] ; literal match [(? number?) 'got-number] ; predicate match [(list a b c) (list b)] ; list match + assignment [(thing i j) (+ i j)] ; structure match + assignment [else 'no-match])) (m "foo") ; 'got-foo (m 42) ; 'got-number (m (list 1 2 3)) ; '(2) (m (thing 25 52)) ; 77 (m "bar") ; 'no-match |
In this example, we see a few of the options for left-hand patterns. + Many more can be found in the docs for match. A literal value like "foo" is matched exactly. A predicate like number? can be matched by wrapping it in the ? operator. When list is used as a match pattern, it not only matches the input but also assigns the pieces to new identifiers—in this case, extracting the middle element b into a new list. Similarly, any structure type can be used as a pattern, and its values matched by position.
The match-define form uses the same pattern-matching vocabulary as match to directly create new variables by “reverse engineering” existing values:
1 2 3 4 5 6 7 8 9 10 11 | (define xs (list 1 2 3)) (match-define (list a b c) xs) a ; 1 b ; 2 c ; 3 (struct thing (x y)) (define th (thing 25 52)) (match-define (thing i j) th) i ; 25 j ; 52 |
In other words, when we use list (or thing) on the left side of a match-define, we’re not creating a list (or instance of the thing structure type) but rather using it as a template for decomposing the value on the right.
We’ll use match and match-define to help finish the colorer.
The rest of the colorer will proceed by matching against srcloc-tok. As usual, we need to create a special rule to handle the eof case:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | #lang br (require "lexer.rkt" brag/support) (provide basic-colorer) (define (basic-colorer port) (define (handle-lexer-error excn) (define excn-srclocs (exn:fail:read-srclocs excn)) (srcloc-token (token 'ERROR) (car excn-srclocs))) (define srcloc-tok (with-handlers ([exn:fail:read? handle-lexer-error]) (basic-lexer port))) (match srcloc-tok [(? eof-object?) (values srcloc-tok 'eof #f #f #f)] [else ···])) |
We use eof-object? wrapped in ? as a left-hand match pattern. On the right, we return a color annotation with srcloc-tok as the value and 'eof as the category. The other fields will be ignored.
Then we can move on to the more interesting cases. First we use match-define to decompose our srcloc-tok into fields and assign them to variables:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | #lang br (require "lexer.rkt" brag/support) (provide basic-colorer) (define (basic-colorer port) (define (handle-lexer-error excn) (define excn-srclocs (exn:fail:read-srclocs excn)) (srcloc-token (token 'ERROR) (car excn-srclocs))) (define srcloc-tok (with-handlers ([exn:fail:read? handle-lexer-error]) (basic-lexer port))) (match srcloc-tok [(? eof-object?) (values srcloc-tok 'eof #f #f #f)] [else (match-define (srcloc-token (token-struct type val _ _ _ _ _) (srcloc _ _ _ posn span)) srcloc-tok) ···])) |
We know each srcloc-tok is an instance of srcloc-token with a token-struct as the first value and a srcloc as the second. So our match pattern combines all three, letting us directly define variables to hold the fields we need to make our coloring annotation: type and val from the token, plus the posn and span from the source location.(Each _ in the pattern means “ignore this field”.)
Now we need to assemble the five values we need for our coloring annotation: the token value, the coloring category, the parenthesis shape, and the start and end locations. We already stored the token value in val. The start location is just posn, and the end location is (+ start span). Then we use another match-define to figure out the cat and paren values:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | #lang br (require "lexer.rkt" brag/support) (provide basic-colorer) (define (basic-colorer port) (define (handle-lexer-error excn) (define excn-srclocs (exn:fail:read-srclocs excn)) (srcloc-token (token 'ERROR) (car excn-srclocs))) (define srcloc-tok (with-handlers ([exn:fail:read? handle-lexer-error]) (basic-lexer port))) (match srcloc-tok [(? eof-object?) (values srcloc-tok 'eof #f #f #f)] [else (match-define (srcloc-token (token-struct type val _ _ _ _ _) (srcloc _ _ _ posn span)) srcloc-tok) (define start posn) (define end (+ start span)) (match-define (list cat paren) (match type ['STRING '(string #f)] ['REM '(comment #f)] ['ERROR '(error #f)] [else (match val [(? number?) '(constant #f)] [(? symbol?) '(symbol #f)] ["(" '(parenthesis |(|)] [")" '(parenthesis |)|)] [else '(no-color #f)])])) (values val cat paren start end)])) |
Each branch of our match-define for (list cat paren) will return a list with a category value and parenthesis shape. This list will be assigned to cat and paren, which can then be inserted into the final values expression that returns the finished coloring annotation, along with the already calculated val, start, and end.
Within the match-define, we use two instances of match. The first will match against the type. A token of type 'STRING gets colored as a 'string, a 'REM token gets colored as a 'comment, and an 'ERROR token gets colored as an 'error. (In all cases, there is no parenthesis shape.) + Astute observers may have noticed that we gave our 'ERROR tokens a type, but not a value, which defaults to #f. DrRacket primarily relies on the start and end positions for coloring. So as long as those are correct, the colorer will work, despite the missing token value.
If we don’t get a match for type, in its else branch we have a nested match on val. Tokens that are number? are each colored as a 'constant; those that are symbol? are each colored as a 'symbol. We use literal "(" and ")" matches to catch any parentheses (this time our coloring annotation has both a 'parenthesis category and the correct parenthesis shape). + Because parentheses already serve as list delimiters in Racket, a parenthesis used as a literal symbol has to be escaped with vertical bars: thus '|)| rather than '). Finally, anything else gets a 'no-color annotation.
And that’s the finished colorer. Keep in mind that these coloring decisions are arbitrary. We’re just trying to figure out the most intuitive mapping from things in our language to DrRacket’s coloring categories. But DrRacket neither knows and nor cares whether, for instance, something colored as a 'comment is in fact a comment. Syntax coloring doesn’t affect how the language works. Just how it looks.
Recall how our sample program looked before we wrote our colorer:
Some elements were colored correctly. But the rem comments in lines 30 and 70 were not colored as comments. The value 'three' on line 60 was not colored as a string. And the right half of line 10 should appear normally.
Let’s open our "sample.rkt" file in DrRacket. For faster performance, DrRacket caches the result of get-info for each language. We have to force a refresh. If we’re using Racket v6.9 or later, we select Racket → Reload #lang Extensions, which reloads our get-info function and our new colorer. If not, we quit and restart DrRacket, which has the same effect. After a moment, the code will look like this:
Notice that the rem comments are now colored correctly, 'three' is correctly colored as a string value, and line 10 is colored normally.
If we type some nonsense characters on the last line that are unrecognized by the lexer, they’re colored as error characters, but they don’t interrupt our editing:
If we go back and put quote marks around the nonsense, we convert it to a valid string, which is recolored accordingly:
What are the limits of sharing the main language lexer with the syntax colorer? For instance, what if we wanted to do something slick, like apply a different color to the line numbers?
In principle it’s possible. But we can only apply different colors to things that can be differentiated from within the lexer (i.e., with different rules and token types).
For instance, right now a line number is delivered inside a token of type 'INTEGER. So it can’t be differentiated from an ordinary number that’s also delivered inside an 'INTEGER token. It’s true that these can be differentiated by the parser—a line number is in a specific position in the grammar—but we can’t use a parser here. + Why not? Because we can’t parse something unless it can first be lexed and tokenized. As we noticed at the beginning, the code is often in a non-lexable state during editing.
To apply a separate color to line numbers, we’d have to add a rule to the lexer to specially match line numbers and wrap them in a new token type, say 'LINE-NUMBER. Then we’d be able to apply a color to 'LINE-NUMBER tokens that’s different from 'INTEGER tokens.
So what is this new lexer rule? Because every program line has to be on a separate line of the file, we could imagine a lexer rule that captured a sequence of a newline and an integer as a 'LINE-NUMBER. But once we change how newlines are tokenized, we’d also have to adjust our parser grammar, as well as the source location of a 'LINE-NUMBER token to account for the leading whitespace.
In short: it could be done. But it might be more trouble than it’s worth.
The alternative is to write a separate lexer for the syntax colorer, as we did for jsonic. This would give us the freedom to lex the source code whatever way we want to produce coloring effects. Of course, because the syntax colorer is strictly cosmetic, this new lexer wouldn’t affect the behavior of the language (nor require changes to the parser, etc.)
Both approaches are reasonable. We shouldn’t forget that even if we don’t share the main lexer with the syntax colorer, we can still share a set of lexer abbreviations (by exporting them from one lexer and importing them into the other). These can be written so they contain core patterns for the lexer rules. This makes both lexers simpler to write.