Before we move into DrRacket integration, we need to take a short detour to prepare our language.
A source location is a set of fields that pinpoint where an S-expression (or other code) came from within a source file. Source locations are used throughout Racket (as they are in other languages) for various tasks. For instance, error reports usually have a source location:
1 2 3 | #lang br (require rackunit) (check-true #f) |
1 2 3 4 5 6 7 | -------------------- FAILURE name: check-true location: unsaved-editor:3:0 params: (#f) expression: (check-true #f) -------------------- |
The message unsaved-editor:3:0 tells us that the error occurred in line 3, column 0.
Source locations are typically tracked as a set of five fields, which can each be #f if the value is unknown, or a number:
The source origin, which is usually a path in the filesystem.
A position (counting from 1). Position represents the number of characters away from the start of the file.
A line number (counting from 1), which is the vertical line number from the top.
A column number (counting from 0), which is the horizontal offset within the current line.
A span (counting from 0), which is the number of characters that the code occupies relative to its position measurement.
One small gotcha: not every Racket function that uses source locations prints all the source-location fields, or prints them in the same order. As we saw above, rackunit only prints the line and column from the source location.
Racket has a standard structure type called a srcloc that holds all five of these values. Functions that handle source locations often use srcloc structures for input or output.
As we learned in stacker, one of the big benefits of implementing a language in Racket is that it can use all of Racket’s existing libraries and tools. This works because every Racket-implemented language is really a source-to-source compiler that translates the new language into a Racket program.
This means that even a graphical tool like DrRacket can handle languages that don’t look anything like Racket. For instance, source locations are used by DrRacket to handle error-highlighting effects and other GUI conveniences. All we have to do is attach the original source locations to the new Racket code. Suppose our new language lets us define a variable like so:
1 | let x be 42 / 0 |
And it’s translated into Racket code that looks like this:
When the divide-by-zero error occurs, DrRacket will be able to use the original source location to highlight the error in the original code. + We will learn how to actually do this in the basic tutorial.
But the cost of this cooperation is a little extra housekeeping so our language provides the information needed to support other Racket tools. Including source locations.
As with contracts and unit tests, using source locations is optional. In earlier tutorials, we didn’t worry about source locations because we wanted to keep the focus on learning the core mechanics of making a language. Nothing bad happened.
But since we now want to integrate our language with DrRacket, we need to take this detour.
We’ve already worked with syntax objects as a way of packaging a reference to literal Racket code with certain metadata fields. One field we’ve mentioned is lexical context, which is a list of variables visible to the code.
Source location is another field that’s stored in a syntax object. For instance, when we use the #' prefix to make a syntax object from a datum, its source location will automatically be stored in the resulting syntax object. In turn, these source-location fields can be read with syntax-line, syntax-column, etc.:
1 2 3 4 5 | (define stx #'foobar) (syntax-position stx) ; 24 (syntax-line stx) ; 2 (syntax-column stx); 14 (syntax-span stx) ; 6 |
As with lexical context, a syntax object retains a reference to its original source location unless these fields are explicitly changed.
As we might guess, source locations are derived from the original source file that holds the code. Therefore, whatever function reads the source code is responsible for collecting source locations.
After that, functions that handle the code need to preserve the source locations. If we don’t manipulate the syntax objects much, this is easy. But more complex manipulations can incur a little extra effort to make sure source locations stay where they should.
In jsonic, the function that first reads in the source code is make-tokenizer. Right now, make-tokenizer doesn’t collect source locations. We can see this if we feed make-tokenizer a character:
1 2 | (require brag/support jsonic/tokenizer) (apply-tokenizer-maker make-tokenizer "x") |
1 | (list (token-struct 'CHAR-TOK "x" #f #f #f #f #f)) |
Our lexer is using the helper function token to create an instance of token-struct named CHAR-TOK with a value of "x". But the next four fields are the source-location fields. They’re set to #f because no source-location information has been collected. The fifth #f signals whether the parser should ignore the token. (We’ll do more with this in a later tutorial.)
For complete source locations, we need four pieces of data: position, line, column, and span. We’ll collect this data, then embed it within the token structures emitted by make-tokenizer. After that, as long as our parse and read-syntax functions preserve the source locations, the source locations will be available to other tools, like DrRacket.
Let’s open "tokenizer.rkt" so we can add the necessary code to make-tokenizer.
Line count and column count are available from the input port. But by default, an input port doesn’t track this information. So first, we activate line and column counting for our port by adding a call to port-count-lines!:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | ··· (define (make-tokenizer port) (port-count-lines! port) ; <- turn on line & column counting (define (next-token) (define jsonic-lexer (lexer [(from/to "//" "\n") (next-token)] [(from/to "@$" "$@") (token 'SEXP-TOK (trim-ends "@$" lexeme "$@"))] [any-char (token 'CHAR-TOK lexeme)])) (jsonic-lexer port)) next-token) (provide (contract-out [make-tokenizer (input-port? . -> . (-> jsonic-token?))])) ··· |
We won’t attach source-location data to our line-comment rule, because it doesn’t produce a token. But we will add it to our SEXP-TOK and CHAR-TOK rules.
Let’s start with CHAR-TOK because it’s a little easier:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | ··· (define (make-tokenizer port) (port-count-lines! port) (define (next-token) (define jsonic-lexer (lexer [(from/to "//" "\n") (next-token)] [(from/to "@$" "$@") (token 'SEXP-TOK (trim-ends "@$" lexeme "$@"))] [any-char (token 'CHAR-TOK lexeme #:position (pos lexeme-start) #:line (line lexeme-start) #:column (col lexeme-start) #:span (- (pos lexeme-end) (pos lexeme-start)))])) (jsonic-lexer port)) next-token) (provide (contract-out [make-tokenizer (input-port? . -> . (-> jsonic-token?))])) ··· |
Just as lexer creates a special variable called lexeme that holds the matched characters, it also creates lexeme-start and lexeme-end, special variables that hold the position, line, and column for the start and end of the lexeme. We retrieve these values from lexeme-start or lexeme-end with the helper functions pos, line, and col. Because the span is a relative measurement, we calculate it by subtracting the start position from the end position. We then pass these values to token using its corresponding keyword arguments—#:position, #:line, #:column, and #:span.
Let’s use the REPL to see how this changes the result from make-tokenizer. If we run "tokenizer.rkt" now, we’ll get errors because our unit tests will fail. Don’t panic—we’ll fix those in a minute. Let’s jump down to the REPL and enter the sample expression we tried earlier:
1 | (apply-tokenizer-maker make-tokenizer "x") |
1 | (list (token-struct 'CHAR-TOK "x" 1 1 0 1 #f)) |
Last time, our source-location fields were all #f, because we hadn’t filled them in. This time, they show the source location we embedded for "x": it’s in position 1 of the input, line 1, column-offset 0, and has a span of 1.
Now we’ll handle the SEXP-TOK rule:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | ··· (define (make-tokenizer port) (port-count-lines! port) (define (next-token) (define jsonic-lexer (lexer [(from/to "//" "\n") (next-token)] [(from/to "@$" "$@") (token 'SEXP-TOK (trim-ends "@$" lexeme "$@") #:position (+ (pos lexeme-start) 2) #:line (line lexeme-start) #:column (+ (col lexeme-start) 2) #:span (- (pos lexeme-end) (pos lexeme-start) 4))] [any-char (token 'CHAR-TOK lexeme #:position (pos lexeme-start) #:line (line lexeme-start) #:column (col lexeme-start) #:span (- (pos lexeme-end) (pos lexeme-start)))])) (jsonic-lexer port)) next-token) (provide (contract-out [make-tokenizer (input-port? . -> . (-> jsonic-token?))])) ··· |
The basic idea is the same. But we need to adjust our source-location fields because we’re trimming two characters from each end of the lexeme, but the source-location data comes from the untrimmed lexeme. Because each delimiter is two characters, we add 2 to both the position and column, and subtract 4 from the overall span.
That’s everything we need to change within make-tokenizer. We can see, perhaps, why not every project needs to track source locations—it requires some extra housekeeping to keep the source-location data in sync, especially if we’re performing other processing on our lexemes. On the other hand, we only have to do it once, and then we can enjoy the benefits of source locations throughout our language, including DrRacket.
Now that we’ve improved make-tokenizer, we have to update its unit tests to reflect the new behavior. (We didn’t change anything about jsonic-token?, so its tests will remain the same.)
We’ll rewrite our tests to use token and its keyword arguments to generate test tokens with the right source-location data. We can do this by just counting characters. This was our first test case that got broken:
1 | (apply-tokenizer-maker make-tokenizer "@$ (+ 6 7) $@") |
According to our lexer rules, this will become an SEXP-TOK token. Its lexeme will have two characters trimmed from the beginning and end. For the trimmed lexeme, the position is two more than the original 1 = 3. The line number will remain 1. The column offset is two more than the original 0 = 2. And the span is four less than the original 13 = 9. So we’ll expect a list with one token that looks like this:
We do the same for our other test case:
1 | (apply-tokenizer-maker make-tokenizer "hi") |
According to our lexer rules, this will become two CHAR-TOK tokens. They will both be on line 1 and both have a span of 1. The first will be in position 1 and column 0. The second will be in position 2 and column 1:
Substituting these new test results, the completed module will look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | #lang br/quicklang (require brag/support racket/contract) (module+ test (require rackunit)) (define (jsonic-token? x) (or (eof-object? x) (token-struct? x))) (module+ test (check-true (jsonic-token? eof)) (check-true (jsonic-token? (token 'A-TOKEN-STRUCT "hi"))) (check-false (jsonic-token? 42))) (define (make-tokenizer port) (port-count-lines! port) (define (next-token) (define jsonic-lexer (lexer [(from/to "//" "\n") (next-token)] [(from/to "@$" "$@") (token 'SEXP-TOK (trim-ends "@$" lexeme "$@") #:position (+ (pos lexeme-start) 2) #:line (line lexeme-start) #:column (+ (col lexeme-start) 2) #:span (- (pos lexeme-end) (pos lexeme-start) 4))] [any-char (token 'CHAR-TOK lexeme #:position (pos lexeme-start) #:line (line lexeme-start) #:column (col lexeme-start) #:span (- (pos lexeme-end) (pos lexeme-start)))])) (jsonic-lexer port)) next-token) (provide (contract-out [make-tokenizer (input-port? . -> . (-> jsonic-token?))])) (module+ test (check-equal? (apply-tokenizer-maker make-tokenizer "// comment\n") empty) (check-equal? (apply-tokenizer-maker make-tokenizer "@$ (+ 6 7) $@") (list (token 'SEXP-TOK " (+ 6 7) " #:position 3 #:line 1 #:column 2 #:span 9))) (check-equal? (apply-tokenizer-maker make-tokenizer "hi") (list (token 'CHAR-TOK "h" #:position 1 #:line 1 #:column 0 #:span 1) (token 'CHAR-TOK "i" #:position 2 #:line 1 #:column 1 #:span 1)))) |
When we run "tokenizer.rkt" again, the unit-testing errors will be gone.
Earlier, we noted that once we’ve captured the source locations, our parse and read-syntax functions need to leave them intact so they’ll be available to tools like DrRacket. Now that we’ve updated make-tokenizer to collect the source locations, we should verify that they’re making it all the way through.
Let’s open "reader.rkt". Our read-syntax function relies on parse. So it should suffice to pass some source code to read-syntax and see if the source locations come through correctly (because that will also imply they’re being handled correctly by parse).
Let’s try a simple test case on the REPL:
1 | (read-syntax #f (open-input-string "//x\ny\nz")) |
read-syntax usually takes two arguments: a path to a source file and an input port that points at that file. But when we’re making test cases, we can also use it with a source string rather than a file. For the first argument, we’ll just pass #f. For the second argument, instead of a file port, we can convert a string into an input port with open-input-string. The string we’re using, "//x\ny\nz", is equivalent to this source:
1 2 3 | //x y z |
The x is inside a line comment, so it should disappear. The y and z (and intervening newline) should appear in the parse tree, with source locations on line 2 and 3, respectively.
When we run this expression on the REPL, the result will look like this:
1 | > #<syntax (module jsonic-module jsonic/...> |
But in DrRacket, the little arrow on the left end of the line will be clickable. Click on it, and DrRacket will reveal a panel that we can use to explore the syntax object returned by read-syntax:
The left side of the panel shows the literal code inside the syntax object. We can see that it’s what we expected—the first line //x disappears, and the other characters go into the parse tree.
DrRacket also lets us click on items of interest within the syntax object. The properties of each item will be shown in the right-hand panel labeled Syntax Info. Again, click the triangle to the left of the panel’s name to reveal the whole panel. If we then click the "y" in the parse tree (as shown below) its source location will be revealed: position 5, line 2, column 0, and span 1.
This is exactly right. Likewise, if we click "z", we’ll see position 7, line 3, column 0, and span 1.
Thus, we’ve verified that parse and read-syntax are preserving source locations correctly. It would also be possible to write some unit tests that automatically verify this. But that would be a detour from the current detour. Let’s press onward.
What’s neat about source locations in Racket is that they’re a property of the original source code that’s carried along with the code wherever it goes. So even if our language implementation arranges the source code into a parse tree, then slices & dices it with any number of macros—the source locations remain attached. (Or if we need to replace the code entirely, we can take its source location and attach it to the new item.) This, in turn, is possible because Racket handles code as recursively annotated syntax objects rather than plain strings.
Just as there’s more than one way to read a source file—we’ve been using a combination of a tokenizer and parser, but that’s just one possibility—there’s more than one way to collect source-location data. For instance, we’ll learn a sleeker way of collecting source locations in the basic tutorial.
The essential idea always remains the same. As we read the source file, we also capture the source-location fields. We then annotate the parsed expressions with this information.