As we did in bf, we’ll use #lang brag to generate a parser for jsonic based on a grammar that describes the structure of jsonic programs. The parser will take the tokens generated by the tokenizer and organize them into a parse tree. (Later, this parse tree will be passed to the expander for further processing.)
Let’s start a new file called "parser.rkt". From there, we’ll build the grammar piece by piece:
1 | #lang brag |
The first line represents the starting production rule, which will become the top node of the parse tree. We name this rule jsonic-program:
1 2 | #lang brag
jsonic-program :
|
What’s in a jsonic-program? Let’s be careful: though our source code combines syntax from JSON and S-expressions, our grammar doesn’t have to actually handle the details of parsing JSON and S-expressions. It just needs to be able to distinguish one from the other. After that, the S-expressions can be evaluated as usual; everything else is JSON, and can be passed through as is.
So we describe a jsonic-program as a sequence of two possible elements: a jsonic-char or a jsonic-sexp. We use | in the grammar to indicate a choice between elements, and * to indicate zero or more occurrences of the element:
1 2 | #lang brag
jsonic-program : (jsonic-char | jsonic-sexp)*
|
Now we need to add production rules for our two new elements, jsonic-char and jsonic-sexp:
1 2 3 4 | #lang brag jsonic-program : (jsonic-char | jsonic-sexp)* jsonic-char : jsonic-sexp : |
A jsonic-sexp is any sequence of characters between our open and closing delimiters. But in our tokenizer, we packaged our S-expressions into a named token called SEXP-TOK, and other characters into a named token called CHAR-TOK. Now, we use these named tokens directly in our grammar:
1 2 3 4 | #lang brag jsonic-program : (jsonic-char | jsonic-sexp)* jsonic-char : CHAR-TOK jsonic-sexp : SEXP-TOK |
By the way, when the parser matches a named token, it pulls the matched string out of the token and puts it in the parse tree. Thus, the finished parse tree will contain no references to SEXP-TOK or CHAR-TOK. It will contain the strings that were inside those tokens.
And that’s all we need.
As we did with the tokenizer, let’s do some quick tests to make sure our grammar generates sensible parse trees. Again, this isn’t a substitute for thorough unit tests. But it provides a basic sanity check before we move on. Also, as we’re building a DSL, it’s good to know how we can watch the pieces work together provisionally before we get all the way to testing source files.
Let’s start a new source file in DrRacket. We’ll require our new tokenizer and parser from jsonic/tokenizer and jsonic/parser, respectively. We’ll also require brag/support so we can use apply-tokenizer-maker again.
1 2 | #lang br (require jsonic/parser jsonic/tokenizer brag/support) |
We can now tokenize & parse toy programs, either within the definitions window, or at the REPL prompt. For instance, a jsonic program consisting only of a comment should be parsed into a tree with a top jsonic-program node, and nothing else:
1 2 3 4 | #lang br (require jsonic/parser jsonic/tokenizer brag/support) (parse-to-datum (apply-tokenizer-maker make-tokenizer "// line commment\n")) |
1 | '(jsonic-program) |
A program with a single S-expression between delimiters:
1 2 3 4 | #lang br (require jsonic/parser jsonic/tokenizer brag/support) (parse-to-datum (apply-tokenizer-maker make-tokenizer "@$ 42 $@")) |
1 | '(jsonic-program (jsonic-sexp " 42 ")) |
A program without nested delimiters:
1 2 3 4 | #lang br (require jsonic/parser jsonic/tokenizer brag/support) (parse-to-datum (apply-tokenizer-maker make-tokenizer "hi")) |
1 2 3 | '(jsonic-program (jsonic-char "h") (jsonic-char "i")) |
A three-line program that contains all three items:
1 2 3 4 | #lang br (require jsonic/parser jsonic/tokenizer brag/support) (parse-to-datum (apply-tokenizer-maker make-tokenizer "hi\n// comment\n@$ 42 $@")) |
1 2 3 4 5 | '(jsonic-program (jsonic-char "h") (jsonic-char "i") (jsonic-char "\n") (jsonic-sexp " 42 ")) |
Exactly right.
This is an apt moment to learn an easier way to write multiline input for any Racket function: a here string. A here string is introduced with #<<LABEL, where LABEL is an arbitrary name that will terminate the here string. The here string starts on the next line, and ends when LABEL appears on a line of its own. Below, we’ll use DEREK as a terminator, and include unescaped quotes around hi:
1 2 3 4 5 6 7 8 9 | #lang br (require jsonic/parser jsonic/tokenizer brag/support) (parse-to-datum (apply-tokenizer-maker make-tokenizer #<<DEREK "hi" // comment @$ 42 $@ DEREK )) |
1 2 3 4 5 6 7 | '(jsonic-program (jsonic-char "\"") (jsonic-char "h") (jsonic-char "i") (jsonic-char "\"") (jsonic-char "\n") (jsonic-sexp " 42 ")) |
This time, the quotes around "hi" are treated as part of the source string, and thus appear in the parse tree.
As we look over these results, we should recall that in every parse tree, all the characters in our source code will appear in a node of the parse tree (except those that are deliberately removed, like comments). Furthermore, the nodes of the parse tree correspond to the names and patterns of the production rules in the grammar. This structure, in turn, will guide the structure of our expander. We’ll add that next.
Before we move on, let’s make sure our reader source files are correct:
1 2 3 4 | #lang br/quicklang (module reader br (require "reader.rkt") (provide read-syntax)) |
1 2 3 4 5 6 7 8 9 | #lang br/quicklang (require "tokenizer.rkt" "parser.rkt") (define (read-syntax path port) (define parse-tree (parse path (make-tokenizer port))) (define module-datum `(module jsonic-module jsonic/expander ,parse-tree)) (datum->syntax #f module-datum)) (provide read-syntax) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | #lang br/quicklang (require brag/support) (define (make-tokenizer port) (define (next-token) (define jsonic-lexer (lexer [(from/to "//" "\n") (next-token)] [(from/to "@$" "$@") (token 'SEXP-TOK (trim-ends "@$" lexeme "$@"))] [any-char (token 'CHAR-TOK lexeme)])) (jsonic-lexer port)) next-token) (provide make-tokenizer) |
1 2 3 4 | #lang brag jsonic-program : (jsonic-char | jsonic-sexp)* jsonic-char : CHAR-TOK jsonic-sexp : SEXP-TOK |