Fancier lexing for lexing fanciers.
1 2 3 4 5 6 7 8 9 | (define-lex-abbrev regex-chars (char-set "()*+?.^$|<!=[]")) (define lex (lexer [(:+ "\n") (token 'NEWLINE lexeme)] [(from/stop-before ";" "\n") (token 'COMMENT #:skip? #t)] [(:+ whitespace) (token 'SP lexeme #:skip? #t)] [regex-chars lexeme] [alphabetic (token 'LITERAL lexeme)])) |
The token structure type lets us create a token that cooperates specially with a brag parser. The first field of the structure is the value matched in the grammar. But the second field of the structure is the value inserted in the parse tree.
SYMBOLIC-NAMES (in all caps) are used to classify a set of tokens (like identifiers, numbers, or strings). When used in the grammar, these names can be referenced without surrounding quotes:
1 2 | #lang brag top : ("x" | FOO)+ |
We make a token with a symbolic name by passing a symbol as the first argument, and the token value as the second:
1 | (token 'FOO 42) |
So when we feed these tokens to our parser:
1 | (parse-to-datum (list "x" (token 'FOO 42))) |
We get this result:
1 | '(top "x" 42) |
The lexer supports many useful matching operators, e.g. :+, from/stop-before, char-set, and alphabetic.
define-lex-abbrev lets us assign names to lexer patterns. This can be handy for defining a set of “reserved” characters that are assigned special meaning in our language.
token structures with #:skip? #t are ignored by a brag parser.
Make a language called algebra that can evaluate the following source. Implement f and g as functions; implement f(zz,zz) and g(10) and g(23) as function applications. # denotes a line comment.
Don’t worry about generalizing the grammar (we’ll do that in the next project). Just make it good enough to run the sample.
Example:
1 2 3 4 5 6 | #lang algebra fun f(x,y) = x + y # fun f(x,y) = x * y fun g(zz) = f(zz,zz) g(10) g(23) |
Result:
1 2 | 20 46 |
Hint: Starter files:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | #lang br/quicklang (require brag/support "grammar.rkt") (provide top fun expr app) (module+ reader (provide read-syntax)) (define-lex-abbrev reserved-toks (:or #;···)) (define tokenize-1 (lexer [whitespace #;···] [(from/stop-before "#" "\n") #;···] [reserved-toks #;···] [(:+ alphabetic) #;···] [(:+ (char-set "0123456789")) #;···])) (define-macro top #;···) (define-macro-cases fun #;···) (define-macro-cases expr #;···) (define-macro app #;···) (define (read-syntax src ip) (define parse-tree (parse src (λ () (tokenize-1 ip)))) (strip-bindings (with-syntax ([PT parse-tree]) #'(module algebra-mod algebra PT)))) |
1 2 3 4 5 6 | #lang brag top : # ··· fun : # ··· expr : # ··· app : # ··· |
Hint: In the lexer, assign symbolic token names to your identifiers (like ID) and integers (like INT).
Hint: In the lexer, make tokens with #:skip #true to ignore parts of the source that serve no purpose. (For instance, whitespace and comments.) Then you don’t have to handle them in the grammar.
Hint: In the grammar, use cuts to remove tokens that serve no further purpose. (For instance, reserved tokens can often be discarded.) Then you don’t have to handle them in the expander.
Hint: You don’t need to implement multiplication to run the test file. (Do you see why?)
Hint: In the expander, you can leverage interposition points. For instance, if you need a macro called app that handles function applications in your program, you could do:
1 | (define-macro (app ARG ...) #'(#%app ARG ...)) |
Or equivalently:
1 | (define-macro app #'#%app) |
Hint: At the REPL, you can feed test cases to your tokenizer like so:
1 | (apply-port-proc tokenize-1 "g(23)") |
Result:
1 | (list (token-struct 'ID 'g #f #f #f #f #f) "(" (token-struct 'INT 23 #f #f #f #f #f) ")") |
You can feed test cases to your parser like so:
1 | (parse-to-datum (apply-port-proc tokenize-1 "g(23)")) |
Result:
1 | '(top (app g 23)) |
A little knowledge is still dangerous.