We introduce the second most consequential idea of our workshop: the grammar of a language.
A grammar describes the structure of every syntactically correct program in a language. It does not describe every semantically correct program.
Question (hard): is it possible to do that?
A grammar is a sequence of production rules consisting of names (on the left) and patterns (on the right). By default, the parse starts with the first rule.
The patterns are made of tokens (aka terminals) or names of production rules (aka nonterminals). These pattern elements can be further modified with quantifiers, groups, or choice patterns.
Possible grammars for US zip codes:
zip-code : digit{5} ["-" digit{4}]
digit : "0" | "1" | "2" | "3" | "4"
| "5" | "6" | "7" | "8" | "9"
zip-code : beginning [middle] end
beginning : digit digit
middle : end "-" digit
end : beginning digit
digit : number | zero
number : odd | even
odd : "1" | "3" | "5" | "7" | "9"
even : "2" | "4" | "6" | "8"
zero : "0"
Question: which is better? Why?
Questions (of no import): what is this grammar notation called? Who is credited with inventing the idea of a grammar?
A grammar can be used as source code for #lang brag to generate a parser.
#lang brag
zip-code : digit{5} ["-" digit{4}]
digit : "0" | "1" | "2" | "3" | "4"
| "5" | "6" | "7" | "8" | "9"
(parse-to-datum "01234-5678")
The result of a parse is a parse tree. Each leaf corresponds to the result of a production rule: it contains the name of the rule followed by the terminal elements matched to its pattern.
'(zip-code (digit "0") (digit "1") (digit "2") (digit "3") (digit "4") "-" (digit "5") (digit "6") (digit "7") (digit "8"))
Question: how do we know the parse tree only contains terminals?
Question: what makes this output format exceptionally convenient for makers of languages?
A grammar in brag can be further annotated with cuts and splices to remove or merge elements.
#lang brag
zip-code : @digit{5} [/"-" digit{4}] ; splice and cut
digit : "0" | "1" | "2" | "3" | "4"
| "5" | "6" | "7" | "8" | "9"
(parse-to-datum "01234-5678")
'(zip-code "0" "1" "2" "3" "4" (digit "5") (digit "6") (digit "7") (digit "8"))
A brag module exports two functions. parse applies the grammar to a list of tokens (or a token-producing thunk) and produces a syntax object. parse-to-datum does the same, but returns the parse tree as a raw datum rather than syntax object.
Question: why is this grammar faulty?
#lang brag
foo : "x" bar
bar : foo
Warning: brag won’t complain about ambiguous grammars.
#lang brag
top : foo | bar
foo : "x"
bar : "x"
Question: is this a bug or a feature?
It’s possible to start a parse with something other than the first rule. This is a technique we can use to adjust how the REPL works for our language.
Try writing a grammar in brag for one or more of the examples below. You don’t have to consider every case, but convince yourself the grammar works on common cases (with parse-to-datum).
16-bit binary numbers, like 0100101101101101
US phone numbers, like (213) 555-6890
CSS hex strings representing grays, like "#ccc" and "#999999"
CSS hex strings representing any color, like "#f9c" and "#036a7e"
Binary numbers that are palindromes, like 11 and 10010100100101001
Binary numbers with an equal number of 0 and 1 digits, like 01101010000111
Math expressions involving addition and multiplication (don’t forget about operator precedence), like 1 + 2 * 3 + 4
Make a language called tacogram that takes the same input as tacopocalypse and converts it back into a Racket program, but using a grammar-based parser rather than a lexer.
Example:
#lang tacogram
##$%#$%#$#$#$$##$%#$%#$#$#$$##$%#$#$#$%#$$##$#$#$%#$%%$#%#$%#$#$%%$##$#$%%#$%%$##$#$%%#$%%$#%%%%#$%%$##$#$#$#$#$%#$$#%%%#$%%%$#%%%%#$%%$##$%#$#$%%%$##$#$%%#$%%$##$#$%#$#$%%$##$%#$#$#$%#$$##$%#$%#$#$#$$##$#$#$%#$%#$$#%%#$%#$%#$$##$#$#$#$#$%#$$#%#$#$#$%%#$$##$#$#$#$#$%#$$##$#$#$%#$%#$$##$%#$%#$%#$$##$#$#$#$#$%#$$##$%#$#$%%#$$##$#$#$#$#$%#$$##$#$#$%#$%#$$#%#$%%#$%#$$##$#$#$#$#$%#$$##$#$#$%%%%$#%#$#$%#$%#$$#%#$#$%#$%#$$#%#$#$%#$%#$$
Result:
"hello world"
(+ 1 (* 2 (- x)))
Hint: you’ll need to organize your project with two files. I’ll give you the first one for free:
#lang br/quicklang
(require "grammar.rkt")
(module+ reader
(provide read-syntax))
(define (tokenize ip)
(for/list ([tok (in-port read-char ip)])
tok))
(define (parse src toks)
(define parse-tree-datum (parse-to-datum toks))
(for/list ([leaf (in-list (cdr parse-tree-datum))])
(integer->char
(for/sum ([val (in-list (cdr leaf))]
[power (in-naturals)]
#:when (equal? val '(taco)))
(expt 2 power)))))
(define (read-syntax src ip)
(define toks (tokenize ip))
(define parse-tree (parse src toks))
(strip-context
(with-syntax ([PT parse-tree])
#'(module tacogram-mod tacogram
PT))))
(define-macro (mb PT)
#'(#%module-begin
(display (list->string 'PT))))
(provide (rename-out [mb #%module-begin]))
The other is not free:
#lang brag
;; ···
Hint: work on "grammar.rkt" in isolation until this source string on the REPL:
(parse-to-datum "\n##$%#$%#$#$#$$\n")
Yields this result:
'(taco-program (taco-leaf (not-a-taco) (taco) (not-a-taco) (taco) (not-a-taco) (not-a-taco) (not-a-taco)))
Hint: your "main.rkt" doesn’t have a lexer. It just treats each character in the source file as a token. So your grammar has to consume these tokens.
Hint: hopefully you didn’t forget what you learned about the notation while you were making tacopocalypse.
You’re ready for the boss battle.