Before we start writing the parser, let’s remind ourselves of the possible tokens we could get from the lexer via the tokenizer:
A NEWLINE token, which holds "\n".
A REM token, which holds a line comment that starts with "rem".
An INTEGER or DECIMAL token, which holds a number.
A STRING token, which holds a string.
A statement name, specifically "print", "goto", or "end".
A "+" operator.
A ":" statement separator.
A ";" print-item separator.
Using these tokens, and the ground rules laid out in specification and setup, we can reason out a grammar for BASIC. Of course, we’ll be using #lang brag once again to generate a parser from that grammar.
Because many elements of BASIC have the same names as things in Racket, we’ll prefix our production rules with b- for clarity. This convention isn’t required by #lang brag—it’ll just help keep things readable, especially once we get into the expander.
Our first rule is b-program, which will become the root node of our parse tree. Every BASIC program is made of lines separated by NEWLINE tokens, which we could express like so:
1 2 3 | #lang brag b-program : b-line (NEWLINE b-line)* b-line : |
The * quantifier means “zero or more repetitions” of the preceding group, in this case the parenthesized subpattern NEWLINE b-line.
This rule will work with our sample program:
1 2 3 4 5 6 7 8 9 | #lang basic 30 rem print 'ignored' 35 50 print "never gets here" 40 end 60 print 'three' : print 1.0 + 3 70 goto 11. + 18.5 + .5 rem ignored 10 print "o" ; "n" ; "e" 20 print : goto 60.0 : end |
But as language designers, we want to be alert to the possibility of other mischief, like a program with nothing in it:
1 | #lang basic |
Or a program with nothing but newlines:
1 2 3 4 5 | #lang basic
|
Or a program with extra newlines scattered around:
1 2 3 4 5 6 7 8 | #lang basic 10 print "hello world" 20 goto 10 |
There’s no reason these programs shouldn’t run—the obvious interpretation is “ignore the newlines and continue as usual”. But if our parser rule for b-program isn’t sufficiently lenient, then these oddball programs will raise parsing errors. + An application of Postel’s Law.
So let’s revise the rule. We can make each b-line element optional by putting it in square brackets:
1 2 3 | #lang brag
b-program : [b-line] (NEWLINE [b-line])*
b-line :
|
Reading the rule, we should be able to convince ourselves that it will handle a range of edge cases without violating the key requirement, which is that multiple lines must be separated by at least one newline. (Of course, we could & should write unit tests to be sure.)
Moving on to the grammar for each individual line. A line has a line number, which is an integer, followed by zero statements, one statement, or multiple statements separated by :. If we also make separate rules for b-line-num and b-statement, we can write our b-line rule with three cases:
1 2 3 4 5 6 7 | #lang brag b-program : [b-line] (NEWLINE [b-line])* b-line : b-line-num | b-line-num b-statement | b-line-num b-statement (":" b-statement)+ b-line-num : INTEGER b-statement : |
But we can also combine the three cases into one by using a * rather than a + quantifier on the colon-separated statements, and putting brackets around the statements, to represent an optional match:
1 2 3 4 5 | #lang brag
b-program : [b-line] (NEWLINE [b-line])*
b-line : b-line-num [b-statement (":" b-statement)*]
b-line-num : INTEGER
b-statement :
|
Not bad. And it would work with our sample program. But having just learned about the importance of lenient rules, we should ask ourselves what should happen in edge cases like these:
1 2 3 | 10 print "foo" : 20 : print "bar" : : 30 ::: |
There’s no reason these lines shouldn’t work. So let’s make our line grammar less finicky, and just skip over blank statement separators. We can do this the same way we did in the b-program rule—by putting square brackets around each b-statement to make them optional:
1 2 3 4 5 | #lang brag
b-program : [b-line] (NEWLINE [b-line])*
b-line : b-line-num [b-statement] (":" [b-statement])*
b-line-num : INTEGER
b-statement :
|
That rule implies that the only thing that must be in a line is a line number.
A rem statement can form the tail of any line. We add a b-rem rule that only matches our special REM token. Then we make b-rem an optional match at the end of any b-line:
1 2 3 4 5 6 | #lang brag b-program : [b-line] (NEWLINE [b-line])* b-line : b-line-num [b-statement] (":" [b-statement])* [b-rem] b-line-num : INTEGER b-statement : b-rem : REM |
We make three rules to cover each of our remaining statements: b-end, b-print, and b-goto. Then, the rule for b-statement simply chooses among these possibilities:
1 2 3 4 5 6 7 8 9 | #lang brag b-program : [b-line] (NEWLINE [b-line])* b-line : b-line-num [b-statement] (":" [b-statement])* [b-rem] b-line-num : INTEGER b-statement : b-end | b-print | b-goto b-rem : REM b-end : b-print : b-goto : |
The b-end rule only matches a literal "end":
1 2 3 4 5 6 7 8 9 | #lang brag
b-program : [b-line] (NEWLINE [b-line])*
b-line : b-line-num [b-statement] (":" [b-statement])* [b-rem]
b-line-num : INTEGER
b-statement : b-end | b-print | b-goto
b-rem : REM
b-end : "end"
b-print :
b-goto :
|
We remember from specification and setup that the print statement is made of a literal print token, and then any number of string or numerical expressions as input, separated by ;. To express the possible input, we introduce a helper rule called b-printable that’s a choice between a STRING token and a new b-expr (which we’ll figure out momentarily). Then, in the b-print rule, we follow the approach we used in b-line, expressing b-print as an optional b-printable, followed by any number of optional b-printable arguments with ; separators:
1 2 3 4 5 6 7 8 9 10 11 | #lang brag b-program : [b-line] (NEWLINE [b-line])* b-line : b-line-num [b-statement] (":" [b-statement])* [b-rem] b-line-num : INTEGER b-statement : b-end | b-print | b-goto b-rem : REM b-end : "end" b-print : "print" [b-printable] (";" [b-printable])* b-printable : STRING | b-expr b-goto : b-expr : |
Finally, the b-goto consists of the literal "goto" token and a numerical expression:
1 2 3 4 5 6 7 8 9 10 11 | #lang brag
b-program : [b-line] (NEWLINE [b-line])*
b-line : b-line-num [b-statement] (":" [b-statement])* [b-rem]
b-line-num : INTEGER
b-statement : b-end | b-print | b-goto
b-rem : REM
b-end : "end"
b-print : "print" [b-printable] (";" [b-printable])*
b-printable : STRING | b-expr
b-goto : "goto" b-expr
b-expr :
|
All that’s left is to sort out numbers and numerical expressions. The only numbers we have are integers and decimals—already represented by the INTEGER and DECIMAL tokens—and the only numerical expression we have to worry about is a sum:
1 2 3 4 5 6 7 8 9 10 11 12 13 | #lang brag b-program : [b-line] (NEWLINE [b-line])* b-line : b-line-num [b-statement] (":" [b-statement])* [b-rem] b-line-num : INTEGER b-statement : b-end | b-print | b-goto b-rem : REM b-end : "end" b-print : "print" [b-printable] (";" [b-printable])* b-printable : STRING | b-expr b-goto : "goto" b-expr b-expr : b-sum : b-number : INTEGER | DECIMAL |
Let’s do b-sum first. A sum is any series of numbers with + signs in between. Following the pattern we figured out for b-program and b-line, we can write the rule like so:
1 2 3 4 5 6 7 8 9 10 11 12 13 | #lang brag
b-program : [b-line] (NEWLINE [b-line])*
b-line : b-line-num [b-statement] (":" [b-statement])* [b-rem]
b-line-num : INTEGER
b-statement : b-end | b-print | b-goto
b-rem : REM
b-end : "end"
b-print : "print" [b-printable] (";" [b-printable])*
b-printable : STRING | b-expr
b-goto : "goto" b-expr
b-expr :
b-sum : b-number ("+" b-number)*
b-number : INTEGER | DECIMAL
|
This time, however, we’re not going to be lenient about input, because an expression like + + + or 4 + + 5 has no natural meaning as a sum. Instead, we leave it alone, so it will raise a parsing error.
Now to finish b-expr. It would be fine to make this rule a choice between a b-sum and a b-number. But a b-sum can itself be a single b-number. So it suffices to make b-expr a b-sum:
1 2 3 4 5 6 7 8 9 10 11 12 13 | #lang brag
b-program : [b-line] (NEWLINE [b-line])*
b-line : b-line-num [b-statement] (":" [b-statement])* [b-rem]
b-line-num : INTEGER
b-statement : b-end | b-print | b-goto
b-rem : REM
b-end : "end"
b-print : "print" [b-printable] (";" [b-printable])*
b-printable : STRING | b-expr
b-goto : "goto" b-expr
b-expr : b-sum
b-sum : b-number ("+" b-number)*
b-number : INTEGER | DECIMAL
|
The interposition of b-expr may seem unnecessary now. But in the expander, we’ll see why it’s useful—we can use it as a “gateway” through which all our numerical expressions have to pass (and smooth over the differences between Racket math and BASIC math). It also means that other elements, like b-print, only need to work with b-expr, rather than having to separately deal with other elements like b-sum, b-number, etc.
In terms of implementing the rules in specification and setup, our grammar is done. Our parser will be able to handle all valid programs, including some weird ones.
Let’s make sure our parser works as expected. First, let’s try this small program:
1 2 3 | 10 print "hello" : print "world" 20 goto 9 + 10 + 11 30 end |
We can check the parse tree by setting up a little test file:
1 2 3 4 5 6 7 8 9 10 11 | #lang br (require basic/parser basic/tokenizer brag/support) (define str #<<HERE 10 print "hello" : print "world" 20 goto 9 + 10 + 11 30 end HERE ) (parse-to-datum (apply-tokenizer make-tokenizer str)) |
The result looks like this:
1 2 3 4 5 6 | '(b-program (b-line (b-line-num 10) (b-statement (b-print "print" (b-printable "hello"))) ":" (b-statement (b-print "print" (b-printable "world")))) "\n" (b-line (b-line-num 20) (b-statement (b-goto "goto" (b-expr (b-sum (b-number 9) "+" (b-number 10) "+" (b-number 11)))))) "\n" (b-line (b-line-num 30) (b-statement (b-end "end")))) |
As usual, we see that the parse tree contains every token that the parser consumed. The first element of each parenthesized node is the name of the rule in the grammar that was used to create the node. The remaining elements are the tokens that were matched to the pattern on the right-hand side of that rule. Named tokens like STRING and INTEGER are represented by their actual token values.
(Those interested in making unit tests—this would be a great moment to start copying out these test expressions & their results, to be combined into tests using check-equal?.)
Let’s also make sure our parser works on our original sample program:
1 2 3 4 5 6 7 8 9 | #lang basic 30 rem print 'ignored' 35 50 print "never gets here" 40 end 60 print 'three' : print 1.0 + 3 70 goto 11. + 18.5 + .5 rem ignored 10 print "o" ; "n" ; "e" 20 print : goto 60.0 : end |
We can move the source code into our testing file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | #lang br (require basic/parser basic/tokenizer brag/support) (define str #<<HERE 30 rem print 'ignored' 35 50 print "never gets here" 40 end 60 print 'three' : print 1.0 + 3 70 goto 11. + 18.5 + .5 rem ignored 10 print "o" ; "n" ; "e" 20 print : goto 60.0 : end HERE ) (parse-to-datum (apply-tokenizer make-tokenizer str)) |
This time, the result looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | '(b-program (b-line (b-line-num 30) (b-rem "rem print 'ignored'")) "\n" (b-line (b-line-num 35)) "\n" (b-line (b-line-num 50) (b-statement (b-print "print" (b-printable "never gets here")))) "\n" (b-line (b-line-num 40) (b-statement (b-end "end"))) "\n" (b-line (b-line-num 60) (b-statement (b-print "print" (b-printable "three"))) ":" (b-statement (b-print "print" (b-printable (b-expr (b-sum (b-number 1.0) "+" (b-number 3))))))) "\n" (b-line (b-line-num 70) (b-statement (b-goto "goto" (b-expr (b-sum (b-number 11.0) "+" (b-number 18.5) "+" (b-number 0.5))))) (b-rem "rem ignored")) "\n" (b-line (b-line-num 10) (b-statement (b-print "print" (b-printable "o") ";" (b-printable "n") ";" (b-printable "e")))) "\n" (b-line (b-line-num 20) (b-statement (b-print "print")) ":" (b-statement (b-goto "goto" (b-expr (b-sum (b-number 60.0))))) ":" (b-statement (b-end "end")))) |
This is correct. Admittedly, it’s hard to know this for sure unless we were to manually pencil out the parse, which for a larger program is a drag. For unit tests, it’s simpler to use one- or two-line sample programs, where it’s easy to see how the source will be converted into a parse tree. But this proves, at least, that our sample program doesn’t trigger any errors from the tokenizer or the parser.
“Why are we copying source code into another file? Why not just make a dialect called basic/parse-only that reads the source & returns the parse tree?”
Ah, what a wonderful idea. And we now have the skills to do it.
Let’s confirm the mission objectives. We want to be able to invoke this new basic/parse-only dialect like so:
1 2 3 4 | #lang basic/parse-only 10 print "hello" : print "world" 20 goto 9 + 10 + 11 30 end |
And get its parse tree as the result:
1 2 3 4 5 6 7 8 | '(b-program "\n" (b-line (b-line-num 10) (b-statement (b-print "print" (b-printable "hello"))) ":" (b-statement (b-print "print" (b-printable "world")))) "\n" (b-line (b-line-num 20) (b-statement (b-goto "goto" (b-expr (b-sum (b-number 9) "+" (b-number 10) "+" (b-number 11)))))) "\n" (b-line (b-line-num 30) (b-statement (b-end "end"))) "\n") |
This should be easy now. To make #lang basic/parse-only, we create a "parse-only.rkt" module and fit it with a simple reader and expander:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | #lang br/quicklang (require "parser.rkt" "tokenizer.rkt") (define (read-syntax path port) (define parse-tree (parse path (make-tokenizer port path))) (strip-bindings #`(module basic-parser-mod basic/parse-only #,parse-tree))) (module+ reader (provide read-syntax)) (define-macro (parser-only-mb PARSE-TREE) #'(#%module-begin 'PARSE-TREE)) (provide (rename-out [parser-only-mb #%module-begin])) |
Here, our read-syntax function takes a path and port and converts it into a module expression. (Recall that our improved make-tokenizer function takes both the port and path as input, so it can calculate source locations correctly.) The expander we use in the module expression is the same file, basic/parse-only. We then provide our read-syntax from a reader submodule.
The expander is likewise simple: just a parse-only-mb that will be renamed on export as #%module-begin. All this macro does is put a quote prefix on our PARSE-TREE.
Now we can run our test program with our new interpreter:
1 2 3 4 | #lang basic/parse-only 10 print "hello" : print "world" 20 goto 9 + 10 + 11 30 end |
Sure enough, we get the parse tree:
1 2 3 4 5 6 7 8 | '(b-program "\n" (b-line (b-line-num 10) (b-statement (b-print "print" (b-printable "hello"))) ":" (b-statement (b-print "print" (b-printable "world")))) "\n" (b-line (b-line-num 20) (b-statement (b-goto "goto" (b-expr (b-sum (b-number 9) "+" (b-number 10) "+" (b-number 11)))))) "\n" (b-line (b-line-num 30) (b-statement (b-end "end"))) "\n") |
What’s nice about using a dialect here is that it’s easy to switch back to running the file normally—we simply adjust the #lang line at the top (using basic-demo in this example since we haven’t made our own basic expander yet): + What’s also interesting is the idea that a single source file can have different results depending on the interpreter that runs it.
1 2 3 4 | #lang basic-demo 10 print "hello" : print "world" 20 goto 9 + 10 + 11 30 end |
1 2 | hello world |
Likewise, it’s easy to make a similar basic/tokenize-only dialect that reveals the output from the tokenizer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | #lang br/quicklang (require brag/support "tokenizer.rkt") (define (read-syntax path port) (define tokens (apply-tokenizer make-tokenizer port)) (strip-bindings #`(module basic-tokens-mod basic/tokenize-only #,@tokens))) (module+ reader (provide read-syntax)) (define-macro (tokenize-only-mb TOKEN ...) #'(#%module-begin (list TOKEN ...))) (provide (rename-out [tokenize-only-mb #%module-begin])) |
And invoke it like so:
1 2 3 4 | #lang basic/tokenize-only 10 print "hello" : print "world" 20 goto 9 + 10 + 11 30 end |
1 2 3 4 5 6 | (list (srcloc-token (token-struct 'NEWLINE "\n" #f #f #f #f #f) (srcloc 'unsaved-editor 1 28 29 1)) (srcloc-token (token-struct 'INTEGER 10 #f #f #f #f #f) (srcloc 'unsaved-editor 2 0 30 2)) (srcloc-token (token-struct '| | #f #f #f #f #f #t) (srcloc 'unsaved-editor 2 2 32 1)) (srcloc-token (token-struct 'print "print" #f #f #f #f #f) (srcloc 'unsaved-editor 3 3 36 5)) ··· |
The point here is that once we know how to create languages quickly—at this point, we can make one with a few lines of code—they become a convenient tool even for small jobs.
Before we move on, let’s pause to enjoy the benefits of tracking source locations using lexer-srcloc instead of lexer. Let’s return to our "test.rkt" file:
1 2 3 4 | #lang basic/parse-only 10 print "hello" : print "world" 20 goto 9 + 10 + 11 30 end |
First, let’s induce a lexing error. A lexing error will occur if the lexer encounters a character that it can’t match with any of its lexing rules. We add a bogus line 25 xyzzy to our test program and run it again:
1 2 3 4 5 | #lang basic/parse-only 10 print "hello" : print "world" 20 goto 9 + 10 + 11 25 xyzzy 30 end |
We get an error in the REPL from the lexer. But DrRacket will also highlight the first character that failed to lex, and move the insertion point there:
The lexer only reports the x as an error—not the whole xyzzy—because it stops as soon as it can’t match any rule, and going further would be futile. We can induce a slightly different lexing error by changing line 25 to prinxyzzy:
1 2 3 4 5 | #lang basic/parse-only 10 print "hello" : print "world" 20 goto 9 + 10 + 11 25 prinxyzzy 30 end |
This time, DrRacket highlights prinx because the lexer got as far as matching prin—the first four characters of the valid token print—before things went sideways.
Let’s now induce a parsing error. To trigger a parsing error, our program needs to lex correctly (that is, it has to be made of valid elements) but some element has to be in a place that the parser grammar doesn’t expect to find it. For instance, let’s change line 25 to print goto:
1 2 3 4 5 | #lang basic/parse-only 10 print "hello" : print "world" 20 goto 9 + 10 + 11 25 print goto 30 end |
This time, we get a parsing error, with the offending goto highlighted:
By the way, if a program contains both a lexing error and a parsing error, the lexing error will be highlighted first—regardless of where in the program it occurs—because the lexer runs before the parser.
Feel free to noodle around with "test.rkt" and induce other errors. What we should notice is that we can start benefiting from DrRacket integration even before we’ve started the expander. Moreover, all we had to do to get error highlighting was use lexer-srcloc. Our lexer, tokenizer, and parser cooperated with DrRacket to do the heavy lifting.
Though our parser works, we’d like to make our parse tree tidier.
By default, every matched token shows up in the parse tree, and then the whole parse tree is passed to the expander. The problem is that our parse tree will end up with a bunch of tokens that were only needed to complete the parsing. Once they’ve served their purpose, it would be nice to filter them out, so that our expander can deal with simpler, cleaner input.
For instance, let’s take a closer look at the parse tree for our shorter sample program (which we can generate with our new #lang basic/parse-only dialect):
1 2 3 4 | #lang basic/parse-only 10 print "hello" : print "world" 20 goto 9 + 10 + 11 30 end |
1 2 3 4 5 6 7 | '(b-program "\n" (b-line (b-line-num 10) (b-statement (b-print "print" (b-printable "hello"))) ":" (b-statement (b-print "print" (b-printable "world")))) "\n" (b-line (b-line-num 20) (b-statement (b-goto "goto" (b-expr (b-sum (b-number 9) "+" (b-number 10) "+" (b-number 11)))))) "\n" (b-line (b-line-num 30) (b-statement (b-end "end")))) |
We could work with this parse tree if we had to. But it contains a lot of superfluous elements that don’t affect the meaning of the program:
The "\n" characters that separated the lines are unnecessary once the lines have been parsed.
The ":" characters separating multiple statements in a line are unnecessary once the statements have been parsed.
The "+" signs in a sum are unnecessary once the sum has been parsed.
The statement names—"print", "goto", "end"—are all unnecessary, because we’ve matched those statements to unique rules in the grammar.
Several of the rules—b-line-num, b-statement, b-number, b-printable—helped make our grammar more readable, but don’t serve any semantic purpose in the parse tree. They just wrap their arguments. Those arguments could be spliced into their surrounding nodes.
If we omitted the unnecessary elements, and spliced the elements from inside the b-line-num, b-statement, and b-number nodes, our parse tree would end up like this:
1 2 3 4 | '(b-program (b-line 10 (b-print "hello") (b-print "world")) (b-line 20 (b-goto (b-expr (b-sum 9 10 11)))) (b-line 30 (b-end))) |
How can we get there? The brag language lets us add cuts and splices to our grammar to simplify the parse tree. Using cuts and splices, we can improve our grammar so that it automatically produces this simplified version of the parse tree. This, in turn, will simplify the work we do in the expander, because we don’t have to handle a bunch of extraneous input.
A cut in a brag grammar will delete the item from the parse tree. A cut is notated by adding a slash / prefix to either the rule name (appearing on the left side of the rule) or a pattern element (appearing on the right). If the cut is applied to a rule name, the rule name disappears from the parse tree, but its node and its matched elements remain. If the cut is applied to a pattern element, that element disappears entirely from nodes matching that rule in the parse tree.
In this case, we only need to use cuts on the pattern elements. We want to get rid of the NEWLINE in the b-program rule, the ":" in the b-line rule, the ";" in the b-print rule, the "+" in the b-sum rule, and the statement names "end", "print", and "goto". We do this by prefixing each of these pattern elements with /:
1 2 3 4 5 6 7 8 9 10 11 12 13 | #lang brag b-program : [b-line] (/NEWLINE [b-line])* b-line : b-line-num [b-statement] (/":" [b-statement])* [b-rem] b-line-num : INTEGER b-statement : b-end | b-print | b-goto b-rem : REM b-end : /"end" b-print : /"print" [b-printable] (/";" [b-printable])* b-printable : STRING | b-expr b-goto : /"goto" b-expr b-expr : b-sum b-sum : b-number (/"+" b-number)* b-number : INTEGER | DECIMAL |
Now, if we generate our sample parse tree again:
1 2 3 4 | #lang basic/parse-only 10 print "hello" : print "world" 20 goto 9 + 10 + 11 30 end |
We can see that the cut elements are gone:
1 2 3 4 5 6 7 | '(b-program (b-line (b-line-num 10) (b-statement (b-print (b-printable "hello"))) (b-statement (b-print (b-printable "world")))) (b-line (b-line-num 20) (b-statement (b-goto (b-expr (b-sum (b-number 9) (b-number 10) (b-number 11)))))) (b-line (b-line-num 30) (b-statement (b-end)))) |
A splice in a brag grammar will merge the elements of a node into the surrounding node. A splice is notated by prefixing either the rule name or a pattern element with an at sign @. As with cuts, if the splice is applied to a pattern element, that element is spliced only within that rule. If the splice is applied to a rule name, then the splice is applied every time the rule appears in the grammar.
Here, we want to splice instances of b-line-num, b-statement, b-printable, and b-number. Since we want these splices to be global—that is, happen every time the rule appears—we apply the splices to the rule names by prefixing them with @:
1 2 3 4 5 6 7 8 9 10 11 12 13 | #lang brag b-program : [b-line] (/NEWLINE [b-line])* b-line : b-line-num [b-statement] (/":" [b-statement])* [b-rem] @b-line-num : INTEGER @b-statement : b-end | b-print | b-goto b-rem : REM b-end : /"end" b-print : /"print" [b-printable] (/";" [b-printable])* @b-printable : STRING | b-expr b-goto : /"goto" b-expr b-expr : b-sum b-sum : b-number (/"+" b-number)* @b-number : INTEGER | DECIMAL |
This time, when we generate our sample parse tree:
1 2 3 4 | #lang basic/parse-only 10 print "hello" : print "world" 20 goto 9 + 10 + 11 30 end |
We get the tidy parse tree we were originally aiming for:
1 2 3 4 | '(b-program (b-line 10 (b-print "hello") (b-print "world")) (b-line 20 (b-goto (b-expr (b-sum 9 10 11)))) (b-line 30 (b-end))) |
We can also try parsing our larger sample program:
1 2 3 4 5 6 7 8 9 | #lang basic/parse-only 30 rem print 'ignored' 35 50 print "never gets here" 40 end 60 print 'three' : print 1.0 + 3 70 goto 11. + 18.5 + .5 rem ignored 10 print "o" ; "n" ; "e" 20 print : goto 60.0 : end |
With similar results (notice in particular how close we’re already getting to the pseudocode for our implementation):
1 2 3 4 5 6 7 8 9 | '(b-program (b-line 30 (b-rem "rem print 'ignored'")) (b-line 35) (b-line 50 (b-print "never gets here")) (b-line 40 (b-end)) (b-line 60 (b-print "three") (b-print (b-expr (b-sum 1.0 3)))) (b-line 70 (b-goto (b-expr (b-sum 11.0 18.5 0.5))) (b-rem "rem ignored")) (b-line 10 (b-print "o" "n" "e")) (b-line 20 (b-print) (b-goto (b-expr (b-sum 60.0))) (b-end))) |
Cuts and splices are strictly a convenience. We’re not required to use them. And they’re not the only way of pruning a parse tree. We could do the same work in the expander, sorting through the raw parse tree when it arrives and tossing out what we don’t need.
But often, that amounts to writing a function that re-parses the parse tree. Given that we’re already making a parser, it makes sense to consolidate the housekeeping here. As we’ve seen, we can easily streamline the parse tree by adding a few / and @ marks.
We are now really, truly done with our parser, and can move on to finishing the reader and then tackling the the expander. The extra effort we invested in the parser will pay dividends then.
1 2 3 4 5 6 7 8 9 10 11 12 13 | #lang brag b-program : [b-line] (/NEWLINE [b-line])* b-line : b-line-num [b-statement] (/":" [b-statement])* [b-rem] @b-line-num : INTEGER @b-statement : b-end | b-print | b-goto b-rem : REM b-end : /"end" b-print : /"print" [b-printable] (/";" [b-printable])* @b-printable : STRING | b-expr b-goto : /"goto" b-expr b-expr : b-sum b-sum : b-number (/"+" b-number)* @b-number : INTEGER | DECIMAL |