This is a summary of the major steps in creating a new language in Racket. This recipe is not “master” in the sense of “comprehensive”. Rather, it’s a core process that you can customize as you wish—except for a few points that are non-negotiable, which are signaled with the word must.
This recipe assumes familiarity with Racket’s language-making workflow and terminology. If you want to learn about that workflow, start with the tutorials, not here.
Let’s assume your new language is called dsl. The rest of the recipe will use #lang dsl as an example. But the process is analogous for any #lang dsl/dialect.
The pseudocode uses #lang br. But you can use any Racket language to implement your language, as long your code follows the other requirements set out below.
When working on a language project, it’s convenient to be able to refer to modules within the package by global names like dsl/module-name, and on the #lang line as #lang dsl. To do this, start by installing your language project as a package.
Within your path/to/source, create a project directory for the language called dsl. Switch into the new directory and use raco pkg install to add it to your local Racket installation as a package.
1 2 3 4
> cd path/to/source > mkdir dsl > cd dsl > raco pkg install
Alternatively: at the command line, switch to your path/to/source and use raco pkg new dsl, which will create the dsl subdirectory and stub out other files that will be useful in the package (including documentation and configuration).
1 2 3 4
> cd path/to/source > raco pkg new dsl > cd dsl > raco pkg install
The boot module is the first module Racket loads when running a program in your language.
If your language is invoked as #lang dsl, its boot module will be "dsl/main.rkt".
If your language is invoked as #lang dsl/dialect, its boot module will be "dsl/dialect.rkt". (Caution: the boot module will not be "dsl/dialect/main.rkt".)
A language can have any number of dialects living in the same project directory, with the boot modules following this naming convention.
As the first step in running any program, Racket invokes the reader for the language. The reader converts source code (= a string of characters in a source file) into S-expressions (= Racket-style syntactic forms).
To start the reader, Racket calls the read-syntax function for the language. Racket looks for this function in the reader submodule of the boot module of the #lang, which must provide it. For instance:
1 2 3 4
#lang br (module reader br (provide read-syntax) ···)
read-syntax can either be defined within the reader submodule:
1 2 3 4 5
#lang br (module reader br (provide read-syntax) (define (read-syntax name port) ···))
Or imported from elsewhere:
1 2 3 4
#lang br (module reader br (provide read-syntax) (require module/that/exports/read-syntax))
read-syntax must accept two input arguments: a source name (that holds the location of the source—e.g., for file input, the name would be a path) and an input port (that points at the source file). read-syntax should read its data from the port. + In principle, it’s possible to read directly from a source name when it’s a path. But this is unreliable, because it assumes the input source is a file. Maybe it’s not. By contrast, the port argument always contains the source data, regardless of the underlying input type. read-syntax should consume all the data from the port (that is, until the port returns eof). In pseudocode:
1 2 3 4 5 6
#lang br (module reader br (provide read-syntax) (define (read-syntax name port) (define s-exprs (read-code-from port)) ···))
read-syntax must return one value: code for a module expression, represented as a syntax object. Typically, the converted S-expressions from the source file are inserted into this syntax object. This syntax object must have no identifier bindings. This module code must include a reference to the expander that will provide the initial set of bindings when the module code is evaluated. In pseudocode:
1 2 3 4 5 6 7 8
#lang br (module reader br (provide read-syntax) (define (read-syntax name port) (define s-exprs (read-code-from port)) (strip-bindings #`(module dsl-mod-name dsl/expander #,@s-exprs))))
Racket takes the module expression returned from read-syntax and uses it to replace the code in the source file. So a source file that looks like this:
1 2 3
#lang dsl dsl source code; ···
Evaluation of this module expression continues from here, starting by importing bindings from the expander.
It’s common, but not mandatory, for read-syntax to rely on two helper functions: a parser and a tokenizer.
The tokenizer reads characters from the input port and converts them to tokens, which are the smallest meaningful units of source code.
You can write a tokenizer by hand. You can also use a helper function called a lexer that uses regexp-style rules to convert source code into tokens.
These tokens are passed to the parser, which arranges them into a hierarchical S-expression called a parse tree.
Together, in pseudocode:
1 2 3 4 5 6 7 8 9 10
#lang br (module reader br (require module/that/provides/parse module/that/provides/tokenize) (provide read-syntax) (define (read-syntax name port) (define the-parse-tree (parse (tokenize port))) (strip-bindings #`(module dsl-mod-name dsl/expander #,the-parse-tree))))
Once the reader finishes, the expander starts. At the end of the reader phase, the source code has been converted into a module expression that contains S-expressions, but has no bindings, e.g.—
The expander provides the initial set of bindings for the module expression returned by the reader, thereby determining the meaning of the identifiers within the S-expressions. This, in turn, allows the expressions to be evaluated as Racket code.
Any module (that meets the other requirements below) can be used as the expander for a language. It’s invoked by the first line of the module code returned by the reader. For instance, this module expression will rely on dsl/expander as its expander:
In essence, the expander name is like an implied require at the beginning of the module:
Most Racket languages have their own custom expander module. But it’s not mandatory.
Every expander must provide a #%module-begin macro, which will be the first thing invoked in the expander. #%module-begin must accept as input all the expressions that appear in the body of the module expression that read-syntax makes. Therefore, it’s not a bad idea to build your #%module-begin around a syntax pattern that accepts any number of input arguments.
To evaluate the module expression returned by the reader, Racket imports the #%module-begin from the expander specified in the module expression. It replaces the module expression with a call to this #%module-begin, passing it all the parsed expressions that are in the body of the module expression. So this:
1 2 3
(module dsl-mod-name dsl/expander (#%module-begin ;; imported from `dsl/expander` dsl-sexprs ···))
Often, the #%module-begin for a language will perform some language-specific processing on the parse tree, and call the #%module-begin in the implementation language. To prevent namespace collisions between the two #%module-begin macros, use rename-out. In pseudocode:
1 2 3 4 5
#lang br (define-macro (dsl-module-begin PARSED-EXPR ...) #'(#%module-begin ;; from `br` PARSED-EXPR ...)) (provide (rename-out [dsl-module-begin #%module-begin]))
Optionally, an expander can provide certain interposition points:
#%top-interaction is used to activate the REPL.
#%app adds support for function calls.
#%datum adds support for self-evaluating values (like numbers and strings).
#%top adds support for missing identifiers.
The br/quicklang dialect automatically exports these macros.
Writing unit tests.
Integrating the language with DrRacket, including syntax coloring and indenting.
Adding Scribble documentation.
Adding an "info.rkt" file.
Making the language available through the Racket package server.