# The meta5ix compiler-compiler # # This 18-line compiler can compile itself to an assembly language for # the abstract machine in meta5ixrun.py; it can also compile a wide # range of languages (LL(1) I think) to similar assembly languages. # It is based on Val Schorre’s META II and on VPRI RN 2010-01, # “Programming and Programming Languages,” # . It is shorter # than META II (18 lines instead of 30) and more readable, uses a # simpler abstract machine, and supports a wider range of input and # output languages, while retaining META II’s ruthlessly simple # register-machine execution model. # It is best understood, as Long’s ACM Queue article describes META # II, as “a field-improvised lever”; as Schorre’s 01963 article says # of META II, it is “not intended as a standard language which # everyone will use to write compilers”, and indeed it has many # shortcomings. But it is expressive enough to use it to implement # such a language. ### Entry point ### # Although meta5ix is fairly flexible with respect to input ordering, # the first rule is the entry point of the compiler. Let me show you # its features: - program: ["-" name @it ":" terms {return}, "#" [:notin ""]] # That defines the rule `program` as zero or more rules or comments, # because `[]` denotes repetition† and `,` separates alternatives‡. A # rule consists of "-", then a `name` (defined by a `name` rule # below), ":", and `terms`; a comment consists of "#" followed by zero # or more characters, implicitly excluding line separators. Schorre # calls this a “syntax equation”. Juxtaposition in such rules, for # example where "-" is followed by `name`, denotes concatenation§. # In the case of parsing a rule, after parsing the name, we emit an # assembly-language label `@` consisting of the parsed name, copied # from the input to the output with the magic word `it`; and, after # parsing the terms (and emitting whatever code is emitted during # their parsing), we emit the instruction “return”. Thus, a call to # the label will begin running the code emitted for the terms, and # when execution reaches its end, the subroutine will return. # Whitespace, including newlines, is implicitly skipped before # attempting to parse a quoted string such as "-" or ":". # The output from running this program on itself begins with the # compilation of the above rule, which is abbreviated as follows: # program # many12 # literal "-" # else seq15 # call name # assert # ... # return # _______ # † Repetition is Kleene closure. # ‡ Alternatives are united by addition in a Kleene algebra, but in # the interest of readability, meta5ix separates them with `,` rather # than the Kleene-algebra `+`, the BNF `|`, or Schorre’s `/`. # § Concatenation is multiplication in a Kleene algebra. ### Terms ### - terms: term ["," {continue $choice} term] @choice # This defines `terms` as one or more `term` strings separated by # commas, which, as explained above, separate alternatives. The two # occurrences of `choice` denote a locally-scoped identifier beginning with # `choice`, unique to this invocation of this rule; this is used once as a # target for the “continue” conditional jump order emitted after each comma and # once as a label after the last alternative. This way, once any alternative # succeeds, execution will continue after the end of the list of # alternatives. - term: (factor {else $seq}, output) [factor {assert}, output] @seq # Analogously, this defines `term` as a sequence of one or more items # that are either a `factor` or an `output`; the parentheses simply # provide syntactic grouping. After the first item, if it was a # factor (which attempts to parse some input and thus might fail), # we emit a conditional jump (the “else” instruction) to the # locally-defined label at the end (and this is in fact the code that # emitted the `else seq15` instruction quoted earlier); but, after items # after the first, we instead emit an “assert” which will report a # parse failure. This is because the meta5ix abstract machine does # not support backtracking except in a very limited form. # The concatenation of factors in the grammar being compiled is thus # mapped to sequential emission of code by the compiler, which then # maps to sequential execution by the target machine. # So now we have given the implementation of two of our forms of # composition, namely alternation and sequencing; repetition and # invocation by name still remain, as well as the atomic units of # input and output from which we thus compose our compiler. ### Factors ### # The definition of `factor` is somewhat more involved; each part will # be explained below: - factor: string {literal $it} , "(" terms ")" , "[" @many terms {continue $many} "]" , name {call $it} , "<<" {begin} terms ">>" {end} , ":fnord" {fnord} factor , ":notin" string {notin $it} , ":between" string {between $it} # string {literal $it} # This is the first alternative for `factor`: a quoted string in the # input compiles to a “literal” instruction with that string as an # argument. This is what generated the `literal "-"` instruction # quoted above. # , "(" terms ")" # A factor can also be parenthesized `terms`. # , "[" @many terms {continue $many} "]" # Or it can be `terms` wrapped in square brackets, denoting # repetition; in this case a label is generated before the code to # parse `terms`, and a conditional jump is generated to attempt # parsing them repeatedly until some attempt is unsuccessful. Thus we # have our third form of composition. # , name {call $it} # A factor can also be a `name`, in which case we generate a “call” # instruction to that name, like the `call name` quoted above; this # is, in a sense, the fourth form of composition. # , "<<" {begin} terms ">>" {end} # We haven’t seen an example of this yet, but a factor can also # describe how to parse a *token*, which can be copied to the output # with “it”; such a token consists of `terms` wrapped in double # angle brackets. “begin” and “end” instructions are emitted to # delimit the extent of the token at runtime; the `end` instruction # backs up the input stream to `begin` if the attempt to parse the # token failed. This is the only form of backup or backtracking # supported by this abstract machine, and it does not nest. # , ":fnord" {fnord} factor # A factor can also be a fnorded factor, which skips any whitespace # with the “fnord” instruction before attempting to parse the factor # itself, in the same way that attempts to parse literal strings like # "-" or "<<" skip whitespace. This is defined as an unary operator # rather than a concatenation, because in a concatenation like `"-" # name` or `"(" terms ")"`, if the first item succeeds, the other # items are required to succeed (by inserting an `assert` instruction # after each one). But skipping whitespace always succeeds, so it is # not a good thing to use for such “predictive parsing”. # , ":notin" string {notin $it} # A factor can also be a `:notin` construct, as seen above in the # definition of comments, which is compiled to a “notin” instruction # which matches any single character not in the argument string (except # line separators). # , ":between" string {between $it} # A factor can also be a `:between` construct, which is compiled # analogously, but matches a single character whose ASCII code lies in # the closed range between two given characters; for example, `between # "AZ"` will match any ASCII capital letter. ### Output ### # The Meta5ix virtual machine has a one-line output buffer in which an # output line is constructed by a series of instructions. - output: (quasiquote, "@" {dedent} (var, quasiquote)) {writeline} # Finally we have finished the definition of `factor` and can move on # to `output`, which defines the `{foo}` and `@x` constructs that # generate the compiled program. `@` turns what follows into a label # with a “dedent” instruction, causing the resulting output line to # not be indented. Strings for output are quasiquoted: - quasiquote: "{" [(<>, "\" <<:notin "">>) {say "$it"}, "$" var] "}" - ch: :notin "$}\" # Verbatim text is tokenized here and emitted verbatim with “say”. # The verbatim text is just one or more normal characters `ch`, or a # backslash (which is dropped) followed by any character, while “$” # introduces a “var” — either an autogenerated local label or the # magic variable “it”. # There’s a subtle point here about skipping whitespace: This # definition does not ignore whitespace *within* the quasiquoted # string, because the abstract machine’s “notin” instruction does not # do that, unlike “literal”. But when Meta5ix tries to parse the “$” # literal, it will first irreversibly discard any whitespace that # might be impeding its greedy quest for “$”. So it is important that # the `"$" var` case and the "\" case are tried *after* the # ordinary-text case — otherwise any initial whitespace in verbatim # text would be lost. # There’s a subtle bug here with respect to breaking quasiquoted text # across lines. # Local labels and “it” are are handled in the same way within {} or # after @: - var: "it" {copyinput}, name {gen $it} # The “gen” instruction is defined by the abstract machine to generate a # name that’s local to the current rule invocation. ### Tokens ### # This leaves only the basic tokens of the language to define, the # things that will be copied to the output by `it`. - string: :fnord <<'"' [:notin '"'] '"', "'" [:notin "'"] "'">> # A `string` is a token, before which we ignore whitespace with # `:fnord`, which can be either "stuff" or 'stuff'; in either case the # contents are prohibited from including the closing delimiter (or, # implicitly, a line ending). - name: :fnord <> # A `name`, similarly, consists of a letter followed by zero or more # letters or digits. - letter: :between "az", :between "AZ", :between "__" # A `letter` is either an ASCII lowercase letter, an ASCII uppercase # letter, or underscore. You’d think you could handle the last case # with just "_" instead of `:between "__"` but that may add unwanted # whitespace to your identifier. ### Notes ### # Meta5ix is the simplest self-compiling compiler-compiler I am aware # of, being 18 lines of fairly simple code, and the abstract machine # it targets is also fairly simple and lightweight; although I’m # currently running it on a virtual machine consisting of 109 lines of # Python code which takes 668 million amd64 instructions to recompile # meta5ix, I think it should be straightforward to reimplement in a # few hundred instructions of assembly. # Its current implementation requires a bootstrap compiler currently # provided by a checked-in previous output; I initially hand-compiled # an early version of meta5ix to its abstract assembly to get it # running, but the bootstrapping chain is currently broken. # It follows META II fairly closely, and it shares META II’s # limitation of being unable to backtrack beyond a single token, and # being unable to reorder tokens copied from its input to its output, # which enables the abstract machine to be implemented with only small # input and output buffers. But, by moving tokenization into the # grammar, meta5ix can support a much wider range of languages than # META II could, simplifying the abstract machine into the bargain. # It also supports an arbitrary set of local labels, rather than META # II’s two, and its syntax is not only more readable, but also more # amenable to editing — for example, adding and removing alternatives. # Moreover, by virtue of being more restrained about inserting # whitespace in the output, it can support a wider variety of output # languages. # There’s a strange hall-of-mirrors quality when adding a new feature # to meta5ix and then using it, because there are three versions at # play at the same time: the old version, which does not support or # use the feature; the intermediate version, which supports it but # does not use it; and the new version, which supports it and uses it. # The intermediate version is compiled by the old version, runs the # code of the intermediate version, and takes the new version as # input. It’s very easy to get confused between the levels, # especially when the feature is backwards-incompatible — for example, # changing the syntax of an existing construct — and you’re debugging. # When the code produced from compiling the new version is incorrect, # is that because the new version is buggy, because the intermediate # version is buggy, or because the intermediate version was compiled # with a buggy old version? It’s easy to find and fix the bug in the # wrong version. And it’s easy to fix it in, say, the intermediate # version and then forget to copy the change to the new version.