# The meta5ix compiler-compiler #

# This 18-line compiler can compile itself to an assembly language for
# the abstract machine in meta5ixrun.py; it can also compile a wide
# range of languages (LL(1) I think) to similar assembly languages.
# It is based on Val Schorre’s META II and on VPRI RN 2010-01,
# “Programming and Programming Languages,”
# <http://www.vpri.org/pdf/rn2010001_programm.pdf>.  It is shorter
# than META II (18 lines instead of 30) and more readable, uses a
# simpler abstract machine, and supports a wider range of input and
# output languages, while retaining META II’s ruthlessly simple
# register-machine execution model.

# It is best understood, as Long’s ACM Queue article describes META
# II, as “a field-improvised lever”; as Schorre’s 01963 article says
# of META II, it is “not intended as a standard language which
# everyone will use to write compilers”, and indeed it has many
# shortcomings.  But it is expressive enough to use it to implement
# such a language.

### Entry point ###

# Although meta5ix is fairly flexible with respect to input ordering,
# the first rule is the entry point of the compiler.  Let me show you
# its features:

- program: ["-" name @it ":" terms {return}, "#" [:notin ""]]

# That defines the rule `program` as zero or more rules or comments,
# because `[]` denotes repetition† and `,` separates alternatives‡.  A
# rule consists of "-", then a `name` (defined by a `name` rule
# below), ":", and `terms`; a comment consists of "#" followed by zero
# or more characters, implicitly excluding line separators.  Schorre
# calls this a “syntax equation”.  Juxtaposition in such rules, for
# example where "-" is followed by `name`, denotes concatenation§.

# In the case of parsing a rule, after parsing the name, we emit an
# assembly-language label `@` consisting of the parsed name, copied
# from the input to the output with the magic word `it`; and, after
# parsing the terms (and emitting whatever code is emitted during
# their parsing), we emit the instruction “return”.  Thus, a call to
# the label will begin running the code emitted for the terms, and
# when execution reaches its end, the subroutine will return.

# Whitespace, including newlines, is implicitly skipped before
# attempting to parse a quoted string such as "-" or ":".

# The output from running this program on itself begins with the
# compilation of the above rule, which is abbreviated as follows:

# program 
# many12 
#         literal "-" 
#         else seq15 
#         call name 
#         assert 
# ...
#        return 

# _______
# † Repetition is Kleene closure.  
# ‡ Alternatives are united by addition in a Kleene algebra, but in
# the interest of readability, meta5ix separates them with `,` rather
# than the Kleene-algebra `+`, the BNF `|`, or Schorre’s `/`.  
# § Concatenation is multiplication in a Kleene algebra.

### Terms ###

- terms: term ["," {continue $choice} term] @choice

# This defines `terms` as one or more `term` strings separated by
# commas, which, as explained above, separate alternatives.  The two
# occurrences of `choice` denote a locally-scoped identifier beginning with
# `choice`, unique to this invocation of this rule; this is used once as a
# target for the “continue” conditional jump order emitted after each comma and
# once as a label after the last alternative.  This way, once any alternative
# succeeds, execution will continue after the end of the list of
# alternatives.

- term: (factor {else $seq}, output) [factor {assert}, output] @seq

# Analogously, this defines `term` as a sequence of one or more items
# that are either a `factor` or an `output`; the parentheses simply
# provide syntactic grouping.  After the first item, if it was a
# factor (which attempts to parse some input and thus might fail),
# we emit a conditional jump (the “else” instruction) to the
# locally-defined label at the end (and this is in fact the code that
# emitted the `else seq15` instruction quoted earlier); but, after items
# after the first, we instead emit an “assert” which will report a
# parse failure.  This is because the meta5ix abstract machine does
# not support backtracking except in a very limited form.

# The concatenation of factors in the grammar being compiled is thus
# mapped to sequential emission of code by the compiler, which then
# maps to sequential execution by the target machine.

# So now we have given the implementation of two of our forms of
# composition, namely alternation and sequencing; repetition and
# invocation by name still remain, as well as the atomic units of
# input and output from which we thus compose our compiler.

### Factors ###

# The definition of `factor` is somewhat more involved; each part will
# be explained below:

- factor: string {literal $it}
       , "(" terms ")"
       , "[" @many terms {continue $many} "]"
       , name {call $it}
       , "<<" {begin} terms ">>" {end}
       , ":fnord" {fnord} factor
       , ":notin" string {notin $it}
       , ":between" string {between $it}

#       string {literal $it}

# This is the first alternative for `factor`: a quoted string in the
# input compiles to a “literal” instruction with that string as an
# argument.  This is what generated the `literal "-"` instruction
# quoted above.

#       , "(" terms ")"

# A factor can also be parenthesized `terms`.

#       , "[" @many terms {continue $many} "]"

# Or it can be `terms` wrapped in square brackets, denoting
# repetition; in this case a label is generated before the code to
# parse `terms`, and a conditional jump is generated to attempt
# parsing them repeatedly until some attempt is unsuccessful.  Thus we
# have our third form of composition.

#       , name {call $it}

# A factor can also be a `name`, in which case we generate a “call”
# instruction to that name, like the `call name` quoted above; this
# is, in a sense, the fourth form of composition.

#       , "<<" {begin} terms ">>" {end}

# We haven’t seen an example of this yet, but a factor can also
# describe how to parse a *token*, which can be copied to the output
# with “it”; such a token consists of `terms` wrapped in double
# angle brackets.  “begin” and “end” instructions are emitted to
# delimit the extent of the token at runtime; the `end` instruction
# backs up the input stream to `begin` if the attempt to parse the
# token failed.  This is the only form of backup or backtracking
# supported by this abstract machine, and it does not nest.

#       , ":fnord" {fnord} factor

# A factor can also be a fnorded factor, which skips any whitespace
# with the “fnord” instruction before attempting to parse the factor
# itself, in the same way that attempts to parse literal strings like
# "-" or "<<" skip whitespace.  This is defined as an unary operator
# rather than a concatenation, because in a concatenation like `"-"
# name` or `"(" terms ")"`, if the first item succeeds, the other
# items are required to succeed (by inserting an `assert` instruction
# after each one).  But skipping whitespace always succeeds, so it is
# not a good thing to use for such “predictive parsing”.

#        , ":notin" string {notin $it}

# A factor can also be a `:notin` construct, as seen above in the
# definition of comments, which is compiled to a “notin” instruction
# which matches any single character not in the argument string (except
# line separators).

#        , ":between" string {between $it}

# A factor can also be a `:between` construct, which is compiled
# analogously, but matches a single character whose ASCII code lies in
# the closed range between two given characters; for example, `between
# "AZ"` will match any ASCII capital letter.

### Output ###

# The Meta5ix virtual machine has a one-line output buffer in which an
# output line is constructed by a series of instructions.

- output: (quasiquote, "@" {dedent} (var, quasiquote)) {writeline}

# Finally we have finished the definition of `factor` and can move on
# to `output`, which defines the `{foo}` and `@x` constructs that
# generate the compiled program.  `@` turns what follows into a label
# with a “dedent” instruction, causing the resulting output line to
# not be indented.  Strings for output are quasiquoted:

- quasiquote: "{" [(<<ch [ch]>>, "\" <<:notin "">>) {say "$it"}, "$" var] "}"
- ch: :notin "$}\"

# Verbatim text is tokenized here and emitted verbatim with “say”.
# The verbatim text is just one or more normal characters `ch`, or a
# backslash (which is dropped) followed by any character, while “$”
# introduces a “var” — either an autogenerated local label or the
# magic variable “it”.

# There’s a subtle point here about skipping whitespace: This
# definition does not ignore whitespace *within* the quasiquoted
# string, because the abstract machine’s “notin” instruction does not
# do that, unlike “literal”.  But when Meta5ix tries to parse the “$”
# literal, it will first irreversibly discard any whitespace that
# might be impeding its greedy quest for “$”.  So it is important that
# the `"$" var` case and the "\" case are tried *after* the
# ordinary-text case — otherwise any initial whitespace in verbatim
# text would be lost.

# There’s a subtle bug here with respect to breaking quasiquoted text
# across lines.

# Local labels and “it” are are handled in the same way within {} or
# after @:

- var: "it" {copyinput}, name {gen $it}

# The “gen” instruction is defined by the abstract machine to generate a
# name that’s local to the current rule invocation.

### Tokens ###

# This leaves only the basic tokens of the language to define, the
# things that will be copied to the output by `it`.

- string: :fnord <<'"' [:notin '"'] '"', "'" [:notin "'"] "'">>

# A `string` is a token, before which we ignore whitespace with
# `:fnord`, which can be either "stuff" or 'stuff'; in either case the
# contents are prohibited from including the closing delimiter (or,
# implicitly, a line ending).

- name: :fnord <<letter [letter, :between "09"]>>

# A `name`, similarly, consists of a letter followed by zero or more
# letters or digits.

- letter: :between "az", :between "AZ", :between "__"

# A `letter` is either an ASCII lowercase letter, an ASCII uppercase
# letter, or underscore.  You’d think you could handle the last case
# with just "_" instead of `:between "__"` but that may add unwanted
# whitespace to your identifier.

### Notes ###

# Meta5ix is the simplest self-compiling compiler-compiler I am aware
# of, being 18 lines of fairly simple code, and the abstract machine
# it targets is also fairly simple and lightweight; although I’m
# currently running it on a virtual machine consisting of 109 lines of
# Python code which takes 668 million amd64 instructions to recompile
# meta5ix, I think it should be straightforward to reimplement in a
# few hundred instructions of assembly.

# Its current implementation requires a bootstrap compiler currently
# provided by a checked-in previous output; I initially hand-compiled
# an early version of meta5ix to its abstract assembly to get it
# running, but the bootstrapping chain is currently broken.

# It follows META II fairly closely, and it shares META II’s
# limitation of being unable to backtrack beyond a single token, and
# being unable to reorder tokens copied from its input to its output,
# which enables the abstract machine to be implemented with only small
# input and output buffers.  But, by moving tokenization into the
# grammar, meta5ix can support a much wider range of languages than
# META II could, simplifying the abstract machine into the bargain.
# It also supports an arbitrary set of local labels, rather than META
# II’s two, and its syntax is not only more readable, but also more
# amenable to editing — for example, adding and removing alternatives.
# Moreover, by virtue of being more restrained about inserting
# whitespace in the output, it can support a wider variety of output
# languages.

# There’s a strange hall-of-mirrors quality when adding a new feature
# to meta5ix and then using it, because there are three versions at
# play at the same time: the old version, which does not support or
# use the feature; the intermediate version, which supports it but
# does not use it; and the new version, which supports it and uses it.
# The intermediate version is compiled by the old version, runs the
# code of the intermediate version, and takes the new version as
# input.  It’s very easy to get confused between the levels,
# especially when the feature is backwards-incompatible — for example,
# changing the syntax of an existing construct — and you’re debugging.
# When the code produced from compiling the new version is incorrect,
# is that because the new version is buggy, because the intermediate
# version is buggy, or because the intermediate version was compiled
# with a buggy old version?  It’s easy to find and fix the bug in the
# wrong version.  And it’s easy to fix it in, say, the intermediate
# version and then forget to copy the change to the new version.