Ur-Scheme: A self-hosting compiler for a subset of R5RS Scheme to x86 asm

Ur-Scheme is a compiler from a small subset of R5RS Scheme to x86 assembly language. It can compile itself. It is free software, licensed under the GNU GPLv3+. It might be useful as a base for a more practical implementation (or a more compact one), or it might be enjoyable to read, but it probably isn't that useful in its current form.

Ur-Scheme is:

Downloading

If you want to get it, even though it's impractical, you can use Darcs to snarf the source repository:

darcs get http://pobox.com/~kragens/sw/urscheme/

Or you can download the source tarball of Ur-Scheme version 0.

If you have MzScheme installed, you can build it by just typing "make".

Limitations

Extensions

These are not in R5RS.

Bugs

Output is unbuffered, which makes it slow.

I still don't have a garbage collector, and programs crash when they run out of memory.

The values returned by output procedures are invalid.

+ and - aren't first-class procedures.

Origins

In February 2008, I wanted to write a metacompiler for Bicicleta, but I was intimidated because I'd never written a compiler before, and nobody had ever written a compiler either in Bicicleta or for Bicicleta. So I thought I'd pick a language that other people knew a lot about writing compilers for and that wasn't too hairy, write a compiler for it, and then use what I'd learned to write the Bicicleta compiler.

It took me about 18 days from the time I started on the project to the time that the compiler could actually compile itself, which was a lot longer than I expected.

I learned a bunch from doing it. Here are some of the main things I learned:

  1. Metainterpreters are a better way to bootstrap. Interpreters are a lot simpler than compilers, especially if they don't have to run fast, and especially if you can write them in a language like Scheme that gives you garbage collection and closures for free. The restriction that the compiler had to be correct both in R5RS Scheme and in the language that it could compile was really a pain. For example, although I could add new syntactic forms to the language that it could compile, and I could add new syntactic forms with R5RS macro definitions, I couldn't simplify the compiler by adding new syntactic forms, because the compiler can't compile R5RS macro definitions. (There's a portable implementation of R5RS macros out there, but it's about twice the size of the entire Ur-Scheme compiler.) Similarly, I was stuck with a bunch of the boneheaded design decisions of the Scheme built-in types — no auto-growing mutable containers, separate types for strings and characters, and the difficulty of doing any string processing without arithmetic and side effects, for example.
  2. Start with the simplest thing that could possibly work. I keep on learning this every year. In this case, the really expensive thing was that I wanted normal function calls to be fast, so I used normal C-style stack frames for their arguments. This seems to have paid off in speed, but it meant I spent four days and about 200 lines of code implementing lexical closures of unlimited extent. (Really! According to my change log, from February 13 to February 16, I basically didn't do anything else except add macros.) If I'd just allocated all my call frames on the heap, the result would have been slow, but I would have gotten done a lot sooner.
  3. Tail-recursion makes your code hard to read. More traditional control structures, such as explicit loops, are both terser and clearer. This is the largest Scheme program I've written.

Future Work

First, of course, there are the bugs to fix, especially including the absence of a garbage collector.

If this compiler has any merit at all, it is in its small size and comprehensibility. Darius Bacon's brilliant 385-line "ichbins" self-compiling Lisp-to-C compiler is much better at that, being less than one-fifth of the size. So one direction of evolution is to figure out what can be stripped out of it. ichbins has no arithmetic, no closures, no separate string or symbol type, and only one side effect.

There's probably a lot that could be made clearer, as well.

Another direction is to try to improve the speed and size of its output code a bit. For example:

Another direction is to make it more practical. For example, improved error-reporting, a profiler, an FFI, and access to files and sockets would be handy.

Another direction is to make it faster. Its assembly output is largely comprised of fixed sequences of instructions stuck together, but those instructions are recreated through many layers of calling every time.