Ur-Scheme: A GPL self-hosting compiler from a subset of R5RS Scheme to fast Linux x86 asm

Ur-Scheme is a compiler from a small subset of R5RS Scheme to Intel x86 assembly language for Linux. It can compile itself. It is free software, licensed under the GNU GPLv3+. It might be useful as a base for a more practical implementation (or a more compact one), or it might be enjoyable to read (particularly since the entire development history is browsable using darcs), but it probably isn't that useful in its current form.

Ur-Scheme is:

Downloading

If you want to get it, even though it's impractical, you can use Darcs to snarf the source repository:

darcs get http://canonical.org/~kragen/sw/urscheme/

Or you can download the source tarball of Ur-Scheme version 3, or version 2, version 1, or version 0 if you like.

On Linux, if you have MzScheme, GCC, and gas installed, you can build it by just typing "make". So far there it only runs on Linux, although it looks like a MacOS port would be pretty easy.

Limitations

Extensions

These are not in R5RS.

Bugs

Output is unbuffered, which accounts for a third of its run-time.

I still don't have a garbage collector, and programs crash when they run out of memory.

Influences and Inspirations

Abdulaziz Ghuloum's paper, An Incremental Approach to Compiler Construction, was an enjoyable read and inspirational. Ghuloum explained his objective and approach thus:

Compilers are perceived to be magical artifacts, carefully crafted by the wizards, and unfathomable by the mere mortals... Real-life compilers are too complex to serve as an educational tool. And the gap between real-life compilers and the educational toy compilers is too wide. The novice compiler writer stands puzzled facing an impenetrable barrier, "better write an interpreter instead."

The goal of this paper is to break that barrier. We show that building a compiler can be as easy as building an interpreter. The compiler we construct accepts a large subset of the Scheme programming language and produces assembly code for the Intel-x86 architecture... The development of the compiler is broken into many small incremental steps. Every step yields a fully working compiler for a progressively expanding subset of Scheme. Every compiler step produces real assembly code that can be assembled then executed directly by the hardware. ...

Compiler construction is not as complex as it is commonly perceived to be. In this paper, we showed that constructing a compiler for a large subset of Scheme that targets a real hardware is simple. The basic compiler is achieved by concentrating on the essential aspects of compilation and freeing the compiler from sophisticated analysis and optimization passes. This helps the novice compiler writers build the intuition for the inner-workings of compilers without being distracted by details.

This is the objective and approach with which I started out Ur-Scheme, and it turned out to be extremely successful.

Marc Feeley's 2004 talk on the "90 minute Scheme to C compiler" was also quite inspirational, although I didn't use any of the techniques he discussed! It was useful to see that a relatively complete Scheme compiler could be written in 800 lines of Scheme, and that the performance of the resulting C code wasn't too terrible.

In the realm of small self-hosting compilers, Fabrice Bellard's 2002 OTCC is near the pinnacle of compactness. In 4748 bytes of code, otccelf compiles the subset of C it is written in into standalone x86 ELF executables. What's even more stunning is that the original otcc, missing the ELF output, is only 3301 bytes, and Bellard explains that he actually started with a smaller subset of C than that! The extended version, TCC, was up to 100KiB and could compile a Linux kernel from scratch in 10 seconds.

Looking at disassembled code from Bernd Paysan's bigFORTH showed me that a code generator could be really very simple. (The bigFORTH web page says, "The compiler generates optimized native code for the i386.", but I don't know what meaning of "optimized" Bernd has in mind.) bigFORTH gets pretty decent performance in my experience, really, much better than the gFORTH threaded-code interpreter. But it looks like most of the bigFORTH primitives simply inline fixed instruction sequences, and no register allocation is done. I figured I could glue together fixed instruction sequences too.

Once I got started, I learned about Darius Bacon's brilliant seven-page self-hosting "ichbins" compiler and read it. It's definitely worth a read.

I decided to call it "Ur-Scheme" because it's not quite a Scheme, with all the missing parts mentioned above; it's more like something that could grow into a Scheme. I was inspired by UrJTAG, but of course the echo of "Ur-Quan Masters" (one of my favorite games) was a plus as well. Also, the ancient capital of Sumer was Ur of the Chaldees (אור כשדים in Hebrew), and it's an Unoriginal Reimplementation of Scheme. (Thanks to Ben Goetter for that.) Or at least some of it.

Origins and Lessons

In February 2008, I wanted to write a metacompiler for Bicicleta, but I was intimidated because I'd never written a compiler before, and nobody had ever written a compiler either in Bicicleta or for Bicicleta. So I thought I'd pick a language that other people knew a lot about writing compilers for and that wasn't too hairy, write a compiler for it, and then use what I'd learned to write the Bicicleta compiler.

It took me about 18 days from the time I started on the project to the time that the compiler could actually compile itself, which was a lot longer than I expected.

I learned a bunch from doing it. Here are some of the main things I learned:

  1. Interpreters are a better way to bootstrap than metacompilers. Instead of writing a compiler for a subset of Scheme in that subset, I should have written an interpreter in standard Scheme (or whatever) and a metacompiler in the language implemented by the interpreter. (This is what Darius Bacon did with "ichbins".) Interpreters are a lot simpler than compilers, especially if they don't have to run fast, and especially if you can write them in a language like Scheme that gives you garbage collection and closures for free.

    The restriction that the compiler had to be correct both in R5RS Scheme and in the language that it could compile was really a pain. For example, although I could add new syntactic forms to the language that it could compile, and I could add new syntactic forms with R5RS macro definitions, I couldn't simplify the compiler by adding new syntactic forms, because the compiler can't compile R5RS macro definitions. (There's a portable implementation of R5RS macros out there, but it's about twice the size of the entire Ur-Scheme compiler.) Similarly, I was stuck with a bunch of the boneheaded design decisions of the Scheme built-in types — for example, no auto-growing mutable containers, separate types for strings and characters, and the difficulty of doing any string processing without arithmetic and side effects.

  2. Start with the simplest thing that could possibly work. I keep on learning this every year. In this case, the really expensive thing was that I wanted normal function calls to be fast, so I used normal C-style stack frames for their arguments. This seems to have paid off in speed, but it meant I spent four days and about 200 lines of code implementing lexical closures of unlimited extent. (Really! According to my change log, from February 13 to February 16, I basically didn't do anything else except add macros.) If I'd just allocated all my call frames on the heap, the result would have been slow, but I would have gotten done a lot sooner.
  3. Tail-recursion makes your code hard to read. More traditional control structures, such as explicit loops, are both terser and clearer. This is the largest Scheme program I've written, so this is my first time really experiencing this. Look at this code (discarded in favor of something simpler now, thank goodness):
    (define (string-append-3 length s2 buf idx)
      (if (= idx (string-length buf)) buf
          (begin
            (string-set! buf idx (string-ref s2 (- idx length)))
            (string-append-3 length s2 buf (1+ idx)))))
    (define (string-append-2 s1 s2 buf idx)
      (if (= idx (string-length s1))
          (string-append-3 (string-length s1) s2 buf idx)
          (begin
            (string-set! buf idx (string-ref s1 idx))
            (string-append-2 s1 s2 buf (1+ idx)))))
    (define (string-append s1 s2)       ; standard
      (string-append-2 s1 s2 (make-string (+ (string-length s1)
                                             (string-length s2)))
                       0))
    

    That horrible rat's nest is my attempt to straightforwardly translate this very straightforward procedural approach, without introducing any forward references:

    def string_append(s1, s2):
        buf = make_string(len(s1) + len(s2))
        idx = 0
        while idx != len(s1):
            buf[idx] = s1[idx]
            idx += 1
        length = len(s1)
        while idx != len(buf):
            buf[idx] = s2[idx - length]
            idx += 1
        return buf
    

    (You wouldn't really want to do that in Python — I'm just explaining what I was thinking.)

    Tail-calls are merely "goto with arguments", which means that they can be implemented very efficiently, and also means that you can easily use them to create unreadable code.

  4. Native-code compilers get OK performance pretty easily. On the standard stupid Fibonacci microbenchmark, on my laptop, Ur-Scheme outperforms MzScheme's interpreter by about a factor of 7.6 (excluding time to compile and assemble):
    (define (fib2 n) 
      (cond ((= n 0) 1)
            ((= n 1) 1)
            (else (+ (fib2 (1- n)) 
                     (fib2 (- n 2))))))
    

    GCC outperforms Ur-Scheme on that same benchmark by about a factor of 5.

    On larger programs, such as the Ur-Scheme compiler itself, Ur-Scheme only outperforms MzScheme's interpreter by about a factor of 5 (rather than 7 or so), which (if you're keeping track) means that Scheme code interpreted in MzScheme is somewhere around 25-50 times slower than C code compiled with GCC.

    The literature about writing compilers is full of hairy techniques like CPS conversion, SSA conversion, higher-order control-flow analysis, type inference, automated theorem proving, partial evaluation, escape analysis, and so on. Ur-Scheme doesn't do any of those things. (Well, actually, it does do a little bit of escape analysis.) In fact, Ur-Scheme doesn't even do register allocation, and it does full dynamic type checking at run-time. I imagine that if you wanted to get performance only 2 or 3 or 4 times worse than C compiled with GCC (which is not exactly a gold standard itself, these days) then you'd need to start hauling out the hairy techniques. But it already beats the pants off all the interpreters I can find, and at least for itself, it generates faster code than the three well-known Scheme compilers I've tried on it (mzc, Gambit-C, and Chicken.) (I suspect that Stalin and Ikarus would do better...)

Interesting Features

It has a one-level parser, like old versions of Microsoft BASIC; there's no separate tokenizer. I wrote the parser after reading the "ichbins" parser, and there is definitely some influence visible.

It contains relatively little mutation. Although almost every line of the compiler has "side effects" like outputting lines of assembly code, there are fairly few locations where the compiler's internal state is mutated. I count 25 calls to set! and string-set! in the 1600 lines of code, including the standard library. The I/O side effects are similarly centralized — there are three calls to display, no calls to newline, and one call to read-char in the whole code. When was the last time you saw a 1600-line C program with only 25 assignment statements?

To simplify code generation, it follows BigFORTH and colorForth by treating the x86 as a stack machine.

What little optimization it does is done by optimization macros that rewrite certain patterns in the source — for example, (if (not a) b c) gets rewritten into (if a c b), and (if (null? x) y z) is rewritten into a call to a special built-in (%ifnull x y z). As described in R5RS, these macros also compile the various Scheme control structures (cond, case, and, and so on) into more fundamental control structures like if and lambda.

At present, because it compiles one expression at a time, you can interactively type code at it and see the compiled assembly results. Unfortunately this isn't very much fun because the generated code is so verbose. Darius discovered this feature.

The assembly code it produces can compile to statically linked, standalone executables because it doesn't depend on any external libraries — it just uses the Linux system call interface directly. They can be relatively small; a minimal executable is about 17KiB, because it includes the whole standard library (not just the parts that are used).

Future Work

First, of course, there are the bugs to fix, especially including the absence of a garbage collector.

If this compiler has any merit at all, it is in its small size and comprehensibility. Darius Bacon's brilliant 385-line "ichbins" self-compiling Lisp-to-C compiler is much better at that, being less than one-fifth of the size. So one direction of evolution is to figure out what can be stripped out of it. ichbins has no arithmetic, no closures, no separate string or symbol type, and only one side effect.

There's probably a lot that could be made clearer, as well.

Another direction is to try to improve the speed and size of its output code a bit. For example:

Another direction is to make it more practical. For example, improved error-reporting, a profiler, an FFI, and access to files and sockets would be handy.

Another direction is to make it run faster.