Ur-Scheme: A GPL self-hosting compiler from a subset of R5RS Scheme to fast Linux x86 asm

Ur-Scheme is a compiler from a small subset of R5RS Scheme to Intel x86 assembly language for Linux. It can compile itself. It is free software, licensed under the GNU GPLv3+. It might be useful as a base for a more practical implementation (or a more compact one), or it might be enjoyable to read (particularly since the entire development history is browsable using darcs), but it probably isn't that useful in its current form.

Ur-Scheme is:

Reasonably fast. It generates reasonably fast code — when compiled with itself, it runs 2½ times faster (in user CPU time) than when it's compiled with Chicken, 1½ times faster than when it's compiled with Gambit-C, and much faster than when it's interpreted by any of Guile, MzScheme, SCM, Elk, or Chicken's interpreter. It also runs reasonably fast; on my laptop, it compiles about 1000 lines of Scheme per user CPU second, which is not fast in absolute terms (on the same hardware, gas assembles the resulting 66 000-line assembly file in just over half a second, which is about 100 000 lines of assembly per second) but seems to be faster than a lot of Scheme compilers. I imagine that it would run faster if compiled with Ikarus or Stalin, but my CPU is too old to run Ikarus, and I can't get Stalin to compile it successfully.
Impractical. It hardly implements anything beyond what's in R5RS, and it omits quite a bit of the stuff that is in R5RS; for example, files, macros, vectors, eval, quasiquotation, call/cc, and floating-point ("inexact numbers"). Only things that were needed for the compiler to compile itself and run some unit tests are included.
Small. The implementation is about 1600 lines of Scheme, plus 700 lines of comments and blank lines. Compiled with itself on x86 Linux and stripped, the executable is 134kiB. On my 700MHz laptop from the previous millennium, compiling it from scratch with MzScheme and gas takes just under 12 seconds.
Safe. Programs compiled with it do not crash or corrupt their memory unless there is a bug in the compiler (er, except if they run out of stack or heap). All types and array indices are dynamically checked at run-time.
Portable. Last time I checked, it could compile itself successfully when compiled by Guile, MzScheme (PLT Scheme), SCM, Elk, Gambit-C, or Chicken, although it doesn't run in Bigloo, Stalin, RScheme, or TinyScheme. Retargeting it to a different processor wouldn't be a huge pain (after all, it's only 1600 lines of code) but wouldn't be that trivial either.
Unit-tested. It comes with a fairly comprehensive set of unit tests, and additionally, it compiles itself well enough that the compiled version produces the same assembly-language output, byte-for-byte, as when it's running interpreted in some other Scheme — at least when it's compiling itself.

Downloading

If you want to get it, even though it's impractical, you can use Darcs to snarf the source repository:

darcs get http://canonical.org/~kragen/sw/urscheme/

Or you can download the source tarball of Ur-Scheme version 3, or version 2, version 1, or version 0 if you like.

On Linux, if you have MzScheme, GCC, and gas installed, you can build it by just typing "make". So far there it only runs on Linux, although it looks like a MacOS port would be pretty easy.

Limitations

The following are missing: user-defined macros, eval, quasiquotation, first-class continuations, vectors, numerical constants with radix prefixes, ports, "let*", "letrec", non-top-level defines ("internal definitions"), "do", promises, non-integer numbers, bignums, load, apply, case-insensitivity, multiple-value returns.
Pairs are immutable, so there are no set-car! and set-cdr!. Despite this, eqv? is not the same as equal? for lists.
Procedures are allowed to take fixed numbers of arguments, or any arbitrary number of arguments, but the normal at-least-N syntax (lambda (a b . rest) ...) is not supported.
Rebinding standard procedures may break your program.
However, some standard procedures are treated as special forms by the compiler, so rebinding them will rarely have any effect.
Integer arithmetic silently overflows.
Most arithmetic operators are omitted; only 1+, 1-, +, -, quotient, and remainder are present.
Very many standard library procedures are missing. Only the following 55 procedures were actually present last time I updated this list: procedure? string? make-string string-set! string-ref string-length car cdr cons pair? symbol? symbol->string string->symbol display newline eq? current-input-port read-char integer? remainder quotient < eof-object? char? integer->char char->integer list length assq memq memv append not string-append char-whitespace? char<? char<=? char-between? char-alphabetic? = char=? eqv? equal? string=? null? boolean? number? for-each map reverse string->list list->string number->string string->number write.
make-string only takes one argument.
read-char takes a port argument and ignores it; write, display, and newline do not take port arguments. current-input-port returns nil.
map and for-each only take two arguments.
number->string only takes one argument.
string->symbol will be slow with enough symbols.

Extensions

These are not in R5RS.

(display-stderr foo) is like display, but uses stderr.
(exit 37) makes the program exit with exit code 37.
(error foo bar baz) aborts the program and sends to stderr "error: " followed by foo, bar, and baz.
1+ and 1- are as in Common Lisp.
(char->string c) is like (make-string 1 c).
(escape string) returns a list of character strings which, when concatenated, form a string representing the original string by escaping all the backslashes, quotes, or newlines in the string by preceding them with a backslash.
#\tab is the tab character. I wouldn't think to mention it but apparently Stalin doesn't support this.

Bugs

Output is unbuffered, which accounts for a third of its run-time.

I still don't have a garbage collector, and programs crash when they run out of memory.

Influences and Inspirations

Abdulaziz Ghuloum's paper, An Incremental Approach to Compiler Construction, was an enjoyable read and inspirational. Ghuloum explained his objective and approach thus:

Compilers are perceived to be magical artifacts, carefully crafted by the wizards, and unfathomable by the mere mortals... Real-life compilers are too complex to serve as an educational tool. And the gap between real-life compilers and the educational toy compilers is too wide. The novice compiler writer stands puzzled facing an impenetrable barrier, "better write an interpreter instead."
The goal of this paper is to break that barrier. We show that building a compiler can be as easy as building an interpreter. The compiler we construct accepts a large subset of the Scheme programming language and produces assembly code for the Intel-x86 architecture... The development of the compiler is broken into many small incremental steps. Every step yields a fully working compiler for a progressively expanding subset of Scheme. Every compiler step produces real assembly code that can be assembled then executed directly by the hardware. ...

Compiler construction is not as complex as it is commonly perceived to be. In this paper, we showed that constructing a compiler for a large subset of Scheme that targets a real hardware is simple. The basic compiler is achieved by concentrating on the essential aspects of compilation and freeing the compiler from sophisticated analysis and optimization passes. This helps the novice compiler writers build the intuition for the inner-workings of compilers without being distracted by details.

This is the objective and approach with which I started out Ur-Scheme, and it turned out to be extremely successful.

Marc Feeley's 2004 talk on the "90 minute Scheme to C compiler" was also quite inspirational, although I didn't use any of the techniques he discussed! It was useful to see that a relatively complete Scheme compiler could be written in 800 lines of Scheme, and that the performance of the resulting C code wasn't too terrible.

In the realm of small self-hosting compilers, Fabrice Bellard's 2002 OTCC is near the pinnacle of compactness. In 4748 bytes of code, otccelf compiles the subset of C it is written in into standalone x86 ELF executables. What's even more stunning is that the original otcc, missing the ELF output, is only 3301 bytes, and Bellard explains that he actually started with a smaller subset of C than that! The extended version, TCC, was up to 100KiB and could compile a Linux kernel from scratch in 10 seconds.

Looking at disassembled code from Bernd Paysan's bigFORTH showed me that a code generator could be really very simple. (The bigFORTH web page says, "The compiler generates optimized native code for the i386.", but I don't know what meaning of "optimized" Bernd has in mind.) bigFORTH gets pretty decent performance in my experience, really, much better than the gFORTH threaded-code interpreter. But it looks like most of the bigFORTH primitives simply inline fixed instruction sequences, and no register allocation is done. I figured I could glue together fixed instruction sequences too.

Once I got started, I learned about Darius Bacon's brilliant seven-page self-hosting "ichbins" compiler and read it. It's definitely worth a read.

I decided to call it "Ur-Scheme" because it's not quite a Scheme, with all the missing parts mentioned above; it's more like something that could grow into a Scheme. I was inspired by UrJTAG, but of course the echo of "Ur-Quan Masters" (one of my favorite games) was a plus as well. Also, the ancient capital of Sumer was Ur of the Chaldees (אור כשדים in Hebrew), and it's an Unoriginal Reimplementation of Scheme. (Thanks to Ben Goetter for that.) Or at least some of it.

Origins and Lessons

In February 2008, I wanted to write a metacompiler for Bicicleta, but I was intimidated because I'd never written a compiler before, and nobody had ever written a compiler either in Bicicleta or for Bicicleta. So I thought I'd pick a language that other people knew a lot about writing compilers for and that wasn't too hairy, write a compiler for it, and then use what I'd learned to write the Bicicleta compiler.

It took me about 18 days from the time I started on the project to the time that the compiler could actually compile itself, which was a lot longer than I expected.

I learned a bunch from doing it. Here are some of the main things I learned:

Interpreters are a better way to bootstrap than metacompilers. Instead of writing a compiler for a subset of Scheme in that subset, I should have written an interpreter in standard Scheme (or whatever) and a metacompiler in the language implemented by the interpreter. (This is what Darius Bacon did with "ichbins".) Interpreters are a lot simpler than compilers, especially if they don't have to run fast, and especially if you can write them in a language like Scheme that gives you garbage collection and closures for free.
The restriction that the compiler had to be correct both in R5RS Scheme and in the language that it could compile was really a pain. For example, although I could add new syntactic forms to the language that it could compile, and I could add new syntactic forms with R5RS macro definitions, I couldn't simplify the compiler by adding new syntactic forms, because the compiler can't compile R5RS macro definitions. (There's a portable implementation of R5RS macros out there, but it's about twice the size of the entire Ur-Scheme compiler.) Similarly, I was stuck with a bunch of the boneheaded design decisions of the Scheme built-in types — for example, no auto-growing mutable containers, separate types for strings and characters, and the difficulty of doing any string processing without arithmetic and side effects.
Start with the simplest thing that could possibly work. I keep on learning this every year. In this case, the really expensive thing was that I wanted normal function calls to be fast, so I used normal C-style stack frames for their arguments. This seems to have paid off in speed, but it meant I spent four days and about 200 lines of code implementing lexical closures of unlimited extent. (Really! According to my change log, from February 13 to February 16, I basically didn't do anything else except add macros.) If I'd just allocated all my call frames on the heap, the result would have been slow, but I would have gotten done a lot sooner.

Tail-recursion makes your code hard to read. More traditional control structures, such as explicit loops, are both terser and clearer. This is the largest Scheme program I've written, so this is my first time really experiencing this. Look at this code (discarded in favor of something simpler now, thank goodness):

(define (string-append-3 length s2 buf idx)
  (if (= idx (string-length buf)) buf
      (begin
        (string-set! buf idx (string-ref s2 (- idx length)))
        (string-append-3 length s2 buf (1+ idx)))))
(define (string-append-2 s1 s2 buf idx)
  (if (= idx (string-length s1))
      (string-append-3 (string-length s1) s2 buf idx)
      (begin
        (string-set! buf idx (string-ref s1 idx))
        (string-append-2 s1 s2 buf (1+ idx)))))
(define (string-append s1 s2)       ; standard
  (string-append-2 s1 s2 (make-string (+ (string-length s1)
                                         (string-length s2)))
                   0))

That horrible rat's nest is my attempt to straightforwardly translate this very straightforward procedural approach, without introducing any forward references:

def string_append(s1, s2):
    buf = make_string(len(s1) + len(s2))
    idx = 0
    while idx != len(s1):
        buf[idx] = s1[idx]
        idx += 1
    length = len(s1)
    while idx != len(buf):
        buf[idx] = s2[idx - length]
        idx += 1
    return buf

(You wouldn't really want to do that in Python — I'm just explaining what I was thinking.)

Tail-calls are merely "goto with arguments", which means that they can be implemented very efficiently, and also means that you can easily use them to create unreadable code.

Native-code compilers get OK performance pretty easily. On the standard stupid Fibonacci microbenchmark, on my laptop, Ur-Scheme outperforms MzScheme's interpreter by about a factor of 7.6 (excluding time to compile and assemble):
```
(define (fib2 n) 
  (cond ((= n 0) 1)
        ((= n 1) 1)
        (else (+ (fib2 (1- n)) 
                 (fib2 (- n 2))))))
```
GCC outperforms Ur-Scheme on that same benchmark by about a factor of 5.

On larger programs, such as the Ur-Scheme compiler itself, Ur-Scheme only outperforms MzScheme's interpreter by about a factor of 5 (rather than 7 or so), which (if you're keeping track) means that Scheme code interpreted in MzScheme is somewhere around 25-50 times slower than C code compiled with GCC.

The literature about writing compilers is full of hairy techniques like CPS conversion, SSA conversion, higher-order control-flow analysis, type inference, automated theorem proving, partial evaluation, escape analysis, and so on. Ur-Scheme doesn't do any of those things. (Well, actually, it does do a little bit of escape analysis.) In fact, Ur-Scheme doesn't even do register allocation, and it does full dynamic type checking at run-time. I imagine that if you wanted to get performance only 2 or 3 or 4 times worse than C compiled with GCC (which is not exactly a gold standard itself, these days) then you'd need to start hauling out the hairy techniques. But it already beats the pants off all the interpreters I can find, and at least for itself, it generates faster code than the three well-known Scheme compilers I've tried on it (mzc, Gambit-C, and Chicken.) (I suspect that Stalin and Ikarus would do better...)

Interesting Features

It has a one-level parser, like old versions of Microsoft BASIC; there's no separate tokenizer. I wrote the parser after reading the "ichbins" parser, and there is definitely some influence visible.

It contains relatively little mutation. Although almost every line of the compiler has "side effects" like outputting lines of assembly code, there are fairly few locations where the compiler's internal state is mutated. I count 25 calls to set! and string-set! in the 1600 lines of code, including the standard library. The I/O side effects are similarly centralized — there are three calls to display, no calls to newline, and one call to read-char in the whole code. When was the last time you saw a 1600-line C program with only 25 assignment statements?

To simplify code generation, it follows BigFORTH and colorForth by treating the x86 as a stack machine.

What little optimization it does is done by optimization macros that rewrite certain patterns in the source — for example, (if (not a) b c) gets rewritten into (if a c b), and (if (null? x) y z) is rewritten into a call to a special built-in (%ifnull x y z). As described in R5RS, these macros also compile the various Scheme control structures (cond, case, and, and so on) into more fundamental control structures like if and lambda.

At present, because it compiles one expression at a time, you can interactively type code at it and see the compiled assembly results. Unfortunately this isn't very much fun because the generated code is so verbose. Darius discovered this feature.

The assembly code it produces can compile to statically linked, standalone executables because it doesn't depend on any external libraries — it just uses the Linux system call interface directly. They can be relatively small; a minimal executable is about 17KiB, because it includes the whole standard library (not just the parts that are used).

Future Work

First, of course, there are the bugs to fix, especially including the absence of a garbage collector.

If this compiler has any merit at all, it is in its small size and comprehensibility. Darius Bacon's brilliant 385-line "ichbins" self-compiling Lisp-to-C compiler is much better at that, being less than one-fifth of the size. So one direction of evolution is to figure out what can be stripped out of it. ichbins has no arithmetic, no closures, no separate string or symbol type, and only one side effect.

There's probably a lot that could be made clearer, as well.

Another direction is to try to improve the speed and size of its output code a bit. For example:

Lambda-lifting would keep let-expressions from consing storage for all of the arguments of the outer procedure, and if you were slightly more ambitious, you could even optimize away the procedure call entirely. That would make them a lot smaller, too.
Although it does some inlining now, it could do more. Also, the existing inlining would benefit from a preliminary scanning pass to see which globals are never mutated and can therefore be inlined safely without breaking Scheme's semantics. Such a scan could also remove the type-check, the two fetches, and the computed call instruction, on the vast majority of procedure calls, and the argument-count check in the vast majority of procedure prologues.
Especially conditional expressions could benefit from some more inlining and integration with their parent if; currently only null?, eq?, and not have this special handling. (Although that covers about half of the conditionals in the compiler.) The procedure-call-and-return-and-test overhead totals about 18 instructions. The implementation of case is particularly inefficient because it actually calls memv.
A little bit of peephole optimization usually goes a long way, but I'm not sure it would in this case; we could get a little bit of dead-code removal (e.g. procedure epilogues preceded by an unconditional jump, as in the very common case of a tail call) and there's the occasional pop %eax; push %eax; movl foo, %eax sequence. But overall I don't think there's much of an advantage yet.

The procedure prologues and epilogues are pretty large, and they could probably be shrunk quite a bit, especially for non-variadic procedures. It might be possible to arrange the stack frame so that a simple leave; ret $12 sequence, or something similarly simple, could replace the current eleven bytes of crap, at least for non-variadic procedures. Tail calls are particularly horrific; here's a three-argument tail call:

 804977f:	50                   	push   %eax
 8049780:	a1 20 89 06 08       	mov    0x8068920,%eax
 8049785:	8d 5c 24 08          	lea    0x8(%esp),%ebx
 8049789:	8b 55 fc             	mov    0xfffffffc(%ebp),%edx
 804978c:	8b 65 f8             	mov    0xfffffff8(%ebp),%esp
 804978f:	8b 6d f4             	mov    0xfffffff4(%ebp),%ebp
 8049792:	ff 33                	pushl  (%ebx)
 8049794:	ff 73 fc             	pushl  0xfffffffc(%ebx)
 8049797:	ff 73 f8             	pushl  0xfffffff8(%ebx)
 804979a:	52                   	push   %edx
 804979b:	e8 fb e8 ff ff       	call   804809b <ensure_procedure>
 80497a0:	8b 58 04             	mov    0x4(%eax),%ebx
 80497a3:	ba 03 00 00 00       	mov    $0x3,%edx
 80497a8:	ff e3                	jmp    *%ebx

That's 43 bytes of code! In this case, as is typical, the function being called (char-between?) is never redefined; it takes a fixed number of arguments and is never called with any other number of arguments; and it is not a closure. These observations would allow the following abbreviated version:

50                   	push   %eax
8d 5c 24 08          	lea    0x8(%esp),%ebx
8b 55 fc             	mov    0xfffffffc(%ebp),%edx
8b 65 f8             	mov    0xfffffff8(%ebp),%esp
8b 6d f4             	mov    0xfffffff4(%ebp),%ebp
ff 33                	pushl  (%ebx)
ff 73 fc             	pushl  0xfffffffc(%ebx)
ff 73 f8             	pushl  0xfffffff8(%ebx)
52                   	push   %edx
e9 xx xx xx xx          jmp    _char_betweenP_4

That's only 28 bytes, a substantial improvement, but still ugly. If we could skip the prologue of tail-called non-closure procedures and just jump directly to the body, then in cases where the caller has at least as many arguments as the callee, we could also avoid copying the saved %esp, %ebp, and return address to a new location on the stack. In that case we could simply do something like this:

8d 5c 24 08          	lea    0x8(%esp),%ebx
8b 65 f8             	mov    0xfffffff8(%ebp),%esp
ff 33                	pushl  (%ebx)
ff 73 fc             	pushl  0xfffffffc(%ebx)
50                   	push   %eax
e9 xx xx xx xx          jmp    _char_betweenP_4

That's only 18 bytes. Still not that pretty, but no longer disgusting.

Storing symbols in a skip list or hash table would probably help a lot with programs like compilers that use string->symbol a lot.

Another direction is to make it more practical. For example, improved error-reporting, a profiler, an FFI, and access to files and sockets would be handy.

Another direction is to make it run faster.