httpdito: a useful web server in just over 2000 bytes, compiled =============================================================== We all have moments where we do things we regret. Sometimes when we’re drunk, or under the influence of bad friends, or lonely. In my case, I guess it was probably sleep deprivation that led me to think, “You know what? Writing a web server in assembly language sounds like a fun thing to do.” It turned out pretty well and revealed some surprising things about modern Linux systems. You might even find it useful. You probably didn’t expect you could write a useful HTTP server in just over 2000 bytes of machine code. 1. What httpdito is useful for 2. What httpdito is 3. How to build httpdito 4. How to configure httpdito 5. Performance 6. Security 7. What it tells us about Linux 8. What it suggests about operating system design 9. Are you insane? To the extent possible under law, Kragen Javier Sitaker has waived all copyright and related or neighboring rights to Dercuano. This work is published from Argentina. What httpdito is useful for --------------------------- httpdito can serve static files from your filesystem on Linux. It’s probably the easiest way to serve static files, and like any static web server written in a language faster than Tcl, it’s plenty fast for just about anything. And it’s almost smaller than 2K, so you can copy it anywhere easily. It serves up the files in the directory where you run it. It won't serve up files elsewhere, unless you have a symlink in there. It doesn't serve up files whose names begin with “.”, such as “.htaccess” or “.htpasswd”. This could be a problem if your intended use is serving up non-bare Git repositories. You can serve up a bare clone, though: $ git clone --bare calculusvaporis calculusvaporis.git Cloning into bare repository 'calculusvaporis.git'... done. $ ./server & [1] 29425 $ (cd calculusvaporis.git; git update-server-info) $ time git clone http://localhost:8086/calculusvaporis.git cv-clone Cloning into 'cv-clone'... real 0m0.259s user 0m0.016s sys 0m0.048s $ It *does* %-decode URLs, so it should work fine to serve up files whose names are in UTF-8, not just ASCII. It won’t run on non-x86 machines, at least without something like qemu-user; it doesn’t do PHP, CGI, or any embedded scripting language; it doesn’t automatically generate directory indexes; it doesn’t do HTTPS or content negotiation; it doesn’t log. With a little hacking, you could probably build it into some other application as a low-overhead embedded HTTP server. What httpdito is ---------------- It’s for fun. It’s a full, useful web server in just over two kilobytes. (Earlier versions were smaller, but an expanded set of MIME types has pushed it up to 2060 bytes.) It’s written in self-contained i386 assembly language; while it uses a lot of macros, all of those macros are defined in the source file itself. Similarly, although it calls a couple of functions, those functions are defined in httpdito, not loaded from a library. It makes system calls, though. It’s well-commented: it’s more intended to be read than to be executed. One commenter exclaimed that it looked more like Python than assembly. The system calls it uses are open(), close(), read(), write(), fork(), _exit(), waitpid(), alarm(), socket(), bind(), setsockopt(), listen(), and accept(). I think that none of them have changed their semantics since BSD 4.3 around 1979, although in Linux some of the constants have different values. In that sense, httpdito is unreasonably portable. I think it’s probably binary-compatible with even very early Linux versions, although I haven’t tried it. How to build httpdito --------------------- I’ll write a Makefile soon, but for now: gcc -m32 -nostdlib server.s -o server.bloated And then optionally objcopy -S -R .note.gnu.build-id server.bloated server If you don’t do that part, it will be over 4kB instead of under 2kB. If you really care about that, you’re probably insane. (Take note that I’ve spent many hours achieving its current size.) How to configure httpdito ------------------------- Before starting it, cd to the directory containing the files you want it to serve up. If you want it to run on a different port, edit the line near the top of server.s that says .equiv port_number, 8086 and set it to the port you want. In its current configuration, httpdito will run up to 2048 concurrent child processes. This sounds like an absurdly huge number, since setting Apache’s MaxChildren setting to more than about 100 or 1000 is likely to start giving you OOM kills. But that’s because Apache process working sets are typically a few megabytes, like 4 to 16 megabytes. By contrast, ps usually displays httpdito RSS as between 4 and 16 *kilo*bytes. That means you can have a thousand times as many processes before you run out of memory. In practice, 2048 processes on my netbook is no big deal, the same netbook where Firefox struggles to render Twitter, often begging me for permission to kill the JavaScript that is bogging it down. Performance ----------- httpdito typically performs a little bit worse than a default Debian configuration of Apache on the same hardware. For example, on my 8-core laptop, both Apache and httpdito manage in the neighborhood of 20000 to 30000 requests per second, according to apachebench (ab), but while httpdito tops out at about 1.8 gigabits per second of outbound bandwidth, Apache manages closer to 2.5. But these numbers aren’t really trustworthy because the limit on them (the bottleneck) is probably ab, not Apache or httpdito. These numbers are not noticeably worsened by increasing the number of concurrent clients up to 1000. The bigger point, though, is that even httpdito is able to max out the gigabit Ethernet network card in the machine. So you probably shouldn’t worry about performance with httpdito. Security -------- I fucking hate dealing with security, but it’s a necessity. Assembly language is not the language of choice for writing software that safely handles untrusted input. It’s easy to write bugs that provide an attacker with remote arbitrary code execution. However, httpdito is extremely simple (296 CPU instructions as 02013) and it’s intended to be secure. It is expected that an attacker on a remote machine can: - Use arbitrary amounts of outgoing bandwidth by requesting files from httpdito. - Deny service to other clients by opening 2048 concurrent connections to httpdito and continuing to open 64 new connections per second. (This could be mitigated in a variety of ways: a scoreboard by hashed IP address, and leaving SYN cookies enabled.) - Use arbitrary amounts of CPU time by flooding httpdito with requests. - Read the contents of the files in the directory where httpdito is running and its descendant directories, including via symlinks. It is not expected that a remote attacker can: - Exhaust your memory or your process table, causing problems for other processes on the machine where httpdito is running. - Execute code they provide, for example via a buffer overflow. - Cause any data to be written to your filesyste. - Read any files outside the directory tree httpdito is serving up. - Provide a link to a third party that causes attacker-provided data to be interpolated into HTML, JS, PDF, or other executable data sent by the server, in an executable way. (That is, httpdito is intended to be XSS-proof.) - Receive any information provided by other users of the server, except a crude measure of server load. - Deny service to other clients without continually sending new requests. Please let me know if you find a weakness (a way that a remote attacker can do something unexpected), or something I should have thought about. What it tells us about Linux ---------------------------- 1. Hardware is fast now. Apache became popular in 1994 and 1995 in large part because it performed better than its ancestor, NCSA httpd, while being a drop-in replacement for it. Philip Greenspun [memorably explained in 1997, speaking of CGI][0], “Computers do not like to fork 500 000 times a day.” Spawning a new process is a fairly expensive operation. Apache’s “preforking” model, in which a pool of worker processes sits around waiting to handle incoming requests, each handling many requests (say, 100) before it dies, essentially eliminated that expense. [0]: http://philip.greenspun.com/panda/server-programming But, at the time, a Linux web server might have been running on a 133MHz Pentium. As a rough estimate, that machine could do a 32-bit add every cycle, 133 million times a second, while my laptop, a 2.5 GHz Core i7, can do ten 64-bit adds every cycle on each of its four cores, or 1 trillion 64-bit adds per second, roughly equivalent to 2 trillion 32-bit adds. So it’s about 7500 times as fast. One second for my laptop is like two hours for the 1994 Linux web server. So some things that were impractical then are practical now. That is, in 1997, forking 500 000 times a day was expensive. Now my laptop can fork 500 000 times in about 20 seconds. 2. Processes are cheap now. But it’s not all about hardware. If you had thousands of concurrently runnable processes on 1994 Linux, it would spend most of its time in the scheduler. But, in recent years, the scheduler has improved a lot; if I’m not mistaken, it’s now O(1) in the number of runnable processes. That means it’s now practical to have truly ungodly numbers of processes awake, as long as they’re doing useful work. In a sense, this is “back to the future”: cheap processes were one of the interesting novelties of Unix. The idea of spawning off multiple time-shared processes for a single command line was rather bizarre for the 1970s, but made it easy to do computations that did not fit comfortably into the 16-bit memory address space of the time, by breaking them into multiple cooperating processes. But there’s cheap, and then there’s *really* cheap. 3. System calls are fast now. Henry (now Alexia) Massalin’s (1993?) dissertation, on the experimental operating system “Synthesis”, passionately advocates making system calls cheaper, and describes several novel ways of achieving this. Among other things, expensive system calls force you to add complexity to your userspace programs to reduce the number of calls to them: I/O buffering, caching, attempts to duplicate kernel logic in userspace, and so on; and they create incentives for “batch” system calls like writev() and poll(). And this complexity creates its own performance problems. Synthesis was noticeably faster than traditional Unixes on the same hardware, and dramatically faster when measuring latency rather than bandwidth --- when your write() calls are fast enough to efficiently pass a single audio sample through a pipe, you can get much lower end-to-end latency on an audio processing pipeline of several processes. Synthesis has been very influential on OS implementors since the dissertation was published, although to my knowledge, the code has never been published, and the results have never been reproduced. But many of the techniques for reducing system call costs have been incorporated into Linux. Partly as a consequence, httpdito can use a one-kilobyte buffer for reading file data and passing it to the network stack, yet still push out gigabits per second; that means each read()/write() pair is taking less than 16 microseconds. If you increase the size of this buffer to reduce the number of calls, it doesn’t help, because apachebench is still the bottleneck. And this helps a lot with making processes cheap. Whenever you fork() a process, you inherently need to make a copy all the memory pages that process writes to. Every time you can use a 1K buffer instead of a 32K buffer, you can use one page instead of eight pages, which means several times as many processes can fit into your L2 cache. 4. Most of the work of an HTTP server is in the kernel. httpdito is currently a 2060-byte executable, of which 1019 bytes are its 296 instructions of machine code. If you look at this code, most of it looks like this: 80482c3: 05 e0 96 04 08 add $0x80496e0,%eax 80482c8: 89 c1 mov %eax,%ecx 80482ca: 89 eb mov %ebp,%ebx 80482cc: b8 03 00 00 00 mov $0x3,%eax 80482d1: cd 80 int $0x80 80482d3: c7 05 b8 96 04 08 c3 movl $0x80495c3,0x80496b8 80482da: 95 04 08 80482dd: 83 f8 00 cmp $0x0,%eax 80482e0: 7d 13 jge 0x80482f5 80482e2: e8 d1 fd ff ff call 0x80480b8 80482e7: 89 eb mov %ebp,%ebx 80482e9: b8 06 00 00 00 mov $0x6,%eax 80482ee: cd 80 int $0x80 That is, it consists largely of system calls (`int $0x80`) and setting up arguments for them. According to the time command, httpdito spends about 40 times as much “system” time as it does “user” time. That is, for every cycle spent in httpdito’s code itself, another 40 cycles are spent in the system calls httpdito is invoking. In essence, httpdito is just scripting the Linux kernel’s capabilities in assembly language. It could be even more extreme; there’s a sendfile() system call which might save a loop of about 10 instructions reading and writing file contents. I chose not to use sendfile(), because although it would probably be faster, httpdito is already fast enough, and it’s not clear that it would be simpler. What it suggests about operating system design ---------------------------------------------- The Linux kernel is already capable of efficiently creating tens of thousands of processes per second on modern hardware, if not hundreds of thousands, and of hosting at least thousands, if not millions, of live processes. So maybe it’s possible, even without changing kernels, to break up new software systems into many tiny processes like httpdito, with a working set of under 100 kilobytes. Then the kernel can provide fault isolation between the processes, and integrating objects written in different languages should be little harder than integrating objects written in the same language. Erlang has been exploring this kind of design, and the results have been promising. But Erlang is slow, even with HIPE, and you have to write your programs in Erlang’s weird language (which, to be fair, is pretty nice, even if it’s weird.) Imagine, for example, a feature-complete web server built this way. A front-end process, functionally equivalent to a load balancer, parses the initial HTTP request and sends it to a router process, encoded with something like protobufs or Thrift. The router process examines the path and decides whether to send it to a filesystem-server backend like httpdito, a PHP server, or something else. httpdito sends the response back to the load balancer, which posts a message to the access-logger process and sends the response back to the HTTP client. Perhaps the parts of the system rendezvous with one another using the filesystem, for example with Unix-domain sockets or named pipes; and perhaps a supervisor process is responsible for starting up enough backend processes to handle the current demand. Rather than shared libraries, in this world, you would have server processes that accept and respond to requests. In essence, this is the original Unix architecture from the 1970s, adapted to the much larger machines of today by using many more processes, rather than only slightly more processes, each containing many times the memory of a PDP-11. This kind of architecture has major potential pitfalls, as the GNU Hurd team found. It’s almost a distributed system, with many of the difficulties of debugging distributed systems and making them robust. On the plus side, we have a lot more knowledge now about how to do that, thanks to Roy Fielding, the Erlang folks, Jeff Dean, and so on. Are you insane? --------------- Yes. It’s 2013. You have to be insane to write application software in assembly language. You got the bonus package: here’s the executable ------------------------------------------------ Feed this text to the uudecode command. (This is an older version.) begin 755 httpdito.gz M'XL("(NYM5("`W-E97?O,0M9TQ$+AK@/@E88HAO=ARJE#AQR26!F MG;H6K+A^$&'##P6A)>Y+/[2K0\_GN21-VDW$%^Y][_D]O^=WS_L\[]TENE_N ML=ELI#QJB)V@]=HYKJX=UGQM$6\G%#RMA"</V"YB^:^" M??4CK@ZONP#L!9W:DI\C#Q\<^7??]K'^:V\Z<'F"J_/ESQ\U5V?6;[V8FDO- MIIT6=KNKRT"EA;'L6[!D,(LJ\,3R;EA]BYV9QS[>XKCA6\PX4'UC/,]V]O8M M_V&:YALGQF:][S=-H3"$3&U]P(WET\`!P+!9=J81)'-X;PQ8P$T>4VH!-'F/ MQ&^EOD^Y76!D<,H]NIWW?(GW1D)BW2VH1CS(?B7/__;-`L_^1:3@0M@$L9G<3^^>=]M M8X<5!0'LJ7B=R:._\$ORWAJK+\9_AO&G(-ZH*3,3#>S9HG,*G9W@3._#QQJU M%L5THLK*FY!DVHUXCF)^S:6=?`<(=OP2WF1-.R%ML_PNIPJ&O6J_7Q7W M>[9Q^;>_8,.%DOZ*#919F]U2FL3)<#J:W'9C+G0,?CS-Y36%WIA*UF ML%HY=!I\N0?QWR?G+D#Y6G!CX8FJC9T/+%3>HN3&1F+_N<3]038E-V#\;L.N_#QY>Z!?F,-D_^"$"Z"6>+ M41&?+Y%KMQ8>:EY\P\A!&E64(:K':$2F'EU3/1$Y%-7#DD?4AMUO2[*D1D(> M255E11BDHAS^#Y([*&H2,"TCX,`FE3]QFA(:DEAKVR9@L[Z1A+1"/3!?36+( M46*L]?BQ4WW=KQ_O/GSD2%^;19:`M,L*:)E`.QB1PZU%%W8F&M&8))>`*U"# M/%QB*"3%\(D$3OT9,<)B$8P94-0A6%1)1.MH?W^OQR=X:;NWG;ZJP,%0=#GL M='0I,B@R-QN)20JJ397TDS%A4C\D/R)$30!C6F,C%( M!&02(2PRD0A!32/_8SQ2ZAV>)0/^60:L]^U;_STX&DH\=%G_+W#X)8 ?#;)X^.VY!,(N[D%>6:L\D+>WRBY+_P-R"FIKB`<````` ` end