httpdito: a useful web server in just over 2000 bytes, compiled
===============================================================
We all have moments where we do things we regret. Sometimes when
we’re drunk, or under the influence of bad friends, or lonely. In my
case, I guess it was probably sleep deprivation that led me to think,
“You know what? Writing a web server in assembly language sounds like
a fun thing to do.”
It turned out pretty well and revealed some surprising things about
modern Linux systems. You might even find it useful. You probably
didn’t expect you could write a useful HTTP server in just over 2000
bytes of machine code.
1. What httpdito is useful for
2. What httpdito is
3. How to build httpdito
4. How to configure httpdito
5. Performance
6. Security
7. What it tells us about Linux
8. What it suggests about operating system design
9. Are you insane?
To the extent possible under law, Kragen Javier Sitaker has waived all
copyright and related or neighboring rights to Dercuano. This work is
published from Argentina.
What httpdito is useful for
---------------------------
httpdito can serve static files from your filesystem on Linux. It’s
probably the easiest way to serve static files, and like any static
web server written in a language faster than Tcl, it’s plenty fast for
just about anything. And it’s almost smaller than 2K, so you can copy it
anywhere easily.
It serves up the files in the directory where you run it. It won't
serve up files elsewhere, unless you have a symlink in there.
It doesn't serve up files whose names begin with “.”, such as
“.htaccess” or “.htpasswd”. This could be a problem if your intended
use is serving up non-bare Git repositories. You can serve up a bare
clone, though:
$ git clone --bare calculusvaporis calculusvaporis.git
Cloning into bare repository 'calculusvaporis.git'...
done.
$ ./server &
[1] 29425
$ (cd calculusvaporis.git; git update-server-info)
$ time git clone http://localhost:8086/calculusvaporis.git cv-clone
Cloning into 'cv-clone'...
real 0m0.259s
user 0m0.016s
sys 0m0.048s
$
It *does* %-decode URLs, so it should work fine to serve up files
whose names are in UTF-8, not just ASCII.
It won’t run on non-x86 machines, at least without something like
qemu-user; it doesn’t do PHP, CGI, or any
embedded scripting language; it doesn’t automatically generate
directory indexes; it doesn’t do HTTPS or content negotiation; it
doesn’t log.
With a little hacking, you could probably build it into some other
application as a low-overhead embedded HTTP server.
What httpdito is
----------------
It’s for fun.
It’s a full, useful web server in just over two kilobytes. (Earlier
versions were smaller, but an expanded set of MIME types has pushed it
up to 2060 bytes.)
It’s written in self-contained i386 assembly language; while it uses a
lot of macros, all of those macros are defined in the source file
itself. Similarly, although it calls a couple of functions, those
functions are defined in httpdito, not loaded from a library. It
makes system calls, though.
It’s well-commented: it’s more intended to be read than to be
executed. One commenter exclaimed that it looked more like Python
than assembly.
The system calls it uses are open(), close(), read(), write(), fork(),
_exit(), waitpid(), alarm(), socket(), bind(), setsockopt(), listen(),
and accept(). I think that none of them have changed their semantics
since BSD 4.3 around 1979, although in Linux some of the constants
have different values. In that sense, httpdito is unreasonably
portable. I think it’s probably binary-compatible with even very
early Linux versions, although I haven’t tried it.
How to build httpdito
---------------------
I’ll write a Makefile soon, but for now:
gcc -m32 -nostdlib server.s -o server.bloated
And then optionally
objcopy -S -R .note.gnu.build-id server.bloated server
If you don’t do that part, it will be over 4kB instead of under 2kB.
If you really care about that, you’re probably insane. (Take note
that I’ve spent many hours achieving its current size.)
How to configure httpdito
-------------------------
Before starting it, cd to the directory containing the files you want
it to serve up.
If you want it to run on a different port, edit the line near the top
of server.s that says
.equiv port_number, 8086
and set it to the port you want.
In its current configuration, httpdito will run up to 2048 concurrent
child processes. This sounds like an absurdly huge number, since
setting Apache’s MaxChildren setting to more than about 100 or 1000 is
likely to start giving you OOM kills. But that’s because Apache
process working sets are typically a few megabytes, like 4 to 16
megabytes. By contrast, ps usually displays httpdito RSS as between 4
and 16 *kilo*bytes. That means you can have a thousand times as many
processes before you run out of memory.
In practice, 2048 processes on my netbook is no big deal, the same
netbook where Firefox struggles to render Twitter, often begging me
for permission to kill the JavaScript that is bogging it down.
Performance
-----------
httpdito typically performs a little bit worse than a default Debian
configuration of Apache on the same hardware. For example, on my
8-core laptop, both Apache and httpdito manage in the neighborhood of
20000 to 30000 requests per second, according to apachebench (ab), but
while httpdito tops out at about 1.8 gigabits per second of outbound
bandwidth, Apache manages closer to 2.5. But these numbers aren’t
really trustworthy because the limit on them (the bottleneck) is
probably ab, not Apache or httpdito.
These numbers are not noticeably worsened by increasing the number of
concurrent clients up to 1000.
The bigger point, though, is that even httpdito is able to max out the
gigabit Ethernet network card in the machine. So you probably
shouldn’t worry about performance with httpdito.
Security
--------
I fucking hate dealing with security, but it’s a necessity.
Assembly language is not the language of choice for writing software
that safely handles untrusted input. It’s easy to write bugs that
provide an attacker with remote arbitrary code execution.
However, httpdito is extremely simple (296 CPU instructions as 02013)
and it’s intended to be secure. It is expected that an
attacker on a remote machine can:
- Use arbitrary amounts of outgoing bandwidth by requesting files from
httpdito.
- Deny service to other clients by opening 2048 concurrent connections
to httpdito and continuing to open 64 new connections per second.
(This could be mitigated in a variety of ways: a
scoreboard by hashed IP address, and leaving SYN cookies enabled.)
- Use arbitrary amounts of CPU time by flooding httpdito with
requests.
- Read the contents of the files in the directory where httpdito is
running and its descendant directories, including via symlinks.
It is not expected that a remote attacker can:
- Exhaust your memory or your process table, causing problems for
other processes on the machine where httpdito is running.
- Execute code they provide, for example via a buffer overflow.
- Cause any data to be written to your filesyste.
- Read any files outside the directory tree httpdito is serving up.
- Provide a link to a third party that causes attacker-provided data
to be interpolated into HTML, JS, PDF, or other executable data sent
by the server, in an executable way. (That is, httpdito is intended
to be XSS-proof.)
- Receive any information provided by other users of the server,
except a crude measure of server load.
- Deny service to other clients without continually sending new
requests.
Please let me know if you find a weakness (a way that a remote
attacker can do something unexpected), or something I should have
thought about.
What it tells us about Linux
----------------------------
1. Hardware is fast now.
Apache became popular in 1994 and 1995 in large part because it
performed better than its ancestor, NCSA httpd, while being a drop-in
replacement for it. Philip Greenspun [memorably explained in 1997,
speaking of CGI][0], “Computers do not like to fork 500 000 times a day.”
Spawning a new process is a fairly expensive operation. Apache’s
“preforking” model, in which a pool of worker processes sits
around waiting to handle incoming requests, each handling many
requests (say, 100) before it dies, essentially eliminated that
expense.
[0]: http://philip.greenspun.com/panda/server-programming
But, at the time, a Linux web server might have been running on a
133MHz Pentium. As a rough estimate, that machine could do a
32-bit add every cycle, 133 million times a second, while my
laptop, a 2.5 GHz Core i7, can do ten 64-bit adds every cycle on
each of its four cores, or 1 trillion 64-bit adds per second,
roughly equivalent to 2 trillion 32-bit adds. So it’s about 7500
times as fast. One second for my laptop is like two hours for the
1994 Linux web server. So some things that were impractical then
are practical now.
That is, in 1997, forking 500 000 times a day was expensive. Now
my laptop can fork 500 000 times in about 20 seconds.
2. Processes are cheap now.
But it’s not all about hardware. If you had thousands of
concurrently runnable processes on 1994 Linux, it would spend most
of its time in the scheduler. But, in recent years, the scheduler
has improved a lot; if I’m not mistaken, it’s now O(1) in the
number of runnable processes. That means it’s now practical to
have truly ungodly numbers of processes awake, as long as they’re
doing useful work.
In a sense, this is “back to the future”: cheap processes were one
of the interesting novelties of Unix. The idea of spawning off
multiple time-shared processes for a single command line was
rather bizarre for the 1970s, but made it easy to do computations
that did not fit comfortably into the 16-bit memory address space
of the time, by breaking them into multiple cooperating processes.
But there’s cheap, and then there’s *really* cheap.
3. System calls are fast now.
Henry (now Alexia) Massalin’s (1993?) dissertation, on the
experimental operating system “Synthesis”, passionately advocates
making system calls cheaper, and describes several novel ways of
achieving this. Among other things, expensive system calls force
you to add complexity to your userspace programs to reduce the
number of calls to them: I/O buffering, caching, attempts to
duplicate kernel logic in userspace, and so on; and they create
incentives for “batch” system calls like writev() and poll(). And
this complexity creates its own performance problems.
Synthesis was noticeably faster than traditional Unixes on the
same hardware, and dramatically faster when measuring latency
rather than bandwidth --- when your write() calls are fast enough
to efficiently pass a single audio sample through a pipe, you can
get much lower end-to-end latency on an audio processing pipeline
of several processes.
Synthesis has been very influential on OS implementors since the
dissertation was published, although to my knowledge, the code has
never been published, and the results have never been reproduced.
But many of the techniques for reducing system call costs have
been incorporated into Linux.
Partly as a consequence, httpdito can use a one-kilobyte buffer
for reading file data and passing it to the network stack, yet
still push out gigabits per second; that means each read()/write()
pair is taking less than 16 microseconds. If you increase the
size of this buffer to reduce the number of calls, it doesn’t
help, because apachebench is still the bottleneck.
And this helps a lot with making processes cheap. Whenever you
fork() a process, you inherently need to make a copy all the
memory pages that process writes to. Every time you can use a 1K
buffer instead of a 32K buffer, you can use one page instead of
eight pages, which means several times as many processes can fit
into your L2 cache.
4. Most of the work of an HTTP server is in the kernel.
httpdito is currently a 2060-byte executable, of which 1019 bytes
are its 296 instructions of machine code. If you look at this
code, most of it looks like this:
80482c3: 05 e0 96 04 08 add $0x80496e0,%eax
80482c8: 89 c1 mov %eax,%ecx
80482ca: 89 eb mov %ebp,%ebx
80482cc: b8 03 00 00 00 mov $0x3,%eax
80482d1: cd 80 int $0x80
80482d3: c7 05 b8 96 04 08 c3 movl $0x80495c3,0x80496b8
80482da: 95 04 08
80482dd: 83 f8 00 cmp $0x0,%eax
80482e0: 7d 13 jge 0x80482f5
80482e2: e8 d1 fd ff ff call 0x80480b8
80482e7: 89 eb mov %ebp,%ebx
80482e9: b8 06 00 00 00 mov $0x6,%eax
80482ee: cd 80 int $0x80
That is, it consists largely of system calls (`int $0x80`) and
setting up arguments for them.
According to the time command, httpdito spends about 40 times as
much “system” time as it does “user” time. That is, for every
cycle spent in httpdito’s code itself, another 40 cycles are spent
in the system calls httpdito is invoking.
In essence, httpdito is just scripting the Linux kernel’s capabilities
in assembly language. It could be even more extreme; there’s a
sendfile() system call which might save a loop of about 10
instructions reading and writing file contents. I chose not to
use sendfile(), because although it would probably be faster,
httpdito is already fast enough, and it’s not clear that it would
be simpler.
What it suggests about operating system design
----------------------------------------------
The Linux kernel is already capable of efficiently creating tens of
thousands of processes per second on modern hardware, if not hundreds
of thousands, and of hosting at least thousands, if not millions, of
live processes. So maybe it’s possible, even without changing
kernels, to break up new software systems into many tiny processes
like httpdito, with a working set of under 100 kilobytes. Then the
kernel can provide fault isolation between the processes, and
integrating objects written in different languages should be little
harder than integrating objects written in the same language.
Erlang has been exploring this kind of design, and the results have
been promising. But Erlang is slow, even with HIPE, and you have to
write your programs in Erlang’s weird language (which, to be fair, is
pretty nice, even if it’s weird.)
Imagine, for example, a feature-complete web server built this way. A
front-end process, functionally equivalent to a load balancer, parses
the initial HTTP request and sends it to a router process, encoded
with something like protobufs or Thrift. The router process examines
the path and decides whether to send it to a filesystem-server backend
like httpdito, a PHP server, or something else. httpdito sends the
response back to the load balancer, which posts a message to the
access-logger process and sends the response back to the HTTP client.
Perhaps the parts of the system rendezvous with one another using the
filesystem, for example with Unix-domain sockets or named pipes; and
perhaps a supervisor process is responsible for starting up enough
backend processes to handle the current demand.
Rather than shared libraries, in this world, you would have server
processes that accept and respond to requests.
In essence, this is the original Unix architecture from the 1970s,
adapted to the much larger machines of today by using many more
processes, rather than only slightly more processes, each containing
many times the memory of a PDP-11.
This kind of architecture has major potential pitfalls, as the GNU
Hurd team found. It’s almost a distributed system, with many of the
difficulties of debugging distributed systems and making them robust.
On the plus side, we have a lot more knowledge now about how to do
that, thanks to Roy Fielding, the Erlang folks, Jeff Dean, and so on.
Are you insane?
---------------
Yes. It’s 2013. You have to be insane to write application software
in assembly language.
You got the bonus package: here’s the executable
------------------------------------------------
Feed this text to the uudecode command. (This is an older version.)
begin 755 httpdito.gz
M'XL("(NYM5("`W-E97?O,0M9TQ$+AK@/@E88HAO=ARJE#AQR26!F
MG;H6K+A^$&'##P6A)>Y+/[2K0\_GN21-VDW$%^Y][_D]O^=WS_L\[]TENE_N
ML=ELI#QJB)V@]=HYKJX=UGQM$6\G%#RMA"</V"YB^:^"
M??4CK@ZONP#L!9W:DI\C#Q\<^7??]K'^:V\Z<'F"J_/ESQ\U5V?6;[V8FDO-
MIIT6=KNKRT"EA;'L6[!D,(LJ\,3R;EA]BYV9QS[>XKCA6\PX4'UC/,]V]O8M
M_V&:YALGQF:][S=-H3"$3&U]P(WET\`!P+!9=J81)'-X;PQ8P$T>4VH!-'F/
MQ&^EOD^Y76!D<,H]NIWW?(GW1D)BW2VH1CS(?B7/__;-`L_^1:3@0M@$L9G<3^^>=]M
M8X<5!0'LJ7B=R:._\$ORWAJK+\9_AO&G(-ZH*3,3#>S9HG,*G9W@3._#QQJU
M%L5THLK*FY!DVHUXCF)^S:6=?`<(=OP2WF1-.R%ML_PNIPJ&O6J_7Q7W
M>[9Q^;>_8,.%DOZ*#919F]U2FL3)<#J:W'9C+G0,?CS-Y36%WIA*UF
ML%HY=!I\N0?QWR?G+D#Y6G!CX8FJC9T/+%3>HN3&1F+_N<3]038E-V#\;L.N_#QY>Z!?F,-D_^"$"Z"6>+
M41&?+Y%KMQ8>:EY\P\A!&E64(:K':$2F'EU3/1$Y%-7#DD?4AMUO2[*D1D(>
M255E11BDHAS^#Y([*&H2,"TCX,`FE3]QFA(:DEAKVR9@L[Z1A+1"/3!?36+(
M46*L]?BQ4WW=KQ_O/GSD2%^;19:`M,L*:)E`.QB1PZU%%W8F&M&8))>`*U"#
M/%QB*"3%\(D$3OT9,<)B$8P94-0A6%1)1.MH?W^OQR=X:;NWG;ZJP,%0=#GL
M='0I,B@R-QN)20JJ397TDS%A4C\D/R)$30!C6F,C%(
M!&02(2PRD0A!32/_8SQ2ZAV>)0/^60:L]^U;_STX&DH\=%G_+W#X)8
?#;)X^.VY!,(N[D%>6:L\D+>WRBY+_P-R"FIKB`<`````
`
end