Body of page last modified 1998-12-07. Lots of new information since then; among other things, there will be a max of 6 CPUs per board, it will probably cost more like $2500 if I remember correctly, things should be out in March or so of 1999, there are some photos of the prototypes, and -- by the way -- the machine that handles the discussion list is down for the count. This paragraph added 1999-02-26. Eventually I'll rewrite this page.

Beowulf on StrongARM boards for $2000

Abstract

The StrongARM SA-110 (originally from DEC, now from Intel) is comparable in integer speed to a Pentium II of the same clock speed, and much cheaper, smaller, and more efficient. Its current top clock speed is 233 MHz. A couple of folks are talking about building some PCI cards with six of these on them and selling them for $2000 each. You'll be able to get boards with eight processors for a bit more -- `more like $3000'. Each card has two PCI buses onboard, 128-256K flash memory (and you'll get tools you can use to program the flash memory), and you could stick several boards in one PC -- so you could have 32 ARM processors if you currently have four free PCI slots.

This page probably needs to be reorganized.

History

Simon Thorpe <thorpe@cerco.ups-tlse.fr> of the Centre de Recherche Cerveau et Cognition of the Université Paul Sabatier was working on some neural-network simulation software, and wanted a faster machine to run it on. He's been working with Neil A. Carson <neil@causality.com> of Chalice Technologies on some hardware to solve his problem. On 1998-09-09, he sent a mail to several people who had previously talked about running StrongARM Beowulfs, me <kragen@pobox.com> among them.

He said:

Briefly, what we are proposing is the following. The basic board will be a standard PCI board that could be used in PCs, Macs or indeed any other computer with PCI slots in it (the attached GIF file gives my impression of what the board would look like). [A more recent version, with some other possible daughtercards, is now available. The other daughtercards are currently vaporware; also, the layout is a little inaccurate. -KJS] Each board would be fitted with 6 or 8 processor modules, split between the two sides of the board, which would plug into standard SODIMM memory slots (Small Outline Dual In-Line Memory Modules), [It looks like there'll be eight SODIMM slots on each board, each of which can take a CPU -KJS] the same type of slots that are used for memory expansion on laptop computers. Each of these daughter boards would have a StrongARM SA-110 micro-processor, a Digital 21285-A PCI bridge circuit, and 32 or 64 Mbytes of SDRAM. In effect, each board would be a computer in its own right, and would run a version of Unix (Linux or NetBSD). It would also have an IP address, allowing messages to be sent efficiently via the PCI bus from processor to processor. [Each board will have two PCI buses onboard. -KJS] Neil and the other programmers at Causality will look after the message passing mechanisms, probably using I₂O protocols. [Neil would like to emphasize that we're just talking about using the I₂O PCI messaging stuff, not the whole proprietary I₂O framework. -KJS]
Essentially, developing will be like developing for any other Unix computers. The idea is that, given the boards can be communicated with using IP, you can telnet them, ftp them, they can nfs mount their filing systems, etc. To start up remote processes, you can rsh them etc. Tools will be gcc/g++, et al, and it should be straightforward to implement standard message passing systems such as PVM and MPI.
There are a couple of limitations. The StrongARM SA-110 doesn't have a floating point unit, but as long as you only want to do integer calculations, it goes like a rocket. Our own software (which is integer only) runs as fast on the StrongARM as on a Pentium II of the same clock speed. This is really impressive, since the StrongARM doesn't have a second level cache (unlike the Pentium II) and the code has (as yet) not been optimised for the StrongARM at all.
The reasons for choosing the StrongARM are pretty straightforward. It is very small, doesn't get hot (< 1 watt) and is cheap. This means that it becomes perfectly feasible to imagine packing large numbers of StrongARMs in a very small space without having to worry about overheating (imagine trying to do the same thing with Pentium IIs). In addition, although the future of StrongARM was once in doubt (it was co-developed by Digital and Advanced Risc Machines), it has now been bought up by Intel who have recently announced that they will be investing heavilly in StrongARM development. See http://developer.intel.com/design/strong/ for details.
The current top-of-the-range StrongARM runs at 233 MHz, and this is what Neil Carson is proposing to use in this first batch. However, in the not-too-distant future, there should be much faster StrongARMs with faster memory buses. One of the nice features of this daughter board arrangement is that it would be pretty simple and cost effective to do a new batch of boards using whatever the best technology is at that moment. Of course, one of the nice things about using this sort of parallel hardware is that (if you have a good problem), even last years technology will still be useful to you -- not like conventional PCs where you feel that you have to buy a new computer every six months if you don't want to be obsolete.

He hoped to find buyers for 15-17 boards to help justify the cost of the initial run of 25. (This has now happened.)

Over the past few months I have been in discussion with Neil Carson from Causality Ltd. http://www.causality.com/ in the UK about the possibility of developing hardware for running a neural network simulator that we have developed in our lab. Right now, the project is looking very promising, and Neil has said that he would be happy to go ahead and build a first batch of 25 boards as soon as he can be reasonably confident that there will be enough buyers. I myself will be buying 8-10 boards, but we need a few other interested people to get the project off the ground. I was wondering whether you [the people to whom he sent this message] might be interested.
. . .
``What about prices?'' you may be saying. Well, it should be possible to do such a board for 1200 pounds ($2000) on this first run of 25. Each board would only take one PCI slot, so with four free PCI slots you could put up to 24 processors in a single PC! [At the time, he thought each board would only host six processors; now it looks like we can do eight. A board with eight processors will cost more than $2000. -KJS]

I discussed this with the other people Simon had sent the mail to, plus Neil Carson. Over the next two days, the following additional information surfaced:

Such a machine would be really cool for some of us, but it would probably be useless for floating-point work, since software floating-point is orders of magnitude slower than hardware floating-point. The Pentium Pro, the Pentium II, and the AMD K6 have floating-point math that's nearly as fast as their integer math, so one K6 or Pentium II would probably be a better buy than two of these six-processor boards.
One person commented that he had plenty of students to rewrite code as fixed-point, and I (Kragen) commented that there's a lot of older code for x86 machines that studiously avoids floating-point.
According to Miguel Barreiro Paz <enano@ceu.fi.udc.es>, there are already 275MHz StrongARMs, if he remembers correctly. Brian C Merrell <merrell@eng.umd.edu> said that these are actually 233MHz units successfully overclocked to 278MHz.
Linux's TCP top speeds on dual 100Mbps Ethernet segments are well below the theoretical 200Mbps top speed of these segments; it's likely that the TCP code (or its equivalent) will be a bottleneck for communication over the PCI bus, which is better than a gigabit per second, and will soon be more than two gigabits per second. Miguel pointed out that he'd never gotten better than 30MB/sec (which is 240 megabits/sec) using Linux TCP even over the loopback interface.
Neil Carson commented:

We will provide a TCP/IP bypass later on, so you call socket with something like PF_BYPASS instead, which looks up an IP address->PCI map, but sends the data direct rather than encapsulating it in several layers.
William T. Rankin <wrankin@ee.duke.edu> commented:

However, I am a little skeptical at the potential for such special purpose cards. Historically, the ``multi-cpus on a daughtercard'' approach has always been of technical interest, but has rarely, if ever, suceeded commercially. A solution in search of a problem, as it were.

I added:

I've often wondered why this is. It seems like it would be such a great solution for image processing, for ray-tracing (POVRay uses all fixed-point math, IIRC), for cellular automata, for neural networks, for recompiling glibc (imagine a glibc2.1 compile in only three hours!), etc.
I remember something in Dr. Dobb's about putting a bunch of i860s on a VLB daughtercard (1993 or so?). I don't remember if it ever got widely used, or if not, why not.
If people are buying IBM SP2s with sixteen nodes, and used to buy Sequent Symmetrys with sixteen processors, why have things like this never succeeded commercially? It's essentially a straightforward hybrid of the two.
Neil Carson commented:

1) When we say I₂O, we mean using the I₂O messaging facility on PCI, not the ``complete I₂O framework'' which is expensive and non-free.
2) Yes, people can telnet into a CPU, build with gcc/g++/whatever. The CPU cards will probably NFS mount filesystems from the host.
. . .
Nope, we've not written it [the software to do the internode communication] yet but it shouldn't be particularly difficult. The host will need to be a machine with a device driver for a card. You can bet that initially this will be a PC of some sort. I don't know how bus-independent Linux's device drivers are, but if they are then with a bit of luck the same driver may work on other Linux platforms too.
Miguel committed to buying a board for testing.
Neil Carson commented further, in response to questions:

Yes, it [each PCI board] will have a bridge. There will actually be two local PCI buses on the card, bridged to the main system bus by a two-bus bridge.
. . .
Yep, got it [the ability for the SA-110s to access the host computer's main memory] in one---if someone wanted to.
Simon Thorpe commented further:

. . . I'm trying to drum up as much support as possible. If the first run isn't just 25 boards (which is Simtec's minimum viable run) but 250, then it gets cheaper for everyone!
Miguel suggested that it would be really cool if the SA-110s could directly talk to other things on the host's PCI bus, and I agreed. Neil said, ``I'm pretty certain that's possible, but I don't know the mechanics by which it would work.''

So it sounds like a pretty incredible piece of hardware. Off-the-shelf free software could make it different from other similar projects that have gone before -- it'll be easy to use (at least for those of us who have done massively parallel processing on Linux before) and won't have a long lag time to develop software for it.

I'm going to see if I can buy one. I don't know yet if I'll have money.

New developments

Enough people have signed up to get some that they're going to go ahead with the first run.
Possible future developments: daughtercards with other things on them, like EIDE controllers, Fast Ethernet cards, or the ARM 10 processor with floating-point. Faster PCI bridge circuits for better memory bandwidth. Faster PCI buses -- 64 bits and/or 66 MHz. None of these exist now.
The current schedule for the hardware looks like the PCI cards and the daughtercards should both be out by early November 1998. We don't know about software for sure. And it's possible the hardware might need redesigns, which might delay things.
Simtec, who is doing the hardware design, is at http://www.simtec.demon.co.uk/.
A guy named Mark van Doesburg has already done a lot of the work on IP over PCI -- he had a very similar idea of his own, and was going to build some boards like these himself.

Conclusion

If you want to buy some of these boards, let either Simon <thorpe@cerco.ups-tlse.fr> or Neil <neil@causality.com> know.

There's a mailing list about this card; you can subscribe by sending an empty email to sa-beowulf-subscribe@kragen.dnaco.net.

kragen@pobox.com