Body of page last modified 1998-12-07. Lots of new information since then; among other things, there will be a max of 6 CPUs per board, it will probably cost more like $2500 if I remember correctly, things should be out in March or so of 1999, there are some photos of the prototypes, and -- by the way -- the machine that handles the discussion list is down for the count. This paragraph added 1999-02-26. Eventually I'll rewrite this page.
The StrongARM SA-110 (originally from DEC, now from Intel) is comparable in integer speed to a Pentium II of the same clock speed, and much cheaper, smaller, and more efficient. Its current top clock speed is 233 MHz. A couple of folks are talking about building some PCI cards with six of these on them and selling them for $2000 each. You'll be able to get boards with eight processors for a bit more -- `more like $3000'. Each card has two PCI buses onboard, 128-256K flash memory (and you'll get tools you can use to program the flash memory), and you could stick several boards in one PC -- so you could have 32 ARM processors if you currently have four free PCI slots.
This page probably needs to be reorganized.
Simon Thorpe <thorpe@cerco.ups-tlse.fr> of the Centre de Recherche Cerveau et Cognition of the Université Paul Sabatier was working on some neural-network simulation software, and wanted a faster machine to run it on. He's been working with Neil A. Carson <neil@causality.com> of Chalice Technologies on some hardware to solve his problem. On 1998-09-09, he sent a mail to several people who had previously talked about running StrongARM Beowulfs, me <kragen@pobox.com> among them.
He said:
Briefly, what we are proposing is the following. The basic board will be a standard PCI board that could be used in PCs, Macs or indeed any other computer with PCI slots in it (the attached GIF file gives my impression of what the board would look like). [A more recent version, with some other possible daughtercards, is now available. The other daughtercards are currently vaporware; also, the layout is a little inaccurate. -KJS] Each board would be fitted with 6 or 8 processor modules, split between the two sides of the board, which would plug into standard SODIMM memory slots (Small Outline Dual In-Line Memory Modules), [It looks like there'll be eight SODIMM slots on each board, each of which can take a CPU -KJS] the same type of slots that are used for memory expansion on laptop computers. Each of these daughter boards would have a StrongARM SA-110 micro-processor, a Digital 21285-A PCI bridge circuit, and 32 or 64 Mbytes of SDRAM. In effect, each board would be a computer in its own right, and would run a version of Unix (Linux or NetBSD). It would also have an IP address, allowing messages to be sent efficiently via the PCI bus from processor to processor. [Each board will have two PCI buses onboard. -KJS] Neil and the other programmers at Causality will look after the message passing mechanisms, probably using I2O protocols. [Neil would like to emphasize that we're just talking about using the I2O PCI messaging stuff, not the whole proprietary I2O framework. -KJS]
Essentially, developing will be like developing for any other Unix computers. The idea is that, given the boards can be communicated with using IP, you can telnet them, ftp them, they can nfs mount their filing systems, etc. To start up remote processes, you can rsh them etc. Tools will be gcc/g++, et al, and it should be straightforward to implement standard message passing systems such as PVM and MPI.
There are a couple of limitations. The StrongARM SA-110 doesn't have a floating point unit, but as long as you only want to do integer calculations, it goes like a rocket. Our own software (which is integer only) runs as fast on the StrongARM as on a Pentium II of the same clock speed. This is really impressive, since the StrongARM doesn't have a second level cache (unlike the Pentium II) and the code has (as yet) not been optimised for the StrongARM at all.
The reasons for choosing the StrongARM are pretty straightforward. It is very small, doesn't get hot (< 1 watt) and is cheap. This means that it becomes perfectly feasible to imagine packing large numbers of StrongARMs in a very small space without having to worry about overheating (imagine trying to do the same thing with Pentium IIs). In addition, although the future of StrongARM was once in doubt (it was co-developed by Digital and Advanced Risc Machines), it has now been bought up by Intel who have recently announced that they will be investing heavilly in StrongARM development. See http://developer.intel.com/design/strong/ for details.
The current top-of-the-range StrongARM runs at 233 MHz, and this is what Neil Carson is proposing to use in this first batch. However, in the not-too-distant future, there should be much faster StrongARMs with faster memory buses. One of the nice features of this daughter board arrangement is that it would be pretty simple and cost effective to do a new batch of boards using whatever the best technology is at that moment. Of course, one of the nice things about using this sort of parallel hardware is that (if you have a good problem), even last years technology will still be useful to you -- not like conventional PCs where you feel that you have to buy a new computer every six months if you don't want to be obsolete.
He hoped to find buyers for 15-17 boards to help justify the cost of the initial run of 25. (This has now happened.)
Over the past few months I have been in discussion with Neil Carson from Causality Ltd. http://www.causality.com/ in the UK about the possibility of developing hardware for running a neural network simulator that we have developed in our lab. Right now, the project is looking very promising, and Neil has said that he would be happy to go ahead and build a first batch of 25 boards as soon as he can be reasonably confident that there will be enough buyers. I myself will be buying 8-10 boards, but we need a few other interested people to get the project off the ground. I was wondering whether you [the people to whom he sent this message] might be interested.
. . .
``What about prices?'' you may be saying. Well, it should be possible to do such a board for 1200 pounds ($2000) on this first run of 25. Each board would only take one PCI slot, so with four free PCI slots you could put up to 24 processors in a single PC! [At the time, he thought each board would only host six processors; now it looks like we can do eight. A board with eight processors will cost more than $2000. -KJS]
I discussed this with the other people Simon had sent the mail to, plus Neil Carson. Over the next two days, the following additional information surfaced:
One person commented that he had plenty of students to rewrite code as fixed-point, and I (Kragen) commented that there's a lot of older code for x86 machines that studiously avoids floating-point.
Neil Carson commented:
We will provide a TCP/IP bypass later on, so you call socket with something like PF_BYPASS instead, which looks up an IP address->PCI map, but sends the data direct rather than encapsulating it in several layers.
However, I am a little skeptical at the potential for such special purpose cards. Historically, the ``multi-cpus on a daughtercard'' approach has always been of technical interest, but has rarely, if ever, suceeded commercially. A solution in search of a problem, as it were.
I added:
I've often wondered why this is. It seems like it would be such a great solution for image processing, for ray-tracing (POVRay uses all fixed-point math, IIRC), for cellular automata, for neural networks, for recompiling glibc (imagine a glibc2.1 compile in only three hours!), etc.
I remember something in Dr. Dobb's about putting a bunch of i860s on a VLB daughtercard (1993 or so?). I don't remember if it ever got widely used, or if not, why not.
If people are buying IBM SP2s with sixteen nodes, and used to buy Sequent Symmetrys with sixteen processors, why have things like this never succeeded commercially? It's essentially a straightforward hybrid of the two.
1) When we say I2O, we mean using the I2O messaging facility on PCI, not the ``complete I2O framework'' which is expensive and non-free.
2) Yes, people can telnet into a CPU, build with gcc/g++/whatever. The CPU cards will probably NFS mount filesystems from the host.
. . .
Nope, we've not written it [the software to do the internode communication] yet but it shouldn't be particularly difficult. The host will need to be a machine with a device driver for a card. You can bet that initially this will be a PC of some sort. I don't know how bus-independent Linux's device drivers are, but if they are then with a bit of luck the same driver may work on other Linux platforms too.
Yes, it [each PCI board] will have a bridge. There will actually be two local PCI buses on the card, bridged to the main system bus by a two-bus bridge.
. . .
Yep, got it [the ability for the SA-110s to access the host computer's main memory] in one---if someone wanted to.
. . . I'm trying to drum up as much support as possible. If the first run isn't just 25 boards (which is Simtec's minimum viable run) but 250, then it gets cheaper for everyone!
So it sounds like a pretty incredible piece of hardware. Off-the-shelf free software could make it different from other similar projects that have gone before -- it'll be easy to use (at least for those of us who have done massively parallel processing on Linux before) and won't have a long lag time to develop software for it.
I'm going to see if I can buy one. I don't know yet if I'll have money.
If you want to buy some of these boards, let either Simon <thorpe@cerco.ups-tlse.fr> or Neil <neil@causality.com> know.
There's a mailing list about this card; you can subscribe by sending an empty email to sa-beowulf-subscribe@kragen.dnaco.net.