Scripps Research Institute
Biological Mass Spectrometry Lab


Shamu Alpha Beowulf Cluster Frequently Asked Questions

Shamelessly patterned after (plagiarised from) the Avalon FAQ at LANL ... #113 on Nov. 1998 Top 500 Supercomputers List

Shamu Frequently Asked Questions

    Shamu.jpg
  1. What is the Shamu cluster?
  2. What does each node contain?
  3. How are the machines connected together?
  4. How much did it all cost?
  5. Which operating system do you use?
  6. How long did it take to get the machine running?
  7. How well does the machine perform?
  8. Those were benchmarks, what about real production code?
  9. Which distribution of Linux do you use?
  10. Which kernel are you currently running?
  11. Which compilers did you use?
  12. What message passing library did you use?
  13. Who did you buy the hardware from?
  14. Linux is obviously less expensive than a commercial operating system like Solaris, HP-UX, AIX or Windows NT, but wouldn't one of those operating systems offer better performance?
  15. Isn't fast ethernet too slow for such a machine? Why didn't you use Myrinet?
  16. Where did the name Shamu come from?
  17. Why are you doing this?

  1. What is the Shamu cluster?
    Shamu is a cluster of twenty Alpha computers dedicated to running the SEQUEST algorithm. The computers, each running Red Hat Linux, are coordinated through the use of PVM software. The cluster was initially designed by Jimmy Eng. Because the cluster uses commodity computers and free operating systems to solve supercomputer-class problems, it is a Beowulf cluster.

  2. What does each node contain?
    Each node of the machine is a DEC/Compaq Alpha workstation in an ATX case and contains
    Additionally, a single node (the "master" which also doubles as a compute node) contains a second fast ethernet card which allows it to connect the cluster to the external network. This node also has an additional 256MB SDRAM (two 128MB DIMMs, CAS3, PC100, ECC, 3.3V, unbuffered, 4 clock) installed in it to run the other applications we use.

    If I had to do it all over again, the SCSI hard drive, SCSI CDROM drive, and Matrox graphics cards would be replaced with cheaper ATA/IDE drives and minimal graphics cards. I would keep a decent graphics card in the single master node. Memory is so cheap now that I would also think about upping the memory in each node to at least 128MB and more likely 256MB.

    The IBM Deskstar 8.4GB IDE drives were not originally on the initial cluster setup and were added at a later date for sequence database storage.

  3. How are the machines connected together?
    With a Bay Networks 350T fast ethernet switch. The 350T is a 16 port 10Base-T/100Base-TX autosense switch.

    A Raritan MCP16 MasterConsole KVM (keyboard, video, mouse) switch (16-nodes) is used to allow access to each node from one monitor, PS/2 keyboard, and PS/2 mouse set.

    With the two 16-port switches, we have the ability to add four more nodes to this cluster before requiring additional connectivity hardware.

  4. How much did it all cost?
    50-something thousand American dollars (in late 1997/early 1998). I would guess the price for the entire setup as of early 1999 would be in the mid-30K range (and falling!).

  5. Which operating system do you use?
    Started out using NT4 but ended up running Red Hat Linux.

  6. How long did it take to get the machine running?
    Too long ... the PVM port for NT-Win32/Alpha wasn't working correctly for me initially and Linux was a whole learning experience in its own right. If I had to do it all over again now, I could probably have everything up and running in a matter of a few days with most of the time involved in installing the OS on each box.

  7. How well does the machine perform?
    I'll get some industry standard benchmarks up here as soon as I figure out what they are how to run them.

  8. Those were benchmarks, what about real production code?
    For those mass spectrometrists out there, here are overall search times that it takes to analyze five hundred (500) MS/MS spectra using a PVM port of SEQUEST version 27 through various databases (benchmarks run on 12/98). +/- 1.0 amu mass tolerance used in all searches; tryptic searches (including preprocessed database searches) allowed 1 internal cleavage site. DNA databases were searched against the translated forward 3 reading frames.

    Database # sequence
    entries
    PVM search
    enzyme=none

    (HH:MM:SS)
    PVM search
    enzyme=trypsin
    (HH:MM:SS)
    PVM search
    preprocessed DB
    enzyme=trypsin
    (HH:MM:SS)
    Unigene (clustered human ESTs) 52,277 00:38:37 00:15:58 00:01:21
    Non-redundant protein 382,465 01:43:56 01:24:14 00:01:37
    Human protein 58,692 00:07:33 00:05:15 00:01:16
    Yeast ORFs 6,351 00:02:38 00:01:38 00:00:52

    To give some perspective to these numbers, I would guess a majority of SEQUEST users out there experience search times anywhere from 1 to 10+ minutes for a single MS/MS spectrum through a protein database on Intel x86 and slower DEC Alpha boxes.

    The benchmark times vary from run to run (+/- seconds to a few minutes) and probably due to the communication across nodes varying (e.g. sometimes the slaves processes all start up quickly ... other times it takes a little longer to start them all). Binaries compiled with the C compiler supplied with Digital Unix 4.0B don't seem to be appreciably faster those compiled with gcc. Also, I timed searches with local databases stored on the IBM 5400 rpm IDE drive vs. the Quantum 7200 rpm Ultra SCSI drive and there doesn't seem to be a significant difference in performance whether the databases reside on one drive or the other.

  9. Which distribution of Linux do you use?
    We use RedHat 5.0 and 6.2.

  10. Which kernel are you currently running?
    It is running the basic 2.0.30 kernel compiled with the 0.89F tulip ethernet driver written by Donald Becker at CESDIS, NASA Goddard Space Flight Center.

  11. Which compilers did you use?
    The standard gcc 2.0. compiler that came with RedHat 5.0.

  12. What message passing library did you use?
    I use the PVM (Parallel Virtual Machine) software package.

  13. Who did you buy the hardware from?
    We bought them from Aspen Systems and Lodgepole Technology Inc. (they worked together for this order). The IBM hard drives came from Hard Drives Northwest. The additional 256MB SDRAM for the master node was ordered from The Memory Man.

  14. Linux is obviously less expensive than a commercial operating system like Solaris, HP-UX, AIX or Windows NT, but wouldn't one of those operating systems offer better performance?
    We started out with NT4.0 but had a hard time getting PVM (3.4beta6) to run on it. Finally gave up and went with Linux. Linux (or any Unix) makes working with the cluster so much easier (for example I can telnet to the master node and into the slave nodes ... something I didn't bother to figure out how to do in NT since telnet there is a Windows gui application.)

  15. Isn't fast ethernet too slow for such a machine? Why didn't you use Myrinet?
    The problem we're trying to solve is coarsed grained and not I/O bound. Myrinet would have added significant cost to each node without much if any return in performance.

  16. Where did the name Shamu come from?
    Seeking a witty name, our lab members took a few minutes and brainstormed ... the best name that we came up with is Shamu. Speed, strength, size, ... ????

  17. Why are you doing this?
    We needed a big/fast computer to run our analysis software and this solution seemed to be the best balance of cost and utility.


Back to the Biological Mass Spectrometry's home page.

Scripps Research Institute
Biological Mass Spectrometry Lab

Webmaster <dtabb@scripps.edu>

Updated 7/3/2000