[Tfug] recommendations for installing a cluster..

Wed Dec 31 08:22:50 MST 2008

Hello all,

I'm looking for opinions or advice on reconfiguring and reinstalling my
lab's cluster.

History:
My lab has a 32 node dell cluster for simulations and it recently imploded
during a facility cooling meltdown. It seems that no hardware was damaged,
but the disks on the master node had forgotten its partition table. I
suspect it was unrelated to the cooling problem and only showed up because
we had to reboot for the first time in ages. I wasn't too upset because the
system has been in a downward spiral for some time and we've been looking
for and excuse to reinstall everything (redhat cruft + too many admins
cycling through trying to do things differnetly = bad times).

Hardware:
1 masternode with 6 swappable raid drives (2 for / using raid1 73GB, 4 using
raid5 originally for /home but currently unused 440GB)
32 dual proc slave nodes with smallish (maybe 100GB) disks and cd drives
1 storage server added recently: dual quad cores, 8BG ram and 7TB software
raid5 for serving /home

Plan:
So far, I'm pretty well bent on debian/ubuntu because I would otherwise go
through apt withdrawel. And I haven't used redhat much since about 2001 and
I find my self doing things like spending 5 minutes to remember that I
should be editing /etc/sysconfig/network-scripts/ifcfg-eth0 instead of
/etc/network/interfaces. We also have plans to add a second cluster with
newer hardware in an adjacent rack and probably have the storage server
serve /home to that as well. Now what I think I want to do is diskless nodes
booting from a chroot on the master node. But I have questions:

1. Should I go diskless and use PXE boot from a chroot on the masternode (I
like this) or just install on the nodes' dirves? It seems like it will be
easier to maintain and upgrade when all I have to do is work with the
chroot. Perhaps the only disadvantage is bootup time? These should reboot
infrequently, so I think that should be fine.

2. I think it was purely a kludge because the storage was added later,  but
the masternode was mounting homedirs and then serving those to the node. It
seems like the MN and the slaves should all just mount /home from the
storage server directy, right? Any reason to do it otherewise?

3. For queing, I'm leaning towards torque/maui which looks like the newer
version of openPBS. We were previsously using SGE. Any opinions/experience?

4. If I leave the hardware raid config alone, I have 73GB raid1 as sda and
440GB raid5 as sdb. I would plan to use sda as /. Since raid1 should be
faster than raid5, I thought I would put swap on sda as well. Any reason to
do otherwise?

5. Now this one is far less important, but if I go with diskless boot nodes
and am using the storage server for /home, that leaves 32 nodes with 100GB
drives in them not doing anything. Also, that 400GB raid5 array on the
masternode. Any clever use I should put these disks to?

I think there are other questions I may have as I get going on this, but I
welcome any comments or suggestions anyone has.
Thanks,
JDR

--
Jeremy D. Rogers, Ph.D.
Postdoctoral Fellow
Biomedical Engineering
Northwestern University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tfug.org/pipermail/tfug_tfug.org/attachments/20081231/6f09dcd0/attachment-0002.html>