[Tfug] recommendations for installing a cluster..

Jeremy D Rogers jdrogers at optics.arizona.edu
Thu Jan 1 09:07:55 MST 2009


On Wed, Dec 31, 2008 at 11:15 AM, Dean Jones <dean.jones at gmail.com> wrote:
> Hi,
>
> I have (and do) manage a few clusters currently.  I'll throw my 2c here.

Trust me, it's appreciated.

[snip]
> Check out OSCAR or SystemImager for automating the system installations.
>  That way when you add the new nodes it will take very little time.
>
> If you have some money Scyld is worth looking at.  It automates software
> installs and back end cluster management as well as scheduling and metrics.
>
> These tools should not care which flavor on Linux you are using.

Heh. Ah, the benefit of not replying for a day. :-)

Another one that someone recommended to me for diskless setup was
jesswulf. Anyone heard of it?
https://www.middleware.georgetown.edu/confluence/display/CCF/JessWulf+-+A+Diskless+Beowulf+Cluster+Toolkit


>>
>> 1. Should I go diskless and use PXE boot from a chroot on the masternode
>> (I like this) or just install on the nodes' dirves? It seems like it will be
>> easier to maintain and upgrade when all I have to do is work with the
>> chroot. Perhaps the only disadvantage is bootup time? These should reboot
>> infrequently, so I think that should be fine.
>
> This really depends on how your jobs/applications behave.
>
> Are you thinking of NFS mounting root from the server?  That really causes
> the host to spend a lot of network cycles just doing normal OS operations,
> not including any additional programs you are running.  And like you say
> booting is very slow, especially if you have to reboot every node.
>
> It was slowing down one of our clusters terribly, but our workloads are
>  hard on the network anyway and pull large amounts of data to crunch on.  A
> small mathmatical model would not see as much of a delay and might not
> notice the overhead of nfs root.

Those are really great points. We primarily run two bits of code. One
(monte carlo) is all proc and no node to node communication (and
almost no ram). The other is FDTD and is very heavy on the
communication. Perhaps this is a good reason to lean toward local
install after all. In my head I was thinking that diskless install
would only really tax the network bandwidth during bootup and that
should be a rare event. But you raise a good point that if ALL OS ops
require bandwidth times the number of nodes, that could add up.

>> 2. I think it was purely a kludge because the storage was added later,
>>  but the masternode was mounting homedirs and then serving those to the
>> node. It seems like the MN and the slaves should all just mount /home from
>> the storage server directy, right? Any reason to do it otherewise?
>
> No, mount from the server actually serving, you are doubling up network load
> on the master.  Perhaps there was a strange network separation reason this
> was done but that should be fixed.

That confirms my suspicion. I just want to make sure I wasn't crazy. I
think the reason was that it was the minimal change to the setup when
the storage server was added.

>>
>> 3. For queing, I'm leaning towards torque/maui which looks like the newer
>> version of openPBS. We were previsously using SGE. Any opinions/experience?
>>
>> 4. If I leave the hardware raid config alone, I have 73GB raid1 as sda and
>> 440GB raid5 as sdb. I would plan to use sda as /. Since raid1 should be
>> faster than raid5, I thought I would put swap on sda as well. Any reason to
>> do otherwise?
>
> You do not want the nodes swapping over nfs.  You at least want that locally
> on their internal disk.

Yes, for sure. I meant only that local swap for the master node would
be on sda. I would of course not nfs mount swap, although that might
be fun to watch sometime. I'm picturing something like the terminator
dying in the pool of molten metal. :-)

>
> root over NFS is bad enough, so keep at least something local.
>
> The cluster mangement tools I mentioned earlier all expect an internal disk
> for storing the OS, at least as a default.

One interesting thing I was considering with local swap, which I'm
sure is way more effort to setup than is worth, is hibernation. It's
much more rare now that we have moved to a facility, but this system
has been plagued with power losses. Since each node only has 4 GB of
ram, it might not be completely out of the question to write that to
disk within the 15-30 minutes of UPS time we have. Wouldn't it be cool
to be able to hibernate the whole cluster and come back online without
interrupting anyones week-long calculation?! Anyway, that's at best on
the back burner.

>>
>> 5. Now this one is far less important, but if I go with diskless boot
>> nodes and am using the storage server for /home, that leaves 32 nodes with
>> 100GB drives in them not doing anything. Also, that 400GB raid5 array on the
>> masternode. Any clever use I should put these disks to?
>
> At least put swap and /tmp (or something similar) onto the nodes local
> disks.  Depending on what you are running on them they may want some actual
> fast disk for storing temporary data.  A local slice would be ideal.  Our
> applications were written to take advantage of a local slice but like I
> mentioned before, they have to move a lot of data around.
>
> Personally I do not think that the payoff of mangagement ease with a
> chroot/nfs root is enough to make up for the perfomance loss involved.
>
> The stability of our nodes has increased since moving away from NFS root as
> well.
>
> If you are using a scheduler and a node dies, you can remove that node from
> the list of available ones and no one should notice.
>
> Hopefully this helps, but clusters can be very specific to the
> job/application running.

Ok, I think you almost have me convinced to go ahead and install
locally.  I may play with the chroot for a day or two and see how far
I get, but I'm starting to see your point here. As long as there is a
good reason (performance) to go with locl installs, that may actually
be easier to setup anyway. I've been looking into setting up dhcp
server and getting node naming to be consistent (possibly based on mac
address), and it makes just static IPs on the nodes sound easier.

The IT guy here pointed me to
http://debianclusters.cs.uni.edu
which is a nice walkthrough on much of this stuff.

Thanks to all who've commented. It's all been very useful. This will
be quite a learning experience for me and will take some time, so keep
the comments and advice coming if you think of anything else.
JDR




More information about the tfug mailing list