fai success story
Venkata
venkata at cs.uno.edu
Tue Feb 3 23:53:55 CET 2004
Hello all. Here's some info. on the success I had using FAI to install
a 72 node Beowuld cluster at the Lousiana State University New Orleans.
----------------------------------------------------------
1. Setup a regular Debian box (woody), but then did "apt-get dist-upgrade"
to (testing). This machine acts as the Beowulf cluster head node. The
config is:
* Dual Xeon 2.2Ghz
* 2 GB RAM
* 2 36GB Ultra 160 SCSI Disks
* 100 Mbps 3com 3c59x NIC / 1000 MBps Broadcom Tigon3 NIC
2. Installed fai 2.5.1 and fai-kernels 1.5.3 on this box to make it the
install
server for the internal Beowulf subnet (eth1 is the Broadcom Tigon3 NIC).
3. Set up the dhcp3 server to serve the fai install image -- snippet of
the file
is below:
# dhcpd.conf for fai
# replace FAISERVER with the name of your install server
# deny unknown-clients;
option dhcp-max-message-size 2048;
use-host-decl-names on;
#always-reply-rfc1048 on;
filename "/boot/fai/installimage";
# the server from which to load the initial boot file if different
# from server-name
#next-server FAISERVER;
subnet 192.168.1.0 netmask 255.255.255.0 {
server-name "master";
default-lease-time 6000;
max-lease-time 6000;
option subnet-mask 255.255.255.0;
option broadcast-address 192.168.1.255;
option domain-name-servers 192.168.1.1;
option routers 192.168.1.1;
option domain-name "linux.beowulf";
option nis-domain "gumbo";
option nis-servers 192.168.1.1;
option root-path "/usr/lib/fai/nfsroot";
}
host gumbo01 {
hardware ethernet 00:00:00:00:00:00;
fixed-address gumbo01;
option host-name "gumbo01";
}
4. Set up NIS on the master node using the info in the NIS Howto. No
problems
here.
5. Configured the partitioning for the client nodes and for the NFS file
server (separate machine
from the master node that will serve home directories and user apps.).
The partitioning for
the nodes is:
# filename: GUMBO_IDE
disk_config hda
primary swap 2048 rw
primary / 4096 rw,errors=remount-ro ;-j ext3
primary /scratch 0- rw,errors=remount-ro ;-j ext3
Each node is a 2Ghz Pentium 4 with 1GB of RAM and 20GB of disk. The
fileserver is a dual
1.4Ghz Xeon Dell Poweredge 2550 (2GB RAM) connected to a Dell Powervault
220s with about 0.5 terabyte
of total storage. The filesever was configured similar to the nodes, but
the RAID filesystem was
configured manually after the FAI install. The disk partitioning used
was as follows:
# filename: GUMBO_FILESERV
disk_config sda
primary swap 8192 rw
primary / 0- rw,errors=remount-ro ;-j ext3
6. All the nodes use the autofs automounter to mount shared apps and
user home directories -- the config
is read from NIS.
The following files were added to the custom files directory of fai:
# filename: FILES_HOME (auto.home)
+auto.home
# filename: FILES_AUTO (auto.master)
+auto.master
# filenames: RSH_FILE (hosts.equiv)
# this file contains a list of the hostnames of all the nodes and is
duplicated on every node. This
# is to support passwordless rsh access.
7. For the packages selection, I pretty much stuck with the default set
for a beowulf node as defined
by the fai samples, but I added the following selections:
# filename: GUMBO
PACKAGES install
ganglia-monitor
gmetad
libganglia1
libganglia1-dev
xlibmesa-dev
xlibmesa3
tk8.3
tk8.4
tk8.4-dev
tcl8.3-dev
tcl8.4-doc
python-dev
python-doc
ssh
ntp
8. In the scripts dir, I added the following lines to the LAST file:
# NIS SPECIFIC HACKS
cat > $target/etc/yp.conf <<-EOF
ypserver 192.168.1.1
EOF
rmdir $target/etc/network/if-up.d
# HACK TO FIX X11 DIR PERMISSIONS
chmod 755 $target/usr/X11R6/bin
# SET UP NTP.CONF FILE
cp /fai/files/etc/ntp.conf/NTP_FILE $target/etc/ntp.conf
NIS, X11, and NTP would not work without these hacks. Here's the
ntp.conf file:
tinker panic 0
logfile /var/log/ntpd
driftfile /var/lib/ntp/ntp.drift
#broadcastclient yes
server master
9. I used the mkdebmirror and debmirror scripts to create a partial
(testing / sarge) mirror on the
master node. After that, I edited the fai.conf (it's attached to this
mail) and ran fai-setup. Here's where
I ran into a problem because libdetect0 is not a part of sarge, so I had
to remove all references to it from
make-fai-nfsroot -- I believe that it has been replaced by discover in
sarge. After that fai-setup ran fine
(apart from the usual apt complaints).
10. I booted each node using floppies from www.rom-o-matic.net. Here's
where I ran into major
trouble but this is more to do with the current state of sarge than with
FAI. A lot of the packages (especially
Gnome 2) have broken dependencies in sarge, and when these failed to
install, the entire install
was aborted. Basically how I got around this was by studying the logs
after each install and removing
the broken package from the package list(s). It took quite a few
installs to get everything right, but once
all the broken packages were removed, there were no problems.
--------------------------------------------------------------------------------
I think that's about it regarding the FAI install -- like I said before,
FAI itself didn't have many issues but
the state of the (sarge) packages led to some headaches. As a side note,
we use a Raritan KVM system to access
the consoles of each node -- this is nice because both video and
keyboard signals can be sent over regular
CAT 5 cable, so management is easy (no need to connect keyboard, screens
to nodes, etc.). Let me know if you need
any more info. Thanks.
--Venkata
More information about the linux-fai
mailing list