- Install openSSH-server from http://packages.ubuntu.com
- Create a the same new user on every box of the cluster
- login as the new user, we used 'shadowfax'
- If you have no .ssh directory in your home directory, ssh to some other machine in the lab; then Ctrl-d to close the connection, creating .ssh and some related files.
- From your home directory, make .ssh secure by entering:
chmod 700 .ssh
- Next, make .ssh your working directory by entering:
- To list/view the contents of the directory, enter:
ls -a [we used ls -l]
- To generate your public and private keys, enter:
ssh-keygen -t rsaThe first prompt is for the name of the file in which your private key will be stored; press Enter to accept the default name (id_rsa).The next two prompts are for the password you want, and since we are trying to avoid entering passwords, just press Enter at both prompts, returning you to the system prompt.
- To compare the previous output of ls and see what new files have been created, enter:
ls -a [we used ls -l]You should see id_rsa containing your private key, and id_rsa.pub containing your public key.
- To make your public key the only thing needed for you to ssh to a different machine, enter:
cat id_rsa.pub >> authorized_keys
[The Linux boxes on our LAN, soon to be cluster, have IPs ranging from 10.5.129.1 to
10.5.129.24 So, we copied each id_rsa.pub file to temp01-temp24 and uploaded these
files via ssh to the teacher station. Then we just ran cat tempnn >> authorized_keys
for each temp file to generate one master authorized_keys file for all nodes that we could
just download to each node's .ssh dir.]
- [optional] To make it so that only you can read or write the file containing your private key, enter:
chmod 600 id_rsa [optional] To make it so that only you can read or write the file containing your authorized keys, enter: chmod 600 authorized_keys
InstantCluster Step 5: Software Stack II (Meeting V)
We then installed openMPI (we had a lot less dependencies this year with Natty 11.04 64bit) and tested multi-core with flops. Testing the cluster as a whole will have to wait until the next meeting when we scale the cluster! We followed openMPI install instructions for Ubuntu from http://www.cs.ucsb.edu/~hnielsen/cs140/openmpi-install.htmlThese instructions say to use sudo and run run apt-get install openmpi-bin openmpi-doc libopenmpi-dev However, the way our firewall is setup at school, I can never update my apt-get sources files properly. So, I used http://packages.ubunutu.com and installed openmpi-bin, gfortran and libopenmpi-dev. That's it!Then we used the following FORTRAN code to test multi-core. FORTRAN, really? I haven't used FORTRAN77 since 1979! ...believe it or don't!
We compiled flops.f on the Master Node (any node can be a master):mpif77 -o flops flops.fand tested openmpi and got just under 800 MFLOPS using 2 cores (one PC):mpirun -np 2 flopsNext, we generated a "machines" file to tell mpirun where all the nodes (Master and Workers) are (2 PCs or nodes with 2 cores each for example):Every node has the same "machines" text file in /home/jobs listing all the IPs, one per line. Every node has the same "flops" executable file (or whatever your executable will be) in /home/jobs. Every node has the same "authorized_keys" text file with all 25 keys in /home/jobs/.sshmpirun -np 4 --hostfile machines flops
Note: last year we got about 900 MFLOPS per node. This year we still have 64bit AMD athlon dualcore processors. However, these are new PCs, so these athlons have slightly different specs. Also, last year we were running Maverick 10.04 32bit ... and ... these new PCs were supposed to be quadcores! We are still awaiting shipment.
InstantCluster Step 6: Scaling the cluster
UPDATE: 2011.1126 (Meeting VI)
Including myself, we only had 3 members attending this week. So, we added 3 new nodes. We had nodes 21-24 working well last time. Now we have nodes 19-25 for a total of 7 nodes, 14 cores and over 5 GFLOPS! This is how we streamlined the process:
(1) adduser jobs and login as jobs
(2) goto http://packages.ubuntu.com and install openssh-server from the natty repository
(3) create /home/jobs/.ssh dir and cd there
(4) run ssh-keygen -t rsa
(5) add new public keys to /home/jobs/.ssh/authenticated_keys to all nodes
(6) add new IPs to /home/jobs/machines to all nodes
(7) goto http://packages.ubuntu.com and install openmpi-bin, gfortran and libopenmpi-devfrom the natty repository
(8) download flops.f to /home/jobs from our ftpsite compile and run:
mpif77 -o flops flops.f and
mpirun -np 2 flops or
mpirun -np 4 --hostfile machines flops
NB: since we are using the same hardware, firmware and compiler everywhere, we don't need to recompile flops.f on every box, just copy flops from another node!
(9) The secret is setting up each node identically:
UPDATE: 2011.1214 (Meeting VII)
We had 5 members slaving away today. Nodes 19-25 were running at about 5 GFLOPS last meeting. Today we added nodes 10, 11, 12, 17 and 18. However, we ran into some errors when testing more than the 14 cores we had last time. We should now have 24 cores and nearly 10 GFLOPS but that will have to wait until next time when we debug everything again....
UPDATE: 2012.0111 (Meeting VIII)
We found that some nodes did not have gfortran installed and many of the authorized_keys files were in consistent. So, we made sure every node had a user called jobs. Then we made sure we installed openssh-server, openmpi-bin, openmpi-doc (optional), libopenmpi-dev and gfortran installed. We generated all the public ssh keys and copied then over to one master authorized_keys file on shadowfax using ssh. Then we copied the master file back to each node over ssh to /home/jobs/.ssh and we tested everything with flops, and then I wrote this on FB:
UPDATE: 2012.0209 (Meeting IX)
(1) While testing the cluster, we got up to 32 cores and 12 GFLOPS. Anything more than 16 nodes had very power efficiency (the more cores we added, the less GFLOPS we got)! We will need to isolate the bottle necks. We isolated node #12, but there must be more. Is it the Ethernet cards or cables? Do we have defective cores? Did we install some nodes inconsistently?
(2) There are recruiting screencasts for CSH, CSAP and CRL, but none for CIS? So, we made one:
(3) Then we booted the whole room in about 5 minutes using the pelicanHPC liveCD. We got up to 48 cores and 18 GFLOPS. This tells us that probably not the cores or the Ethernet causing problems. We may have to reinstall a couple of nodes.
InstantCluster Step 7: Compiling MPI Code! (2012.0229, Meeting X)We downloaded the Life folder from BCCD and ran "make -f Makefile" to generate Life.serial, Life.MPI, Life.MP, Life.hybrid with just one snafu. Only PC25 had all the X11 libraries to compile properly! There must be some way to get those libraries installed on the other nodes. So, I wrote the following email to the folks at BCCD:Hello Skylar, et al: I only get to meet with my Computing Independent Study class once or twice a month after school. So, we hadn't had a chance to try some of your code until now. If you recall, I asked if the MPI code on BCCD could be used on any openMPI cluster and you encouraged me to try some of the code here, http://bccd-ng.cluster.earlham.edu/svn/bccd-ng/tags/bccd-3.1/trees/home/bccd Yesterday, we were playing around with the Life folder. When I tried "make -f Makefile" on my linux box, everything compiled fine. However, when my students did so they got: gcc -o Life.serial Life.c pkit.o -O3 -ggdb -pg -lm -lX11 -L/usr/X11R6/lib In file included from Life.h:39:0, from Life.c:41: XLife.h:40:62: fatal error: X11/Xlib.h: No such file or directory compilation terminated. make: *** [Life.serial] Error 1 Both my PC and theirs are setup identically for clustering in that we are running 64bit Ubuntu 11.04 + openMPI + openSSH with public key authentication. Apparently their machines don't have the X11 libraries. I'm thinking that I installed something more on my box to enable various other programs (SmartNotebook, VLC, etc) to work that may have automagically installed those libraries on my box? Do you know how to get those libraries? TIA, A. Jorge Garcia Applied Math and CompSci http://shadowfaxrant.blogspot.com http://www.youtube.com/calcpage2009And the reply was:I think you'll want to install libx11-dev on the systems generating the error. You can either use "apt-get install" on the command line, or the package manager GUI. -- Skylar Thompson (firstname.lastname@example.org) -- http://www.cs.earlham.edu/~skylar/
So, that's what we are going to do next time! THANX, SKYLAR!!
UPDATE: 2012.0314 (Meeting XI - PI Day!)
We wrote to Skylar once again:
Thanx for your help regarding libx11-dev. We installed it on every Linux box we have in the classroom in case we need it again and we ran "make -f Makefile" in the Life directory we downloaded from http://bccd-ng.cluster.earlham.edu/svn/bccd-ng/tags/bccd-3.1/trees/home/bccd and we got Life.serial working. Now, we tried to run Life.mpi and ran into another SNAFU. Just to recap, we're trying to run some of your BCCD openMPI code on our cluster. Our cluster has a common user on every Linux box (64bit dual-core Athons running Ubuntu 11.04) called jobs that has openSSH public-key authentication. In other words, a user need only login in once on any of our 25 boxes and can ssh to any other box without providing a userid or passwd. Of course we also have openMPI installed with gfortran. We ran "scp Life.mpi email@example.com:Life.mpi to populate all the /home/jobs directories. Each node also has the same /home/jobs/machines file listing the static IPs of all the nodes in the cluster as well as the same /home/jobs/.ssh/authorized-keys file. So, we figured we could run Life.mpi on one node (2 cores, 5 threads per core): mpirun -np 10 Life.mpi or on five nodes (10 cores,1 process per core) mpirun -np 10 --hostfile machines Life.mpi using any box as the master node, running from /home/jobs. Single node worked great. Many nodes gave: "Error: Could not open display Life.h:93..." Do you have any idea how to remedy this issue? Also, -r, -c and -g seem to work fine as commandline input for the Life executables, but -t doesn't affect the output. We didn't try file input yet. TIA, A. Jorge Garcia Applied Math and CompSci http://shadowfaxrant.blogspot.com http://www.youtube.com/calcpage2009
And David Joiner, a developer for the http://BCCD.net team replied: OK, in that case the tricky thing is that each node has its own account and its own .bashrc file, so you will need to make changes to all of them. Assuming your head node has an IP www.xxx.yyy.zzz, this would mean each .bashrc file would need a line like (in each .bashrc on every client node) export DISPLAY=www.yyy.yyy.zzz:0.0 and on your head node you would want to allow other machines to display to your xhost using the command "xhost +". (at command line on head node before running mpirun command) xhost + Let me know if that fixes your display problems. Dave
UPDATE: 2012.0425 (Meeting XII)
OK, we added the export command: export DISPLAY=10.5.129.11:0.0 to every node's bashrc file making PC11 (10.5.129.11) our master node. Also, our machines file looks like this: 10.5.129.5 10.5.129.6 10.5.129.7 ... 10.5.129.25 Then we ran from the master node: xhosts + jobs@lambda:~$ mpirun -np 4 --hostfile ~/machines Life.mpi and we got this error: Error: Could not open display at XLife.h:93 Error: Could not open display at XLife.h:93 Error: Could not open display at XLife.h:93 Error: Could not open display at XLife.h:93 -------------------------------------------------------------------------- mpirun has exited due to process rank 2 with PID 6704 on node 10.5.129.7 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- jobs@lambda:~$ Do you know how to fix this? TIA, A. Jorge Garcia Applied Math and CompSci http://shadowfaxrant.blogspot.com http://www.youtube.com/calcpage2009
Now, David joiner replied that we should add the line "export DISPLAY=10.5.129.11:0.0" to the .bashrc file for each worker node, if 10.5.129.11 is the master node. Adding that line to the master will only mess up Life.mpi on that single node. Then, when running Life.mpi on the cluster from the master node, we run:
mpirun -np 4 --hostfile ~/machines Life.mpi
We did all that and still there's no joy....
NB: ssh -Y xxx.xxx.xxx.xxx doesn't work anymore when sshing into a worker node!
PS: next time try NSF?
LAST UPDATE: 2012.0530 (Meeting XIII)We had our last meeting today! We burned a bunch of BCCD DVDs and ran GalaxSee without a problem on several cores. This year's team actually achieved several milestones. We got pelicanHPC running on all 50 cores from one CD via PXEboot for the first time. We got BCCD with Life and GalaxSee running for the first time. We even installed openMPI over public key authenticated openSSH natively on our lan and ran flops.f at 20 GFLOPs. Congrats on a job well done!BTW, the best results we had this year hardware wise were:BCCD: 4 nodes, 8 cores, 3.2 GFLOPS, 8 GB RAM, 1 TB HDDLAN: 16 nodes, 32 cores, 12.8 GFLOPS, 32 GB RAM, 4 TB HDDHPC: 25 nodes, 50 cores, 20 GFLOPS, 50 GB RAM, 6.25 TB HDDMany thanx go also go to the developers of the following projects who helped us:1995-2005: openMOSIXClusterKNOPPIX (with fractals)QUANTIAN (with povray)ParallelKNOPPIX (old version of pelicanHPC)Bootable Cluster CD 2.x2005-2012: openSSHBootable Cluster CD 3.x (with c++)Cluster by Night (with ruby)pelicanHPC (new version of ParallelKNOPPIX with c++, python and fortran)
(we used to use the BCCD liveCD, look at their openMPI code):Conway's 2D Game of LifeN-Body Orbits
(maybe we can use Python on our MPI cluster?):Parallel PythonIPython
(look at some clustering environments we've used in the past):We used PVM PVMPOV in the 1990s.openMOSIX and kandel were fun in the 2000s.(look what other people are doing with MPI):MPI intro, nice!MPI on UbuntuSample MPI codehttp://www.cc.gatech.edu/projects/ihpcl/mpi.html
==============================================What we are researching I (Sept)(look what this school did in the 80s and 90s):Thomas Jefferson High coursesThomas Jefferson High paperThomas Jefferson High ftpThomas Jefferson High teacher==============================================Today's Topic:CIS(theta) 2011-2012 - Compiling MPI Code! - Meeting XIIIToday's Attendance:CIS(theta) 2011-2012: GeorgeA, KennyK, LucasEToday's Reading:Chapter 9: Building Parallel Programs (BPP) using clusters and parallelJava
==============================================Membership (alphabetic by first name):CIS(theta) 2011-2012: Graham Smith, George Abreu, Kenny Krug, LucasEager-LeavittCIS(theta) 2010-2011: David Gonzalez, Herbert Kwok, Jay Wong, Josh Granoff, Ryan Hothan
CIS(theta) 2009-2010: Arthur Dysart*, Devin Bramble, Jeremy Agostino, Steve BellerCIS(theta) 2008-2009: Marc Aldorasi, Mitchel Wong*CIS(theta) 2007-2008: Chris Rai, Frank Kotarski, Nathaniel RomanCIS(theta) 1988-2007: A. Jorge Garcia, Gabriel Garcia, James McLurkin, Joe Bernstein*nonFB==============================================Well, that's all folks, enjoy!