This post covers the initial steps I took to setup my five-node Raspberry Pi 3 cluster with SLURM, a scheduler and resource manager for High-Performance Computing (HPC) clusters. Did I end up with a fast HPC resource? Well, no. With 1GB memory and a 4-core ARM processor each, it is like I strung together a couple of cell phones. Do I really need to use SLURM, a scheduler used on some of the largest supercomputers in the world? Also, no. There are much simpler options with less overhead available for a small cluster like this. But since when are such at-home, DIY projects intended to be practical?

There are some key items missing from this post. I still haven’t created any shared users or setup any monitoring/maintenance software. I also haven’t queued any starter MPI programs to run yet. This will come later.

Materials


Assemble Rack

Do not plug any of the Pis into power yet!

  1. Add the heat sinks on top of the CPUs and the LAN Chips for each Pi.
  2. Insert a NOOBS microSD card with the label facing downward for each Pi.
  3. Follow the cluster case instructions. Install fans with label facing up and with the red wire plugged into the 5V GPIO pin and the black wire plugged into a ground pin (see images).


Set Up the Master Node

  1. Plug in a monitor using HDMI, a mouse using USB, and a keyboard using USB.
  2. Plug into power supply.
  3. Select your WiFi network
  4. At the bottom of the screen, change your keyboard language if necessary.
  5. Select the Raspbian option and click install.
  6. After it boots up, select your country, language, and timezone
  7. Change your Pi’s password. Select a strong password as you have connected it to the internet.
  8. Finish filling out the initial config options and wait for the Pi to update.
  9. Restart your Pi
  10. Open Terminal (icon on top menu bar)
  11. Run these 3 commands:
    1. sudo apt-get update
    2. sudo apt-get install realvnc-vnc-server realvnc-vnc-viewer
    3. sudo raspi-config
      1. Select interfacing options
      2. Enable VNC
      3. Select network options
      4. Create hostname – node01 – if you want to install SLURM, you need to use this naming convention <consistent_name_for_all><unique_number>: example node 01
      5. Reboot your Pi for the changes to take effect
  12. Sign up for RealVNC – https://www.realvnc.com/en/raspberrypi/#sign-up
  13. On the Pi, click the VNC icon on the top-right corner of the screen. Then click the hamburger button in the top right and select Licensing
  14. Add the email and password you used to sign up for Real VNC, change the computer name in team to the hostname you applied in step 13
  15. Download/install the VNC viewer to the computer you want to control the Pi from (so that the monitor, keyboard, and mouse and no longer necessary)
  16. Open VNC view and sign in. You will see your Pi listed. You may need to authorize sign-in through your email.
  17. Enter in your Pi’s username and password. The default username is pi and the password is the one you set in step 9.
  18. Disconnect the monitor, keyboard, mouse. You will now finish setting up through VNC viewer.
  19. In the VNC GUI, click the Raspberry Pi Menu button (Upper left) > Preferences > Raspberry Pi configuration
    1. In the Interfaces tab, click enable on all options. You might want these later.
    2. Reboot Pi
  20. Open Terminal again and run the following commands:
    1. sudo apt-get install build-essential
    2. sudo apt-get install manpages-dev
    3. sudo apt-get install gfortran
    4. sudo apt-get install nfs-common
    5. sudo apt-get install nfs-kernel-server
    6. sudo apt-get install vim
    7. sudo apt-get install openmpi-bin
    8. sudo apt-get install libopenmpi-dev
    9. sudo apt-get install openmpi-doc
    10. sudo apt-get install keychain
    11. sudo apt-get install nmap
    12. sudo apt install ntpdate -y
  21. Run cat /sys/class/net/eth0/address to get your MAC address
  22. Run ifconfig to get your IP address and Net Mask. This is listed under the wlan0 section since you are connected via wifi
  23. Run netstat -nr and write down gateway address
  24. Set up static IP addresses for ethernet and wifi (these need to be different and also something another device on your network wouldn’t have)
    1. sudo nano /etc/dhcpcd.conf
    2. Add the following lines to the top of the file (minus the numbers in front of the lines), modify the values for your new ips and your gateway:
      1. interface eth0 # ethernetstatic
      2. ip_address=10.0.2.200/24 # before the slash this is your new ip, it should start with your gateway numbers with the numbers after the last dot unique to this Pi, the /24 is if you have 255.255.255.0 netmask which is likely for a home network
      3. static routers=10.0.2.1 # the gateway
      4. static domain_name_servers=10.0.2.1 # the gateway
      5. interface wlan0 #wifi
      6. static ip_address=10.0.2.201/24 # this is for WiFi, use a different IP
      7. static routers=10.0.2.1
      8. static domain_name_servers=10.0.2.1
    3. Restart your pi sudo reboot


Set Up the Servant Nodes (a.k.a. Compute Nodes)

  • Follow the same steps as above.
  • Use and track different static ip addresses.
  • Track the MAC addresses for connecting via switch.
  • Use consistent names if you plan on installing SLURM as your resource manager.
  • You don’t need to sign up for Real VNC again. Sign in to each Pi with your account.


Networking

Some switches will require you to log into their admin interface and set the static IPs. Use the MAC addresses and static IPs you marked down if this is the case.

  1. Connect Pis to switch with ethernet cords.
  2. Connect switch to travel router with ethernet cord.
  3. Power on switch and travel router.
  4. Set up travel router as a client according to its instructions.
  5. Use the VNC Viewer to log into and turn off the wifi from each Pi.
  6. On your main computer, edit your /etc/hosts file so that you can ssh to your Pis using their hostname and not their ip. Add a line for each node and use the static ethernet IP you set up for each. Example line:
    1. 10.0.2.200 node01


Shared Network File System (NFS)

  1. ssh to your master node. (remember to use your default username at this point ssh pi@node01)
  2. Plug in your flash drive to a USB port on your master node.
  3. Figure out its /dev location with lsblk. In this example it is located at /dev/sda1
  4. Format the drive with this command: sudo mkfs.ext4 /dev/sda1
  5. Create the mount directory with these commands. I used open permissions at this point.
    1. sudo mkdir /scratch
    2. sudo chown nobody.nogroup -R /scratch
    3. sudo chmod 777 -R /scratch
  6. Find the uuid for /dev/sda1 with blkid.
  7. Edit fstab to mount this filesystem on boot with sudo nano /etc/fstab
    1. Add the following line. Modify with your uuid.
    2. UUID=fdcd70f1-fd9e-465d-9bd0-f9169b7cfb47 /scratch ext4 defaults 0 2
  8. Mount the drive with sudo mount -a
  9. Again, setting open permissions on the mounted drive:
    1. sudo chown nobody.nogroup -R /scratch
    2. sudo chmod -R 766 /scratch
  10. Edit /etc/exports. Add the following line, but replace ip addr with your gateway with the last 1 replaced with a zero plus your netmask number after a slash.
    1. /scratch 10.0.2.0/24(rw,sync,no_root_squash,no_subtree_check)
  11. Update the NFS kernel server with sudo exportfs -a
  12. SSH to and mount the directory on all of the servant nodes:
    1. Run these commands:
      1. sudo mkdir /scratch
      2. sudo chown nobody.nogroup /scratch
    2. Add the following line to /etc/fstab, replace with your masternode’s IP:
      1. <master node ip>:/scratch /scratch nfs defaults 0 0
    3. Mount it with sudo mount -a
    4. Change the file permissions with sudo chmod -R 777 /scratch


Install SLURM on Master Node

  1. SSH to your master node.
  2. Edit /etc/hosts file and add a line for the other four nodes like you did for your main computer so it will recognize the other nodes by their hostname.
  3. Install SLURM with sudo apt install slurm-wlm -y
  4. Use SLURMs default config file to start your config file:
    1. cd /etc/slurm-llnl
    2. sudo cp /usr/share/doc/slurm-client/examples/slurm.conf.simple.gz .
    3. sudo gzip -d slurm.conf.simple.gz
    4. sudo mv slurm.conf.simple slurm.conf
  5. Edit /etc/slurm-llnl/slurm.conf
    1. Add the ip address of the master node:
      1. SlurmctldHost=node01(<ip addr of node01>)
    2. Comment out this line:
      1. #SlurmctldHost=workstation
    3. Set the cluster name
      1. ClusterName=hi-five-pi
    4. Towards the end of the file, delete the example compute node entry and then add information for all your nodes. Example line for one node:
      1. NodeName=node01 NodeAddr=<ip addr node01> CPUs=4 State=UNKNOWN
    5. Create a default partition.
      1. PartitionName=general Nodes=node[02-05] Default=YES MaxTime=INFINITE State=UP
  6. Create a file called /etc/slurm-llnl/cgroup.conf – cgroup
  7. Whitelist system devices by creating /etc/slurm-llnl/cgroup_allowed_devices_file.conf – cgroup_allowed_devices_file
  8. Copy all of the files we created/edited to /scratch so that we can use them on the other nodes:
    1. sudo cp slurm.conf cgroup.conf cgroup_allowed_devices_file.conf /scratch
    2. sudo cp /etc/munge/munge.key /scratch
  9. Start SLURM with these commands:
    1. sudo systemctl enable munge
    2. sudo systemctl start munge
    3. sudo systemctl enable slurmd
    4. sudo systemctl start slurmd
    5. sudo systemctl enable slurmctld
    6. sudo systemctl start slurmctld


Install SLURM on Compute Nodes (Repeat on all Compute Nodes)

  1. SSH to node
  2. Install slurm with sudo apt install slurmd slurm-client -y
  3. Edit /etc/hosts file to include the 4 other Pis
  4. Copy the configuration files with these commands
    1. sudo cp /scratch/munge.key /etc/munge/munge.key
    2. sudo cp /scratch/slurm.conf /etc/slurm-llnl/slurm.conf
    3. sudo cp /scratch/cgroup* /etc/slurm-llnl
  5. Start SLURM with these commands.
    1. sudo systemctl enable munge
    2. sudo systemctl start munge
    3. sudo systemctl enable slurmd
    4. sudo systemctl start slurmd


Test SLURM

If this part fails, try rebooting all nodes. Then inspect your config files. Restart the services if you make changes.

  1. SSH to master node
  2. Run sinfo to see your partition information.
  3. Run a job on four nodes to make them print their hostname:
    1. srun –nodes=4 hostname


Resources