DIY cluster: software [Autumn 2017]
In a previous post I talked a bit about the hardware I used to build my DIY compute cluster in the autumn of 2017. The primary goal of the project was to gain experience building and debugging parallel software, but one of the secondary goals was to familiarise myself with Linux as an operating environment, which I had little prior experience with. Naturally, the first choice was which flavour of Linux to use, of which there are literally hundreds. At the time of building the cluster, the only distribution I had any exposure to was Ubuntu, so I opted for the then current server version (16.04) as I thought it would minimise any potential compatibility issues with the desktop variant of Ubuntu (also 16.04) I had installed on the machine where I planned to do my development.
The first thing that needed doing was setting up remote access. I quickly learned that SSH or "secure shell" is the industry standard here. Setting up SSH on each of the nodes turned out to be a breeze, Ubuntu offers an option to do this automatically during the OS install process, but even if you leave that option unchecked, the SSH server can be installed with a single command once the operating system is up and running. SSH can either give you a standard interactive shell, where you can run commands yourself on the remote machine, or a non-interactive shell where it logs in, runs a command string then logs out again. Another tool I made extensive use of was PSSH or "parallel ssh". It's a simple tool, all it does is launch an ssh session on a number of hosts, run a command in a non-interactive manner, then report back the exit code of the command from each host. Handy for avoiding repetitive operations.
NETWORKING:
Typically clusters have a "head node" from which the cluster is managed, shared resources are stored and from which programs are launched. In my case, my desktop machine serves as the head node. To access each of the nodes easily though, I needed them to have static IP addresses. In windows or Linux distributions with a rich desktop environment, such things are typically handled by a widget somewhere on the desktop. But on a Linux server, network interfaces are configured at boot time based on the contents of a configuration file in the operating system (/etc/network/interfaces). In my case, the home router where I was living at the time assigned IP addresses in the typical 192.168.0.x range. So I configured my desktop (the head node) for 192.168.0.80, and the compute nodes for 192.168.0.70-74. The interfaces file on the compute nodes ended up looking like this:
Another necessity for clustering is some form of shared storage, so that executables and other resources are accessible by the compute nodes in the cluster. Fortunately Linux has a simple ubiquitous tool for this too, called NFS (network file-system). NFS allows a system to share directories that can then be mounted by other machines remotely. Once installed, the NFS server is configured by modifying the contents of the /etc/exports file. I created a "cluster" directory in my user accounts home directory on the head node, and added a line for each compute node to the exports file:
Similarly to network interfaces, drives (remote or otherwise) can be mounted at startup by specifying the mounts in /etc/fstab. In my case, the same user account name is used on all the machines in the cluster, including the head node. As a result, cluster related files will be at the same place in the filesystem on all the machines (/home/ac/cluster/), this will make things easier later. The following line instructs the operating system to mount the NFS share defined on the head node, I added an identical such line to the /etc/fstab file on each of compute nodes so that they would all mount the share at startup:
INSTALLING MPICH:
A stack of machines, remotely accessible or not, does you little good without the tools to leverage it. MPI or "message passing interface" is an industry standard framework for multi-node message passing based parallel programming. So it seemed like a sensible place to start. MPI however is just a standard. There are a few different implementations to choose from; OpenMPI is a widely deployed open source implementation, MPICH is a reference implementation of the MPI standard maintained by the Argonne National Laboratory, and MVAPICH is a version of MPICH specifically built to support infiniband and other exotic interconnects.
For my cluster, I opted for MPICH, as I didn't need the advanced features of either OpenMPI or MVAPICH. Installing it involves building from source, which is not nearly as involved as it sounds. After downloading and extracting mpich, I found a readme detailing the installation process. The --prefix argument allows you to specify where the output binaries end up. In my case, I chose to put them in the NFS-shared "cluster" directory I created earlier, so that they would be easily accessible at the same path on all the nodes. I also disabled fortran, as the configure command complained to me that it couldn't find a fortran compiler, and I didn't need fortran support anyway:
Turns out MPI is a pretty complicated piece of software, so this process can take a while. Once the compilation is complete, the final step is to set up the shell so that it can find the compiled binaries. If you configure MPICH without specifying your own nonstandard installation directory, then you don't need to do this. But I wanted to have the option to install other MPI implementations down the line without having to uninstall MPICH. So rather than do that, I added a few lines to my bashrc as follows:
FIRST CODE:
Naturally, the first thing to do in any new programming environment is to write a hello world program. So here it is. The program just prints out a message from each process, informing the user of which process the message is coming from and the name of the host where the process is running.
MPI actually acts as it's own compiler toolchain. Gcc becomes mpicc, G++ becomes mpicxx, etc. The executables are actually just wrappers for whatever C/C++ compiler you happened to use to build your MPI install, but all the same, compiling MPI programs requires that you use the MPI toolchain rather than your ordinary compiler:
A running MPI program is ultimately just bunch of instances of the same program, running in different places. All MPI really does is provide abstraction for communication between processes and a convenient means by which to launch those processes. As such, an MPI program can actually be run the same way as any ordinary program, that is, all on it's lonesome. MPI algorithms can actually be pretty trivially designed so that they tolerate this, memory and compute requirements notwithstanding. But if you are using MPI, you are probably thinking parallel, or at least you should be. There are a few different ways to launch MPI programs in parallel, various different process managers and so-forth, but generally it comes down to invoking a command and passing it some parameters to describe how and where you want to run your instances. The following command for example launches 4 instances of the hello world program:
In this case though, the instances are all launched locally. If you have a cluster of machines though, you want to be able to launch your MPI program on all of them. Mpirun/mpiexec will actually, when passed a list of hosts, invoke SSH to launch remote instances. Usually an SSH session will prompt the user for a password though, which is a problem as the sessions launched by mpirun are not interactive, and even if they were, who has time to type the same password dozens of times every time they want to run a program? You can of course just disable the password for SSH or use a blank one, but this is not a particularly good idea for a number of reasons (mostly security related). A safer way to do this is to use public/private key-pairs for authentication. In my case, the head node needs to have password-free SSH access to the compute nodes. This little shellcode snippet generates a random rsa keypair and copies it to each of my nodes, prompting the user for each machines password:
The last step is to set up a file defining the hosts:
Then, run the program with reference to the host file:
And viola! An MPI capable cluster.
BENCHMARKS:
There exists a set of benchmarks called the OSU microbenchmarks, maintained by ohio state university (who also maintain the aforementioned MVAPICH), which allow you to get low level performance insights for a variety of MPI primitives on any given system. This, for example, is a graph of how point-to-point bandwidth varies with respect to message size:
Bandwidth is poor at small transfer sizes and gets progressively higher as messages become larger. Peak bandwidth of 117MB/s is observed at the largest message size. It only reaches 91% of the theoretical maximum of 128MB/s expected for gigabit ethernet (1gbps / 8). I'm not sure where this limit comes from, but if I had to guess it's probably down to overhead from TCP/ip packet headers, and from overheads in MPI itself. Additionally, the OSU benchmark suite also has latency benchmarks:
As expected, the lowest latency occurs at the smallest message sizes, with the best achieved latency being about 26 microseconds. As the size of the message increases though, the measured latency becomes dominated by transmission time, rather than the actual latency of the network. This hypothesis is supported by the fact that by using the largest latency (3.59 ms at 4MB message size) to calculate the bandwidth of the transfer gives us just shy of 117MB/s, very close to the peak bandwidth observed during the bandwidth benchmark. There are a shed-load of other benchmarks in the suite too, testing all sorts of MPI functionalities. They're definitely a useful tool to have, especially as a sanity check when installing new hardware. But I'm not sure how much you can infer from them about an actual MPI workload. Time will tell I guess!
The first thing that needed doing was setting up remote access. I quickly learned that SSH or "secure shell" is the industry standard here. Setting up SSH on each of the nodes turned out to be a breeze, Ubuntu offers an option to do this automatically during the OS install process, but even if you leave that option unchecked, the SSH server can be installed with a single command once the operating system is up and running. SSH can either give you a standard interactive shell, where you can run commands yourself on the remote machine, or a non-interactive shell where it logs in, runs a command string then logs out again. Another tool I made extensive use of was PSSH or "parallel ssh". It's a simple tool, all it does is launch an ssh session on a number of hosts, run a command in a non-interactive manner, then report back the exit code of the command from each host. Handy for avoiding repetitive operations.
NETWORKING:
Typically clusters have a "head node" from which the cluster is managed, shared resources are stored and from which programs are launched. In my case, my desktop machine serves as the head node. To access each of the nodes easily though, I needed them to have static IP addresses. In windows or Linux distributions with a rich desktop environment, such things are typically handled by a widget somewhere on the desktop. But on a Linux server, network interfaces are configured at boot time based on the contents of a configuration file in the operating system (/etc/network/interfaces). In my case, the home router where I was living at the time assigned IP addresses in the typical 192.168.0.x range. So I configured my desktop (the head node) for 192.168.0.80, and the compute nodes for 192.168.0.70-74. The interfaces file on the compute nodes ended up looking like this:
Another necessity for clustering is some form of shared storage, so that executables and other resources are accessible by the compute nodes in the cluster. Fortunately Linux has a simple ubiquitous tool for this too, called NFS (network file-system). NFS allows a system to share directories that can then be mounted by other machines remotely. Once installed, the NFS server is configured by modifying the contents of the /etc/exports file. I created a "cluster" directory in my user accounts home directory on the head node, and added a line for each compute node to the exports file:
Similarly to network interfaces, drives (remote or otherwise) can be mounted at startup by specifying the mounts in /etc/fstab. In my case, the same user account name is used on all the machines in the cluster, including the head node. As a result, cluster related files will be at the same place in the filesystem on all the machines (/home/ac/cluster/), this will make things easier later. The following line instructs the operating system to mount the NFS share defined on the head node, I added an identical such line to the /etc/fstab file on each of compute nodes so that they would all mount the share at startup:
INSTALLING MPICH:
A stack of machines, remotely accessible or not, does you little good without the tools to leverage it. MPI or "message passing interface" is an industry standard framework for multi-node message passing based parallel programming. So it seemed like a sensible place to start. MPI however is just a standard. There are a few different implementations to choose from; OpenMPI is a widely deployed open source implementation, MPICH is a reference implementation of the MPI standard maintained by the Argonne National Laboratory, and MVAPICH is a version of MPICH specifically built to support infiniband and other exotic interconnects.
For my cluster, I opted for MPICH, as I didn't need the advanced features of either OpenMPI or MVAPICH. Installing it involves building from source, which is not nearly as involved as it sounds. After downloading and extracting mpich, I found a readme detailing the installation process. The --prefix argument allows you to specify where the output binaries end up. In my case, I chose to put them in the NFS-shared "cluster" directory I created earlier, so that they would be easily accessible at the same path on all the nodes. I also disabled fortran, as the configure command complained to me that it couldn't find a fortran compiler, and I didn't need fortran support anyway:
Turns out MPI is a pretty complicated piece of software, so this process can take a while. Once the compilation is complete, the final step is to set up the shell so that it can find the compiled binaries. If you configure MPICH without specifying your own nonstandard installation directory, then you don't need to do this. But I wanted to have the option to install other MPI implementations down the line without having to uninstall MPICH. So rather than do that, I added a few lines to my bashrc as follows:
FIRST CODE:
Naturally, the first thing to do in any new programming environment is to write a hello world program. So here it is. The program just prints out a message from each process, informing the user of which process the message is coming from and the name of the host where the process is running.
MPI actually acts as it's own compiler toolchain. Gcc becomes mpicc, G++ becomes mpicxx, etc. The executables are actually just wrappers for whatever C/C++ compiler you happened to use to build your MPI install, but all the same, compiling MPI programs requires that you use the MPI toolchain rather than your ordinary compiler:
A running MPI program is ultimately just bunch of instances of the same program, running in different places. All MPI really does is provide abstraction for communication between processes and a convenient means by which to launch those processes. As such, an MPI program can actually be run the same way as any ordinary program, that is, all on it's lonesome. MPI algorithms can actually be pretty trivially designed so that they tolerate this, memory and compute requirements notwithstanding. But if you are using MPI, you are probably thinking parallel, or at least you should be. There are a few different ways to launch MPI programs in parallel, various different process managers and so-forth, but generally it comes down to invoking a command and passing it some parameters to describe how and where you want to run your instances. The following command for example launches 4 instances of the hello world program:
In this case though, the instances are all launched locally. If you have a cluster of machines though, you want to be able to launch your MPI program on all of them. Mpirun/mpiexec will actually, when passed a list of hosts, invoke SSH to launch remote instances. Usually an SSH session will prompt the user for a password though, which is a problem as the sessions launched by mpirun are not interactive, and even if they were, who has time to type the same password dozens of times every time they want to run a program? You can of course just disable the password for SSH or use a blank one, but this is not a particularly good idea for a number of reasons (mostly security related). A safer way to do this is to use public/private key-pairs for authentication. In my case, the head node needs to have password-free SSH access to the compute nodes. This little shellcode snippet generates a random rsa keypair and copies it to each of my nodes, prompting the user for each machines password:
The last step is to set up a file defining the hosts:
Then, run the program with reference to the host file:
And viola! An MPI capable cluster.
BENCHMARKS:
There exists a set of benchmarks called the OSU microbenchmarks, maintained by ohio state university (who also maintain the aforementioned MVAPICH), which allow you to get low level performance insights for a variety of MPI primitives on any given system. This, for example, is a graph of how point-to-point bandwidth varies with respect to message size:
Bandwidth is poor at small transfer sizes and gets progressively higher as messages become larger. Peak bandwidth of 117MB/s is observed at the largest message size. It only reaches 91% of the theoretical maximum of 128MB/s expected for gigabit ethernet (1gbps / 8). I'm not sure where this limit comes from, but if I had to guess it's probably down to overhead from TCP/ip packet headers, and from overheads in MPI itself. Additionally, the OSU benchmark suite also has latency benchmarks:
As expected, the lowest latency occurs at the smallest message sizes, with the best achieved latency being about 26 microseconds. As the size of the message increases though, the measured latency becomes dominated by transmission time, rather than the actual latency of the network. This hypothesis is supported by the fact that by using the largest latency (3.59 ms at 4MB message size) to calculate the bandwidth of the transfer gives us just shy of 117MB/s, very close to the peak bandwidth observed during the bandwidth benchmark. There are a shed-load of other benchmarks in the suite too, testing all sorts of MPI functionalities. They're definitely a useful tool to have, especially as a sanity check when installing new hardware. But I'm not sure how much you can infer from them about an actual MPI workload. Time will tell I guess!
Comments
Post a Comment