Infiniband for the cluster

One of the biggest limiting factors for my cluster, insofar as performance is concerned, has been the interconnect. I know it seems odd to talk about the performance of a cluster built for educational purposes from near 10 year old hardware, but the interconnect is a very big part of what makes an HPC system "HPC" and not just a bunch of dumb servers on a LAN. I knew this would probably be the case when building the thing of course, 1gigabit is not a lot of bandwidth, and I have found that I have to be extremely sparing with communication when writing code for the thing. So, since building it just over a year ago, I've kept my eye on ebay, searching for something cheap and fast to replace the network with. Meet the Mellanox IS5023, an 18-port 40 gigabit unmanaged infiniband switch, which I found on ebay for £125.


I was also able to find single port infiniband cards for ~£20 each from china and QSFP copper direct-attach cables for ~£8 each, also from china. I also bought a 10 meter fibre optic cable, in case I want to run the cluster and switch in a separate room at some point. All in, including a couple of spare cards and cables, the whole lot came to a bit over £300. For the uninitiated, infiniband comes in a variety of different flavours, each referred to by a 3 letter designation. There are versions of it that operate at anything from 10 to 200 Gigabits per second, with plans in the works for 400Gbps "NDR" and 1Tbps "XDR" standards in the near future. This switch though supports infiniband "QDR", which uses four links per cable, each at a comparatively modest 10gbps. All of the kit I picked up operates at QDR speeds.

The network cards are Mellanox connectx-2, and operate over a pci-express 2.0 x8 link, limiting bandwidth into any given machine to 32Gbps in each direction. In addition to this, the infiniband link uses 8/10 encoding, so 10 symbols are used to transmit each byte. This means that the real theoretical peak bandwidth is actually "only" 32 gigabits, the same as the PCIe interface anyway. In practise though, I expect to see less than that in benchmarks due to overheads elsewhere in the system. The boards in the cluster machines though aren't capable of PCIe 3.0 anyway, so going for the more expensive PCIe 3.0 capable connectx-3 wouldn't have been worthwhile. To mount the switch, I created a structural frame and added it to the top of the cluster, then ran the network cables up the front of the cluster, securing them in a bundle with cable ties at each node:


This is neat, and leaves space at the top of the cluster for a dedicated head node or storage server should I decide to add one in the future. There is however an elephant in the room. Namely, the noise. The switch is equipped with four 15,000 rpm sunon fans. I knew in advance that the switch was probably going to be loud, but I wasn't prepared for just how loud it would be. Worse still, on it's perch atop the cluster, it's right at ear height beside my desk. I can just about tolerate it with earphones in and music blasting at maximum volume, so long as I am willing to endure some short term hearing loss, but I'd really rather not. Fortunately, I don't actually need the switch to set up a rudimentary infiniband network, I can connect the nodes directly to test the cards. At least until I can move the setup to another room or do something to solve the noise problem.

DRIVERS:

The driver package that these cards use is called the OFED or "open fabrics enterprise distribution". Finding somewhere to download it turned out to be a chore and a half. The cards have (perhaps understandably, given their age) been designated by Mellanox as end of life, which means they won't provide support for them of any kind. Downloading driver packages also seems not to be possible on their website. I eventually found a link in a reddit thread of all places, which took me directly to a download page on the Mellanox website that didn't seem to be accessible in any of the menus. However, it was for a Ubuntu 16.10 version of the OFED. Attempts to install this under 16.04 were fruitless. Modifying the URL directly though allowed me to get to the download page for the version I was looking for. The install process was actually pretty simple:

All fine and dandy then? Well, not quite. The download from the mellanox website would actually fail, but appear in my browser to have completed properly! Then, during the install process, errors would occur, different for every download instance, as different files would be missing from the archive (because the download would fail in slightly places each time). It turned out the download would actually fail after a fixed amount of time. The randomness was the result of varying download rates causing the download to get cut off at a random point. How bizarre... In the end, I installed a different browser, one that supported the resuming of failed downloads, and managed to get the whole archive down and install the drivers.

SETTING UP THE NETWORK:

Infiniband is actually an entirely alien network stack, which means to say, it doesn't use TCP/IP as it's underlying transport protocol. The main reason behind this design choice is to do with processing overhead. In an IP network, messages have to trundle down the TCP/IP stack of the sending machine, over the network, then back up the stack on the receiving machine. Doing all this requires CPU cycles, interrupts, context switches and a fair amount of synchronisation between the kernel and whatever user-space process is actually using the socket. TCP/IP was engineered this way based on the assumption that CPUs were capable of handling data much faster than networks. This is still true to a large extent, you'll be hard pressed to find a contemporary processor that can't handily manage a wifi connection or gigabit Ethernet stack. However when you get into the gigabytes per second range, this is no longer true. The infiniband stack is engineered from the ground up for efficiency, and actually bypasses the kernel entirely allowing user-space programs to interact with each-other directly. What little processing is required is handled mostly by hardware on the network card, allowing communication to take place with very little overhead.

With the cards installed, the next step is to ensure that they are detected by their host systems. The command "ibstat" will deliver information about any installed infiniband hardware and report back on its status:

The card is seen by the system, excellent. It's state is "down" and "polling" indicating that it is actively searching for a connection. This is about what we expect. Previously I mentioned that PCIe bandwidth over PCIe 2.0 would likely limit achievable bandwidth over the link, so it make sense to ensure that the card has negotiated the proper speed with the system. Linux contains a command called lspci which will list information associated with devices connected to the PCI/PCIe bus. It tends to output a lot of text, so I grepped it for the device name "MT26428":

The link status reads "PCIe 2.0 5GT/s", just as it's supposed to. The card is an x8 card, meaning that it uses eight links at 5GT/s. It's technically possible that the card negotiated for a narrower link (x4 and x1 are common widths), but probably not. We'll see once I compile and run some benchmarks.

Similar to DHCP on a TCP/IP network, an infiniband network requires that there is a process somewhere on the network to manage addressing. Many infiniband switches have integrated "subnet managers" which handle this, but a subnet manager can actually be run on any system in the network. In fact unlike DHCP, which is centralised, the infiniband subnet manager uses a distributed algorithm and can be run on many machines in the network for redundancy. In my case, I opted to run an instance on the topmost node in the cluster for now, as I usually use the nodes in descending order when I don't have need of all of them. The following commands start the subnet manager and print a list of hosts that are visible on the network:

Sure enough, there are my two hosts, cluster nodes 4 and 3. The next point of concern is that infiniband hardware will, in a manner similar to ethernet, negotiate port speed when connected. So for example, a QDR (40 gigabits) capable card might be connected to a DDR (20 gigabit) capable switch, causing the link to auto-negotiate a slower data-rate than the specification of that card would suggest. This is a concern because some switches from Mellanox are capable of QDR speeds, but without appropriate licensing, will only operate at DDR speeds. So the next thing to do is to check the speed of the link:

The link is up, and the link speed reads as 40 gigabits, just as it should. One final touch is to set the subnet manager to run on startup, which it doesn't do by default. Adding the startup command to /etc/rc.local did the trick.

CONFIGURING MPI FOR INFINIBAND:

Previously, I installed MPICH on the system to facilitate parallel programming. This has worked great over plain old ethernet, but MPICH doesn't actually support infiniband by default. MPICH does have support for mxm, a communication framework that in turn supports infiniband, and the previously installed OFED does come with mxm binaries and headers, but I was unable to get MPICH to compile using them. An alternative might be to use "ipoib" or IP over infiniband, which essentially just uses the infiniband network as a channel for IP traffic, but that would reintroduce much of the overhead that infiniband is engineered to avoid. In the end I decided to switch to OpenMPI. It can be found here. At the time of writing this, the most recent version is 4.0.0. Like MPICH it is distributed as source and needs to be compiled, the following commands did it for me:

The "cluster" directory is shared, and is visible at the same point in the filesystem on the head node and all compute nodes. I installed OpenMPI it to it's own directory alongside MPICH, making sure to compile it on one of the compute nodes so that it would spot the infiniband hardware and compile the code needed to support it. Lastly, I modified the lines in the bashrc on each of the nodes and my head node to point at the new MPI installation.

When trying to run an MPI program, I was warned that OpenMPI doesn't support infiniband directly and uses something called UCX instead. It's possible however to pass a parameter to mpirun that tells OpenMPI to use the infiniband hardware directly. I'm not sure why it was decided that this shouldn't be default behaviour, but I imagine there is a reason. If I run into problems down the line, I'll bear it in mind. In any case, I was able to compile and run my trusty MPI hello world program as follows:

Excelent, everything appears to be working. Though it's worth pointing out that the hello world program I use here doesn't do any inter-node communication whatsoever. We'll need real programs for that.

As a final note, a peculiarity of my system now is that the head node, my desktop machine, doesn't actually have access to the infiniband network. It's a small form factor machine with only a single PCIe slot, which is already populated. This means that when running an MPI program, warnings are sometimes generated to the effect of "I can't find the infiniband device you said would be here". It seems OpenMPI is smart enough to deal with this though. I have actually been toying with the idea of replacing my ageing desktop machine for a while now too, this is unlikely to be the straw that breaks the camels back, but it's certainly one more for the heap.

BENCHMARKS:

To test the link, I recompiled the OSU MPI microbenchmarks using the OpenMPI compiler wrappers (they were previously compiled with MPICH), then ran a bandwidth test:


Network bandwidth is now actually comparable to the bandwidth of socket to socket transfers within a node! Ethernet of course, is left in the dust. the peak transfer rate is 3400MB/s, or 85% of the theoretical maximum bandwidth imposed by the link and by the PCIe 2.0 interface used by the cards. The deficit could be from a variety of sources; overhead in MPI, overhead in PCIe code in the driver, overhead between the chipset and the processor, or perhaps all of the above. It really is anybodies bet. I certainly don't have a deep enough understanding of the Infiniband stack to comment with any certainty. In any case, throughput is clearly much improved. Latency also sees drastic improvements at both large and small message sizes:


The latency of infiniband transfers at low message sizes was roughly 1.6 microseconds, whilst the direct socket-to-socket transfer came in at about 600 nanoseconds. Ethernet is once again nowhere to be seen with latencies in the 25-26 microsecond range. These tests were run with only two nodes connected directly by a cable, and not via the switch. The spec sheet claims that the switch should exhibit a maximum port-to-port latency of 100 nanoseconds, it will be interesting to see if that is true. I think I'm going to hold off on further testing though until I can do something about the fan noise situation.

Comments

Popular posts from this blog

Silencing a noisy infiniband switch

DIY cluster: hardware [Autumn 2017]