Posts

Showing posts from 2018

Silencing a noisy infiniband switch

Image
Recently, I installed infiniband networking in my DIY cluster, including a rather noisy 18 port switch that I've been meaning to do something about. The plan is to either replace the fans in the switch with ones that spin a bit slower, or if that isn't possible, admitting defeat and just moving the whole setup to another room. Maybe I can use the cluster as an impromptu central-heating system. 1) Inside the Mellanox IS5023. This is the first real piece of enterprise-grade kit I've owned. The first thing that struck me about it was the build quality. The chassis is sturdy, all internal cables are tightly secured out of the way, the motherboard is significantly thicker than what you see in consumer gear, and the design has clearly been optimised for airflow and redundancy of cooling. Speaking of airflow, the cooling situation is very simple. Four forty millimetre fans draw cool air through a vent at the rear of the unit and through the rear compartment containing the po

Infiniband for the cluster

Image
One of the biggest limiting factors for my cluster , insofar as performance is concerned, has been the interconnect. I know it seems odd to talk about the performance of a cluster built for educational purposes from near 10 year old hardware, but the interconnect is a very big part of what makes an HPC system "HPC" and not just a bunch of dumb servers on a LAN. I knew this would probably be the case when building the thing of course, 1gigabit is not a lot of bandwidth, and I have found that I have to be extremely sparing with communication when writing code for the thing. So, since building it just over a year ago, I've kept my eye on ebay, searching for something cheap and fast to replace the network with. Meet the Mellanox IS5023, an 18-port 40 gigabit unmanaged infiniband switch, which I found on ebay for £125. I was also able to find single port infiniband cards for ~£20 each from china and QSFP copper direct-attach cables for ~£8 each, also from china. I also bo

DIY cluster: software [Autumn 2017]

Image
In a previous post  I talked a bit about the hardware I used to build my DIY compute cluster in the autumn of 2017. The primary goal of the project was to gain experience building and debugging parallel software, but one of the secondary goals was to familiarise myself with Linux as an operating environment, which I had little prior experience with. Naturally, the first choice was which flavour of Linux to use, of which there are literally hundreds. At the time of building the cluster, the only distribution I had any exposure to was Ubuntu, so I opted for the then current server version (16.04) as I thought it would minimise any potential compatibility issues with the desktop variant of Ubuntu (also 16.04) I had installed on the machine where I planned to do my development. The first thing that needed doing was setting up remote access. I quickly learned that SSH or "secure shell" is the industry standard here. Setting up SSH on each of the nodes turned out to be a breeze,

Building a router [Spring 2018]

Image
One of the problems I had after building my cluster was to do with address allocation. The router at the house where I was living at the time had been locked down by the landlords, so I was unable to set static IP addresses on it, or even look at the router settings for that matter. It is possible to configure a static address on each machine, and that is exactly what I did when setting everything up. Occasionally though, the router would allocate one of my static IPs to another device on the network, leaving me unable to connect to some of my machines or otherwise messing things up. The protocol responsible for handing out addresses on an IP network is called DHCP (dynamic host configuration protocol). When a device connects to a network, one of the first things it does is broadcast a DHCP request. A a listening DHCP server then responds with an IP address for the device to use, ensuring that it doesn't conflict with any already allocated addresses. It stands to reason then tha

DIY cluster: hardware [Autumn 2017]

Image
Following graduation, one of the first projects I set myself was to design and build a cluster. The motivation was that during my masters project, in an effort to produce an original thesis, I had inadvertently glanced off a much more general problem. Namely to do with the decomposition and parallelization of tasks. At a conceptual level, this is pretty simple. You just break down whatever process you want to do concurrently into blocks, determine the dependencies between those blocks, then allocate them to functional elements as appropriate. These "functional elements" can in principal be anything, humans in a team, servers in a rack etc. Simple, right? The devil of course, as always, is in the details. Engineering concurrent software is hard. Engineering systems that can handle arbitrary concurrent software is very hard. Which brings me neatly to this project. I wanted to build a system which I could use to get a feel for the sort of problems that arise when writing, using