DIY cluster: hardware [Autumn 2017]

Following graduation, one of the first projects I set myself was to design and build a cluster. The motivation was that during my masters project, in an effort to produce an original thesis, I had inadvertently glanced off a much more general problem. Namely to do with the decomposition and parallelization of tasks. At a conceptual level, this is pretty simple. You just break down whatever process you want to do concurrently into blocks, determine the dependencies between those blocks, then allocate them to functional elements as appropriate. These "functional elements" can in principal be anything, humans in a team, servers in a rack etc. Simple, right? The devil of course, as always, is in the details. Engineering concurrent software is hard. Engineering systems that can handle arbitrary concurrent software is very hard. Which brings me neatly to this project. I wanted to build a system which I could use to get a feel for the sort of problems that arise when writing, using and debugging parallel software. A field that I now know as HPC, "high performance computing" or more colloquially, super-computing. Along the way I also ended up familiarising myself with networking, the Linux command line, and a whole host of supporting (mostly Linux-y) technologies. But more on that later.

The goal the hardware was to, within budget and power constraints, build something as analogous to a real-world HPC installation as possible. So stacking up some raspberry pi's was out of the question. I had been working with commercial PC hardware for years, so identifying compatible parts wasn't too difficult. After a couple weeks scrounging on eBay, I settled on some old circa-2010 dual socket motherboards from HP dl180/se1220/se216m1 storage servers. On a good day, they can be had on eBay for £30 each. They have two lga1366 CPU sockets, 12 ddr3 memory slots, and a whole bunch of IO options. For the processors, I opted for L5640's, 6-core 12-thread low power xeon chips. They are far from the fastest processors available for the platform but I ended up opting for these anyway, partially because I wanted to save some money on electricity, but mostly because I could get them in matched pairs for £10 each, shipped from south Korea. To summarise, each node would have full fat x86 cores with sse4.2 (though alas no AVX), multiple CPU sockets, and feature pci-express expansion options, making them fairly good analogues for the kind of servers you might find in a real-world HPC installation. In the end, after all expenses; fans, heatsinks, power supplies, memory, etc each node came to a total of roughly £130. Not too shabby.

The first issue I encountered was actually anticipated. Typically for any given processor socket, motherboard manufacturers adhere to a standardised heatsink mounting hole layout. However some board manufacturers, and this is especially common in servers, have their own ideas about how CPU heatsinks ought to be bolted to things. In the case of these HP boards, there are only two mounting holes, one at either end of the CPU socket, rather than the usual four. One solution might have been to buy heatsinks that were originally designed for this brand of server. If the boards are available, then the heatsinks ought to be too, right? Unfortunately, these boards originally came out of 2U servers, so the heatsinks were quite small and would have required high speed fans (like the ones you find in a server, who would have guessed?) to keep the processors cool. Additionally, they were unreasonably expensive, and I wanted to be able to sit next to the thing while it was running, so I decided against using them. In the end I found a brand of aftermarket CPU heatsink that was both cheap, and whose original mounting brackets were detachable via small screws at the base. Perfect. I removed the original mounting hardware, and replaced it with some plates cut from some aluminium siding.

1) CPU heat-sink mounted with makeshift mounting solution.

My second problem was not anticipated. The boards I ordered must have been manufactured before the newer generation of lga1366 xeon chips were released, and despite the assurances of the seller, arrived with old firmware that made them incompatible with my (slightly) more modern westmere L5640s. The L5640 is on the qualified vendor list for the servers they came out of though, so this was fixable with a firmware update. Updating firmware though, requires a working processor. I ended up buying another CPU, an L5520, that was guaranteed to work for less than £5.

At this point, another problem arose. The boards would now POST successfully with the older processor installed, but then immediately shut down, informing me that there were fan failures. The fans of course, weren't plugged in. As it turns out, the boards check for the presence and speed of installed fans before allowing the system to boot, to protect the hardware from overheating. Smart in a data-centre or some other high availability environment, but not very helpful here. To make matters worse, these boards came with non standard fan headers, so I couldn't just plug in some conventional PC case fans.

For the uninitiated, there are generally two flavours of PC case fan; 4 pin and 3 pin. 3 pin fans have the same pins and layout as 4 pin ones, but they omit a fourth wire which carries a PWM (pulse width modulation) signal. This signal is generated by the motherboard to control the speed of the fan. Additionally, PWM fans read an open circuit as "full speed", so 3 and 4 pin fans can be used interchangeably. The motherboard manufacturer can then choose to control attached fans either using PWM, or by directly varying the supply voltage.

2) Pin-outs for various fan headers

In the case of the proprietary connectors on these boards though, it's not so simple. After asking around on some forums though, I was able to find someone who had a pin-out for them. Each header has a single ground pin, a single PWM pin, two 12 volt pins for supplying power and two rpm signals. There are six such headers on the board, four of which are checked when the system starts up, and two of which appear to serve as auxiliary headers, and aren't checked. A quick peek at the internal organisation of the server sheds some light on why this is:

3) Internal fan configuration of an hp dl180 g6

Each header supports two fans in series, providing redundancy should one of them fail. If both fans on any given header fail, then the system refuses to boot. To get around this, I took the rpm signals from each of the CPU cooler fans, which luckily spin fast enough not to trip the motherboards built-in safeguards, and connected them to each of the rpm pins on all 4 headers with a series of jumpers. I then connected some large 140mm case fans to the auxiliary fan headers for general component cooling. After that, I had to put a copy of windows on a hard drive, as the firmware update utility provided by HP was not the usual flash-drive-at-boot affair that you often see with consumer boards, but instead a windows based utility that performs the upgrade from within the operating system. Sketchy, but it worked. Finally, I was able to boot the board with two L5640s installed.


4) Finally up and running. The cheapest fans available came with blue LEDs.

The final step was to build an enclosure for the cluster. This ended up being a pretty simple affair. I went down to the local DIY store, had some chipboard cut to an appropriate size, and used m4 threaded rod to bolt the whole assembly together. I added some aluminium braces to the corners, mounted the case fans to the back, hooked up a gigabit Ethernet switch, and with that the hardware was ready to go.

5) The completed cluster

Comments

Popular posts from this blog

Silencing a noisy infiniband switch

Infiniband for the cluster