Silencing a noisy infiniband switch
Recently, I installed infiniband networking in my DIY cluster, including a rather noisy 18 port switch that I've been meaning to do something about. The plan is to either replace the fans in the switch with ones that spin a bit slower, or if that isn't possible, admitting defeat and just moving the whole setup to another room. Maybe I can use the cluster as an impromptu central-heating system.
This is the first real piece of enterprise-grade kit I've owned. The first thing that struck me about it was the build quality. The chassis is sturdy, all internal cables are tightly secured out of the way, the motherboard is significantly thicker than what you see in consumer gear, and the design has clearly been optimised for airflow and redundancy of cooling. Speaking of airflow, the cooling situation is very simple. Four forty millimetre fans draw cool air through a vent at the rear of the unit and through the rear compartment containing the power supply. The fans push the air into a forward compartment, which contains the business end of the switch. The air immediately passes over a large heatsink which cools whatever application-specific silicon the switch uses. At the front where cables are attached, there are a series of comparatively small vent holes through which air is exhausted, the restriction these present sets up positive pressure in the forward compartment, which in turn ensures that the transceivers all get roughly equal airflow. I'm by no means experienced enough to comment with any authority (I have never designed or built anything like this, and likely never will) but the whole design screams attention to detail to me. It's almost a shame to mess with it. However mess with it I shall.
TEMPERATURE TESTING
The first order of business then is to figure out if I can actually mess with it without busting something. Does it run hot? Does it absolutely need those four 45 decibel 12,000 rpm fans? Or can I get away with something a little more conservative. I'm inclined to say yes though on three counts; firstly, the switches spec sheet states that it is suitable for operation in 45 degree ambient temperatures, and I can just not do that, secondly is the fact that cooling performance actually sees diminishing returns as airflow speed increases, and lastly, the switch actually has to support the power usage of connected cables, and this 18 port model is actually a cut-down version of a larger 36 port one, so the cooling is quite likely sized for that configuration. But I better test it anyway though, just to be sure. This of course means stressing the thing out. I have (for the moment) only five nodes, meaning I likely can't come close to maxing it out. But I can load it about as far as the nodes in this cluster ever will, which will have to do for now. I'll just have to pay attention to it if I decide to add more hardware later.
To drive as much data through the switch as I can, I implemented a simple load test program which rotates a set of buffers between processes. It's a very simple program. Each process initialises two identical buffers, it fills one with random numbers and leaves the other empty. It then transmits the contents of the first buffer to the next process whilst simultaneously receiving the buffer from the previous one. References to the two buffers are then exchanged and the process repeats:
It's pretty crude for sure. But it should be pretty heavy on the network. Certainly heavier than any application I have written for the cluster thus far, and probably about as communication heavy as an MPI program can reasonably be. I compiled it and left it running for about half an hour, then went looking for hot-spots inside the switch with an infrared thermometer. Turns out it fared pretty well, the main heatsink was scarcely above room temperature, and the only parts on the board over 30 degrees were the black surface mount resistors in the regulator circuits on the left hand side of the board:
This isn't too much of a concern as components in switched mode regulators like this often operate comfortably in the 100 degree range. Additionally, I was unable to actually detect any difference in exhaust air temperature from the switch at idle versus load anyway. Besides, if the power supply circuits do end up requiring additional cooling, I can always add some heatsinks.
FAN INSTALLATION
The fans that came installed in the switch are sunon units, they are rated for 19cfm (cubic feet per minute) of airflow each. I wanted quieter ones obviously, but I didn't want to stray too far from the specifications of the fans that were already installed. I eventually came across a set of 5 40mm fans for <£10 on ebay:
They are rated for 7.5 cubic feet per minute and operate at much lower power, only 1.2 watts. As for noise, they are rated at only 27 decibels, which whilst being much louder than your average consumer system, should be much more tolerable than the deafening 45 decibel fans that they are replacing. To get them installed in the switch though, I had to do some rewiring. Turns out the fan headers on the switch motherboard, despite using a standard 3-pin 0.1" pitch connectors the likes of which you see on consumer PC hardware all the time, actually use a non-standard pinout. Not to worry though, modifying these headers is actually pretty trivial, you just poke out the pins with something pointy and put them back in whatever order you need them:
Lastly, I had to remove the old fans and find a way to mount the new ones. I actually expected this to be a pain because of the positioning of the screws, but the central divider to which the fans are mounted turned out to be removable, so this was pretty easy too:
Lastly I re-ran the stress test to confirm that there were no thermal issues. Upon turning the switch on, a warning light appeared on the front indicating that there was a problem with the fans. Makes sense that the firmware would check the fan speed I suppose. However the switch actually operated normally, and temperatures under stress weren't appreciably higher than with the old fans, so this doesn't seem to be a problem (fingers crossed).
SOME BENCHMARKS
When I first installed the new network hardware, I ran some point to point bandwidth and latency benchmarks, but nothing that required more than two nodes. Now that the switch has been quietened down though, I thought I would test performance of the interconnect again, but with collective communication this time. The test uses the MPI_Allgather primitive, which is a compound scatter/gather operation. For the uninitiated, this involves each process broadcasting some data to the other hosts (scattering), whilst simultaneously receiving (gathering) the data that the other hosts are scattering. As I understand, the broadcast part of this process is not done in a simple loop, but instead with a tree-like structure that leverages multiple nodes for each broadcast. This helps massively if there are a large number of broadcast targets, as broadcast operations happen in logarithmic time rather than linear. In any case, the operation ends up being a lot more complex than just a simple transfer, so the OSU benchmarks don't actually offer a "bandwidth" test, instead just reporting the amount of time required for the operation.
This is pretty much what we saw with the point to point tests back when I originally installed the infiniband hardware. Transfer time is limited by the latency of the interconnect for small messages, but as they get larger, it becomes increasingly dominated by bandwidth. Nothing unexpected. Here's the same test for ethernet:
This actually surprised me. In previous point to point tests, ethernet was slower than infiniband, as you would expect, and it is here too by roughly the same margin. However, when you start using multiple processes per node, ethernet performance becomes a little unpredictable, with an anomaly at a message size of 64k. This actually makes some sense, as with multiple processes on each box, some of the communication taking place will be intra-node, and not over the network. That intra-node communication is on the order of 100 times faster than gigabit ethernet, as I discovered when I installed the infiniband network. This means that from the point of view of whomever implements MPI_allgather, it makes some sense to do the intra-node communication first to minimise use of the (slower) network, and then gather between processes on other nodes. Something like this could in part explain the wonky graph above. I know little about how MPI is actually implemented though, so take my words with a grain of salt.
Either way, I can use the network now without being deafened. And with the blog more or less up to date with the cluster project as a whole, I can get back to actually writing code!
1) Inside the Mellanox IS5023. |
TEMPERATURE TESTING
The first order of business then is to figure out if I can actually mess with it without busting something. Does it run hot? Does it absolutely need those four 45 decibel 12,000 rpm fans? Or can I get away with something a little more conservative. I'm inclined to say yes though on three counts; firstly, the switches spec sheet states that it is suitable for operation in 45 degree ambient temperatures, and I can just not do that, secondly is the fact that cooling performance actually sees diminishing returns as airflow speed increases, and lastly, the switch actually has to support the power usage of connected cables, and this 18 port model is actually a cut-down version of a larger 36 port one, so the cooling is quite likely sized for that configuration. But I better test it anyway though, just to be sure. This of course means stressing the thing out. I have (for the moment) only five nodes, meaning I likely can't come close to maxing it out. But I can load it about as far as the nodes in this cluster ever will, which will have to do for now. I'll just have to pay attention to it if I decide to add more hardware later.
To drive as much data through the switch as I can, I implemented a simple load test program which rotates a set of buffers between processes. It's a very simple program. Each process initialises two identical buffers, it fills one with random numbers and leaves the other empty. It then transmits the contents of the first buffer to the next process whilst simultaneously receiving the buffer from the previous one. References to the two buffers are then exchanged and the process repeats:
It's pretty crude for sure. But it should be pretty heavy on the network. Certainly heavier than any application I have written for the cluster thus far, and probably about as communication heavy as an MPI program can reasonably be. I compiled it and left it running for about half an hour, then went looking for hot-spots inside the switch with an infrared thermometer. Turns out it fared pretty well, the main heatsink was scarcely above room temperature, and the only parts on the board over 30 degrees were the black surface mount resistors in the regulator circuits on the left hand side of the board:
2) The power regulator section of the board. |
This isn't too much of a concern as components in switched mode regulators like this often operate comfortably in the 100 degree range. Additionally, I was unable to actually detect any difference in exhaust air temperature from the switch at idle versus load anyway. Besides, if the power supply circuits do end up requiring additional cooling, I can always add some heatsinks.
FAN INSTALLATION
The fans that came installed in the switch are sunon units, they are rated for 19cfm (cubic feet per minute) of airflow each. I wanted quieter ones obviously, but I didn't want to stray too far from the specifications of the fans that were already installed. I eventually came across a set of 5 40mm fans for <£10 on ebay:
3) Replacement fans |
They are rated for 7.5 cubic feet per minute and operate at much lower power, only 1.2 watts. As for noise, they are rated at only 27 decibels, which whilst being much louder than your average consumer system, should be much more tolerable than the deafening 45 decibel fans that they are replacing. To get them installed in the switch though, I had to do some rewiring. Turns out the fan headers on the switch motherboard, despite using a standard 3-pin 0.1" pitch connectors the likes of which you see on consumer PC hardware all the time, actually use a non-standard pinout. Not to worry though, modifying these headers is actually pretty trivial, you just poke out the pins with something pointy and put them back in whatever order you need them:
4) Modified fan connectors. |
Lastly, I had to remove the old fans and find a way to mount the new ones. I actually expected this to be a pain because of the positioning of the screws, but the central divider to which the fans are mounted turned out to be removable, so this was pretty easy too:
5) New fans installed in the switch. |
Lastly I re-ran the stress test to confirm that there were no thermal issues. Upon turning the switch on, a warning light appeared on the front indicating that there was a problem with the fans. Makes sense that the firmware would check the fan speed I suppose. However the switch actually operated normally, and temperatures under stress weren't appreciably higher than with the old fans, so this doesn't seem to be a problem (fingers crossed).
SOME BENCHMARKS
When I first installed the new network hardware, I ran some point to point bandwidth and latency benchmarks, but nothing that required more than two nodes. Now that the switch has been quietened down though, I thought I would test performance of the interconnect again, but with collective communication this time. The test uses the MPI_Allgather primitive, which is a compound scatter/gather operation. For the uninitiated, this involves each process broadcasting some data to the other hosts (scattering), whilst simultaneously receiving (gathering) the data that the other hosts are scattering. As I understand, the broadcast part of this process is not done in a simple loop, but instead with a tree-like structure that leverages multiple nodes for each broadcast. This helps massively if there are a large number of broadcast targets, as broadcast operations happen in logarithmic time rather than linear. In any case, the operation ends up being a lot more complex than just a simple transfer, so the OSU benchmarks don't actually offer a "bandwidth" test, instead just reporting the amount of time required for the operation.
This is pretty much what we saw with the point to point tests back when I originally installed the infiniband hardware. Transfer time is limited by the latency of the interconnect for small messages, but as they get larger, it becomes increasingly dominated by bandwidth. Nothing unexpected. Here's the same test for ethernet:
This actually surprised me. In previous point to point tests, ethernet was slower than infiniband, as you would expect, and it is here too by roughly the same margin. However, when you start using multiple processes per node, ethernet performance becomes a little unpredictable, with an anomaly at a message size of 64k. This actually makes some sense, as with multiple processes on each box, some of the communication taking place will be intra-node, and not over the network. That intra-node communication is on the order of 100 times faster than gigabit ethernet, as I discovered when I installed the infiniband network. This means that from the point of view of whomever implements MPI_allgather, it makes some sense to do the intra-node communication first to minimise use of the (slower) network, and then gather between processes on other nodes. Something like this could in part explain the wonky graph above. I know little about how MPI is actually implemented though, so take my words with a grain of salt.
Either way, I can use the network now without being deafened. And with the blog more or less up to date with the cluster project as a whole, I can get back to actually writing code!
Comments
Post a Comment