The Next Generation Open Compute Hardware: Tried and Tested
by Johan De Gelas & Wannes De Smet on April 28, 2015 12:00 PM ESTThe Latest and Greatest: Leopard
Leopard, the latest update to the Windmill motherboard, is equipped with the Intel C226 chipset to support up to two E5-2600v3 Haswell Xeons.
One processor mode is fully supported, in which the CPU can access all RAM onboard. Increased thermal margins (mainly because of upping the chassis height to 2 OU in Winterfell), bigger CPU heatsinks, and better airflow guidance allow the system to receive CPUs with a maximum TDP of 145 Watt, which means you can insert every Xeon except for E5-2687W v3 (160W TDP). Only eight DIMM channels are connected per CPU, but DDR4 allows for a maximum capacity of 128GB per DIMM resulting in a theoretical maximum of 2TB RAM, which Facebook reckons is plenty for years to come. New in this generation is that you can now plug NVDIMM modules (persistent flash storage on a DIMM form factor), which Facebook is testing to see if it can replace PCIe-based add-in cards.
Besides the generational CPU update, other major changes include the removal of the onboard external PCIe connector, support for a mezzanine card with dual QSFP receptacles, a TPM header, the addiction of an mSATA/M.2 slot for SATA/NVMe based storage, and 8 more PCIe lanes routed to the riser card slot for a total of 24. The SAS connector has been removed, as Leopard will not be used as a head node for Knox.
Leopards, with the optional debug board (power/reset buttons and serial-to-USB) plugged in
A big addition to the board is a baseboard management controller (BMC). A simple headless Aspeed AST1250 controller provides traditional Out Of Band IPMI access to query sensor and FRU data, control system power and provide Serial Over Lan. But Facebook taught it some new tricks: to aid bare-metal debugging, it keeps 256 post codes in buffer, offers 128kB of serial console output, and you are able to remotely dump MSR data, which is done automatically after the IERR/MCERR signal is active.
A rather unique feature of the BMC is that it allows you to update the CPLD, VR, BMC and UEFI firmware (basically all the firmware present on the motherboard) remotely, a feature also fully validated by all suppliers of the mentioned components. Another feature that's been added is average power reporting, the BMC keeps a buffer of 600 power measurements, and permits you to query the buffer for a specific interval via IPMI. To improve the accuracy of the power sensor data, factory determined (non)-linear compensations are applied to the measured power usage. Lastly, another unique feature that stems from better rack-level integration is the ability to throttle CPU power usage when power demand in the power zone exceeds capacity – for instance when a PSU dies. When the load increases to the PSU capacity, it executes a quick temporary drop to 1 Volt. This triggers an 'Under Voltage' condition in the servers which in turns activates the Fast Proc Hot signal on the CPUs, causing them to clock down for a certain amount of time and thus decreasing PSU load, allowing it to remain active instead of shutting down.
26 Comments
View All Comments
Black Obsidian - Tuesday, April 28, 2015 - link
I've always hoped for more in-depth coverage of the OpenCompute initiative, and this article is absolutely fantastic. It's great to see a company like Facebook innovating and contributing to the standard just as much as (if not more than) the traditional hardware OEMs.ats - Tuesday, April 28, 2015 - link
You missed the best part of the MS OCS v2 in your description: support for up to 8 M.2 x4 PCIe 3.0 drives!nmm - Tuesday, April 28, 2015 - link
I have always wondered why they bother with a bunch of little PSU's within each system or rack to convert AC power to DC. Wouldn't it make more sense to just provide DC power to the entire room/facility, then use less expensive hardware with no inverter to convert it to the needed voltages near each device? This type of configuration would get along better with battery backups as well, allowing systems to run much longer on battery by avoiding the double conversion between the battery and server.extide - Tuesday, April 28, 2015 - link
The problem with doing a datacenter wide power distribution is that at only 12v, to power hundreds of servers you would need to provide thousands of amps, and it is essentially impossible to do that efficiently. Basicaly the way FB is doing it, is the way to go -- you keep the 12v current to reasonable levels and only have to pass that high current a reasonable distance. Remember 6KW at 12v is already 500A !! And thats just for HALF of a rack.tspacie - Tuesday, April 28, 2015 - link
Telcos have done this at -48VDC for a while. I wonder did data center power consumption get too high to support this, or maybe just the big data centers don't have the same continuous up time requirements ?Anyway, love the article.
Notmyusualid - Wednesday, April 29, 2015 - link
Indeed.In the submarine cable industry (your internet backbone), ALL our equipment is -48v DC. Even down to routers / switches (which are fitted with DC power modules, rather than your normal 100 - 250v AC units one expects to see).
Only the management servers run from AC power (not my decision), and the converters that charge the DC plant.
But 'extide' has a valid point - the lower voltage and higher currents require huge cabling. Once a electrical contractor dropped a piece of metal conduit from high over the copper 'bus bars' in the DC plant. Need I describe the fireworks that resulted?
toyotabedzrock - Wednesday, April 29, 2015 - link
48 v allows 4 times the power at a given amperage.12vdc doesn't like to travel far and at the needed amperage would require too much expensive copper.
I think a pair of square wave pulsed DC at higher voltage could allow them to just use a transformer and some capacitors for the power supply shelf. The pulses would have to be directly opposing each other.
Jaybus - Tuesday, April 28, 2015 - link
That depends. The low voltage DC requires a high current, and so correspondingly high line loss. Line loss is proportional to the square of the current, so the 5V "rail" will have more than 4x the line loss of the 12V "rail", and the 3.3V rail will be high current and so high line loss. It is probably NOT more efficient than a modern PS. But what it does do is move the heat generating conversion process outside of the chassis, and more importantly, frees up considerable space inside the chassis.Menno vl - Wednesday, April 29, 2015 - link
There is already a lot of things going on in this direction. See http://www.emergealliance.org/and especially their 380V DC white paper.
Going DC all the way, but at a higher voltage to keep the demand for cables reasonable. Switching 48VDC to 12VDC or whatever you need requires very similar technology as switching 380VDC to 12VDC. Of-course the safety hazards are different and it is similar when compared to mixing AC and DC which is a LOT of trouble.
Casper42 - Monday, May 4, 2015 - link
Indeed, HP already makes 277VAC and 380VDC Power Supplies for both the Blades and Rackmounts.277VAC is apparently what you get when you split 480vAC 3phase into individual phases..