The Next Generation Open Compute Hardware: Tried and Tested
by Johan De Gelas & Wannes De Smet on April 28, 2015 12:00 PM ESTNetworking
Server contributions aren't the only things happening under the Open Compute project. Over the last couple of years a new focus on networking was added. Accton, Alpha Networks, Broadcom, Mellanox and Intel have each released a draft specification of a bare-metal switch to the OCP networking group. The premise of standardized bare-metal switches is simple: you can source standard switch models from multiple vendors, and run the OS of your choosing on it, along with your own management tools like Puppet. No lock-in and almost no migration path to be concerned with when implementing different equipment.
To that end, Facebook created Wedge, a 40G QSFP+ ToR switch together with the Linux-based FBOSS switch operating system to spur development in the switching industry, and, as always, to offer a better value for the price. FBOSS (along with Wedge) was recently open sourced, and in the process accomplished something far bigger: convincing Broadcom to release OpenNSL, an open SDK for their Trident II switching ASIC. Wedge's main purpose is to decrease vendor dependency (e.g. choose between an Intel or ARM CPU, choice of switching silicon) and allow consistency across part vendors. FBOSS lets the switch be managed with Facebook's standard fleet management tools. And it's not Facebook alone who can play with Wedge anymore, as Accton announced it will bring a Wedge based switch to market.
Facebook Wedge in all its glory
Logical structure of the Wedge software stack
But in Facebook's leaf-spine network design, you need some heavier core switches as well, connecting all the individual ToR switches to build the datacenter fabric. Traditionally those high-capacity switches are sold by the big network gear vendors like Cisco and Juniper, and at no small cost. You might then be able to guess what happens next: a few days ago Facebook launched '6-pack', its modular, high-capacity switch.
Facebook 6-pack, with 2 groups of line/fabric cards
A '6-pack' switch consists of two module types: line cards and fabric cards. A line card is not so different from a Wegde ToR switch, where 16 40GbE QSFP+ ports at the front are supplied with 640Gbps of the 1.2Tbps ASIC's switching capacity; the main difference with Wedge is the remaining 640Gbps is linked to a new backside Ethernet-based interconnect, all in a smaller form factor. The line card also has a Panther micro server with BMC for ASIC management. In the chassis, there are two rows of two line cards in one group, each operating independently of the other.
Line card (note the debug header pins left to the QSFP+ ports)
The fabric card is the bit connecting all of the line cards together, and thus the center part of the fabric. Though the fabric switch appears to be one module, it actually contains two switches (two 1.2Tbps packet crunchers, each paired to a Panther microcontroller), and like the line cards, they operate separate from each other. The only thing being shared is the management networking path, used by the Panthers and their BMCs, along with the management ports for each of the line cards.
Fabric card, with management ports and debug headers for the Panther cards
With these systems, Facebook has come a long way towards making its entire datacenter networking built with open, commodity components and running it using open software. The networking vendors are likely to notice these developments, and not only because of their pretty blue color.
ONIE
An effort to increase modularity even more is ONIE, short for the Open Network Install Environment. ONIE is focused on eliminating operating system lock-in by providing an environment for installing common operating systems like CentOS and Ubuntu on your switching equipment. ONIE is baked into the switch firmware, and after installation the onboard bootloader (GRUB) directly boots the OS. But before you start writing your Puppet or Chef recipes to manage your switches, a small but important side-note needs to be added: to operate the switching silicon of the Trident ASIC you need a proprietary firmware blob from Broadcom. And up until very recently, Broadcom would not give you the firmware blob unless you have some kind of agreement with them. This is why, currently, the only OSs you can install on ONIE enabled switches are commercial OSes like BigSwitch and Cumulus, who have agreements in place with the silicon vendors.
Luckily, Microsoft, Dell, Facebook, Broadcom, Intel and Mellanox have started work on a Switch Abstraction Interface (proposals), which would obviate the need for any custom firmware blobs and allow standard cross-vendor compatibility, though it remains to be seen to which degree this can completely replace proprietary firmware.
26 Comments
View All Comments
Black Obsidian - Tuesday, April 28, 2015 - link
I've always hoped for more in-depth coverage of the OpenCompute initiative, and this article is absolutely fantastic. It's great to see a company like Facebook innovating and contributing to the standard just as much as (if not more than) the traditional hardware OEMs.ats - Tuesday, April 28, 2015 - link
You missed the best part of the MS OCS v2 in your description: support for up to 8 M.2 x4 PCIe 3.0 drives!nmm - Tuesday, April 28, 2015 - link
I have always wondered why they bother with a bunch of little PSU's within each system or rack to convert AC power to DC. Wouldn't it make more sense to just provide DC power to the entire room/facility, then use less expensive hardware with no inverter to convert it to the needed voltages near each device? This type of configuration would get along better with battery backups as well, allowing systems to run much longer on battery by avoiding the double conversion between the battery and server.extide - Tuesday, April 28, 2015 - link
The problem with doing a datacenter wide power distribution is that at only 12v, to power hundreds of servers you would need to provide thousands of amps, and it is essentially impossible to do that efficiently. Basicaly the way FB is doing it, is the way to go -- you keep the 12v current to reasonable levels and only have to pass that high current a reasonable distance. Remember 6KW at 12v is already 500A !! And thats just for HALF of a rack.tspacie - Tuesday, April 28, 2015 - link
Telcos have done this at -48VDC for a while. I wonder did data center power consumption get too high to support this, or maybe just the big data centers don't have the same continuous up time requirements ?Anyway, love the article.
Notmyusualid - Wednesday, April 29, 2015 - link
Indeed.In the submarine cable industry (your internet backbone), ALL our equipment is -48v DC. Even down to routers / switches (which are fitted with DC power modules, rather than your normal 100 - 250v AC units one expects to see).
Only the management servers run from AC power (not my decision), and the converters that charge the DC plant.
But 'extide' has a valid point - the lower voltage and higher currents require huge cabling. Once a electrical contractor dropped a piece of metal conduit from high over the copper 'bus bars' in the DC plant. Need I describe the fireworks that resulted?
toyotabedzrock - Wednesday, April 29, 2015 - link
48 v allows 4 times the power at a given amperage.12vdc doesn't like to travel far and at the needed amperage would require too much expensive copper.
I think a pair of square wave pulsed DC at higher voltage could allow them to just use a transformer and some capacitors for the power supply shelf. The pulses would have to be directly opposing each other.
Jaybus - Tuesday, April 28, 2015 - link
That depends. The low voltage DC requires a high current, and so correspondingly high line loss. Line loss is proportional to the square of the current, so the 5V "rail" will have more than 4x the line loss of the 12V "rail", and the 3.3V rail will be high current and so high line loss. It is probably NOT more efficient than a modern PS. But what it does do is move the heat generating conversion process outside of the chassis, and more importantly, frees up considerable space inside the chassis.Menno vl - Wednesday, April 29, 2015 - link
There is already a lot of things going on in this direction. See http://www.emergealliance.org/and especially their 380V DC white paper.
Going DC all the way, but at a higher voltage to keep the demand for cables reasonable. Switching 48VDC to 12VDC or whatever you need requires very similar technology as switching 380VDC to 12VDC. Of-course the safety hazards are different and it is similar when compared to mixing AC and DC which is a LOT of trouble.
Casper42 - Monday, May 4, 2015 - link
Indeed, HP already makes 277VAC and 380VDC Power Supplies for both the Blades and Rackmounts.277VAC is apparently what you get when you split 480vAC 3phase into individual phases..