Today we’re excited to introduce Backpack, our second-generation modular switch platform. Together with our recently announced Wedge 100, this completes our 100G data center pod. We’ve already begun putting Backpack in production at Facebook, and we’ve submitted the specification to the Open Compute Project for review. Going forward, we are excited to work with the community to develop an ecosystem around Backpack and help others build on this platform. As we migrate our infrastructure from 40G to 100G, we’re faced with several hardware design challenges. Not only does our network need to run at a higher speed, but we also need a better cooling system to support 100G ASIC chips and optics, which consume significantly more power than previous generation components. Backpack is designed to meet both of these requirements while keeping our fabric design simple and scalable. Backpack has a fully disaggregated architecture that uses simple building blocks called switch elements, and it has a clear separation of the data, control, and management planes.
It has an orthogonal direct chassis architecture, which enables better signal integrity and opens up more air channel space for a better thermal performance. Finally, it has a sophisticated thermal design to support low-cost 55C optics. On the software side, we continue to use FBOSS and OpenBMC, our own open-source networking stack and baseboard management implementations. To support the new chassis we were able to leverage a lot of the work we did for 6-pack, our first generation switch. Architecturally, we kept the no-supervisor model when building these bigger switches. The Backpack chassis is equivalent to a set of 12 Wedge 100 switches connected together. We continue to use the BGP routing protocol for the distribution of routes between the different line cards in the chassis. To that end, we have fabric cards set up as route reflectors and line cards set up as route reflector clients. We are integrating the switch’s software into the existing BGP monitoring infrastructure at Facebook, where we log all BGP route updates to our centralized in-memory data store, Scuba.
For our particular simultaneous update case, if we had had the ability to zoom in, we would have seen a number of BGP withdraw messages where none were expected. Analogously, while our BGP monitor gave us a control plane view, we realized we were lacking the data plane view. To that end, we also built a FIB monitor that now can log FIB updates. This ability to debug the deepest internals of FBOSS switches with sophisticated tools like Scuba has proven to be very effective. With this design, the load of the control plane is spread across all the cards and not centralized on a single one. This allows us to move to different form factors without requiring serious per-chassis tweaks to the software. In addition, the fact that the software remains largely unchanged across our different hardware platforms means that any improvements we make to scale out for bigger switches flow down naturally to rack switches as well. Our testing approach at Facebook is methodologically segmented by phases.
Our testing approach at Facebook is methodologically segmented by phases. Engineering validation testing: This is the phase for early prototypes of the hardware. As part of this phase, we perform basic testing around the mechanical, thermal, power, and electrical engineering characteristics of the platform. We also run a basic snake test in which traffic passes through all ports of the modular chassis in a standalone state. Design validation testing: This is the phase where we perform a comprehensive functional testing of the platform while it’s integrated with other network devices. The testing typically takes place in a mini-fabric that was put together in our labs. Production validation testing: This is the last stage before the hardware can exit to mass production status. In this phase, we place the platform in a test pod inside our production network and monitor its performance at real production scale. When we ran 6-pack through these testing phases, we uncovered bugs and performance deficiencies. Because Backpack relies on the same software code base, the performance gains from fixing these issues in 6-pack carried over automatically.
For example, removing BGP Next Hop from the ECMP groups improved slow reactions to failure conditions. We also switched to a new port striping method in which separate line cards were dedicated for rack switch connectivity and spine switch connectivity. With this new configuration, traffic was load-balanced properly through the chassis whether it had to traverse the fabric spine or egress through another front-panel interface. Incorporating these fixes into Backpack from the start helped us achieve all our design goals within a year, and gave us a head start on optimizing Backpack for a production environment. The Backpack system-level testing took place in a testing pod at our Altoona Data Center. In the next phase, we replaced one fabric switch with a Backpack chassis in a production pod, which was a seamless and successful integration with the production environment. Our plan forward is to continue deploying Backpack in production pods hosting various frontend and backend services to ensure that it can meet the demands of a wide range of applications served by our data centers.
|