Data Center Networking @ Facebook by David Swafford

I’ve been looking into data center fabrics and how you handle the scale of large networks lately so I decided I should take some time today to fully view the presentation(video and PDF) by David Swafford which he did at NANOG 59 late last year.

I met David Swafford when Facebook came to town for MPLS 2013. He was a really cool guy. I was inspired even at the time by hearing the way that they are going about support their networks. Very smart!

I took away a lot of nuggets from watching it. Here are a few:

  • Assume we can’t trust any rack
  • We can’t trust networking boxes either
  • Backbone devices are powerful in the wrong ways for a data center. They can handle many routes but don’t have the desired port density.
  • Going from 2 large leaf switches to many smaller leaf switches allows you to move from 1+1 to N+1.
  • Beware of silent failures by complex networking devices. They are hard to detect, BTW.
  • Automating ToR switch upgrades and handing a “push-button” interface to the service owners helped to remove the roadblocks for full upgrades of ToR switches. (I found it analogous to app upgrades on my phone)
  • They even scripted many parts of the process, such as determining who the on-call is for a given group at a given time. Fascinating.

Monitor all the things:

  • interface statistics and state
  • bgp statistics and state
  • FIBs
  • TCP retransmits

Respond to your Alerts with Automation:

  • FBAR stands for Facebook Automation Remediation
  • Receive Alert, login to device, verify still down, either ignore or remedy.

He also covers a lot of thoughts on engineers that automate:

  • Spend less time doing repetitive tasks
  • Spend more time solving interesting problems or learning

His final challenge: What would you do if you weren’t afraid?