I just read Pete Welcher's superb series on NSX, DFA, ACI, and other SDN stuff on the Chesapeake Netcraftsmen blog, and it helped me think more clearly about a problem that's been bothering me for a long time: how do we do realistically scalable packet capture in networks that make extensive use of ECMP and/or tunnels? Here's a sample network that Pete used:
Conventionally, we place packet capture devices at choke points in the network. But in medium-to-large data center designs, one of the main goals is to eliminate choke points: if we assume this is a relatively small standard ECMP leaf-spine design, each of the leaf switches has four equal-cost routed paths through the spine switches, and each spine switch has at least as many downlinks as there are leaf switches. The hypervisors each have two physical paths to the leaf switches, and in a high-density virtualization design we probably don't have a very good idea of what VM resides on what hypervisor at any point in time.
Now, add to that the tunneling features present in hypervisor-centric network virtualization schemes: traffic between two VMs attached to different hypervisors is tunneled inside VXLAN, GRE, or STT packets, depending on how you have things set up. The source and destination IP addresses of the "outer" hypervisor-to-hypervisor packet are not the sample as the "inner" VM-to-VM addresses, and presumably it's the latter that interest us. Thus, it's hard to even figure out which packets to capture. If we capture all of them (hard at any kind of scale; good 10G line-rate capture is still expensive and troublesome), we still have to filter for the inner tunnel headers to figure out what we're looking at.
What can we do? I see a few options:
- Put a huge mirror/tap switch between the leaf and spine. Gigamon makes some big ones, with up to 64 x 40GE ports or 256 x 10GE ports. When you max those out, start putting them at the end of each row. They advertise the ability to pop all kinds of different tunnel headers in hardware, along with lots of cool filtering and load-balancing capabilities.
- Buy or roll-your-own rack-mount packet capture appliances on commodity hardware, and run ad hoc SPAN sessions to them from the leaf switches.
- Install hypervisor-based packet capture VMs on all your hypervisors and capture from promiscuous mode vSwitches. There are lots of commercial solutions here, or you could roll your own. Update: Pete Welcher responded on Twitter and mentioned the option of doing packet capture pre-tunnel-encap or post-tunnel-decap. That's originally what I was thinking of with this option, but after reviewing a couple of his posts again, it appears there may be scenarios where the hypervisor makes tunnels to itself, so a better way of doing it might be to implement a packet capture API in the hypervisor itself that can control the point in the tunnel chain where the capture takes place. The next question is: where do we retrieve the capture? Does the API send the capture to a VM, save it on a datastore, dump it to a physical port, send it via another tunnel akin to ERSPAN? I'd want multiple options.
- Make sure Wireshark, tshark, or tcpdump is installed on every VM.
- Give up on intra-DC capture and focus only at the ingress/egress points.
However, the whole point of these designs is "SDN". What I *hope* is going to happen as SDN controllers start to become available is that the controller will be sufficiently aware of VM location that it can instruct the appropriate vSwitch OR leaf-switch OR spine switch to copy packets that meet a certain set of criteria to a particular destination port. Call it super-SPAN (can you tell I'm not headed for a new career in product naming?). It would be nice to be able to define the copied packets in different ways:
- Conventional L3/L4 5-tuple. This would be nice because it could be informed by NetFlow/IPFIX data, without the need for DPI on the flow-exporter.
- VM DNS name, port profile, or parent hypervisor.
- QoS class.
- "Application profile" -- it remains to be seen exactly what this means, but this is one of those SDN holy-grail things that allows more granular definition of traffic types.
Again, I know nothing about plans for this stuff from any vendor. But I hope the powers-that-be in the SDN/etc world are thinking about at least some of these kinds of capabilities.
And... about the time they get that all figured out, we'll have to be dealing with a bunch of that traffic being encrypted between VMs or hypervisors...