Loopback Mountain: March 2014

Here's something that's been on my radar lately: while all the talk in the networking world seems to be about the so-called "massively scalable" data center, almost all of the people I talk to in my world are dealing with the fact that data centers are rapidly getting smaller due to virtualization efficiencies. This seems to be the rule rather than the exception for small-to-medium sized enterprises.

In the micro data center that sits down the hall from me, for example, we've gone from 26 physical servers to 18 in the last few months, and we're scheduled to lose several more as older hypervisor hosts get replaced with newer, denser models. I suspect we'll eventually stabilize at around a dozen physical servers hosting in the low hundreds of VMs. We could get much denser, but things like political boundaries inevitably step in to keep the count higher than it might be otherwise. The case is similar in our other main facility.

From a networking perspective, this is interesting: I've heard vendor and VAR account managers remark lately that virtualization is cutting into their hardware sales. I'm most familiar with Cisco's offerings, and at least right now they don't seem to be looking at the micro-DC market as a niche: high-port count switches are basically all that are available. Cisco's design guide for the small-to-medium data center starts in the 100-300 10GE port range, which with modern virtualization hosts will support quite a few thousand typical enterprise VMs.

Having purchased the bulk of our older-generation servers before 10GE was cheap, we're just getting started with 10GE to the server access layer. Realistically, within a year or so a pair of redundant, reasonably feature-rich 24-32 port 10GE switches will be all we need for server access, probably in 10GBASE-T. Today, my best Cisco option seems to be the Nexus 9300 series, but it still has a lot of ports I'll never use.

One thought I've had is to standardize on the Catalyst 4500-X for all DC, campus core, and campus distribution use. With VSS, the topologies are simple. The space, power, and cooling requirements are minimal, and the redundancy is there. It has good layer 3 capabilities, along with excellent SPAN and NetFlow support. The only thing it seems to be lacking today is an upgrade path to 40GE, but that may be acceptable in low-port-density environments. Having one platform to manage would be nice. The drawbacks, of course, are a higher per-port cost and lack of scalability -- but again, scalability isn't really the problem.

Comments welcome.

(With the usual caveats that I am just a hick from Colorado, I don't know what I'm talking about, etc.)

I just read Pete Welcher's superb series on NSX, DFA, ACI, and other SDN stuff on the Chesapeake Netcraftsmen blog, and it helped me think more clearly about a problem that's been bothering me for a long time: how do we do realistically scalable packet capture in networks that make extensive use of ECMP and/or tunnels? Here's a sample network that Pete used:

Conventionally, we place packet capture devices at choke points in the network. But in medium-to-large data center designs, one of the main goals is to eliminate choke points: if we assume this is a relatively small standard ECMP leaf-spine design, each of the leaf switches has four equal-cost routed paths through the spine switches, and each spine switch has at least as many downlinks as there are leaf switches. The hypervisors each have two physical paths to the leaf switches, and in a high-density virtualization design we probably don't have a very good idea of what VM resides on what hypervisor at any point in time.

Now, add to that the tunneling features present in hypervisor-centric network virtualization schemes: traffic between two VMs attached to different hypervisors is tunneled inside VXLAN, GRE, or STT packets, depending on how you have things set up. The source and destination IP addresses of the "outer" hypervisor-to-hypervisor packet are not the sample as the "inner" VM-to-VM addresses, and presumably it's the latter that interest us. Thus, it's hard to even figure out which packets to capture. If we capture all of them (hard at any kind of scale; good 10G line-rate capture is still expensive and troublesome), we still have to filter for the inner tunnel headers to figure out what we're looking at.

What can we do? I see a few options:

Put a huge mirror/tap switch between the leaf and spine. Gigamon makes some big ones, with up to 64 x 40GE ports or 256 x 10GE ports. When you max those out, start putting them at the end of each row. They advertise the ability to pop all kinds of different tunnel headers in hardware, along with lots of cool filtering and load-balancing capabilities.
Buy or roll-your-own rack-mount packet capture appliances on commodity hardware, and run ad hoc SPAN sessions to them from the leaf switches.
Install hypervisor-based packet capture VMs on all your hypervisors and capture from promiscuous mode vSwitches. There are lots of commercial solutions here, or you could roll your own. Update: Pete Welcher responded on Twitter and mentioned the option of doing packet capture pre-tunnel-encap or post-tunnel-decap. That's originally what I was thinking of with this option, but after reviewing a couple of his posts again, it appears there may be scenarios where the hypervisor makes tunnels to itself, so a better way of doing it might be to implement a packet capture API in the hypervisor itself that can control the point in the tunnel chain where the capture takes place. The next question is: where do we retrieve the capture? Does the API send the capture to a VM, save it on a datastore, dump it to a physical port, send it via another tunnel akin to ERSPAN? I'd want multiple options.
Make sure Wireshark, tshark, or tcpdump is installed on every VM.
Give up on intra-DC capture and focus only at the ingress/egress points.

Option 1 is the only one that really confronts both the network diversity and tunnel encapsulation head-on. Those boxes and their administrative overhead don't come cheap, but today this is probably the most fiddle-free option. The other options require a lot of customization and manual intervention that may or may not interfere with change-control procedures, and don't provide obvious solutions for de-obfuscating the tunneled traffic. Option 2 also suffers from serious scalability problems in ECMP designs. Option 5 just avoids the problem, but might work for some people.

However, the whole point of these designs is "SDN". What I *hope* is going to happen as SDN controllers start to become available is that the controller will be sufficiently aware of VM location that it can instruct the appropriate vSwitch OR leaf-switch OR spine switch to copy packets that meet a certain set of criteria to a particular destination port. Call it super-SPAN (can you tell I'm not headed for a new career in product naming?). It would be nice to be able to define the copied packets in different ways:

Conventional L3/L4 5-tuple. This would be nice because it could be informed by NetFlow/IPFIX data, without the need for DPI on the flow-exporter.
VM DNS name, port profile, or parent hypervisor.
QoS class.
"Application profile" -- it remains to be seen exactly what this means, but this is one of those SDN holy-grail things that allows more granular definition of traffic types.

Finally, it would be nice if the controller was smart enough to be able to load-balance the copied packets when necessary, so that the same capture target sees both sides of a given flow.

Again, I know nothing about plans for this stuff from any vendor. But I hope the powers-that-be in the SDN/etc world are thinking about at least some of these kinds of capabilities.

And... about the time they get that all figured out, we'll have to be dealing with a bunch of that traffic being encrypted between VMs or hypervisors...

Loopback Mountain

Friday, March 21, 2014

Quick Thoughts on the Micro Data Center

Wednesday, March 5, 2014

Packet Capture in Diverse / Tunneled Networks?