Virtual data center traffic capture and troubleshooting

Before discussing traffic capture in a virtual data center, let’s start with some clear facts :

  • According to Cisco, 76% of the overall network traffic is exchanged inside data centers
  • According to CIO Insightover 85% of server processing power is now virtualized

The equation is very easy to understand: if you cannot run diagnostics on traffic exchanged between virtual machines (which does not always reach the physical wires) in a virtual data center, you will fail to find the root cause of your performance degradations!

What’s the issue with traffic capture in virtual data centers?

Virtualization is not just about hosting several machines on one physical host but comes with a whole new set of practices that have changed the way data centers are run and brings along a good number of new challenges:

  • Dynamism: knowing where the servers may be at one point of time is now a challenge. Facilities such as vMotion provide increased flexibility and automatically move virtual servers depending upon where resources are available. The consequence is that, at any given time, you cannot predict where this or that virtual machine will be located. Tapping the traffic of a given virtual machine by extracting traffic from the ESX (physical) host through a physical NIC will not be feasible anymore. Identifying where to capture the traffic is therefore your first challenge.
  • Traffic capture mechanisms inside the virtual data center: we were used to relying upon TAP devices and network switches (SPAN) to get a copy of the traffic and analyze it… this is not as straight-forward as it was in the physical world.

When considering ways to conduct troubleshooting inside a virtual data center, you have to address these two points.

Troubleshooting in a virtual data center is a challenge
Troubleshooting in a virtual data center is a challenge

What can you still do WITHOUT any traffic capture inside the virtual data center?

Well, you have some workarounds to keep doing things the good old way, but they have limits. As a result, you are:

  • Limited in scope (what you can see)
  • Limited to certain contexts (you have to match a certain number of criteria)

Here are the options that remain:

  • Capture traffic between clients and front servers on the core switches. This will work provided that:
    • Your VDI / Thin client architecture is not also hosted inside the virtual data center
    • You have no need to investigate performance issues beyond the front server
  • Keep routing between VLANs on the core switches and maintain the VLAN segmentation between the different application tiers. In this particular situation, you can capture the traffic between the VLANs on the physical network.
    • You have no interest for the traffic inside a given VLAN

These workarounds obviously do not work if you need to conduct in-depth application performance troubleshooting.

What are your options for capturing traffic inside the virtual data center?

There are several options for capturing traffic between servers, depending upon the resources available (e.g., budget, flexibility, security guidelines, etc). You have to consider two sets of issues to define how you are going to capture traffic inside your virtual data center:

Option 1: Promiscuous mode

When it comes to troubleshooting a precise problem, the oldest solution around is to use “promiscuous mode” to be able to visualize the traffic between several virtual machines within a virtual server farm.

“Promiscuous mode” consists of defining a group of virtual machines (for VMware environment called port group) that are going to be connected through the virtual switch in such a way that every packet sent by one VM is received by all VMs inside the port group. This is quite similar to having the virtual switch acting like a hub for these virtual machines.

You can then either integrate a virtual machine which runs the packet analysis software (e.g., Wireshark or SkyLIGHT PVX Virtual Capture Probe) or you can dedicate a physical network interface to this group to have either a virtual or an external physical device able to capture and analyze the traffic to/from the machines in the group.

Option 2: Virtual TAPs

Some software vendors have issued software solutions that can interact with the virtualization infrastructure to copy the traffic without having to get it from the virtual switch. Good examples of these solutions are Phantom TAP from Ixia and Gigamon’s GigaVue. Although these solutions may sound attractive at first glance, you should consider the following facts:

  • Their implementation at the kernel level of the virtualization can be quite intrusive
  • They will send the traffic to an analysis device through a tunnel; you still need to make sure that carrying the copy of the traffic (either inside or outside the virtualization host is doable without any risk or additional difficulty). What is considered here is the raw traffic, which means potentially a high volume of traffic.

Option 3: Virtual switch with ER/R/SPAN capabilities

The native virtual switch (e.g., Vswitch for VMWare, Openvswitch for Virtualbox and Xen) can be replaced by a virtual switch providing advanced capabilities in terms of traffic captures (e.g., SPAN, RSPAN, ERSPAN). These advanced switches supporting such traffic capture capabilities are :

They can provide features such as SPAN (port mirroring), RSPAN (port mirroring to a physical port on another switch), ERSPAN (port mirroring with a destination to a specific IP address through a tunnel).

Some of these features will have some prerequisites in terms of hardware or virtual appliance used to receive the ER/RSPAN, as well as in terms of server farm licenses (e.g., type of vSphere license – VDS is available from the VMware Enterprise Plus license).

Advantages & Drawbacks

Options for traffic capture in a virtual data center
Options for traffic capture in a virtual data center

Obviously, this is based on the capabilities of the different solutions are subject to future changes. We shall keep you posted if we are aware of new features made available.

Once you have defined how you can capture traffic inside your virtual data center, you should consider how you are going to analyze that traffic and transform it into operational analytics. There are several approaches for doing so, they are described in this second article: Best practices for performance troubleshooting in a virtual data center.