Why automate packet analysis for performance troubleshooting?

There are many ways to analyze network traffic. Flows provided by network devices to reflect traffic volumes are one way; packet analysis is the other way.

When it comes to performance troubleshooting and application transaction visibility, your sole option is to rely on packet data and packet analysis to understand the scope and the root causes of a performance degradation (you can find out why here).

The main constraints that apply to packet analysis are:

  • the time required to manually analyze packet traces
  • the skills required
  • the fact that if you have not located with sufficient precision those packets that are worth analyzing, you will get nowhere. This means that you already know:
    • For which application
    • For which client
    • What time precisely the degradation happened
    • What is the normal response time for that application

If you do not match these criteria, chances are high that you will most probably be wasting your time when analyzing packets.

Automating packet analysis (i.e., implementing wire data analysis to get your performance analytics) is an alternative.

1. Traffic volume

The size of networks, the bandwidth used by each application, and the number of applications in use are continuously growing. This represents a first challenge when you want to leverage network traffic to diagnose performance issues.

What’s the maximum size of the capture file you can analyze with a software sniffer?

Most IT professionals agree that you cannot load a file that exceeds 100MB with a reasonable processing time. Others are saying that even a few MB are already too much.

What does 100MB represent for your network?

  • 100MB = 800Mb
  • 800Mb / 10Gbps = 0,08 seconds or 800Mb / 1Gbps = 0,8 seconds

Each time you load this type of file, you will view a short snapshot of your network activity: less than 1 second of traffic at 1Gbps and 0.08 of a second on traffic at 10Gbps.

Even though this is a short timeframe, if an application transaction represents a data exchange (query and response) of 50kB, you will still have collected 2000 transactions that will have to be analyzed manually.

It is obvious that if you cannot tell which of these transactions is the one you should pay attention to, and which ones are the normal ones you can refer to, you will probably not get any answers to your questions.

2. History and retention time

One of the essential characteristics of performance degradations is that they are intermittent either due to some congestion phenomenon — on a given system or infrastructure device — or due to an application flaw.

One of the key challenges is determining when the degradation occurred. This requires you:

  • to retain the network traffic for a sufficient period of time (each hour of packet capture retained for a 1Gbps link requires 500GB of storage; retaining 24 hours of that raw traffic requires 12TB).
  • to first define what is the normal behavior / response time of the application (overall and for a given type of transaction). Then you need to identify when the performance got degraded.

3. Performance Overview

To identify which flows are worth analyzing in detail, you need to be able to locate the perimeter of the degradation. For this, you need an overview of the performance of a given segment/application which enables you to identify the scope of the degradation: 

  • For which client
  • Connecting to which server
  • For which transactions (all, some and which ones)

4. Metric computation

To troubleshoot performance issues and identify the root cause of a degradation, you need a comprehensive set of metrics:

  • Network health metrics (latency, packet loss, retransmission, flow characteristic – QoS settings, path, etc.)
  • TCP metrics (Session setup metrics, TCP errors, server response times, query, and response transfer times)
  • Common services metrics (DNS response times and success rates)
  • Application transaction metrics (processing time, query and response PDU transfer, error code, page load times for HTTP)

You certainly want to manipulate data based on these metrics on large traffic volumes, and see the evolution over time, then drill down on the flows that present an anomaly.  For that, you cannot rely on manual (or close to ex-post metric computation, which adds up to long processing times).

5. Application transaction visibility

There is no point in performing efficient troubleshooting of degradations by looking at application performance metrics that would apply to all transactions considered globally. To get to the root cause and to provide actionable information to fix the issue, you need to be able to define the perimeter of the degradation at the individual transaction level.

This requires you to identify the slow transactions but also to compare them to baseline information related to that specific transaction type. For example, to compare a specific SQL query performance to similar SQL queries.

Doing this requires that you can:

  • Identify similar queries in the large volume of traffic
  • Easily access the metrics for these transactions for comparison

Although there is still value in using a packet analyzer and accessing the details of frames once you know exactly what the frames corresponding to the defective transactions are, you need a new approach to packet analysis.

The very essential requirement is certainly automation to:

  • cope with volumes
  • automate metric computation
  • provide an overview of the entire network and IT infrastructure
  • view application transactions