By Boris Rogier August 31, 2016

Why retention time is critical to performance troubleshooting

Capturing network traffic to get performance analytics is the first step for troubleshooting operations. The second step is to store the information you need and to make that data available over time. Depending on your environment, business context, organization and budget constraint, you may want to retain that data for a shorter or longer period, which is called the retention time.

What is the retention time?

The retention time determines the period for which you can retain your performance analytics. You may want to keep a one-hour window or one year of history; the retention time refers to how far in time you can look back.

Why is the retention time a critical criterion for troubleshooting?

There are some major reasons that why a real-time (only) performance analytics solution would be insufficient. They are:

1. The feedback chain is generally slow: from end users to the troubleshooting team

The engineers who perform in-depth troubleshooting and leverage performance analytics are not usually in direct contact with end users. Most of the time end users will report their problems to a helpdesk team that will perform an initial set of checks to qualify the problem and verify the good condition of the network and application and eventually escalate to the layer 2 engineers for deeper analysis.

This leads to a number of different delays:

End users do not report all the problems instantly
The helpdesk will need some time to qualify and report the issue
The engineers will not be immediately available and will have to arbitrate between competing priorities
Performance degradations (unlike software bugs) cannot be reproduced

In the end, this loop may take hours or days before an in-depth diagnostic really starts.

2. Intermittent degradations

Performance degradations would be so simple to analyse if only they:

Happened all the time in a stable and repeatable way
Had only a single possible cause
Applied to all the application transactions in a homogeneous way

Well, unfortunately this is definitely NOT how it works…

Indeed, by nature, performance degradations are:

Intermittent, occurring at different times of the day and the week and possibly in different circumstances
Have numerous potential causes (and often not just one at a time)
Apply to a certain type of transaction or users only

This means that if you do not retain performance analytics and have a purely instant view of performance, you may simply not see the degradation itself.

3. Baselining: defining what is normal

This is one of the greatest questions when performing diagnostics: “What is the normal behavior of that user / application?”.

This translates into the following investigations:

What is the normal network or application usage?
What is the usual response time for this application overall? For this transaction? For this user?

Without any baseline, referring to the recent “slow transactions” as the ones with the longest response times is simply wrong: it might result in a complete misinterpretation.

This means that if you do not have a sufficient data history—the result of a short retention time—to define what is a normal behavior (volume, response times, for a given user, application, transaction) and be able to compare it with a wide set of user activities, you cannot draw any serious conclusion.

Well, this is not as simple as it seems! Retention time and storage surely have a close relationship but it is also strongly driven by what kind of data is going to be retained.

1. Packet data vs Performance analytics

If you are using a standard raw packet storage solution, the relationship between retention time and storage is very simple:

Retention Time = Storage (TB) / Average Throughput (Mbps)

As an example, if you want to retain 7 days of history for a small volume of traffic (say 500Mbps), you will need 12TB of storage. In such an outdated framework, it means the data is only stored. It does not say anything about how fast you can access the appropriate information.

To achieve access to the appropriate information faster, instead of the keeping full packet details, you will want to calculate performance analytics (e.g., detailed indicators for performance, network usage, TCP behaviors, transaction details, etc.). If you store these metrics directly, you will get two big advantages:

Required storage space will be smaller and hence for an equal volume of storage, the retention time will be much larger
Information retrieval: it will also be much faster to get to the correct information, as well as to build dashboards / KPIs and trigger alerts for proactive monitoring

2. Granularity

When considering performance analytics solutions, you have to consider the level of granularity which is available and hence if the levels of detail fit your troubleshooting requirements:

Conversation details: can you go back to every single conversation/session?
Transaction details: can you isolate each and every application transaction?
Filtering: how precisely can you filter on these conversations?
Performance indicators available: how complete is your set of indicators for the network, TCP health, and application performance?

How does this impact your troubleshooting strategy?

Before investing in any kind of network-based troubleshooting capabilities, you have to ask yourselves these questions:

How long should I retain troubleshooting data?
Do I really have to keep all packet details? Is there any compliance obligation or not?
What are the performance indicators I need to run a proper diagnostic?
How quickly can I isolate slow transactions? (whatever the cause)
How easily can I compare a given activity with a standard behavior?

To make the most of network traffic to troubleshoot performance degradations and monitor end-user response times, you need to take a new approach at how you analyze and gather information from your network traffic. Skylight is one example of such an approach.

Accedian is now part of Cisco |