End-to-End Performance Monitoring: 3 Easy Steps to Jump-Start it Effectively

A Short-Cut for Putting a Wire Data Performance Monitoring Solution into Action

In our day-to-day practice as performance monitoring experts (also known as “relationship therapists”) for cloud services, applications, and networks, we have learned that putting a wire data solution into action for performance monitoring is a challenging, complex task. Most of the time the primary concern is the inability to have visibility on to the complete application chain.

Fortunately, the low hanging fruit when starting with end-to-end performance monitoring is about rationalizing and prioritizing user complaints. Meaning complete visibility on the application chain is not required to get started.

This is because, in today’s hybrid, software-defined infrastructures, the largest grey area by far is the combination of security zones and related, different types of WAN (Wide Area Network) connections, including the local internet ones and the security layers on top of that.

Typical hybrid infrastructure set-up
A typical hybrid infra setup

As it happens, this large grey area is the most important source of wire data for analyzing the behavior of cloud services and applications as experienced by users. Moreover, identifying the appropriate data collection points is easy since the number of network devices involved in a WAN connection is very limited.

This allows us to simplify and standardize the solution deployment with 3 easy steps:

  1. Setting up data collection
  2. Modelling incoming wire data
  3. Validating configuration and initial results

Once these steps are completed, your teams have a clear understanding of the user experience for each of the cloud services and applications being monitored. This includes an understanding of which of the IT domains is causing the delays:

  1. Is it the end-user device, the network, a cloud service, or an application?
  2. And if a combination of these, to what extent each and where to start improving things?
Opportunities for improvement
Where to start improving things and who should be involved party to involve?

Moreover, these teams will also benefit from having a consistent troubleshooting workflow with a quick and predictable outcome when analyzing performance monitoring issues.

Combined, this helps your IT organization in preventing time-consuming conversations about which party (or parties!) should be involved when improvements are needed (yesterday!).

Step 1 – Setting up Data Collection

Since this is a Wire Data setup we need access to packets through port mirroring (also known as span ports) and taps (combined with aggregators).

Port mirroring is something that is configured on a network device and is about copying packets to a specific port on that network device.

A tap is about connecting a device in-line between 2 network devices. This device then copies all incoming data to one of the 2 outgoing ports.

As a rule of thumb we recommend the following:

  • Taps are used for collecting data between 2 network devices like for example routers, switches, load balancers and firewalls.
  • Port-spanning/port-mirroring is used for data collection about the hosts connected to a network device and hosts running on a hypervisor.

A typical use case for taps is the need for a detailed performance breakdown of an application chain; including analyzing the performance impact of network devices like load balancers, firewalls and IDS/IPS systems.

Typical use cases for port mirroring are monitoring the user experience in offices as well as monitoring the performance between to or more virtual machines.

More information on the pro’s and con’s about taps and port-spanning/port-mirroring is found here and here.

Once this stage is completed, the monitoring solution is connected and the data collection starts; moving forward to modelling the incoming Wire Data.

Step 2 – Modelling Wire Data

Modelling Wire Data is all about translating the raw data into easy-to-understand dashboards and reports. Assigning data to groups representing applications, locations and hosts already covers 80% of the configuration work for improving problem resolution times. This is because:

  • Each location group represents a certain group of end-user devices.
  • Each host group represents a certain group of systems related to a security zone in a data center.
  • Each application group represents a certain group of systems and/or cloud services belonging to a certain application chain.

These 3 group types are the key elements in an improved workflow around analyzing and troubleshooting performance issues.

The remaining 20% is related to making a jump start on the plan-do-check-act cycle for embedding this improved troubleshooting workflow in your current processes.

We recommend starting with 3 reports as common ground for making his happen:

  • A daily report covering day- and night-time (i.e. from 7 AM to 7 PM and from 7 PM to 7 AM).
  • A weekly report covering from Monday to Monday; both at 7 AM.
  • A monthly report covering 1st day of the month to 1st day of the month; again, both at 7 AM.

The recommended content for each of the reports includes (1) – a high-level summary about the health of the most important, business-critical applications, (2) – a break down into the performance levels and (3) – the details for starting the analysis and troubleshooting process.

Example of a high-level summary of performance monitoring
An example of a high-level summary

The combination of 3 group types and 3 reports allow your teams to make significant improvements in reducing problem resolution times.

Step 3 – Validate the Configuration and the Initial Results

Once the Wire Data is modelled, we run a quick health-check of the monitoring system to make sure everything is working as expected:

  • The utilization of the data capture interfaces;
  • The number of packet drops (if any);
  • The CPU, memory and disk utilization of the monitoring solution.

We strongly recommend an end-to-end performance monitoring solution that gives you visibility on short- and long-term metrics on each of these 3 topics. The short-term metrics are especially important since more often than not, microbursts of data can be the root cause for a monitoring solution that is suffering.

Quick health check of the monitoring system
A quick health-check of the monitoring system