Accedian is now part of Cisco  |

Avatar photo
By Will Moonen

Troubleshooting slow application performance: Packets vs. insight

Which is better for troubleshooting software-as-a-service (SaaS) slow application performance: packet or number crunching?

Most enterprise IT professionals and network performance management (NPM) tool vendors consider it mandatory to have the original packets for analysis when solving slow application performance. That’s because visibility into network, TCP transport, and application performance is essential, especially when web, SaaS applications, and hybrid cloud infrastructure are involved.

While there are certainly cases where you need the original packets,  there may be cases where sifting through large volumes of packets would impair rapid problem isolation and resolution time.

That is why it is our contention, and based on real customer scenarios, that insights offered by automated “number crunching” wire-data analytics trumps terabytes of stored packets.

Let’s examine further.

Fast MTTR for slow application performance in complex IT environmentsAchieving fast MTTR in complex IT environments: number vs. packet crunching

Handling slow cloud applications

Here’s a situation we encountered at one client that had a large number of branch offices.  Users at many branch offices reported slow application performance, mostly around three cloud applications. From initial analysis, we discovered that the performance degradations occurred between 7 and 11 AM  and every three to five working days.

The network team indicated this was not a network issues. The branch location suffering the most from these degradations had a fully meshed 10 Gbps backbone with a redundant 1 Gbps internet connection. Additionally,  the Internet connection was running in active-standby mode with an average utilization of 10%.

If we used a traditional packet-crunching approach, this would require a troubleshooting kit that stores packets for several days, consuming 8TB of storage based on the average Internet connection utilization. We knew there would be a number of key challenges:

    1. The troubleshooting team would need to undertake a time-consuming cross-check of the three cloud applications with other applications to evaluate performance.
    1. Beyond the initial report noting performance degradations, there would be no expedited processes to analyze the packets by looking into the network, front-end server, application, or client.
  1. Even if the network remained suspect, there may be no process to eliminate any network dependency issues such as the impact of switching back and forth between two Internet connections.

In fact, if we took this approach to analyze network dependencies,  the process would have required more than 16TB of storage and possibly double the resolution time.

Checking network dependencies for slow application performanceAnalyzing the impact of firewalls, IDS/IPS and load balancers

From packet to number crunching: The better path to resolution

As highlighted above, a traditional packet-crunching analysis would be time-consuming and ineffective. That’s why the team took a number-cruncher approach as it would:

    • Generate metadata including tracking as well as analyzing and storing all TCP flags and anomalies against sequence numbers
    • Perform real-time and historical analysis related to specific time slots, user locations, and applications
    • Expedite cross-checking capabilities leveraging all application and user metadata
  • Rule out network-related dependencies by leveraging the troubleshooting kit’s network ports

Number crunching: the proof is in the numbers

Let’s return to that branch office scenario we talked about at this beginning of this blog. By using a solution with automated number-cruncher analysis on the problematic Internet connection, after three weeks, the team would have stored 21 days of per-minute layer 2-7 wire analytics in just 346 GB of disk space. This would be significantly less than the 8 TB of storage space potentially used by a packet-cruncher solution.

Slow application performance troubleshooting - database size<click image to enlarge>
The database size after three weeks of number-crunching, automated wire-data analytics

And based on this same three-week period, the system reported up to 4 Gbps of traffic (where most was classified as TLS). Through number-crunching, less than one minute was needed to process packets related to 307 million flows.

The power of number crunching to resolve slow application performance<click image to enlarge>
The power of number-crunching

Conclusion

From the scenario above, you can see how the number-cruncher has processed hundreds of millions of packets in minutes, something not possible with a traditional packet-cruncher approach. For a first-line help desk, imagine the gains in productivity when troubleshooting reports come in every day or every couple of days. Analysis time could be reduced by significant margins.

The Skylight experts have found, from years of experience, that rarely do you need the real packets. Instead, by relying on number-cruncher analysis tools, help desk professionals can move away from manual or semi-automated packet-crunching to resolve troubleshooting issues. Because as we have seen above, the numbers tell all you all that you need to know.