Blog

Troubleshooting TCP Performance With SaaS and Cloud Applications

Troubleshooting TCP performance in complex IT environments that integrate SaaS and cloud-hosted applications can be quite challenging.

SaaS and cloud-hosted applications often degrade because of unhealthy TCP relationships (sessions) between client and servers in physical, SaaS, and cloud infrastructure. The way TCP sessions set up and tear down directly impacts SaaS and Cloud performance, and the user experience, especially if there are reasons to believe that hosts are overloaded and messages are dropped. A persistent increase in the number of TCP zero window (0-Win) events and duplicate acknowledgements (DupAck) are typically good indicators that end-users are suffering from degraded performance. Detecting and solving poor TCP/IP performance impacting SaaS and Cloud application is straightforward and delivers quick resolution to network, server, and application degradations, eliminating dysfunctional relationships from ruining your your users’ day, and your own.

Troubleshooting TCP Performance: 6 Easy Steps to Find the Root Cause

Finding the root cause of TCP performance issues impacting SaaS and Cloud applications can be challenging and time consuming. The following 6 steps enable you to speed up this process and include some actionable pointers toward finding the “low hanging fruit” when looking for ways to mitigate TCP performance issues and target improvements in SaaS and cloud application user experience.

Step 1. Start by ruling out an overloaded client or server side by taking a look at the number of 0-Win events. If these events are coming in rapidly, you may want to involve the respective desktop or system administrator(s) and have a look at the workload on these hosts.

Step 2. If the number of 0-Win events is close to zero, then most likely the TCP transmission problem is somewhere on the network path between the client and server side. If both are within the same subnet, it should be fairly easy to figure out where the delays and/or drops are coming from. A quick look at the MAC tables from the connected network devices should tell you which devices and interfaces are involved.

Step 3. If the client and server side are not within the same subnet, it means that one or more routers (or something similar) are involved. Start by finding the intermediate subnets, devices, and interfaces by looking at the MAC addresses and routing tables of the designated gateway on the client and server side. This should tell you which other routers and interfaces are actively involved in sending and receiving messages.

Step 4. If it turns out that both MAC addresses are pointing to the same routing device, then most likely that routing device has too much things to do besides routing messages. For example, maybe the device is actually a firewall with (too?) many policies. Perhaps it is a load-balancer running CPU intensive tasks such as intrusion detection and prevention (IDS/IPS), performing SSL offloading, or performing data compression. This is probably a good time to involve the system administrator of these devices.

Step 5. However, if both MAC addresses are pointing to different routing devices, then most likely one or more WAN connections are involved to access cloud or SaaS applications. If redundant, check the load-sharing algorithm on the routers. Modern IP routers and switches support packet-based load sharing. While this is a very effective way of performing load sharing, it may result in some unexpected side effects. Such asymmetric network paths may require additional processing time on the hosts as the order by which messages are received might be changed.

Step 6. Once you have an understanding of the devices and interfaces between the client and server side, start looking at things like CPU and memory utilization, frame drops, CRC errors, buffer overflows, and interface utilization. These are good indicators for figuring out what could have caused packet drops and, therefore, are causing additional delays due to retransmissions.

How an N/APM Monitoring Solution Helps in Troubleshooting TCP Performance

When you need to perform these steps regularly, consider deploying a wire data analytics monitoring solution. Typically, their topology capabilities support you by automating device discovery between 2 hosts. This is because they translate the contents of MAC and routing tables into a topology map. They can also automate the analysis and reporting of TCP metrics for each session: SYN, SYN-ACK, RST, 0-WIN and more, that allows you to isolate problems quickly, without having to perform manual packet analysis.

troubleshooting TCP performance with SaaS and Cloud applications
Screenshots from SkyLIGHT PVX N/APM solution displaying TCP behavior for two cloud-hosted Email applications