By Boris Rogier July 19, 2010

Why is retransmission representative of network congestion?

Network Performance, Troubleshooting,

How does network congestion impact user experience?

Network congestion impacts the user experience

Context

This is a story about network congestion and its impact on user experience. We were recently visiting a customer who had just installed an APS unit in its data center.

The customer’s infrastructure was fairly simple:

they had all their production servers in the data center
their users were located either in the headquarters (where the data center is located) or in one of their nearly 100 remote sites; these sites were connected to the headquarters through an MPLS network
they are part of a larger group, whose data center is connected to the same MPLS cloud, that provides access to some central services like DNS, mail (Lotus Notes), and access to a secured Internet gateway.

Key concepts

Retransmission

Packets being resent after having been either lost or damaged. Packet Retransmission is identified thanks to their TCP sequence and acknowledgement numbers, as well as checksum values. Only packets with a non-null payload are checked.

Retransmission Delay

RD stands for Retransmission Delay. RD is defined as the time between a packet and its last retransmission.

Retransmission Rate

RR stands for Retransmission Rate. RR is defined as the ratio of retransmitted packets to the total number of packets in a conversation.

(See our series of articles on TCP performance and analysis.)

What we observed

The customer’s network manager initially referred to a complaint from an end-user regarding slow access from their remote site to an application located in the data center.

By looking at the application performance chart, we saw a significant volume of Retransmission Delay (from server to client) and no other changes apart from a slight increase in the DTT (Data Transfer Time) values for the application in question.

We extended our scope of investigation by looking at the network performance chart for all applications (i.e., for clients located in the remote sites and servers in the data center). We observed that the retransmission delay was high, regardless of which application was being used. Also, we observed that the RD occurred mostly in the direction from Server to Client.

I made the hypothesis that there might be some congestion between the data center and the remote site in the direction of DC → remote site. So we looked at the bandwidth graph of the APS for the traffic between the data center and the remote site. We observed peak traffic of 1.2Mbps, mainly due to Windows SUS—which told me a lot about the lack of control of their network and system administration traffic flows.

The customer found the value of 1.2Mbps interesting, although not enough at first glance for him to be convinced that there was network congestion… This was because the bandwidth available on the remote site end was 2Mbps—with the maximum bandwidth available on the DC end at 80Mbps and only a very low proportion of this maximum bandwidth being used.

So we decided to look into the SNMP graphs provided by their telecom operator for the remote site router… and the bandwidth graph was showing a flat line at 2Mbps for 30 minutes for the incoming traffic.

Conclusion on the meaning of retransmissions

Retransmissions are significant and you should have a look at retransmissions to determine:

whether they are intermittent or continuous
what is the perimeter where you can observe them (for which client zone(s), for all servers or one)

What they tell us, in the end, is that some packets are not reaching the other hosts or that acknowledgment packets are not getting back to the sender.

The direction of the retransmission (server → client or client → server) may not be so significant as network congestion in one way may induce retransmission in both ways (for example, network congestion from server to client would generate some retransmissions from server to client—the packets sent by the server do not get acknowledged fast enough and the server retransmits them—and from the client to the server—because even though the packets from the client to the server reach the server fairly fast, the acknowledgment packets from the server to the client suffer from the network congestion and the client retransmits the original packets). You should also keep in mind that the balance of retransmission between client → server and server → client also depends on the balance of traffic in both directions.

Conclusion on the impact of the retransmission delay on the User Experience

In two ways, Retransmission Delay is a good indicator of the impact of network congestion on the Quality of Experience (QoE) of network users:

A Retransmission Delay in one direction is driven by the quantity of data sent in that direction; this value corresponds to the additional time required for a user to receive all data.
Retransmission has a secondary consequence on the time required to receive data: when there is a retransmission, the host resets its TCP window and the size of the buffer to its minimum default size. This means that each time there is a retransmission, the throughput is going back to a very low level and then starts increasing again progressively. If retransmissions are frequent, then the throughput often goes back to a minimum level and never reaches an optimal level. This means a much larger Data Transfer Time value, because the throughput to transfer the application response remains very low. This phenomenon is what is usually called a TCP Slow-Start.

Troubleshooting slow application performance

Using packet delay to identify root cause

Accedian is now part of Cisco |