Accedian is now part of Cisco  |

Avatar photo
By Thierry Notermans

DNS series #3 : troubleshooting DNS protocol errors

This is the third and final article of a series (article 1article 2) covering some important aspects of the DNS protocol for troubleshooting application performance issues.

Many applications perform name resolution through DNS queries. As explained in the first article, this provides much more flexibility compared to working with fixed IP addresses. The drawback of adding this process in an end-to-end application chain is that it can have disastrous performance impacts in case this process does not perform properly due to DNS protocol errors.

Troubleshooting DNS Performance

The result of a DNS query can fall into one of the following categories:

  • DNS request is successful
  • DNS request has been received by the DNS server but has not been processed properly and the client gets an error message
  • The client does not receive an answer from the DNS server

When troubleshooting DNS performance issues, it’s important to quickly assess DNS query processes:

  • Are the DNS processes successful? Always? For each user and request name?
  • Are successful DNS processes performant enough according to the baselines?
  • Which DNS processes are unsuccessful and why? No answer? Specific problems encountered?

Successful but Non-Optimal DNS Processes

Let’s take a simple example. In enterprise environments, each client’s TCP/IP stack is normally configured to use a local DNS server. In complex multi-site architectures, it’s not uncommon that clients use remote DNS servers. Instead of using the local network, each DNS request traverses the whole Wide Area Network (WAN), which adds latency.

Bypassing the local DNS server by using external DNS services is also often encountered in production environments. This can have an impact on overall network and application performance as the client will not take advantage of extra features such as caching supported by the local DNS server.

With SkyLIGHT PVX, troubleshooting DNS performance is as simple as viewing the global DNS performance dashboard:

global DNS performance dashboard for DNS troubleshooting
Global DNS performance dashboard

In this particular case, you notice a DNS performance problem at 14:04.

By clicking on this peak, you can then see which client has been impacted as well as the DNS resolution processes that were involved:

DNS resolution processes
DNS resolution processes

In this case, the client was 172.16.8.58, requesting the name “baXXXXXes.com” to the local DNS server 172.16.1.12. This request was successful (Response Code: “NoError”) but the entire DNS resolution process took 15.7 seconds! By looking at the corresponding network-related KPIs (by clicking on the “L3” icon at the left), you can quickly correlate network performance to this particular DNS request process to determine whether the network was to blame for this or not.

As mentioned before, it’s also important to check whether the DNS servers are used appropriately (i.e., no external DNS servers and the proper performance of local DNS servers).

The “Top DNS Servers” view provides you with the correct and concise information:

Top DNS servers
Top DNS servers

As you can see here, the DNS server at 172.16.1.12 seems to have some performance problems as its performance is not stable—you can see a huge deviation bar and an average DNS process time of 2.5 seconds. Furthermore, 4 of the 37 DNS requests have not been successful. This is something to look in to when further troubleshooting DNS performance.

Unsuccessful DNS Process: Error Response Codes

In a DNS response packet, you’ll find the response code in a specific flag, such as shown in the Wireshark trace below:

Wireshare trace for troubleshooting dns
Wireshare trace

There are many different error codes supported by DNS.

These are identified by SkyLIGHT PVX and can be filtered in case you intend to focus on one or more DNS response messages in particular:

Filtering DNS responses codes
Filtering DNS responses codes

From the perspective of troubleshooting DNS performance, the main DNS response codes we are interested in are the following:

  • Code 0: No error
  • Code 1: Format error – query cannot be interpreted
  • Code 2: Server failure – process impossible
  • Code 3: Name error – domain name does not exist
  • Code 4: DNS query type not implemented on the DNS server
  • Code 5: Refused – name server refuses due to policy

If everything goes fine, the response code should be “0” and the related network performance should be good (i.e., the duration of the whole DNS request/response process should not exceed some milliseconds depending on your IT infrastructure).

All other codes mentioned are related to DNS issues.

So, the first step in troubleshooting DNS-related problems is to get an overview of all responses codes and corresponding performances.

With SkyLIGHT PVX, this is achieved by visualizing the DNS Requests Overview dashboard:

DNS Requests Overview dashboard
DNS Requests Overview dashboard

Referring back to the first DNS blog article, you should notice on the first line of this dashboard that 411 DNS requests for IPv4 addresses (Request Type “A”) did not succeed. In this case, the problem is that the requested name simply does not exist (Response Code = 3).

The next logical step is to check which FQDNs were requested:

Fully Qualified Domain Names
Fully Qualified Domain Names

As you can see, the FQDN “ntp.labo.securactive.lan” does not exist either. You can also view all clients that have requested this FQDN in order to further troubleshoot and solve the problem.

DNS Queries Without Responses

DNS queries that are issued and that do not generate any response back are situations that should be clearly identified for further analysis. They can be the sign of various issues such as:

  • Bad TCP/IP stack configuration of the endpoints that point to a non-existing or decommissioned DNS server
  • DNS server overload
  • Network-related problems

With SkyLIGHT PVX, you can easily identify such DNS requests by looking at the DNS Requests Overview dashboard:

DNS Requests Overview dashboard
DNS Requests Overview dashboard

Well, it’s quite easy to determine with SkyLIGHT PVX. Just click on the “+” sign on the left to get the answer:

We can see from this dashboard that 5 requests for IPv4 addresses did not receive an answer. Which one(s) exactly?

IP4V addresses that didn’t receive an answer

All these failed requests have been sent to the same DNS server at 172.16.1.12.

Perhaps this server is down. How can we check this? Again, quite simple by filtering on this particular DNS server IP address and requesting all DNS processes that occurred during this particular chosen timeframe:

DNS server IP addresses
DNS server IP addresses

As you can notice, this DNS server seems to be up and running seeing it answered some other requests correctly.

The troubleshooting process can continue by looking to network-related KPIs and so on.

Conclusions

DNS is an important process in today’s business applications. Bad or non-optimal name resolution processes can have a dramatic impact on the overall applications performances. SkyLIGHT PVX can decode DNS transactions at layer 7, providing you with the ability to quickly analyze and troubleshoot complex DNS-related performance issues.