Network troubleshooting in distributed Wi-Fi environments
Wi-Fi is installed after everything else in the network is already set up – switches, routers, servers, firewalls, VPNs etc. Naturally, customers rely on their Wi-Fi solution provider to alleviate any network problems that arise during the Wi-Fi deployments, even though the problems are not necessarily Wi-Fi specific.
Network issues aren’t something new in any project. However, the troubleshooting task becomes challenging when it needs to be done remotely and when there isn’t much onsite IT help. This is often the case with the distributed Wi-Fi deployments. Also, due to the heterogeneity of the network infrastructure in many environments in the distributed vertical, sometimes very stealthy network problems are encountered. Take these recent troubleshooting examples which underscore these points.
Number Jumble in DNS Forwarding
In a recent distributed Wi-Fi installs that we were involved in, the AP would fail to connect to the cloud. After eliminating a few obvious possibilities, we found that the AP was failing to perform a DNS resolution. The customer was using Google DNS. So, we put the DNS request packet from the AP under the microscope and found it to be properly formatted. We did not have access to many parts of the customer’s network to view the logs. Then, one of us made an interesting observation – when the AP pinged something in the Internet, the AP would briefly succeed in performing the DNS resolution. Upon pursuing that lead, we found out that when there was no ping, the DNS request used source port 1024 and failed, but when there was a concurrent ping, it used source port higher than 1024 (since 1024 was taken up by ping) and succeeded.
That meant some network element in the packet path was blocking the DNS request with the source port 1024. There is no restriction like this in the DNS protocol on the source port number. Linux implementations have been known to follow the convention to use low ports only for Kernel privilege processes. Some network element in the packet path seemed to have incorrectly adapted that convention to some packet filter rule on the interface. Even then, it seemed the implementation glossed over the detail, since Linux privileged ports are below 1024, not including 1024. Patching the AP software to exclude port numbers below 1025 in the DNS requests solved the problem (though we and even the customer still do not know what was blocking the original DNS requests).
In another case, we got panic call from a customer saying that putting an AP in their network disrupted the network. Everyone thought there must be some malformed packet coming out of the AP that some switch or router did not like. We examined several packets from the AP to ensure there was no rogue packet in there. Then the investigation moved to network servers, starting with DHCP and DNS, and we found them to be in good health. Then came the turn of the RADIUS server, which is where unusual activity was spotted. After the AP deployment, the RADIUS process showed frequent restarts. But why would that happen? After all, RADIUS requests from the AP were well formatted. Digging further, we discovered that the contractor had put in an incorrect RADIUS secret in the AP. Instead of gracefully rejecting requests coming from the AP, the RADIUS server process used to restart upon seeing the incorrect secret (and without writing any logs). Entering the correct RADIUS secret in the AP fixed the issue. Hopefully, the customer will now maintain or upgrade their RADIUS infrastructure.
Toast to Network Jedi
I have many such stories to tell and I am sure many of you too. The real heroes in troubleshooting stories like these are the engineers who jump in to resolve insidious problems during Wi-Fi installs. What is also impressive is that they do this in the networks that are completely new to them and often only remotely and partially accessible. My shout-out to all the networking Jedi for making Wi-Fi work in more networks every day!