DIY Downsides: When successful troubleshooting deprives you of an excuse for upgrading.

The same base urges that led me to upgrade my home server, have had me itching to upgrade my home network. The server build sated that hunger for a while, but now that it is done, I’ve been feeling the pull, again. I’ve held out, but yesterday, I thought I was going to HAVE to upgrade because one of my routers started acting up. Happily/sadly, successful troubleshooting has stimied such rationalizations.

The current network arrangement consists of a Netgear WNDR3800 running OpenWRT as the main router/firewall. It has direct connections a couple of small ARM servers running Debian, which run various services. It also provides WiFi access. At the other side of the house, there is another router, a Netgear WNDR3700, also running OpenWRT that is bridged to the main router over 5GHz WiFi. It provides wired connectivity for a printer, a server, and a Acu-Link bridge that connects a weather station to the Internet. It also provides WiFi access in the 2.4GHz band and guarantees decent connectivity from our back yard.

I’ve been considering various options for upgrading, with the goals of getting a higher-speed link between the two routers, and a speed bump for my laptop, which has a 3-stream radio, whereas the routers are limited to two streams. I wasn’t entirely happy with any of the options though, which is one of the reasons I’d been able to resist the upgrade urge.

Yesterday though, as I’ve already mentioned, things started getting flakey. I noticed that the second router seemed to be rebooting every ten to fifteen minutes, interrupting connectivity for connected devices for a good 60-90 seconds. I could have used this as an excuse to upgrade, but I tried to troubleshoot the problem first.

I’d recently moved some equipment around, so the first thing I checked was that the power cord hadn’t been partially unplugged. It didn’t seem like it had, but I made sure it was well plugged in and waited to see if the problem continued. It did.

As part of the equipment moves, I’d swapped in a higher-capacity UPS. The UPS had been functioning just fine, in its previous location, but I wondered if it was adding noise that the routers power supply was having trouble contending with, so I swapped in another 12V DC wall wart. At first, that seemed to solve the problem, no reboots in over 20 minutes, but before another 20 minutes were over, the router rebooted again.

I wondered if it was a temperature issue. It was a warm day, though no warmer than other days this summer. I decided to take the router apart to see if there was any dust slowing down convection from the case. Once I got the case open though, I realized that I’d already opened and cleaned out the router a few weeks before. Back to the drawing board.

I decided to open up an SSH session so I could watch the system log (OpenWRT logs to a ring buffer in memory, so you have to use “logread -f.” in the hopes that I’d see something useful before the device rebooted itself. Once I had that running, I took my dog for a ~60 minute walk. When I returned, the router was still running, still logging to my screen. I scrolled back over the logs, and didn’t see anything unusual. I started to wonder if it was a “heisenbug.” Perhaps I was going to have to upgrade after all.

The log filled with information about connections and disconnections from WiFi devices, and IPv6 related housekeeping. A few minutes later though, something different flashed past on the console. I scrolled back and saw that the kernel reported that it detected a low memory condition and killed off a maintenance script to keep from running out of memory. I thought it strange, but I didn’t see what would have changed to make low-memory conditions commonplace, or how they’d lead to a reboot. It was, however, my only lead, and as I sat there thinking it over, I saw another low memory warning.

It was starting to make sense, I imagined how the router could get in a situation where it killed off an essential process and ended up rebooting. It might even be that the router hardware had a watchdog function, which would reboot the device if a process didn’t reset the watchdog timer on a regular interval. Killing that process could lead to a reboot.

Now that my only lead was starting to seem plausible, I dug deeper. I first checked the amount of memory in use with the “free” command. It reported there were less than 2MB free, which surprised me since I remembered it was typically many times that. Next I used “ps” to see what processes were running, and which were using the most memory. None of them were obviously huge, but a few of the larger processes didn’t look quite right.

I have an Acu-Rite weatherstation and an Acu-Link bridge. The bridge relays readings from the weatherstation to Acu-Rite’s servers. I have a system in place to capture the data as it is being transmitted, and feed it to Weewx on one of my servers. Weewx keeps its own record, and also updates Weather Underground. To accomplish this, I have a startup script on the secondary router that uses “ncat” to wait from an incoming connection from the driver I wrote for WeeWX. Once the connection comes in, ngrep sniffs for packets from the Acu-Link bridge relaying data from my weather station to the internet service. That data forwarded overmy network to the waiting weewx driver on my server.

The problem was that there were 3-4 copies of these processes running on the router. Ordinarily, there should only be one. After a connection is lost temporarily and weewx has reconnected, there might be two running before the old copy times out. I tried reducing the max number of connections from four to two, which helped a bit, but it was still cycling the connection much to quickly and for no obvious reason.

I tried restarting weewx, but that didn’t help, it was still connecting and then disconnecting too frequently. To debug things further, I decided to simulate the server connection and see if the data was being captured and forwarded properly. It wasn’t!

From here, I wanted to connect to the diagnostic webserver running on the Acu-Link to confirm that it had a connection to the weather station. I looked at the main router to find the Acu-Link’s IP address, and in the process I realized that it had changed. I hadn’t created a DHCP reservation for the Acu-Link device because DNSMasq, which provides DHCP service on the router, generally provides a stable IP address to devices. Something had happened though, probably when I was moving around hardware, and it had assigned a new IP address.

So, to recap:

  1. My script was depending on a result of the default behavior of DNSMasq for the IP address of the Acu-Link bridge.
  2. As a result of reconfiguring my network, something changed, and with it the result of that default behavior, leading to the IP address of the Acu-Link bridge changing.
  3. As a result, of the changed IP address, my script for sniffing weather data failed to collect that data and forward it to weewx running on my server.
  4. Because it wasn’t getting data as expected, weewx tried to reconnect to the sniffer script.
  5. Because weewx was reconnecting as frequently as it was, excess copies of ncat, ngrep and the ash shell accumulated on the router, eating up memory.
  6. Because of the low memory condition, the kernels out of memory killer (OOM killer) started killing off processes.
  7. Because some key process was killed by the OOM killer, the router rebooted, continuing the cycle.
  8. Because the router was rebooting frequently, I decided to troubleshoot it.
  9. Because I succeeded in troubleshooting the root cause, I have removed a reason to buy a new router.

Sometimes being awesome has to be its own reward, I guess.

Since I couldn’t in good conscience buy a new router, I did a few easy things to fix the problem. I configured the DHCP server to assign a predefined IP to the Acu-Link bridge, and update the script to look for packets from that IP.

There additional mediations to this issue that I probably won’t undertake, including:

  1. Updating my WeeWX driver to make sure it properly cleans up sockets when reconnecting. This might lead to quicker cleanup of the old processes on the router.
  2. Updating my WeeWX driver to throttle the rate at which it retries connections.
  3. Trying to automatically detect the IP of the Acu-Link bridge and then use that IP for the ongoing packet sniffing.
  4. Adding some code to the sniffer script that will be more aggressive about cleaning up unused connections.