Monitoring has alerted me to some network outages which occurred between about 12:30 and 13:00 today. Over the weekend I had made some changes to the way routing processes are restarted on my core routers if they become unresponsive, and I believe this helped keep the disruption shorter than last time. I believe I’ve finally pinned down the root cause of these outages. One of the core routers is getting short of memory and it seems this causes routing processes to run slowly. (The other core router has twice the memory and does not seem to be having the same kind of problems.) Protocol timers then expire, at which point other routers shut down and reestablish their associations with the problem router, causing an outage in the meantime. On its own this would not be too disruptive, but the router’s internal monitoring also detects the routing processes becoming unresponsive, so it shuts down and restarts those processes, which takes time. To mitigate the problems I have disabled a link between two core routers for now and I expect this will reduce the memory pressure and keep things running smoothly until I can implement a more permanent fix.
I have been working recently on configuring a pair of new core routers to replace the (aging) current ones. The new routers have much faster processors and 32 times the memory of the current ones, and I have been prioritising this work since the outage last Thursday, especially over the weekend. I expect to get these new routers into service very soon now which should fully resolve the recent network problems.