[00:30] The Servology colo network just suffered a connectivity outage lasting nearly 5 minutes, from 00:10 to 00:14 today. This appears to have been caused by an outage of the router we peer with at our primary transit provider, which in turn appears to be due to scheduled maintenance. I can only guess that the router was not shut down cleanly and that this is why it took several minutes for BGP to reconverge on our backup transit provider.
I have manually shut down our BGP sessions to our primary transit provider for the time being. I am going to see if I can reverse the primary/backup configuration of the two connections for the remainder of the maintenance window, such that I can bring the BGP sessions back up again (to avoid relying on a single transit for longer than necessary) without more risk than necessary of this problem happening again.
[16:00] The BGP sessions with the primary and backup transit providers are both up and running normally. I plan to shift traffic back to the primary transit tonight shortly after 22:00. There should (I hope!) be no noticeable outage, but there may be brief periods of packet loss as routing protocols reconverge.
[23:50] Reverting last night’s localpref and prepend changes and soft-refreshing BGP sessions.
[00:25] Work complete.