[Resolved] Connectivity outage

The Servology network suffered a short network outage lasting from approximately 12:55 to 12:58 today. This was caused by emergency upstream network maintenance. I only received about 12 minutes’ notice of the maintenance window, and the outage occurred less than half an hour into that window. Unfortunately I wasn’t in a position to take any action to mitigate the effects of the maintenance. My apologies for this.

[Resolved] Web service overload

[15:50] At around 15:15 today a large number of near-simultaneous connections were made to one of the former-Artonezero web servers. Due to the time taken to process each request, the server’s request queue filled up and new requests were having to wait a very long time to be processed, making the server effectively unusable. I restarted the web server process at approximately 15:36, resetting all outstanding connections and allowing the service to come back to normal. I will keep an eye on it and will try to find the cause if this happens again.

[Resolved] Outage on several virtual servers

[08:15] There has been an outage on several virtual servers hosted by Servology, including several servers supporting servology’s DNS, email and web services. This appears to be due to a problem with the physical host running those virtual servers. I am currently restarting the affected servers on alternative hardware. This outage will have affected most services hosted by Servology, including DNS resolution and web and email hosting.

[10:00] I believe everything was back up and running as of approximately 08:40 this morning. As far as I can tell the outage started around 07:30, although it looks like there may have been degraded performance on Servology-hosted VMs throughout the night.

[Recovered] Server shutdown due to overtemperature

[15.30] As far as I can tell, one of the former Artonezero servers may have suffered a hardware failure and stopped working. This will affect some services for former Artonezero customers, in particular email for these customers is currently not working. Support via support@servology.co.uk is also not working. I am working on determining the extent of the problems and making plans to restart affected services on alternative hardware, but as I’m not at my desk today it may take some time. I will update this post as I make progress.

[17:30] I have begun starting virtual servers on alternative hardware. Servers should be starting to come back online. I have been prioritising email servers. Depending on the server, fsck (filesystem checks) may take a considerable time.

[18:10] My datacentre provider has advised that they were carrying out work on the cooling systems in the suite today. Despite the extra temporary cooling untis they installed in the suite during the work, it seems some machines saw an increase in temperatures, in one case beyond the emergency shutdown threshold – this would be what caused today’s problems. I’m told the work is all complete now and the suite’s usual cooling is fully back in operation, and I have restarted the failed server. This machine seems to be working normally now. I expect services to be returning to normal soon.

[18:45] I believe all servers and services used by former-Artonezero customers have come back up.

[Resolved] Upstream maintenance

[01:30] Servology’s primary transit provider is due to perform hardware maintenance to their router to which Servology connects, between 03:00 and 05:00 today. I am therefore about to adjust settings on our BGP router to move the majority of both inbound and outbound traffic away from our primary transit provider to our secondary transit provider for the duration of the maintenance, which should help to minimise disruption if the router undergoing maintenance is not shut down cleanly.

[11:00] The upstream maintenance appears not to have gone ahead. My router did not see any disconnection or other interruption to service. The transit provider declared a major incident due to a hardware failure related to a maintenance window earlier in the night, so I can only guess they have postponed the maintenance that would have affected Servology’s connection to them. I have reverted last night’s routing changes to restore traffic to its usual paths, and will keep an eye out for future maintenance by the transit provider.

[Resolved] DNS resolver glitches

[14:30] I discovered today that there have been glitches in connectivity to one or both of the Servology DNS resolvers over the last day or two. The start of the problem appears to coincide with when I installed the latest security updates on resolver B at the weekend, but as far as I can see do not appear to have been caused by those updates – for now I am assuming the timing was coincidental. The cause of the problem seems to be a faulty VRRP process on an older router – two routers were both trying to claim the same IP address on a specific subnet where some of Servology’s internal servers are hosted. I have shut down the faulty VRRP process and this appears to have fixed the connectivity glitches.

I believe the problems were confined to a few IP addresses which are statically configured on routers and distributed via OSPF, so access to most services should not have been affected. Affected services included Servology DNS resolvers and possibly some Servology NTP and SMTP servers. The IP addresses in question most likely worked fine from outside the Servology network (my monitoring server is hosted externally and did not report any problems) – I the glitches were probably only visible from local subnets.

I will continue to monitor this in case of further problems.

[Resolved] Connectivity outage

[00:30] The Servology colo network just suffered a connectivity outage lasting nearly 5 minutes, from 00:10 to 00:14 today. This appears to have been caused by an outage of the router we peer with at our primary transit provider, which in turn appears to be due to scheduled maintenance. I can only guess that the router was not shut down cleanly and that this is why it took several minutes for BGP to reconverge on our backup transit provider.

I have manually shut down our BGP sessions to our primary transit provider for the time being. I am going to see if I can reverse the primary/backup configuration of the two connections for the remainder of the maintenance window, such that I can bring the BGP sessions back up again (to avoid relying on a single transit for longer than necessary) without more risk than necessary of this problem happening again.

[16:00] The BGP sessions with the primary and backup transit providers are both up and running normally. I plan to shift traffic back to the primary transit tonight shortly after 22:00. There should (I hope!) be no noticeable outage, but there may be brief periods of packet loss as routing protocols reconverge.

[23:50] Reverting last night’s localpref and prepend changes and soft-refreshing BGP sessions.

[00:25] Work complete.

[Monitoring] Connectivity glitches

01:40: We’ve had a couple of glitches in connectivity which were picked up by monitoring, one at about 00:58 which lasted for a minute or two, and another at about 01:30 which lasted a little more than five minutes. I can see that Servology’s BGP sessions have not been interrupted, so can only assume this problem happened upstream of us. I will keep an eye on it.

[Resolved] Network instability

Monitoring has alerted me to some short network outages, the first of which lasted from around 22:06 to 22:12 this evening. This appears to have been caused by a problem on the BGP link between the Servology network and our primary upstream transit provider. The connection broke with a “Hold Timer Expired” notification sent by our router, which generally means the BGP process at the far end of the link was not responding for some reason.

As there has been more than one such event, I have temporarily shut down our IPv4 session with that transit provider and all IPv4 connectivity is via our backup transit provider for now. I will keep an eye on the situation and reestablish our primary IPv4 BGP session when it seems safe to do so.

[02:00] This incident appears to have been due to outages of the transit provider router. My nagios monitoring installation regularly pings that router’s loopback IP address (i.e. not the interface address we use to connect to it) from a third-party hosting provider’s network, and saw two periods when that router address did not answer pings, corresponding to the two outages we saw. It has now been several hours without nagios noticing any further outages so I have resstablished the BGP session, returning the network configuration to normal.

[Resolved] Power redundancy failure in rack 2

At 05:18 this morning nagios notified me of a failure of one of the two power feeds in Servology rack 2 at Telehouse North. The ATS in the rack reported a loss of redundancy (but as far as I can see, the input which failed was the one it was not using at the time, so it did not need to switch), and several servers with dual power supplies reported one of their two PSUs had failed. I contacted Star Europe and one of their staff discovered that a circuit breaker had tripped and switched it back on at about 07:15. This restored redundancy and I believe this problem is now resolved.

This does leave the question of why the breaker tripped. I’m not sure there’s much I can do on this front as the Star staff member couldn’t see anything untoward, but if it trips again I will have to find a way to investigate.