Servology – Servology status

[Mitigated] Connectivity outage caused by upstream maintenance

Posted on 2025-01-23 | by Servology

[01:30] I have seen several short outages between the Servology network and the rest of the internet since midnight tonight. So far there have been outages around 3 minutes long at approximately 00:08–00:11 and 00:25–00:28, and shorter outages at around 00:56–00:57 and 01:04–01:05.

These appear to have been caused by my maintenance on my primary transit provider’s router. The notification I received a few days ago about the maintenance, I am sure said it would be 00:00–04:00 on 30 January, but it would seem it has since been edited to say 23 January. Therefore the alarms I had set for the evening of 29 January to remind myself to shift traffic to the backup upstream connection before the maintenance was in vain and I wasn’t able to mitigate this maintenance in advance.

I have now shifted traffic to the backup transit provider for the rest of the night to hopefully mitigate any problems if the maintenance causes any further connectivity glitches. I will shift it back to the primary transit tomorrow.

[11:40] I have just moved inbound and outbound IPv4 and IPv6 traffic back to Servology’s primary transit provider. Their router doesn’t appear to have had any more connectivity problems since 01:05 this morning.

[Resolved] Connectivity outage

Posted on 2025-01-15 | by Servology

The Servology network suffered a short network outage lasting from approximately 12:55 to 12:58 today. This was caused by emergency upstream network maintenance. I only received about 12 minutes’ notice of the maintenance window, and the outage occurred less than half an hour into that window. Unfortunately I wasn’t in a position to take any action to mitigate the effects of the maintenance. My apologies for this.

[Resolved] Web service overload

Posted on 2024-12-16 | by Servology

[15:50] At around 15:15 today a large number of near-simultaneous connections were made to one of the former-Artonezero web servers. Due to the time taken to process each request, the server’s request queue filled up and new requests were having to wait a very long time to be processed, making the server effectively unusable. I restarted the web server process at approximately 15:36, resetting all outstanding connections and allowing the service to come back to normal. I will keep an eye on it and will try to find the cause if this happens again.

[Resolved] Outage on several virtual servers

Posted on 2024-12-16 | by Servology

[08:15] There has been an outage on several virtual servers hosted by Servology, including several servers supporting servology’s DNS, email and web services. This appears to be due to a problem with the physical host running those virtual servers. I am currently restarting the affected servers on alternative hardware. This outage will have affected most services hosted by Servology, including DNS resolution and web and email hosting.

[10:00] I believe everything was back up and running as of approximately 08:40 this morning. As far as I can tell the outage started around 07:30, although it looks like there may have been degraded performance on Servology-hosted VMs throughout the night.

[Recovered] Server shutdown due to overtemperature

Posted on 2024-08-28 | by Servology

[15.30] As far as I can tell, one of the former Artonezero servers may have suffered a hardware failure and stopped working. This will affect some services for former Artonezero customers, in particular email for these customers is currently not working. Support via support@servology.co.uk is also not working. I am working on determining the extent of the problems and making plans to restart affected services on alternative hardware, but as I’m not at my desk today it may take some time. I will update this post as I make progress.

[17:30] I have begun starting virtual servers on alternative hardware. Servers should be starting to come back online. I have been prioritising email servers. Depending on the server, fsck (filesystem checks) may take a considerable time.

[18:10] My datacentre provider has advised that they were carrying out work on the cooling systems in the suite today. Despite the extra temporary cooling untis they installed in the suite during the work, it seems some machines saw an increase in temperatures, in one case beyond the emergency shutdown threshold – this would be what caused today’s problems. I’m told the work is all complete now and the suite’s usual cooling is fully back in operation, and I have restarted the failed server. This machine seems to be working normally now. I expect services to be returning to normal soon.

[18:45] I believe all servers and services used by former-Artonezero customers have come back up.

[Resolved] Upstream maintenance

Posted on 2024-08-21 | by Servology

[01:30] Servology’s primary transit provider is due to perform hardware maintenance to their router to which Servology connects, between 03:00 and 05:00 today. I am therefore about to adjust settings on our BGP router to move the majority of both inbound and outbound traffic away from our primary transit provider to our secondary transit provider for the duration of the maintenance, which should help to minimise disruption if the router undergoing maintenance is not shut down cleanly.

[11:00] The upstream maintenance appears not to have gone ahead. My router did not see any disconnection or other interruption to service. The transit provider declared a major incident due to a hardware failure related to a maintenance window earlier in the night, so I can only guess they have postponed the maintenance that would have affected Servology’s connection to them. I have reverted last night’s routing changes to restore traffic to its usual paths, and will keep an eye out for future maintenance by the transit provider.

[Resolved] DNS resolver glitches

Posted on 2024-07-29 | by Servology

[14:30] I discovered today that there have been glitches in connectivity to one or both of the Servology DNS resolvers over the last day or two. The start of the problem appears to coincide with when I installed the latest security updates on resolver B at the weekend, but as far as I can see do not appear to have been caused by those updates – for now I am assuming the timing was coincidental. The cause of the problem seems to be a faulty VRRP process on an older router – two routers were both trying to claim the same IP address on a specific subnet where some of Servology’s internal servers are hosted. I have shut down the faulty VRRP process and this appears to have fixed the connectivity glitches.

I believe the problems were confined to a few IP addresses which are statically configured on routers and distributed via OSPF, so access to most services should not have been affected. Affected services included Servology DNS resolvers and possibly some Servology NTP and SMTP servers. The IP addresses in question most likely worked fine from outside the Servology network (my monitoring server is hosted externally and did not report any problems) – I the glitches were probably only visible from local subnets.

I will continue to monitor this in case of further problems.

[Resolved] Connectivity outage

Posted on 2024-03-13 | by Servology

[00:30] The Servology colo network just suffered a connectivity outage lasting nearly 5 minutes, from 00:10 to 00:14 today. This appears to have been caused by an outage of the router we peer with at our primary transit provider, which in turn appears to be due to scheduled maintenance. I can only guess that the router was not shut down cleanly and that this is why it took several minutes for BGP to reconverge on our backup transit provider.

I have manually shut down our BGP sessions to our primary transit provider for the time being. I am going to see if I can reverse the primary/backup configuration of the two connections for the remainder of the maintenance window, such that I can bring the BGP sessions back up again (to avoid relying on a single transit for longer than necessary) without more risk than necessary of this problem happening again.

[16:00] The BGP sessions with the primary and backup transit providers are both up and running normally. I plan to shift traffic back to the primary transit tonight shortly after 22:00. There should (I hope!) be no noticeable outage, but there may be brief periods of packet loss as routing protocols reconverge.

[23:50] Reverting last night’s localpref and prepend changes and soft-refreshing BGP sessions.

[00:25] Work complete.

[Monitoring] Connectivity glitches

Posted on 2024-02-13 | by Servology

01:40: We’ve had a couple of glitches in connectivity which were picked up by monitoring, one at about 00:58 which lasted for a minute or two, and another at about 01:30 which lasted a little more than five minutes. I can see that Servology’s BGP sessions have not been interrupted, so can only assume this problem happened upstream of us. I will keep an eye on it.

[Resolved] Network instability

Posted on 2023-10-10 | by Servology

Monitoring has alerted me to some short network outages, the first of which lasted from around 22:06 to 22:12 this evening. This appears to have been caused by a problem on the BGP link between the Servology network and our primary upstream transit provider. The connection broke with a “Hold Timer Expired” notification sent by our router, which generally means the BGP process at the far end of the link was not responding for some reason.

As there has been more than one such event, I have temporarily shut down our IPv4 session with that transit provider and all IPv4 connectivity is via our backup transit provider for now. I will keep an eye on the situation and reestablish our primary IPv4 BGP session when it seems safe to do so.

[02:00] This incident appears to have been due to outages of the transit provider router. My nagios monitoring installation regularly pings that router’s loopback IP address (i.e. not the interface address we use to connect to it) from a third-party hosting provider’s network, and saw two periods when that router address did not answer pings, corresponding to the two outages we saw. It has now been several hours without nagios noticing any further outages so I have resstablished the BGP session, returning the network configuration to normal.