All systems are operational
Scheduled Maintenance
Test of Main Power Failure

We have been informed by our data center provider that they will perform a main power failure test. Both main power feeds of the data center are backed by independent UPS systems as well as diesel generators and all of our systems are connected to both power feeds. Therefore, we do not expect any impact.

Date / Time
Saturday, 2019-10-19, 13:00 CEST to 18:00 CEST

Expected Impact
No impact expected.

Thank you for your understanding.

Network Upgrade: Partial Migration to New Network Hardware

In the evening and night from Monday, 2019-10-21, to Tuesday, 2019-10-22, we will live-migrate a subset of the virtual servers at cloudscale.ch to compute nodes that are attached to new network equipment. It is possible that your virtual servers will be live-migrated multiple times during this maintenance work. You may also experience short periods of packet loss (up to 1-2 minutes) or higher RTTs for connections from and towards the Internet as well as between virtual servers (even when using the private network interface).

Date / Time
Monday, 2019-10-21, 20:00 CEST to Tuesday, 2019-10-22, 02:00 CEST

Expected Impact
Short periods of packet loss (up to 1-2 minutes) or higher RTTs for connections towards the Internet as well as between virtual servers.

We apologize for any inconvenience this may cause and thank you for your understanding.

Past Incidents

Saturday, 19th October 2019

No incidents reported

Friday, 18th October 2019

No incidents reported

Thursday, 17th October 2019

No incidents reported

Wednesday, 16th October 2019

No incidents reported

Tuesday, 15th October 2019

Linux Cloud Servers Incident Report Regarding Partial Connectivity Issues on 2019-10-08

Context

On 2019-09-30, during a scheduled maintenance, we migrated the default gateways of your virtual servers at cloudscale.ch to new network equipment. In order to keep the downtime of the default gateways as short as possible, we tried to stop the advertisements of the old default gateways just seconds before deploying the new default gateways using one of our Ansible playbooks.

Timeline

On 2019-10-07, during another scheduled maintenance, we live-migrated a subset of the virtual servers at cloudscale.ch to compute nodes that were physically attached to new top-of-rack switches.

In the morning of 2019-10-08, the day after the migration, we received several reports from customers regarding connectivity issues. All of the affected virtual servers had a few things in common: they were running on compute nodes attached to the new switches, they had a public network interface, and they lost connectivity minutes to hours after completion of the scheduled maintenance the night before. Virtual servers that only had a private network interface did not seem to be affected.

We immediately gathered support data and escalated the case to the vendor of our new network equipment. Shortly after, we started a first debugging session with one of their escalation engineers.

At the same time we decided to live-migrate the affected servers away from the compute nodes that were attached to the new top-of-rack switches. This immediately resolved the connectivity issues for the virtual servers in question. According to our observation, connectivity remained stable even after live-migrating these servers to computes nodes attached to the new network equipment once more.

Having skimmed through the support data in the meantime, the vendor suspected hardware misprogramming to be the root cause of the connectivity issues. As we had reason to believe that the issue had been remediated by the live-migrations, we then focused on root cause analysis in collaboration with the vendor.

On 2019-10-08 at 14:05 CEST we received the first report that the connectivity issues for a virtual server recurred despite having been live-migrated back and forth already. This immediately led to the decision to completely roll back the announced migration of the night before. Minutes later, two of our engineers set off for our data center.

On 2019-10-08 at 17:20 CEST we successfully completed the rollback process: All virtual servers had been live-migrated back to compute nodes that had been re-attached to the old network equipment. Within the following hours, several customers confirmed that they had not experienced any further issues after the rollback.

Root Cause

Further investigation in collaboration with the vendor revealed that a feature called "ARP and ND suppression" in combination with stale neighbor entries for the old default gateways was the root cause of this incident. In fact, the new top-of-rack switches were still replying to ARP and ND requests using the MAC address of the old default gateways (that went out of service on 2019-09-30). This only affected servers that were physically attached to the new top-of-rack switches.

The fact that those stale entries still existed more than a week since the migration of the default gateways, was confirmed to be a bug in the firmware of our new networking equipment.

After clearing those stale entries we were able to confirm that connectivity was fully restored for all virtual servers. In agreement with the respective customers, we live-migrated several of their virtual servers back to compute nodes that were still attached to the new network equipment, along with internal (specifically monitoring) servers. We are experiencing stable connectivity without any further issues ever since.

Next Steps

As confirmed by the vendor, the newly-discovered bug advertising stale entries through the "ARP and ND suppression" feature could only have this effect as the MAC addresses of the default gateways changed in the course of the initial migration. Having completed the gateway migration on 2019-09-30 and cleared the stale entries manually in the process of researching and resolving this case, it is safe to assume that this issue cannot recur. This conclusion is also backed by the stable operation of various virtual servers as mentioned above.

After thorough analysis we therefore decided to resume the migration and again plan to live-migrate a subset of the virtual servers at cloudscale.ch to compute nodes that are attached to the new network equipment next Monday, 2019-10-21 (see upcoming maintenance announcements).

Please accept our apologies for the inconvenience this issue may have caused you and your customers.

Monday, 14th October 2019

No incidents reported

Sunday, 13th October 2019

No incidents reported