Tuesday, 15th October 2019

Linux Cloud Servers (RMA1) Incident Report Regarding Partial Connectivity Issues on 2019-10-08

Context

On 2019-09-30, during a scheduled maintenance, we migrated the default gateways of your virtual servers at cloudscale.ch to new network equipment. In order to keep the downtime of the default gateways as short as possible, we tried to stop the advertisements of the old default gateways just seconds before deploying the new default gateways using one of our Ansible playbooks.

Timeline

On 2019-10-07, during another scheduled maintenance, we live-migrated a subset of the virtual servers at cloudscale.ch to compute nodes that were physically attached to new top-of-rack switches.

In the morning of 2019-10-08, the day after the migration, we received several reports from customers regarding connectivity issues. All of the affected virtual servers had a few things in common: they were running on compute nodes attached to the new switches, they had a public network interface, and they lost connectivity minutes to hours after completion of the scheduled maintenance the night before. Virtual servers that only had a private network interface did not seem to be affected.

We immediately gathered support data and escalated the case to the vendor of our new network equipment. Shortly after, we started a first debugging session with one of their escalation engineers.

At the same time we decided to live-migrate the affected servers away from the compute nodes that were attached to the new top-of-rack switches. This immediately resolved the connectivity issues for the virtual servers in question. According to our observation, connectivity remained stable even after live-migrating these servers to computes nodes attached to the new network equipment once more.

Having skimmed through the support data in the meantime, the vendor suspected hardware misprogramming to be the root cause of the connectivity issues. As we had reason to believe that the issue had been remediated by the live-migrations, we then focused on root cause analysis in collaboration with the vendor.

On 2019-10-08 at 14:05 CEST we received the first report that the connectivity issues for a virtual server recurred despite having been live-migrated back and forth already. This immediately led to the decision to completely roll back the announced migration of the night before. Minutes later, two of our engineers set off for our data center.

On 2019-10-08 at 17:20 CEST we successfully completed the rollback process: All virtual servers had been live-migrated back to compute nodes that had been re-attached to the old network equipment. Within the following hours, several customers confirmed that they had not experienced any further issues after the rollback.

Root Cause

Further investigation in collaboration with the vendor revealed that a feature called "ARP and ND suppression" in combination with stale neighbor entries for the old default gateways was the root cause of this incident. In fact, the new top-of-rack switches were still replying to ARP and ND requests using the MAC address of the old default gateways (that went out of service on 2019-09-30). This only affected servers that were physically attached to the new top-of-rack switches.

The fact that those stale entries still existed more than a week since the migration of the default gateways, was confirmed to be a bug in the firmware of our new networking equipment.

After clearing those stale entries we were able to confirm that connectivity was fully restored for all virtual servers. In agreement with the respective customers, we live-migrated several of their virtual servers back to compute nodes that were still attached to the new network equipment, along with internal (specifically monitoring) servers. We are experiencing stable connectivity without any further issues ever since.

Next Steps

As confirmed by the vendor, the newly-discovered bug advertising stale entries through the "ARP and ND suppression" feature could only have this effect as the MAC addresses of the default gateways changed in the course of the initial migration. Having completed the gateway migration on 2019-09-30 and cleared the stale entries manually in the process of researching and resolving this case, it is safe to assume that this issue cannot recur. This conclusion is also backed by the stable operation of various virtual servers as mentioned above.

After thorough analysis we therefore decided to resume the migration and again plan to live-migrate a subset of the virtual servers at cloudscale.ch to compute nodes that are attached to the new network equipment next Monday, 2019-10-21 (see upcoming maintenance announcements).

Please accept our apologies for the inconvenience this issue may have caused you and your customers.