Wednesday, 27th November 2019

Network Infrastructure (RMA) Incident Report Regarding Outages on 2019-11-22

Context

Over a period of more than a year, cloudscale.ch has been evaluating and testing new network equipment. Our recently opened cloud location in Lupfig has been built on this new equipment right from the start. The existing network equipment at our location in Rümlang, on the other hand, has been replaced gradually with the new devices during several announced maintenance windows. Replacing the existing network devices allowed us to further increase the speed of key links (e.g. of our NVMe-only storage cluster) to 100 Gbps and to optimally interconnect the existing and new cloud region, Rümlang (RMA) and Lupfig (LPG). Furthermore, the new setup provides more network ports overall to accommodate our continued growth.

During the evaluation and testing period, we closely worked with the vendor's engineering team to test and verify the new setup for our specific environment. After extensive testing of different configurations and over the course of multiple new firmware releases, we gained confidence that the new setup had reached the required maturity to take over our productive workload.

Timeline

On 2019-09-30, we took the first step by switching over the default gateways of all servers at our then only location RMA1 in a previously announced maintenance window. Between 2019-10-07 and 2019-11-09, we gradually live-migrated our customers' virtual servers to compute nodes that were physically attached to the new network devices in further announced maintenance windows.

During productive operation, we noticed short periods of link flaps leading to partial packet loss that correlated with high CPU utilization on the new network devices. We analyzed this behavior in close collaboration with the vendor's engineering team and applied several changes and fixes in multiple steps as proposed by the vendor, mitigating some aspects of the overall issue, but not resolving it completely.

On 2019-11-22 at 07:48 CET we received alerts from both our internal as well as external monitoring, indicating connectivity issues within our networking infrastructure at RMA1. Within minutes, one of our engineers started investigating. However, the issues grew to a complete outage at RMA1 by 08:07 CET. With the exception of creating new virtual servers, normal operation could be restored starting 08:24 CET after we had disabled BFD to take some load off the CPUs of our network devices.

Between 10:16 CET and 12:59 CET, further periods of partial or full network outages at RMA1 occurred. After most of our infrastructure was operational again by 12:20 CET, we focused on fully restoring Floating IPs and IPv6 connectivity. From 13:15 CET, all services and features at RMA1 were back to full availability.

Having found the root cause (see below) while handling these outages, implementing a blacklist ACL as a first mitigation, applied at 13:59 CET, has proved to be very effective and reduced CPU utilization on the switches by almost 50%. Right after implementing this first mitigation we added specific checks and triggers to our internal monitoring system in order to detect any arising issues triggered by the same root cause immediately.

In fact, as the traffic pattern changed over time, the mitigation had to be adapted accordingly in a timely manner to prevent the switch CPUs from spiking again. So, while this mitigation did help us get back into a stable state by massively reducing the CPU utilization on the network equipment, it did not provide us with a sustainable mid-term fix.

Both the vendor's and our own engineering team have been working very closely together to find a mid-term fix the same day. We scheduled synchronization calls every 90 minutes and focused on various options for mid-term fixes in a joint effort. At around 22:45 CET, we started implementing the mid-term fix, based on a whitelist ACL approach, that we all believed to be the most robust while not introducing further variables. Around midnight, we deployed the fix to the production network in region RMA and confirmed its effectiveness shortly after.

The vendor will continue working on code changes that will replace the mid-term fix in the future. We will test future firmware releases in our lab and deploy them on a regular basis during previously announced maintenance windows following our standard procedures.

Root Cause

As it turned out, the root cause was a high rate of inbound IPv6 traffic directed to many different, unused IPv6 addresses. While data traffic is usually forwarded by the ASIC directly, packets to unknown destinations in connected networks are punted to the device's CPU for it to perform neighbor resolution (ARP/ND). In this case, the amount of IPv6 neighbor solicitation messages overwhelmed the CPU, competing against control plane traffic necessary to keep up existing BFD, BGP and LACP connections. Failure to send or process control plane traffic in time results in the flapping of the respective links, causing a period of packet loss for the connected subset of switches and servers.

While on 2019-11-22 the basic problem was the same as the weeks before, the CPU utilization peaks were slightly higher, triggering a watchdog process to kill other processes running on our network devices in multiple instances. This, in turn, led to a complete crash of the affected devices and also interfered with our problem analysis and debugging attempts.

Given the immense number of IPv6 addresses within an allocation or a subnet, scanning through an IPv6 address range can bind significant resources to process the necessary neighbor solicitations. This is a weakness inherent in the IPv6 protocol and a potential attack vector if exploited in a targeted manner. While our new network equipment rate-limits traffic that requires neighbor resolution, the respective vendor-defaults turned out not to be strict enough.

As a mid-term fix, we currently block inbound IPv6 traffic which is not directed to an existing IPv6 address using a whitelist ACL, effectively avoiding unnecessary neighbor solicitations and the thereby caused CPU utilization. Now that the flaw has been identified and mitigated, we continue working closely with the vendor to replace the mid-term fix by a more dynamic and scalable long-term solution.

While scanning address ranges can be part of a malicious attack (including DoS/DDoS), this is not always the case. Scanning address ranges has also been part of scientific projects or security research in the past and would need an evaluation on a case-by-case basis. We currently have no reason to believe that the IPv6 address scans hitting our network and causing the described issues have been performed in malicious intent.

SLA Considerations

The total duration of the outages of services in region RMA exceeds the downtime allowed for by our 99.99% availability SLA stated in our terms and conditions.

Currently, we can neither classify the address scans as a DoS attack (which would not constitute a breach of the SLA as defined) nor exclude the possibility of them being part of such an attack.

Assuming the outages were not the result of an attack, all affected customers would be eligible for "a pro-rata credit note for the duration of said failures for the services you use that have been affected". However, as a symbolic gesture, we decided to refund the full daily cost of the services charged on 2019-11-22 to all of our customers, regardless of the SLA evaluation.

You can find the refund of your total daily charges of 2019-11-22 in a combined credit transaction in your billing overview in our cloud control panel.

Next Steps

As outlined above, we consider the current mid-term fix fully effective, protecting our infrastructure against further issues stemming from IPv6 address scans.

We are currently working with the vendor on multiple possible long-term solutions to the fundamental problem which facilitated this incident in the first place. Once available, we will test any changes or new firmware versions in our lab first and then deploy them to production during previously announced maintenance windows following our standard procedures.

While the outages also affected our cloud control panel and therefore the creation and modification of virtual servers, already running servers at our second location LPG1 were not affected by the incident on 2019-11-22. We encourage our users to look into geo-redundancy if they see fit for their individual use case. It goes without saying that we have applied the same mid-term fix to our network equipment at LPG1 as well, effectively preventing this issue from emerging in both cloud locations in the future.

Please accept our sincere apologies for all the inconvenience this may have caused you and your customers.