Saturday, 23rd November 2019

Network Infrastructure (RMA) Network Issues: Fix in Place

The network in region RMA has been stable for more than 24 hours now and we are convinced that it will remain that way. Here is why:

Reproduction
After collecting more data during yesterday's outage, the vendor was able to re-create the problem in their lab. This allowed both the vendor's and our own engineering team to work on a short-term mitigation as well as on mid-term fixes but it will also provide a way to validate long-term solutions that will be introduced in subsequent firmware releases.

Short-Term Mitigation
After identifying the potential root cause yesterday afternoon, we implemented a workaround to prevent the suspected weakness in the firmware from being triggered by unusual traffic patterns. However, as the traffic pattern changed over time, the mitigation had to be adapted accordingly in a timely manner to prevent the switch CPUs from spiking again. So, while this did help us get back into a stable state by massively reducing the CPU utilization on the network equipment, it did not provide us with a sustainable mid-term fix.

Mid-Term Fix
Both engineering teams have been working very closely together to find a mid-term fix the same day. We scheduled synchronization calls every 90 minutes and focused on various options for mid-term fixes in a joint effort. At around 22:45 CET, in a final conference call, we started implementing the mid-term fix that we all believed to be the most robust while not introducing further variables. Around midnight, we deployed the fix to the production network in region RMA and confirmed its effectiveness shortly after.

Monitoring
Right after implementing the first mitigation we added specific checks and triggers to our internal monitoring system in order to receive an alert should the situation change all of a sudden. We also added very aggressive triggers for the CPU utilization and expected to receive some "false positive" alerts during the night. However, we did not receive any alerts as the CPU utilization has been very steady for more than 24 hours now.

Long-Term Solution
The vendor will continue working on code changes that will replace the mid-term fix in the future. We will test future firmware releases in our lab and deploy them on a regular basis during previously announced maintenance windows following our standard procedures.

Incident Report
We will follow up with a detailed incident report by mid next week.

Please accept our sincere apologies for all the inconvenience this may have caused you and your customers.