Core Network: Emergency Software Upgrade from 2023-09-18 20:00 CEST to 2023-09-18 23:11 CEST
Updates
Post Mortem
During the scheduled emergency maintenance of our core routers on Monday, 2023-09-18, an unfortunate combination of issues led to an outage of IPv6 traffic in our LPG1 zone between 20:39 and 20:56 CEST, followed by a complete loss of Internet connectivity for about 160 seconds between 21:00 and 21:03 CEST at the same location. While the scheduled maintenance work also included the routers in our RMA1 zone, we stopped and reverted our changes before services in RMA1 were impacted.
Timeline
At 20:00 CEST, we started the maintenance work as announced, and upgraded the first core router at our LPG1 zone. Once this router had been upgraded and was ready to take over customer traffic, we shifted traffic away from the second router to the first one in our LPG1 zone. Up to this point, everything worked as intended, and the newly upgraded router handled the traffic as expected.
At 20:39 CEST, we rebooted the second, now-idle router at our LPG1 zone for upgrading. Upon the reboot, we immediately started to see IPv6 issues. This unexpected effect seemed to be triggered by the reboot of the second router, and we therefore suspected a bug in the overall setup.
Since we had not seen this behavior when testing the upgrade in our lab, we decided to cancel the upgrade and revert everything to the previous, working state. Therefore, we did not proceed with installing the new software version on the second router, but booted it up on the existing, old version again. Once back up, the second router did not fully load its configuration, which is a reproducible malfunction of the old software version and required additional manual intervention, further delaying the recovery.
By 20:56 CEST, IPv6 connectivity in our LPG1 zone was restored.
At 21:00 CEST, we decided to reload the configuration on the second, now-active router once again to be sure that all relevant parts were active. By mistake, the service was restarted instead of reloaded, which caused a full outage of Internet traffic for about 160 seconds in our LPG1 zone.
By 21:03 CEST, traffic at our LPG1 zone was stable again and remained so thereafter.
Once sure that no further issues were open, we started reverting the first, already upgraded router to its old software version. The rollback to the previous overall state was completed shortly before midnight (CEST), with no further impact for our customers.
We sincerely apologize for the inconvenience these issues have caused you and your customers.
We are further investigating the root cause which had led to the unexpected behavior when rebooting the second router. Once we are confident that this issue is understood and mitigated, we will announce a new maintenance window for installing the required upgrade to our core routers.
We had to revert our change since we were facing unexpected (IPv6) issues with the new software release. As a next step, we will have to go back to the lab and try to reproduce the issues we have seen tonight. We will follow up with an incident report later this week.
Please accept our apologies for the inconvenience this may have caused you and your customers.
Emergency Maintenance
From Monday, 2023-09-18, 20:00 CEST to Tuesday, 2023-09-19, 02:00 CEST, we will perform an emergency software upgrade of our core network to fix a critical vulnerability in both our RMA (Rümlang) and LPG (Lupfig) regions (in sequential order). During this maintenance work you may experience short periods of packet loss (up to 1-2 minutes) or higher RTTs for connections from and towards the Internet. Connections between virtual servers at cloudscale.ch will not be affected by this maintenance work.
Date / Time
From Monday, 2023-09-18, 20:00 CEST to Tuesday, 2023-09-19, 02:00 CEST
Expected Impact
Short periods of packet loss (up to 1-2 minutes) or higher RTTs for Internet-facing connections. Thanks to our redundant setup we do not expect any further impact on already running virtual servers.
We apologize for any inconvenience this may cause and thank you for your understanding.
← Back