Today at 13:48 CEST, one of our switches rebooted abruptly:
2018/05/11-11:48:15, [HASM-1001], 27750, SW/1 | Standby | FFDC, CRITICAL, VDX6940-3, An unexpected failover event occured. 2018/05/11-11:51:10, [HASM-1004], 126557, INFO, VDX6940-3, Processor reloaded - Software Fault:ASSERT.
As expected, the second switch in the same rack took over within seconds - at that point in time, the impact for customers was minimal.
At 14:09 CEST, once the reboot was almost complete, the rebooted switch sent out an IPv6 router advertisement (RA) message with the wrong prefix-information flags set. This led to the situation that all cloud servers with a public network interface started to use SLAAC to configure an additional IPv6 interface address (in addition to the existing and correct, DHCP-populated address). Subsequently these servers tried to communicate with the outside world using the wrong, auto-configured IPv6 address. Due to port security filters we have in place, these packets were blocked on the compute hosts.
At 14:13 CEST, we received the first alert that IPv6 connectivity was potentially broken and started investigating.
At 14:28 CEST, we found the root cause and started to send out corrective IPv6 RAs on each router interface. These RAs contained a valid lifetime of only 10 seconds to override the last message. Shortly after, the auto-configured IPv6 addresses were successfully flagged as "no longer valid" and a few minutes later, IPv6 connectivity was confirmed to be back to normal.
At 15:03 CEST, we started seeing hardware errors resulting in unstable LACP operations on the second switch:
Chip ioctl take more than 2 sec, before_lock 0xba46da0a, after_lock 0xba46db4d, after_proc 0xba46db52
At 15:24 CEST, we decided to reboot the second switch in an orderly fashion to mitigate the situation with LACP. Because LACP was no longer operating properly, this may have led to several seconds of packet loss.
At 15:35 CEST, the second switch came back up and LACP was working as expected again. Since then, the situation is stable.
Please accept our apologies for the inconvenience this incident has caused for you and your customers. We keep doing our best to prevent such a situation from happening in the future.
Please do not hesitate to contact us if you have any follow-up questions.
In related matter:
In summer 2017, we started to evaluate a replacement for our current network equipment. We have already purchased new hardware in September 2017 and are heavily testing the next generation platform ever since. However, we need to sort out a few remaining open issues together with the vendor to avoid ending up in the same situation we are in now. Rest assured that we do everything we can to provide you with a stable network in the future.