Temporary Disruption of Storage Service

Major incident Region RMA (Rümlang, ZH, Switzerland) Linux Cloud Servers (RMA1)
2019-02-12 10:50 CEST · 16 minutes

Updates

Post-mortem

Management Summary

On 2019-02-12 at 10:50 CET, we applied a small firewall change on our productive Ceph storage servers, which previously had been tested successfully in our lab environment. However, part of the traffic between the storage servers was subsequently dropped, resulting in virtual machines’ disk IO requests being temporarily blocked.

After disabling firewalling on these purely internal links as an immediate measure, all blocked IO requests were resumed, resolving the disruption by 11:06 CET.

Our investigation revealed differences in the connection tracking mechanism between the two Linux kernels in use. After validation in our lab environment, we have enabled a sysctl flag to compensate for this behavior in newer kernel versions. At the same time, we harmonized the kernel versions of all storage servers.

Root Cause Analysis

While our latest-generation NVMe-only storage servers are running Linux kernel 4.15 for hardware compatibility reasons, the rest of our SSD-only storage cluster was still running kernel 4.4.

In the course of our investigation we became aware of a series of changes to the Linux kernel affecting how connection tracking is implemented in kernel 4.10 and later. This corresponds with our own finding that, after reloading the firewall rules, some pre-existing network connections were not recognized as such if one or both of the communication partners were running kernel 4.15. In those cases, packets of the affected connections were dropped and the corresponding storage IO operations blocked.

As we learned, this issue can be prevented by setting the sysctl flag nf_conntrack_tcp_be_liberal to 1 in order for pre-existing connections to be recognized as such again. We set this flag as part of our actions regarding this incident (see section “Steps Taken” below).

While we had both kernel versions running in parallel in our lab as well, it turned out that version 4.15 was underrepresented. Unlike our productive setup, the storage cluster in our lab had quickly recovered from a couple of dropped packets and therefore disk access of our lab clients was not noticeably impaired.

Steps Taken

First, we brought our lab to the same state as our production system (that is, with the firewall disabled). We then updated all storage servers in our lab to kernel 4.15, removing one of the variables for our ongoing and future tests. In a next step, we persistently set nf_conntrack_tcp_be_liberal to 1, and finally rebooted all systems in turn while closely monitoring the storage cluster as well as client systems. Using this procedure, data redundancy is preserved at all times, and the intended firewall rules became active as part of the bootup process.

After all the steps outlined above were successfully tested in our lab - including changing and reloading the firewall rules several times - we proceeded likewise with the production servers, thereby completing the original firewall change scheduled for 2019-02-12.

Please accept our apologies for the inconvenience this incident may have caused you and your customers. We keep doing our best to prevent such situations from happening.

February 15, 2019 · 15:45 CEST
Update

Management Summary

On 2019-02-12 at 10:50 CET, we applied a small firewall change on our productive Ceph storage nodes, which previously had been tested successfully in our lab environment. However, parts of the traffic between the storage nodes was subsequently dropped, resulting in virtual machines’ disk IO requests being temporarily blocked.

After a quick investigation, we decided to disable firewalling on these purely internal links for the time being as an immediate measure. At this point all blocked IO requests were resumed, resolving the disruption by 11:06 CET.

Current State of Investigation

The firewall change was to increase a limit in a pre-existing firewall rule. This rule is a precautionary measure and did not block any real traffic, neither using the old nor the new, higher limit. Nevertheless, reloading the firewall rule set seemed to block parts of the internal traffic between the storage nodes, which lead to re-establishing of connections after having to wait for the previous connections to time out.

Current research suggests that a part of the pre-existing connections had not been recognized as such by the connection tracking mechanism when the new firewall rules had been loaded. We are further investigating which circumstances had led to this behavior in our particular case, and how to rule out any risk of this recurring during future firewall changes.

Once we have concluded our investigation, we will follow up with a full root cause analysis.

Please accept our apologies for the inconvenience this incident may have caused you and your customers. We keep doing our best to prevent such situations from happening.

February 13, 2019 · 18:25 CEST
Issue

After prior testing in our lab we have rolled out a change to our production storage cluster on 2019-02-12 at 10:50 CET. This change (increasing a limit in our firewall rule set) led to unexpected packet drops between various storage nodes. As an immediate measure, we have stopped the firewall on all storage nodes temporarily and are investigating the root cause. We will follow up with a detailed analysis.

Please accept our apologies for the inconvenience this incident may have caused.

February 12, 2019 · 10:50 CEST

← Back