On 2019-02-12 at 10:50 CET, we applied a small firewall change on our productive Ceph storage servers, which previously had been tested successfully in our lab environment. However, part of the traffic between the storage servers was subsequently dropped, resulting in virtual machines' disk IO requests being temporarily blocked.
After disabling firewalling on these purely internal links as an immediate measure, all blocked IO requests were resumed, resolving the disruption by 11:06 CET.
Our investigation revealed differences in the connection tracking mechanism between the two Linux kernels in use. After validation in our lab environment, we have enabled a sysctl flag to compensate for this behavior in newer kernel versions. At the same time, we harmonized the kernel versions of all storage servers.
Root Cause Analysis
While our latest-generation NVMe-only storage servers are running Linux kernel 4.15 for hardware compatibility reasons, the rest of our SSD-only storage cluster was still running kernel 4.4.
In the course of our investigation we became aware of a series of changes to the Linux kernel affecting how connection tracking is implemented in kernel 4.10 and later. This corresponds with our own finding that, after reloading the firewall rules, some pre-existing network connections were not recognized as such if one or both of the communication partners were running kernel 4.15. In those cases, packets of the affected connections were dropped and the corresponding storage IO operations blocked.
As we learned, this issue can be prevented by setting the sysctl flag
nf_conntrack_tcp_be_liberal to 1 in order for pre-existing connections to be recognized as such again. We set this flag as part of our actions regarding this incident (see section "Steps Taken" below).
While we had both kernel versions running in parallel in our lab as well, it turned out that version 4.15 was underrepresented. Unlike our productive setup, the storage cluster in our lab had quickly recovered from a couple of dropped packets and therefore disk access of our lab clients was not noticeably impaired.
First, we brought our lab to the same state as our production system (that is, with the firewall disabled). We then updated all storage servers in our lab to kernel 4.15, removing one of the variables for our ongoing and future tests. In a next step, we persistently set
nf_conntrack_tcp_be_liberal to 1, and finally rebooted all systems in turn while closely monitoring the storage cluster as well as client systems. Using this procedure, data redundancy is preserved at all times, and the intended firewall rules became active as part of the bootup process.
After all the steps outlined above were successfully tested in our lab - including changing and reloading the firewall rules several times - we proceeded likewise with the production servers, thereby completing the original firewall change scheduled for 2019-02-12.
Please accept our apologies for the inconvenience this incident may have caused you and your customers. We keep doing our best to prevent such situations from happening.