On 2019-02-12 at 10:50 CET, we applied a small firewall change on our productive Ceph storage nodes, which previously had been tested successfully in our lab environment. However, parts of the traffic between the storage nodes was subsequently dropped, resulting in virtual machines' disk IO requests being temporarily blocked.
After a quick investigation, we decided to disable firewalling on these purely internal links for the time being as an immediate measure. At this point all blocked IO requests were resumed, resolving the disruption by 11:06 CET.
Current State of Investigation
The firewall change was to increase a limit in a pre-existing firewall rule. This rule is a precautionary measure and did not block any real traffic, neither using the old nor the new, higher limit. Nevertheless, reloading the firewall rule set seemed to block parts of the internal traffic between the storage nodes, which lead to re-establishing of connections after having to wait for the previous connections to time out.
Current research suggests that a part of the pre-existing connections had not been recognized as such by the connection tracking mechanism when the new firewall rules had been loaded. We are further investigating which circumstances had led to this behavior in our particular case, and how to rule out any risk of this recurring during future firewall changes.
Once we have concluded our investigation, we will follow up with a full root cause analysis.
Please accept our apologies for the inconvenience this incident may have caused you and your customers. We keep doing our best to prevent such situations from happening.