Today at 02:42 CET, the operating system of one of our storage nodes froze completely. As a consequence, all NVMe SSDs (Ceph OSDs) of that node stopped responding to requests and were therefore marked as
down by the storage cluster shortly after. The cluster then started rebalancing data as intended in order to restore the desired replication level. Once the affected storage node was power cycled by one of our system engineers, all Ceph OSDs came back up.
The replication level recovery process completed successfully by 03:43 CET. Thanks to our high level of redundancy, customer impact should have been minimal, if any.
Root Cause Analysis
The root cause of this freeze could not be fully determined yet: We neither see any syslog nor IPMI log entries that point towards a hardware issue and therefore have to assume that an edge case was triggered in software. We will follow up with an update if we can pinpoint the root cause more precisely.
Please accept our apologies for the inconvenience this incident may have caused for you and your customers.