Tuesday, 2nd July 2019

Incident Report Regarding Disruption of Bulk and Object Storage Service

Management Summary

On 2019-07-01 between 10:15 and 10:23 CEST, requests to our object storage had been answered with an HTTP error code 503. At the same time, I/O operations to bulk volumes were partially blocked as well. After 10:23 CEST, access to our bulk and object storage was fully restored.

Detailed Report

On 2019-07-01 at 09:07 CEST, we began working on the scheduled maintenance of our bulk and object storage nodes as previously announced in https://cloudscale-status.net/incident/48.

At 10:16 CEST, our monitoring system reported an issue with our object storage.

An immediate analysis showed that the Ceph cluster had blocked access to some PGs (placement groups) belonging to the bulk and object storage pools because Ceph's hard limit of the PG per OSD (object storage device) ratio had been exceeded. This had triggered a bug as outlined in https://tracker.ceph.com/issues/23117.

At 10:23 CEST, we decided to stop all OSDs on the recently upgraded storage node, allowing Ceph to recover and permit full access again, effectively resolving this issue for our users.

After thorough analysis, we decided to increase the PG per OSD ratio and then restarted all OSDs on the upgraded node. After the increase, all OSDs were starting to backfill as expected.

We will keep this increased PG per OSD ratio for the remainder of the scheduled maintenance while upgrading the rest of the storage nodes.

Please accept our apologies for the inconvenience this service disruption may have caused you and your customers.