All systems are operational

Past Incidents

Saturday, 6th July 2019

No incidents reported

Friday, 5th July 2019

No incidents reported

Thursday, 4th July 2019

No incidents reported

Wednesday, 3rd July 2019

No incidents reported

Tuesday, 2nd July 2019

Incident Report Regarding Disruption of Bulk and Object Storage Service

Management Summary

On 2019-07-01 between 10:15 and 10:23 CEST, requests to our object storage had been answered with an HTTP error code 503. At the same time, I/O operations to bulk volumes were partially blocked as well. After 10:23 CEST, access to our bulk and object storage was fully restored.

Detailed Report

On 2019-07-01 at 09:07 CEST, we began working on the scheduled maintenance of our bulk and object storage nodes as previously announced in https://cloudscale-status.net/incident/48.

At 10:16 CEST, our monitoring system reported an issue with our object storage.

An immediate analysis showed that the Ceph cluster had blocked access to some PGs (placement groups) belonging to the bulk and object storage pools because Ceph's hard limit of the PG per OSD (object storage device) ratio had been exceeded. This had triggered a bug as outlined in https://tracker.ceph.com/issues/23117.

At 10:23 CEST, we decided to stop all OSDs on the recently upgraded storage node, allowing Ceph to recover and permit full access again, effectively resolving this issue for our users.

After thorough analysis, we decided to increase the PG per OSD ratio and then restarted all OSDs on the upgraded node. After the increase, all OSDs were starting to backfill as expected.

We will keep this increased PG per OSD ratio for the remainder of the scheduled maintenance while upgrading the rest of the storage nodes.

Please accept our apologies for the inconvenience this service disruption may have caused you and your customers.

Monday, 1st July 2019

Object Storage Object Storage Unavailable (503)

The root cause has been identified and a workaround is in place. Recovery is in progress. We will follow up with a detailed root cause analysis.

Linux Cloud Servers Bulk Storage Unavailable

The root cause has been identified and a workaround is in place. Recovery is in progress. We will follow up with a detailed root cause analysis.

Object Storage Object Storage Unavailable (503)

The object storage is operational again, however in degraded redundancy state. We will keep investigating the root cause of this outage.

Linux Cloud Servers Bulk Storage Unavailable

The bulk storage is operational again, however in degraded redundancy state. We will keep investigating the root cause of this outage.

Object Storage Object Storage Unavailable (503)

We were notified by our monitoring system that the object storage is currently not available (error 503). We are investigating the issue.

Linux Cloud Servers Bulk Storage Unavailable

We were notified by our monitoring system that the bulk storage is currently not available. We are investigating the issue.

Bulk and Object Storage Cluster: OS Upgrade and Data Migration, scheduled 3 weeks ago

From Monday 2019-07-01 to Sunday 2019-07-07, we will upgrade the operating system of our bulk and object storage cluster and migrate all data to an optimized and encrypted storage format. In order to minimize the impact on storage performance and data redundancy, we will perform the upgrade and migration one host at a time. During this maintenance, we expect periods of degraded performance on the bulk and object storage cluster (i.e. bulk volumes of your cloud servers and S3-compatible object storage). No further impact is expected.

Date / Time
Monday 2019-07-01, 09:00 CEST to Sunday 2019-07-07 23:59 CEST

Expected Impact
We expect periods of degraded bulk and object storage performance during this maintenance.

We apologize for any inconvenience this may cause and thank you for your understanding.

Sunday, 30th June 2019

No incidents reported