Compute - Outage

Incident Report for Exonet

Postmortem

At around 12:20 one of the nodes in our storage cluster unexpectedly crashed. When that happens an automatic failover is triggered which causes the storage connections to go through another node in the cluster. The system is designed in a way that this should not cause an interruption.

For the large majority of the servers on the platform that worked as expected. However, on two nodes in our virtualization cluster the servers became unreachable. We’re still investigating what caused this and are in direct contact with our suppliers to find the cause of this issue. We will update once we have more information available.

Servers on the affected nodes were restarted and all servers were back online at around 12:47. All remaining monitoring notifications were resolved before 13:00.

The cause of the crash in the storage cluster has been identified in coordination with our supplier. It was caused by a bug in the firmware and there is a solution available by installing a new version of the firmware. We will be performing maintenance as soon as possible to deploy this fix.

Our apologies for the inconvenience. Please contact our support department if you have any further questions.

Posted Nov 17, 2022 - 16:14 CET

Resolved

This incident has been resolved.

Posted Nov 17, 2022 - 15:47 CET

Monitoring

All affected hosts and services are back online and we are actively monitoring the situation.

Posted Nov 17, 2022 - 13:07 CET

Identified

Some virtualization nodes crashed during an automatic failover. Engineers had to execute a manual failover. Affected hosts and services are coming back online now.

Posted Nov 17, 2022 - 12:47 CET

Update

We are continuing to investigate this issue.

Posted Nov 17, 2022 - 12:28 CET

Investigating

We are experiencing a partial outage on our virtualization platform. Engineers are currently investigating the issue. Customers may experience downtime on some services until the issue has been resolved.

Posted Nov 17, 2022 - 12:22 CET

This incident affected: Managed Servers (Compute).