Major service performance degradation and outage

Incident Report for Cluvio

Postmortem

What happened: Primary Postgres database cluster was running at 100% CPU, adversely affecting backend response times and overall performance. The overall duration of the outage was ~6.5 hours (10:54 UTC to 17:30 UTC) with the performance intermittently worsening/improving during this time period.

Root cause: A maintenance job failed to run for several days unnoticed, which caused the database to be pushed to limits in several high-frequency operations. Ultimately a change in query planner strategy and resource exhaustion caused the DB performance to suddenly drop.

A secondary major impact was our inability to timely notify our customers about the outage in progress. Due to the unfortunate combination of a laptop HW failure and no access credentials for a new member of our ops team, we were not able to update the status.cluvio.com with the information about the outage. Additionally, the lack of support staff made it very hard to answer individual support requests while simultaneously working on fixing the outage core issue. This resulted in more than a few customers experiencing several hours of an outage without any information or responses from the Cluvio side. This was ultimately likely the more significant impact of this incident.

What we did to fix the issue: After determining the root cause, we manually applied the necessary maintenance cleanup in several steps, reducing the database load along the way (which temporarily worsened the external performance), increased the DB capacity, and performed a subsequent database optimization.

What we are going to implement to avoid these types of issues in the future:

We will improve proactive monitoring of ongoing maintenance being correctly performed as well as additional early detection of symptoms of failures, further improving the checks we have in place (response time degradation, elevated error rates, etc.)
We will add clearly defined (and quick to perform) steps during any outage that prioritize updating the operational status on status.cluvio.com timely when issues are first detected and frequent updates during any longer-duration issue.
We will improve the redundancy of staff for both Ops team (primary / secondary on-call, with necessary redundancy of actual ability to effectively intervene - HW, systems access) as well as making sure we have support capacity to handle spikes of support requests during the business hours.

Posted Jul 05, 2021 - 17:37 CEST

Resolved

The incident has been resolved and the Cluvio service is back to nominal.

Posted Jul 02, 2021 - 22:38 CEST

Monitoring

We have addressed the core issue of the slowdown and the service should be back to nominal for all customers, we will continue to monitor the situation closely.

Posted Jul 02, 2021 - 19:52 CEST

Investigating

We have been investigating serious performance issues, that affect most customers. The cause is still unknown, we will provide an update as soon as we know more.

Posted Jul 02, 2021 - 13:30 CEST

This incident affected: EU (Germany, AWS eu-central-1) (Cluvio Web Application (EU), Query Execution (EU), REST API (EU), Emails for Sql Alerts and Dashboard Schedules (EU)).