What happened: Primary Postgres database cluster was running at 100% CPU, adversely affecting backend response times and overall performance. The overall duration of the outage was ~6.5 hours (10:54 UTC to 17:30 UTC) with the performance intermittently worsening/improving during this time period.
Root cause: A maintenance job failed to run for several days unnoticed, which caused the database to be pushed to limits in several high-frequency operations. Ultimately a change in query planner strategy and resource exhaustion caused the DB performance to suddenly drop.
A secondary major impact was our inability to timely notify our customers about the outage in progress. Due to the unfortunate combination of a laptop HW failure and no access credentials for a new member of our ops team, we were not able to update the status.cluvio.com with the information about the outage. Additionally, the lack of support staff made it very hard to answer individual support requests while simultaneously working on fixing the outage core issue. This resulted in more than a few customers experiencing several hours of an outage without any information or responses from the Cluvio side. This was ultimately likely the more significant impact of this incident.
What we did to fix the issue: After determining the root cause, we manually applied the necessary maintenance cleanup in several steps, reducing the database load along the way (which temporarily worsened the external performance), increased the DB capacity, and performed a subsequent database optimization.
What we are going to implement to avoid these types of issues in the future: