Skip to main content

()

INCIDENT REPORTSID: 2STATUS: PUBLISHED

Incident Response: The 3AM Database Outage

Author
oncall_engineer
Date
11/18/2025
Read Time
12m 05s
Tags
POST-MORTEM

> Incident Report #492

Severity: CRITICAL Duration: 45 minutes Impact: 15% of monitors failed to report status.

- Timeline

  • 03:15:00 - Alert fired: Connection pool exhaustion.
  • 03:18:22 - Engineer acknowledged.
  • 03:25:00 - Rolling restart initiated.
  • 03:45:00 - Services fully restored.

- Root Cause

A memory leak in the connection pooler caused gradual resource exhaustion. The issue was triggered by a spike in monitoring requests combined with a misconfigured connection timeout.

- What Went Wrong

  1. Monitoring Gap - Our alerting didn't catch the gradual memory increase
  2. Configuration Error - Connection timeout was set too high (30s vs recommended 5s)
  3. Lack of Circuit Breakers - No automatic failover mechanism

- Immediate Actions Taken

We implemented an emergency fix that included:

  • Reduced connection timeout to 5 seconds
  • Added memory usage alerts
  • Deployed circuit breakers

- Long-term Improvements

  1. Enhanced monitoring of database connections
  2. Implemented automatic scaling triggers
  3. Added chaos engineering tests
  4. Updated runbooks with specific recovery procedures

- Impact Analysis

During the outage:

  • 15% of active monitors couldn't report status
  • No data loss occurred
  • Customer notifications were delayed by average 2 minutes

- Prevention Measures

Going forward, we've implemented:

  • Weekly load testing
  • Bi-weekly chaos experiments
  • Automated connection pool monitoring
  • Enhanced alerting thresholds

> Conclusion

This incident reinforced the importance of proper resource monitoring and having robust failover mechanisms in place.

End of log entry.
Filed Under:
#POST-MORTEM#DATABASE#INCIDENT