Incident Details
Regional outage: sa-east-br (🇩🇪 Munich) (Regions) Partial Outage Resolved
Regions • Started UTC • resolved UTC • Duration
- Investigating
Monitor is failing from 🇩🇪 Munich.
- Resolved Latest
Regional outage resolved
📘 Postmortem
Summary
Munich's primary monitoring server experienced a transient network issue reaching the sa-east-br endpoint, triggering a 6-minute 29-second regional outage alert. A rendering bug in the status page caused all 8 monitors in the "Regions" component group to appear as Outage — not just sa-east-br. The same network instability caused the monitor to flip down/up repeatedly throughout the day, resulting in alert email spam to affected users. Multiple root causes were identified and fixed the same day.
Impact
- Status page: All 8 regional monitors (eu-west-uk, na-west-us, apac-east-jp, na-central-us, eu-central-de, apac-se-sg, sa-east-br, eu-south-it) incorrectly displayed as "Outage" for ~6 minutes, despite only sa-east-br being checked from Munich.
- Alert spam — initial cascade (09:08–09:14 UTC): Munich's network recovery triggered a simultaneous wave of "Monitor Recovered" emails to ~15+ distinct users across the platform within a 6-minute window, immediately following the false "Monitor Down" wave. All affected monitors belonged to different customers and were unrelated to each other — the cascade was caused by Munich checking all their monitors from a single location.
- Alert spam — ongoing flapping (09:22–14:54 UTC): One affected user received 23 emails (11× Down + 12× Recovered) across a ~5.5-hour window as their monitor continued to flap. The 10-minute DOWN duplicate-suppression window and the complete absence of recovery suppression allowed every down/up cycle to fire independently.
- Server load: A separate daily aggregation job ran a runaway PostgreSQL query for 1h 17min, pegging CPU at ~100% on the Munich server, compounding the network instability.
- No customer data or status pages were hosted in the affected region. Monitoring checks originating from Munich were transiently delayed or failed; all other regions continued operating normally.
Root Cause
Three independent root causes contributed to the incident:
1. Status page rendering bypassed multi-region consensus (primary cause of the cascade) The status page determined a monitor's displayed state from the single most-recent check result across all regions, rather than the consensus-confirmed state maintained by the monitoring engine. When Munich reported "down" for sa-east-br, its result became the most recent for every monitor it had recently checked, displaying all sibling monitors as "Outage". The 9 other monitoring agents' recent "up" results were not used by the rendering path, making the consensus logic invisible to end users.
2. No flap detection or recovery debounce The alert system suppressed duplicate "down" notifications within a 10-minute window, but recovery ("up") notifications had no suppression at all. With 12–16 minute gaps between flaps — longer than the suppression window — both down and recovery emails fired on every cycle, producing 23 emails to one user over ~5.5 hours.
3. Background database job ran without a timeout A nightly data aggregation job used unbounded database queries that ignored the 30-minute timeout configured for the job. A slow statistical aggregation over a large historical dataset ran uninterrupted for 1h 17min, saturating CPU on the Munich server and contributing to the degraded network performance that triggered the initial outage.
Remediation
All fixes were shipped the same day:
Immediately applied:
- Terminated the stuck database aggregation job to restore normal CPU levels.
Fixes deployed:
- Status page multi-region rendering: The status page rendering path now auto-discovers results from all monitoring agents that have checked a monitor recently (within plan limits) and applies consensus logic before displaying state. Monitors reported "down" by one region but "up" by the remaining 9 now correctly display as "degraded" rather than "Outage".
- Flap detection + recovery debounce: Added a rolling flip counter (4 flips within 30 minutes triggers a 20-minute alert suppression window) and a 5-minute stable-UP confirmation period before recovery notifications are sent. This reduces repeat alert emails for flapping monitors by ~70%.
- Aggregation job timeout enforcement: The nightly aggregation job's database queries now correctly inherit the job's 30-minute timeout, so a slow query cannot run indefinitely. An additional filter was added to exclude false-positive results from the dataset, reducing the data volume processed.
- Durable monitoring agent result queue: Agent result delivery switched from an in-memory queue (dropped results after ~17 minutes of server unavailability) to a disk-backed queue with no retry limit. Results now survive indefinitely during server downtime and are flushed in bulk immediately on reconnection.
- High-availability failover documentation: Documented and validated the procedure for promoting a standby server to primary using DNS-based failover, covering multiple DNS providers as an alternative to floating IPs.
Timeline
- 09:08 UTC — Munich monitoring agent fails check for sa-east-br endpoint. Regional outage incident auto-created.
- 09:08 UTC — Status page rendering bug causes all 8 monitors in the Regions group to display as "Outage" (cascading display).
- 09:08–09:14 UTC — "Monitor Down" notifications sent to ~15+ distinct users across the platform as Munich's check failures propagated simultaneously across all monitors it was responsible for.
- 09:13–09:14 UTC — Munich network stabilizes. "Monitor Recovered" notifications sent to the same users.
- 09:15 UTC — sa-east-br incident auto-resolved. Duration: 6m 29s.
- 09:22–14:54 UTC — One affected user's monitor continues to flap (11 down/up cycles, 23 notifications total) due to missing flap detection and recovery debounce.
- ~08:00–09:15 UTC — Nightly data aggregation job runs for 1h 17min without a timeout, saturating server CPU and contributing to Munich's degraded network performance. Terminated manually at ~09:15.
- 09:15–17:00 UTC — Root causes identified and all fixes deployed.
Published 19.05.2026. 16:12:01