The Monitor Crashed the Thing It Was Monitoring
OpenAI's December outage was caused by the service meant to watch for outages.
On December 11, 2024, OpenAI rolled out a new telemetry service to watch its fleet more closely. The telemetry service then took the fleet down, and the telemetry made it impossible to fix.
Per OpenAI's own incident writeup, a new telemetry deployment caused every node in each cluster to run resource-intensive Kubernetes API calls whose cost scaled with cluster size. Thousands of nodes did this at once, the API servers buckled, and the control plane fell over in most large clusters between 3:16 and 7:38 PM PST. The fix was to remove the offending service — which required reaching the control plane they could no longer reach.
The instrument installed to improve visibility became the single thing nobody could see around. This is the failure mode of observability bought as a product rather than defined as a practice: the new watcher was treated as inherently safe because its job title was 'monitoring.' Nobody wrote down what 'healthy' meant for the act of monitoring itself, so there was no SLO it could violate before it violated everything else.
It reveals how the field quietly reclassified telemetry as overhead-free. We meter latency, error rates, and freshness on every workload, then ship the metering layer with the operational caution of a font change. OpenAI notes staging stayed green because the blast radius only appeared above a certain cluster size, and DNS caching hid the failures long enough to keep the rollout going — the dashboard said fine while the floor was already gone. A green board is a claim about what you instrumented, not a claim about reality.
Watch for organizations that treat observability tooling as exempt from the change discipline they apply to the systems it observes. The tell is a runbook that monitors the database, the pipeline, and the API but never asks what happens when the monitor itself misbehaves at scale. OpenAI's remediation list now puts phased rollout and control-plane health checks on the monitoring layer too. If your observability has no failure budget of its own, you have a second production system that nobody is on call for.
Observability is a production system. If you never defined 'healthy' for the watcher, you have not added a safety net — you have added an unmonitored dependency.
A new telemetry service caused nodes to run Kubernetes API operations whose cost scaled with cluster size, overwhelming the control plane between 3:16 and 7:38 PM PST on Dec 11, 2024.
supports01Engineers could not remove the offending service because the resulting load locked them out of the Kubernetes control plane they needed to access.
supports01Staging did not surface the issue because impact only appeared above a certain cluster size, and DNS caching delayed visible failures during rollout.
supports01OpenAI publicly attributed the outage to a new telemetry service rollout, not a security incident or product launch.
No notes yet. The margin is open.
Sign in to add a note. The margin is moderated — we keep it useful, not cruel.
Anomaly detection now defines 'good' for you. It defines it as 'whatever usually happens.'
Process DebtYou can monitor a metric to the second and still not know what it counts.
Business Sense RequiredA management system can preserve discipline. It cannot supply the missing vocabulary.