249·05 · Dashboard TheaterNo. 249 · 29 May 2026 · 2 min

The Monitor Crashed the Thing It Was Monitoring

OpenAI's December outage was caused by the service meant to watch for outages.

PromotedThe EditorSignal, Not Theater

On December 11, 2024, OpenAI rolled out a new telemetry service to watch its fleet more closely. The telemetry service then took the fleet down, and the telemetry made it impossible to fix.

Per OpenAI's own incident writeup, a new telemetry deployment caused every node in each cluster to run resource-intensive Kubernetes API calls whose cost scaled with cluster size. Thousands of nodes did this at once, the API servers buckled, and the control plane fell over in most large clusters between 3:16 and 7:38 PM PST. The fix was to remove the offending service — which required reaching the control plane they could no longer reach.

The instrument installed to improve visibility became the single thing nobody could see around. This is the failure mode of observability bought as a product rather than defined as a practice: the new watcher was treated as inherently safe because its job title was 'monitoring.' Nobody wrote down what 'healthy' meant for the act of monitoring itself, so there was no SLO it could violate before it violated everything else.

It reveals how the field quietly reclassified telemetry as overhead-free. We meter latency, error rates, and freshness on every workload, then ship the metering layer with the operational caution of a font change. OpenAI notes staging stayed green because the blast radius only appeared above a certain cluster size, and DNS caching hid the failures long enough to keep the rollout going — the dashboard said fine while the floor was already gone. A green board is a claim about what you instrumented, not a claim about reality.

Watch for organizations that treat observability tooling as exempt from the change discipline they apply to the systems it observes. The tell is a runbook that monitors the database, the pipeline, and the API but never asks what happens when the monitor itself misbehaves at scale. OpenAI's remediation list now puts phased rollout and control-plane health checks on the monitoring layer too. If your observability has no failure budget of its own, you have a second production system that nobody is on call for.

The takeaway

Observability is a production system. If you never defined 'healthy' for the watcher, you have not added a safety net — you have added an unmonitored dependency.

The claim, mapped

A new telemetry service caused nodes to run Kubernetes API operations whose cost scaled with cluster size, overwhelming the control plane between 3:16 and 7:38 PM PST on Dec 11, 2024.
supports01
Engineers could not remove the offending service because the resulting load locked them out of the Kubernetes control plane they needed to access.
supports01
Staging did not surface the issue because impact only appeared above a certain cluster size, and DNS caching delayed visible failures during rollout.
supports01
OpenAI publicly attributed the outage to a new telemetry service rollout, not a security incident or product launch.
supports01 02

Sources

OpenAI Status — API, ChatGPT & Sora Facing Issues — Incident Report2024-12-13 · Tier 1 · primaryA new telemetry service configuration caused every node to run resource-intensive Kubernetes API operations whose cost scaled with cluster size, overwhelming the control plane.

↗

TechCrunch — OpenAI blames its massive ChatGPT outage on a 'new telemetry service'2024-12-13 · Tier 2 · newsOpenAI says a newly deployed telemetry service overwhelmed its Kubernetes control plane, and engineers could not access the control plane to remove it because of the same load.

↗

Mark this entry

Marginalia · 0 notes

No notes yet. The margin is open.

Related entries

Definition Drift

Your Pipeline Learned to Call Yesterday's Breakage Normal

Anomaly detection now defines 'good' for you. It defines it as 'whatever usually happens.'

Process Debt

Data observability raised a fortune to watch the number. Defining the number raised nothing.

You can monitor a metric to the second and still not know what it counts.

Business Sense Required

ISO gave AI a management system. It did not define your training data.

A management system can preserve discipline. It cannot supply the missing vocabulary.