
Rare Agent Work · Incident Intelligence Edition
Rev 1.0 · Updated March 14, 2026
8 documented production failures — root causes, blast radius, and what actually fixed them
Learn from 8 real production incidents before they happen to you — exact failure modes, root cause trees, remediation timelines, and the governance changes that followed.
What this report gives you
The finding that changes your next decision
“Every incident in this report was visible in the logs before it became an incident — the bulk-send spike was a 47x volume anomaly, the auth cascade was a wall of 401 errors for four days, the $47k cost explosion was 8x expected cost per session for 72 hours — and every signal was missed because nobody had written down what normal looked like.”
This report is right for you if any of these are true
Why this report exists
The hardest lessons in production AI are the ones teams learn the wrong way — at 2am, with customers affected, under pressure. This report documents 8 real categories of production agent failure in enough detail that you can learn the lesson without having the incident. Each post-mortem includes a five-layer root cause analysis, a timeline, and the one governance change that would have prevented it.
Honest disqualification. If none of the above matches you, this report was written for you.
Root cause tree, timeline reconstruction, blast radius assessment, and remediation analysis for each incident category.
Trigger → proximate cause → contributing factors → systemic gap → governance failure. Reusable for your own incidents.
Two fill-in-the-blank templates: the first-hour triage protocol and the post-incident governance change spec. Ready to use in actual incidents.
How to establish normal behavior baselines before you need them — the precursor signal detection that most teams skip.
Three scenario scripts your team can run before launch to surface gaps without having an actual incident.
24-item audit covering the systemic gaps that appear across all 8 incident categories.
All 7 sections — scroll down to read.
The five-layer root cause framework: why stopping at Layer 2 (proximate cause) guarantees you'll have the same incident again.
Every post-mortem methodology has a version of "five whys" — keep asking why until you reach the root cause. That methodology is correct in principle and incomplete in practice for AI agent incidents, because the root cause of an AI agent failure is almost never a single causal chain. It is the intersection of a trigger condition, a missing technical control, a monitoring gap, and an organizational assumption that turned out to be wrong.
The framework used in this report separates incident analysis into five layers that must be analyzed independently and then synthesized:
Layer 1: Trigger — The specific event that initiated the incident. This is usually well-documented and over-discussed in post-mortems, because it is concrete and blameable. A CSV import. A webhook fired twice. A rate limit not checked. The trigger is never the root cause — it is the visible entry point.
Layer 2: Proximate Cause — The immediate technical failure the trigger exposed. The deduplication key that wasn't set. The approval gate that wasn't inserted. The rate limiter that wasn't implemented. This is what teams fix after an incident, and fixing it is necessary but not sufficient — the same failure will recur through a different trigger if the systemic gap beneath it isn't addressed.
Layer 3: Contributing Factors — The conditions that made the proximate cause possible. Insufficient testing with real data. A handoff between two teams where each assumed the other owned the safeguard. Timeline pressure that caused a known risk to be deferred. Contributing factors are usually organizational and process-related, which makes them uncomfortable to document honestly.
Layer 4: Systemic Gap — The architectural or process design choice that allowed the contributing factors to exist. No deduplication pattern standard across the platform. No automated check that approval gates are present before production deployment. No ownership policy for automation governance. Systemic gaps are the layer most often skipped in post-mortems because addressing them requires changing how the organization works, not just how the software works.
Layer 5: Governance Failure — The oversight or policy failure that allowed the systemic gap to persist. No review process that would have caught the missing control. No accountability for the governance standard. No escalation path when a team member identified the risk and was overridden by schedule pressure.
Teams that stop at Layer 2 fix the specific failure mode they just experienced. Teams that work through all five layers fix the class of failure mode — and prevent the three variants they haven't encountered yet.
What this means for your post-mortems: Most teams declare an incident closed when the proximate cause is fixed and the system is back online. By this framework's standard, they have completed Layer 2 of a five-layer analysis. The systemic gap is still open. The governance failure is still unaddressed. When the next variant of the same incident class arrives — and it will — the team will be surprised. This report shows you what all five layers look like for eight different incident categories, so that when you run your own post-mortem, you know what layer you're actually on.
The precursor signal problem: why every incident in this report was visible in the logs before it became an incident.
The most consistent finding across all eight incident categories in this report is that the incident was visible before it became an incident — in signals that nobody was watching for, because nobody had established what normal looked like.
This is not a monitoring failure in the traditional sense. Most teams have monitoring. The problem is the absence of a baseline: a documented expectation of what normal agent behavior looks like, against which anomalies become visible.
What a monitoring baseline looks like in practice:
The implementation requirement that most teams skip: A baseline is only useful if it is written down before an incident. Teams that establish baselines post-incident build them under pressure, too specific to the incident that just happened, missing adjacent failure classes. The right time to build your monitoring baseline is during the 48 hours before your first production deployment — when you have the clearest picture of expected behavior and the motivation to think carefully about what normal should look like.
The baseline you can build today — in under two hours: Document four numbers for each agent deployment you operate: (1) expected sessions per hour at peak, (2) expected tool calls per session by task type, (3) expected cost per session by task type, (4) expected error rate by error category. Write these numbers down. Set alerts at 3x each. This exercise takes two hours and transforms your monitoring from reactive to anticipatory. Every team in this report that had an incident without early detection had failed to do this one thing.
5 more sections in this report
What unlocks with purchase:
Incident 01: The Bulk Send — 847 Customers, One CSV, Zero Deduplication
Incident 01: The bulk-send. 847 customers, one CSV, zero deduplication. Minute-by-minute reconstruction.
Incident 03: The Auth Cascade — 14 Workflows Down, 4 Days Silent
Incident 03: The auth cascade. 14 workflows silent for 4 days. The 30-minute fix that catches it within 24 hours.
Incident 06: The Cost Explosion — $47,000 in 72 Hours
Incident 06: The $47k cost explosion. The three controls that would have converted catastrophe into a caught anomaly.
Incident 07: The Orchestration Deadlock — Two Agents Waiting on Each Other
Incident 07: The orchestration deadlock. Two agents waiting on each other — why this incident class will increase in 2026.
Tabletop Exercise Script: The Bulk-Send Scenario (Run This Before Your Next Launch)
Tabletop exercise script: the complete bulk-send scenario with facilitator notes, inject events, and debrief questions — run this with your team before the next launch.
One-time purchase · Instant access · No subscription
Incident 01: The bulk-send. 847 customers, one CSV, zero deduplication. Minute-by-minute reconstruction.
This incident class kills automation programs. Not because it is technically complex — it is not — but because it happens visibly, to real customers, and the immediate response is almost always to shut down the entire automation program rather than fix the specific failure.
What happened: A marketing team member uploaded a CSV of 847 customer email addresses to trigger a "thank you" workflow. The CSV included a header row and 847 data rows. The workflow triggered on every row — including the header row itself.
Friday, 2:14 PM: First 848 emails begin sending at approximately 200/min. The header row email address ("Email Address") received the message alongside every real customer.
Friday, 2:23 PM: Send completes. 848 executions. No errors flagged. The workflow behaved exactly as it was configured to behave.
Friday, 2:47 PM: First customer reply arrives: "Why did I receive 6 identical emails?" The customer appeared six times in the CRM export — duplicated entries nobody caught. The deduplication step was on the backlog. It never shipped.
The full blast radius: 847 customers received the email. 6 customers received it multiple times. 1 non-customer received it (the header row). The automation program was suspended for three weeks while leadership debated whether to continue using it at all.
Trigger: CSV import with 847 rows plus a header row that the trigger treated as a data row.
Proximate cause: No deduplication key on the trigger. No row-count sanity check before execution began. No dry run against the actual file before the production send.
Contributing factors: The workflow was built by one team member and reviewed by another who assumed deduplication was handled upstream in the CRM export. Neither verified. Timeline pressure to send before end-of-week meant skipping the planned 48-hour shadow mode.
Systemic gap: No organizational standard requiring deduplication logic for any workflow that processes records from a file or CRM export. No pre-flight checklist including a row count review and a duplicate scan before triggering any bulk operation.
Governance failure: The shadow-mode requirement existed as an informal norm with no enforcement mechanism. A team member under deadline pressure could skip it without triggering any review. There was no named owner for the automation governance standard.
What actually fixed it: Not adding deduplication — that was already planned. What fixed it was a mandatory pre-flight gate: any workflow processing more than 10 records must complete a dry run review where the first 5 intended executions are shown to a human before the full run proceeds. This gate has blocked a recurrence of the bulk send incident class in every deployment since.
Incident 03: The auth cascade. 14 workflows silent for 4 days. The 30-minute fix that catches it within 24 hours.
Timeline reconstruction:
Day 0, Thursday 4:47 PM: A team member who owned a service account used across 14 automated workflows leaves the company. Standard IT offboarding runs that evening. The service account is deleted.
Day 1, Friday 6:03 AM: The first workflow dependent on the deleted account runs its scheduled trigger. The API call returns 401 Unauthorized. The workflow has error handling — but the error handler sends a notification to the deleted account's email address. The notification is never received.
Day 1, Friday 8:47 AM through 11:59 PM: Nine more workflows run and fail. All error notifications go to the same deleted email address. Nobody knows anything is wrong.
Day 4, Monday 9:12 AM: A team member checks a dashboard populated by one of the failing workflows and finds it hasn't updated since Thursday. Investigation begins. The full scope: 14 workflows down, four days of data missing, three customer-facing processes that failed silently over the weekend.
Trigger: Employee offboarding plus service account deletion.
Proximate cause: Workflows used hardcoded service account credentials rather than role-based access credentials that survive individual account changes. Error notifications routed to the account owner's email rather than a durable team alias. No monitoring that checked whether scheduled workflows had actually run.
Systemic gap: No credential dependency mapping for automation infrastructure. No standard for routing error notifications to a durable team address. No workflow execution monitoring independent of error notification.
Governance failure: IT offboarding had no step requiring a dependency audit before account deletion. Automation infrastructure was not included in the offboarding checklist. There was no owner for the credential audit process.
What fixed it: Two changes. First: every workflow error notification re-routed to a team alias with at least two members. Second: a weekly automated check verifying each scheduled workflow actually ran in the last 7 days and sending a summary to the automation owner. This single change — the weekly execution summary — would have surfaced this incident within 24 hours instead of 96.
Incident 06: The $47k cost explosion. The three controls that would have converted catastrophe into a caught anomaly.
This incident combines three compounding failure modes and produces numbers large enough to generate immediate organizational trauma.
What happened: A new multi-agent system was deployed to production after testing. The testing environment used GPT-4o-mini for all tasks. A single configuration variable — model_tier — was not updated during the production deployment. Production defaulted to GPT-4o for every task. In the 72 hours before the cost spike was detected, the system processed $47,000 in API calls — approximately 14x the monthly budget.
Why it wasn't caught:
No cost monitoring: The team had API cost visibility at the monthly billing level only. There were no daily or hourly alerts. By the time costs were reviewed, the incident was already 72 hours old.
No per-session budget: There was no hard ceiling on token spend per session enforced at the infrastructure level. Individual sessions ran uncapped.
No environment parity check: The deployment pipeline had no automated verification that production configuration matched intended values. The configuration drift between test and production was not caught before rollout.
The three controls that would have prevented this — in priority order:
Control 1: Daily cost alerts at 50%, 80%, and 100% of budget. This converts a 72-hour detection gap into same-day detection. The specific thresholds matter less than the existence of the alert. This is a 30-minute setup task in any major cloud provider and most LLM API dashboards.
Control 2: Per-session token budget enforced at the gateway layer, not the prompt layer. Prompts can be overridden by the model; gateway limits cannot. Set the per-session limit at 3x your expected maximum session cost. Anything above that is either a runaway session or a configuration error.
Control 3: Pre-deployment configuration diff. Before any production deployment, automatically compare production configuration against staging and require explicit sign-off on any differing value. This is a script, not a process — it takes 30 minutes to build and prevents a $47,000 incident.
The pattern this incident reveals: Cost explosions almost always involve a configuration gap (wrong model, wrong parameters), a missing ceiling (no per-session budget), and a detection delay (no real-time alerting). All three are required for the incident to reach the numbers that cause organizational damage. Fixing any one of the three converts a catastrophic incident into a caught-and-corrected anomaly. Fixing all three means the incident class cannot reach organizational-damage scale even if the triggering configuration error still occurs.
Incident 07: The orchestration deadlock. Two agents waiting on each other — why this incident class will increase in 2026.
Orchestration deadlocks are the most technically obscure incident class in this report, and the one most likely to affect teams building multi-agent systems in 2026. The pattern is subtle enough that teams often misdiagnose it as a performance problem or an LLM quality issue before the root cause becomes clear.
What happened: A planner-executor-reviewer architecture was deployed to production. The planner agent decomposed tasks and assigned them to executor agents. The reviewer agent evaluated executor outputs and could request revisions.
Sessions 1–200: System performed as designed. Planner → Executor → Reviewer → Complete. Average 8–12 tool calls per session.
Session 201+: A specific task type — multi-document synthesis — began generating revision requests from the reviewer that the executor couldn't satisfy with its current tool access. The executor would revise. The reviewer would reject with slightly different feedback. The executor would revise again. Sessions began running 40, 80, 120+ tool calls without completing.
Day 4: Three concurrent sessions hit the context window limit during the revision loop. The system did not degrade gracefully — it produced incomplete outputs while consuming full token budgets. Cost for the day: 4x baseline. Customer-facing output quality: sharply degraded.
Trigger: A specific task type that exceeded the executor's tool-access boundary.
Proximate cause: No loop detection on the planner-executor-reviewer handoff. The reviewer could reject indefinitely without an escalation path. The executor had no mechanism to report that the reviewer's requirements exceeded its capabilities.
Systemic gap: No maximum revision count per task. No reviewer-to-escalation path when a task cannot be completed within defined tool boundaries. No test coverage for tasks near the boundary of executor capability.
What fixed it: Three changes. First: a hard maximum of 3 revision cycles per task, after which the task escalates to a human. Second: the reviewer was explicitly scoped to evaluate quality within the executor's defined tool access — if a quality improvement requires a capability the executor doesn't have, that is an escalation, not a revision request. Third: all executor capability boundaries were documented and added to the test suite as explicit boundary condition tests.
Why this incident class will increase in 2026: As teams move from single-agent to multi-agent systems, the planner-executor-reviewer pattern is becoming the dominant architecture. Every team adopting it will eventually encounter a task type that falls into the gap between executor capabilities and reviewer requirements. The teams that have already defined their escalation protocol before that task type arrives will handle it in minutes. The teams that haven't will spend days debugging what looks like a model quality problem but is actually a missing architectural constraint.
Tabletop exercise script: the complete bulk-send scenario with facilitator notes, inject events, and debrief questions — run this with your team before the next launch.
A tabletop exercise is a structured walk-through of an incident scenario with the people who will actually be involved in a real incident response. It takes 90 minutes. It surfaces more gaps than any audit, because it forces the people who own the process to explain it out loud — and what people say they will do in a high-pressure situation is usually different from what the documentation says they should do.
This is the exact scenario script for the bulk-send incident class. It is structured for a team of 3–6 people and includes facilitator notes, inject events, and debrief questions.
Before the exercise:
Send participants the scenario summary 24 hours before: 'We are running a tabletop exercise on a bulk-send automation failure. No technical knowledge is required. We will walk through a scenario, ask what we would do at each decision point, and identify gaps.'
Assign roles before the exercise starts: Incident Commander (the person who will coordinate the response), Technical Lead (the person who will diagnose and fix), Communications Lead (the person who will talk to affected customers and internal stakeholders), and Observer (takes notes on gaps identified, does not participate in the scenario response).
The scenario — read aloud by the facilitator:
'It is Friday at 2:47 PM. An email arrives from a customer saying they received 6 identical emails from your company in the last 30 minutes. You search your inbox and find two more similar complaints. You do not yet know the scope of the problem.'
Inject 1 (pause and discuss): 'What do you do in the next 5 minutes? Who do you call? What do you check first?'
Facilitator note: The correct answer includes: (1) immediately check the automation run history to understand what triggered and how many times, (2) check whether the send is still running or has completed, (3) assign someone to draft a holding response for customer complaints while the scope is assessed. Teams that debate in the first 5 minutes instead of acting are revealing a gap in incident command clarity.
Inject 2 (after 10-minute discussion): 'You check the run history. A team member uploaded a CSV of 850 records 45 minutes ago. Your logs show 851 workflow executions — including one from a header row. The workflow appears to have completed. How many customers were affected and how do you find out?'
Facilitator note: Teams without a deduplication log will not be able to answer this question quickly. Document whether the team has a log of which records were processed in each bulk run. If not, flag as a gap.
Inject 3 (after 10-minute discussion): 'You determine 847 unique customers received the email. 6 customers received it multiple times. Your CEO is asking for a public statement within the hour. What does it say, and who approves it?'
Facilitator note: Watch for communication ownership gaps. Is there a named person who owns customer-facing incident communications? Is there an approval chain that can move in under an hour?
Debrief questions (facilitator reads each, team discusses):
Why the last question matters: Tabletop exercises that produce no specific, assigned action items have a near-zero impact on incident prevention. The entire value of the exercise is in the gaps it surfaces and the specific changes that follow. If the exercise ends with 'that was useful' but no named owner for a named change with a named deadline, the exercise will not prevent the incident class it was designed to address.
Every claim in this report traces to a verifiable source.
Last reviewed March 14, 2026
Who wrote this, what evidence shaped it, and how the recommendations are framed.
Author: Rare Agent Work · Written and maintained by the Rare Agent Work research team.
Proof 1
Covers 8 incident categories: bulk send, auth cascade, memory loop, cost explosion, MCP poisoning, prompt injection via retrieval, orchestration deadlock, and rollback failure.
Proof 2
Each post-mortem includes a root cause tree, timeline reconstruction, blast radius assessment, and the specific governance change that would have prevented it.
Proof 3
Includes two fill-in-the-blank incident response templates ready for use in actual incidents.
Powered by Claude — trained on this report's content. Your first question is free.
Ask anything about implementation, setup, or how to apply the concepts in this report. Your first question is free — then we'll ask you to sign in.
Powered by Claude · First question free
When the report isn't enough
Architecture review, implementation rescue, and strategy calls for teams with real blockers. Every intake is read by a human before any next step.