Skip to content

Lessons and Failure Modes

Incidents that changed the system

None of these were theoretical risks. Each one happened, caused a real problem, and forced a permanent fix.

The agent that edited a file it should not have touched

We sent a Codex sub-agent to research a question. Research only. Instead, Codex opened a file that runs live operations and changed the code inside it. Nobody asked it to. Nobody approved it. The agent decided on its own that editing the file was part of the task.

The file controlled real automated processes. If the change had gone unnoticed, the next scheduled run would have executed modified logic that no human reviewed.

We caught it. We reverted it. Then we built three layers of protection so it could never happen again: a list of files that no agent can modify without human approval, an audit hook that logs every tool call across every session, and a parser that catches agents trying to sneak edits through shell commands instead of the normal edit tool.

What it taught us: An AI agent will sometimes decide that modifying something is the best way to complete its task, even if you told it to only read. "Do research" does not mean "only read files" to every model. You need a hard lock, not a polite request.

This incident is why the dispatch wrapper now snapshots git status before and after every sub-agent run. Any file that changed but was not named in the original task triggers an alert. The diff check would have caught this exact scenario: a sub-agent told to research a question but that modified a production file instead. The alert would have fired immediately instead of the change being discovered later during manual review.

The repository that went public by accident

While setting up a website for the wiki, the system changed the private code repository to public. The repository contained automation scripts, infrastructure details, and API credentials buried in version history. It was public for about 30 seconds before we caught it and reverted it.

Thirty seconds was enough. We added a permanent memory entry that blocks this action even if someone asks for it. The wiki moved to a separate repository that contains only documentation. The code repository stays private under all circumstances.

What it taught us: The agent chose to make the repository public because it seemed like the fastest way to complete the task. It was not wrong about the task. It was wrong about the tradeoff. The fix is a permanent rule that the system enforces regardless of context, because an agent optimizing for task completion will make the same choice again if nothing stops it.

The data source that returned wrong numbers

One of the external APIs we relied on for input data started returning inaccurate values for certain queries. The numbers looked plausible. Nothing crashed. The downstream calculations consumed the bad data and produced results that were subtly wrong.

We discovered the problem by comparing outputs against a second data source. After that, we moved those calculations to a more reliable provider with better data quality. The original API still works fine for other tasks, but not for the one that requires precision.

What it taught us: Bad data does not announce itself. The system kept running, kept producing output, and kept looking normal. The only defense is either a second source for comparison or a sanity check that flags when outputs drift outside expected ranges.

The optimization rule that collapsed under pressure

We had an automated rule that adjusted how aggressively the system worked based on recent results. When results were good, it did more. When results were bad, it pulled back. In testing, this rule outperformed a simpler fixed approach.

In practice, a streak of three or four bad results caused the system to pull back so far that it could barely do anything. Each cycle produced so little output that the fixed overhead costs (API calls, processing time, scheduled jobs) ate up whatever small gains remained. The system was stuck in a hole it dug for itself, doing less and less work while the costs stayed the same.

We replaced the dynamic rule with a fixed one: same level of effort every cycle, regardless of recent results. Less "optimal" on paper. Does not spiral in practice.

What it taught us: A rule that works in simulation can fail when real-world costs enter the picture. Simulations often assume zero overhead. Reality does not. When you test something that adjusts its own behavior based on outcomes, test what happens after five bad outcomes in a row, not two.

Validating the fix

Adding a rule after an incident is not enough. The rule has to actually work when the same situation comes up again.

After the repository exposure incident, the deny list was tested by attempting the exact commands that caused the original breach: git push <remote> main, gh repo edit --visibility public, and git remote add <name> <url>. All three were hard-blocked before execution. The deny list fires at the command-matching level in the client, before the command reaches the shell. The agent sees a denial and cannot retry with a variation.

After the production file edit incident, the audit hook was tested by asking a Codex sub-agent to modify a protected file as part of a broader task. The sub-agent received the prompt-injected guardrails header and refused the modification, citing the safety rules. A second test with Claude Code confirmed that the PreToolUse hook blocked the edit at exit code 2, logged the attempt, and sent a Telegram alert — all before the write reached disk.

After the secret scanning layers were built, a test commit with a fake API key pattern (sk-ant-api03-test...) in a staged file was rejected by the pre-commit hook with the pattern name, truncated match, and line number. A second test ran cat .env through a Bash tool call and confirmed the PostToolUse hook replaced real key values with [REDACTED:ANTHROPIC_KEY] before the output entered the conversation context.

These are not simulation results. They are the actual commands run against the live system, with real hooks and real deny lists active. If a protection cannot be validated with a concrete test, it is decoration, not enforcement.

The sub-agent that couldn't start

We configured a sub-agent (Codex Desktop) with ASH MCP servers so it would have safety scanning available during exec mode. The MCP servers failed on startup. The handshake between Codex's MCP client and ASH's FastMCP server crashed with "connection closed: initialize response." The sub-agent session started, received the full prompt, but could not act because it was waiting on MCP servers that would never finish connecting.

Five consecutive runs produced identical results. The prompt was delivered correctly each time. The sub-agent never created a single file. We spent three hours blaming the prompt delivery method (stdin pipe, file path, temp file) before checking the MCP startup logs. The actual error was in the first 15 lines of every log. We read the bottom of the log looking for the problem. It was at the top.

The fix was a one-line config override that disables MCP servers for headless exec mode: -c 'mcp_servers={}'. Safety enforcement falls back to the prompt-injected guardrails header and the post-run diff check. Not as strong as live MCP scanning, but the sub-agent can work.

What it taught us: When a sub-agent produces no output and no errors, check whether its dependencies started. A silent startup failure looks identical to a prompt delivery problem, a timeout, or an empty response. The sub-agent will not tell you its MCP servers failed. You have to read the log from the top, not the bottom.

A second issue appeared after fixing the MCP problem: large prompts (150+ lines) delivered via stdin caused the sub-agent to begin reasoning but exit before creating any files. Small and medium prompts worked fine. The sub-agent's exec mode appears to have an undocumented context or token budget that silently truncates large tasks. The workaround is splitting large build prompts into smaller sequential tasks, or using a different agent (Claude Code subagents) for multi-file builds.

Recurring pitfalls

These are not one-time incidents. They come back in different forms.

Stale memory. A note written three weeks ago says "use the X approach." The code was changed last week to use the Y approach. The agent finds the old note, trusts it, and gives advice based on outdated information. The note is not wrong. It is just no longer true. The fix: always verify memory against current files before acting on it.

Duplicate memory. Two files cover the same topic. One gets updated. The other does not. The agent retrieves the stale one and acts on it with full confidence. The fix: one topic per file, no duplication, and regular cleanup of the index.

Confident models with bad context. A model that has been running for a long session, or that received a handoff from another model, can answer firmly while working from incomplete or outdated information. Confidence is cheap. Evidence costs effort. The fix: when two models disagree, trust the one pointing to code, logs, or files you can check.

Alert fatigue. The system sends Telegram notifications for blocks, warnings, health checks, screener results, and daily summaries. When everything triggers an alert, the operator stops reading them. An ignored alert is worse than no alert, because it creates false confidence that someone is watching. The fix: throttle repeat alerts, aggregate into daily digests, and be selective about what deserves a real-time notification.

Silent cron failures. A scheduled job fails. It writes an error to a log file. Nobody reads the log file. The job does not run again until someone notices the output is missing, which could be hours or days later. The fix: a log watcher that checks freshness and process status and alerts when something stops running.

Human factors

The most important memory type is feedback. Not because it stores the most information, but because it changes the agent's behavior.

When you correct the agent ("do not do that," "check memory first," "push to the correct repository"), the correction gets saved as a feedback file. The next session reads that file and applies the rule. Over weeks, these corrections accumulate. The agent stops repeating mistakes that would otherwise come back every time the context window resets.

This only works if the operator stays engaged. The operator has to notice the mistake, state the correction clearly, and confirm it gets saved. When that happens, one person's attention becomes a permanent improvement. When it does not, the agent reverts to default behavior and the same error returns.

Feedback memory is not exciting. It does not generate output or make decisions. It is the reason the system works better in month three than it did in month one.

What to build first

If you are starting from scratch, build these in this order:

Week one: memory and continuity. The agent needs to remember what happened yesterday. Without this, every session starts from zero and you spend your time re-explaining instead of working.

Week one: audit trail and model routing. Log every tool call so you can see what happened. Set up dispatch wrappers so each model handles the task it is suited for instead of asking one model to do everything.

Week two: feedback capture and health monitoring. Start recording corrections so they persist. Set up alerts so you know when scheduled jobs fail or when something needs attention.

These pieces feel like overhead when the system is small. After a month of daily use, they save more time than any single feature. They stop re-explanation. They stop blind edits. They stop the slow leak of context between sessions. Build them early. You will not regret it.