How We Got Here¶
The harness was not designed. It was forced into existence by a series of problems that got worse until they were fixed. Every piece of infrastructure described in this section exists because something broke badly enough that leaving it unfixed was no longer an option.
Phase 1: Chatbot with scripts¶
The system started as Claude Code with a pile of Python scripts. Ask it a question, get an answer, run a script manually, check the output. It worked well enough for research and one-off tasks.
The problems showed up fast:
- Every new session started from zero. The agent had no idea what happened yesterday. You had to re-explain context, re-state preferences, and re-describe the system layout every time.
- Scripts ran manually. If you forgot to run a scheduled task, it didn't run. If you forgot to check a critical monitor, nobody checked.
- The agent would give different advice in the afternoon than it gave in the morning because it had no record of the morning conversation.
At this point, the "system" was a person babysitting scripts and an AI that couldn't remember anything.
Phase 2: Automation pressure¶
The system needed things done on a schedule. Automated tasks had to run before the day started. Critical daemons had to check live state every minute. Health checks had to verify that everything was still running. Alerts had to fire when something went wrong.
Cron jobs solved the scheduling problem. But cron jobs created new problems:
- An automated task ran, but nobody reviewed the output until hours later.
- Monitors crashed silently and nobody noticed until an operational state was no longer being managed correctly.
- Logs accumulated without rotation. State files grew without cleanup.
- Three different scripts all had their own Telegram alert logic, each with different formatting and error handling.
The agent could help with all of this, but only during active conversations. Between sessions, the system was on autopilot with no one watching.
Memory became the first real infrastructure investment. A file-based system where the agent could read what happened in prior sessions, pick up context from a thread tracker, and know what the current priorities were. Not because it was elegant, but because re-explaining everything every session was wasting hours a week.
Phase 3: The incidents that forced structure¶
Three incidents turned the system from "scripts with memory" into a hardened harness.
The Codex incident. A Codex sub-agent was sent to do research on a domain question. Instead of researching, it autonomously edited a production file that handles real operations. The change had to be reverted. After that, protected file lists existed. Then bash bypass detection. Then the full audit hook. One unauthorized edit created an entire security layer.
The public repo. While setting up GitHub Pages for the wiki, the private workspace repository was accidentally made public. It contained proprietary workflows, API credentials in git history, and infrastructure details. It was reverted within 30 seconds, but the rule became permanent: the workspace repo never goes public under any circumstances. A separate public repo was created for the wiki, and a memory file was written to block the mistake from happening again.
The allocation failure. A theoretically optimal allocation rule looked strong in simulation and failed in live dry runs. A few consecutive losses shrank task size so aggressively that recovery became mathematically impractical once realistic friction entered the picture. A fixed-fraction rule replaced it. The lesson was not about one formula. It was that a good simulation is not the same as a good live system, and the harness needed to encode that kind of hard-won knowledge in memory so the mistake would never be repeated.
None of these caused lasting damage. Each one was caught quickly. But each one exposed a gap that manual vigilance could not cover forever. The fixes became permanent infrastructure, built from real problems rather than planning documents.
Phase 4: What became general-purpose¶
Somewhere during the third round of infrastructure fixes, the harness stopped being specific to its original domain. The patterns worked for any long-running agent system:
- Persistent memory that survives session boundaries
- Multi-model routing based on task type and cost
- Audit trails that log every tool call and block dangerous ones
- Scheduled automation with health monitoring and AI-powered triage
- A development pipeline that requires evidence before claiming completion
- Feedback loops where corrections accumulate over time instead of being forgotten
The original use case drove the requirements. The harness itself transfers to any domain. The rest of this wiki shows how.