If your AI agent keeps failing, read this
Learn how harness engineering uses deterministic scripts, guardrails, and state management to turn AI coding agents into production systems that ship real code.
Most AI coding agents can write impressive demos. Few can ship production code without breaking everything around it. The difference is harness engineering: the discipline of building systems that make AI agents reliable.
Here is how I used it to ship 100+ PRs/month at Amazon
Get the free AI Agent Building Blocks ebook when you subscribe:
I’m Fran. I’m a software engineer at Amazon during the day, and I write and experiment with AI during the night.
I want to tell you about the moment I realized that prompting alone would never work for production AI agents.
I was working on an automation project at Amazon. The goal was simple: update large JSON configuration files automatically based on requirements. These configs were thousands of lines long, and the updates followed predictable patterns.
A perfect job for an AI agent, right? That’s what everyone thought. Engineers on the team opened their AI-powered IDE or CLI, typed their prompts to modify the JSONs, and watched the LLM struggle to modify the target node correctly.
It failed to implement the changes properly. Every single time.
The model wasn’t broken. We were on Opus 4.6 with a one-million context window.
The context window was a problem. When you feed multiple 10,000-line JSON files into an LLM, the model loses track of the surrounding structure. It edits what you asked it to edit, but it quietly breaks everything around it. No error message. No warning. Just a structurally invalid file that passes a surface-level glance but fails in production.
This is not a model quality problem. It is an environment problem. And the fix is not a better prompt.
You may think the fix is Anthropic to release a 10M context window, but we know that a bigger context window still degrades after 100k or 200k tokens.
The real fix is a harness.
Harness engineering is the discipline that turned my broken prototype at Amazon into a system that now ships over 100 PRs per month. Fully autonomous.
I wrote a 10-step guide to build that agent in this previous post:
In this post, you’ll learn
What harness engineering is and how it differs from prompt engineering, context engineering, and agent engineering
Why AI agents fail on large structured files like JSON, and how to fix it with deterministic scripts
The four pillars of a production AI harness: state management, context architecture, guardrails, and entropy management
How I built a harness at Amazon that ships 100+ PRs/month without human intervention
The mindset shift that separates engineers who demo AI from engineers who deploy it
Why AI Agents Fail on Large Files
Most engineers today interact with AI coding tools the same way: open an IDE, type a prompt, review the output, repeat. For small files and isolated tasks, this works beautifully. But the moment the problem involves a large amount of files, the whole approach falls apart.
Large Language Models are probabilistic engines. They predict the next token based on patterns in their context window. When the context window is filled with thousands of lines of structured data, the model’s attention gets diluted. It correctly identifies the node you want to modify, but it loses track of sibling keys, nested brackets, and structural integrity. The result is a file that looks right at the point of change but is broken somewhere else.
We have to understand that Context Window isn’t the same as Context Attention. As a human, I can store hundreds of items in a storage unit, but I will remember about a fraction of the items I have there.
Same with LLMs. Performance degrades as the context window gets filled (and costs).
Did you know that every message you send is sending all the previous conversation in an API call? Yes, you’re billed also for those past messages. The servers in the cloud don’t keep any state, they only have a cache.
When the model fails to make an update, the instinct is to write a better prompt.
Add more constraints.
Tell the model to “preserve the surrounding structure.”
“Make no mistakes.”
But that is like asking someone to juggle while blindfolded and then giving them more detailed instructions about hand positioning. The problem is not the instructions. The problem is the blindfold.
The context window itself becomes a liability when it’s packed with thousands of lines of repetitive structure. No prompt can fix that.
I covered in this post how to scale AI setting up guardrails
What Is Harness Engineering?
Harness engineering is the discipline of designing the systems, architectural constraints, execution environments, and automated feedback loops that wrap around AI agents to make them reliable in production.
The term was first coined by Mitchell Hashimoto, the founder of HashiCorp. The metaphor comes from horse riding. Think of the LLM as a powerful horse. It has raw energy, speed, and strength. But without reins, a saddle, and a bridle, that energy is undirected and potentially destructive (the horse kicks you, the LLM runs a rm -rf, and I don’t know which is worse). The harness allows the rider to direct the horse’s power productively.
To understand where harness engineering fits, here’s how it relates to the other disciplines you’ve probably heard about:
Prompt Engineering → Single interaction to craft the best input to the model (single request-response interaction).
Context Engineering → Control what the model sees during a whole session (multiple interactions until clearing).
Harness Engineering → Designs the environment, tools, guardrails, and feedback loops (multiple sessions).
Agent Engineering → Design the agent’s internal reasoning loop (define specialized agents).
Platform Engineering → Infrastructure to manage deployment, scaling, and cloud operations (where agents can run).
Prompt engineering is about what you say to the model.
Context engineering is about what the model sees.
Harness engineering is about the entire world the model operates in. It includes the tools the agent can call, the constraints it cannot violate, the documentation structure it reads, and the automated feedback loops that catch its mistakes before they reach production.
How I Built a Harness That Ships 100+ PRs/Month at Amazon
Let me walk you through the specific problem I solved, because abstract talk about agents only becomes useful when you see them applied to a real constraint.
The problem: We had large JSON configuration files that needed automated, repetitive updates. These files were too big for the LLM’s context window. Every manual update was tedious, error-prone, and time-consuming.
What everyone else tried: Engineers on the team opened their IDEs and started prompting. The LLM would correctly modify the target node, but would fail to identify which other files had to be updated, and it would fail to keep the correct JSON structure. There was no awareness of JSON structural integrity as a hard constraint. Every run was a coin flip. Sometimes it worked. Most times it broke. You can’t trust an AI like this.
The harness approach: Instead of trying to update the prompt, I narrowed the problem to one specific operation: How to read and write into our JSON files. I wasn’t trying to build a general-purpose agent. I built a scoped one. I wrote deterministic Python scripts to handle the actual JSON surgery: read the file, apply a precise modification, validate the structure, write it back. The agent’s only job was to provide the intent, the what, and the where. The script provided the execution guarantee.
The key insight was this: the agent calls the script as a tool. It does not generate JSON directly. It tells the script what to change, and the script changes it with zero ambiguity. This means the AI is the brain that chooses which steps to take, like a CEO indicating directions. The AI didn’t have to make the groundwork itself.
I then added a structural validation step as a guardrail. If the resulting JSON is malformed, the agent cannot proceed. It physically cannot ship a broken config. This provides a feedback loop, which is something managers and C-level executives also want when delegating to humans.
The result: 100+ PRs per month. Zero structural corruption. Fully autonomous. The system has been running for months, and after a few weeks of tweaking edge cases in the deterministic scripts, the Agent nails the updates.
At some point, we realized the only reason a PR gets rejected is that the requirement was wrong, not because the AI didn’t execute the requirement.
That’s when you are into something good.
This is what harness engineering looks like in practice. You stop asking the model to do things it’s bad at. You give it the tools for the parts that require precision, you let the agent handle the parts that require judgment, and you instruct it not to jump in to do the job itself.
The Four Pillars of Harness Engineering
My JSON automation project taught me the pattern to build a good AI agent, but the approach is generic. After studying how OpenAI, Anthropic, and other teams have built their own harnesses, I’ve identified four pillars that every production harness needs.
1. State Management
AI agents are stateless by default. Every API call starts with a blank slate. For a task that takes five minutes, this is fine. For a task that spans hours or requires following the updates of dozens of files, statelessness is bad. The agent forgets what it did 20 steps ago. It repeats the same mistake in a loop. It loses track of the overall architecture. This “AI amnesia” is the most common failure mode in long-running agent tasks, and it’s why Openclaw got very popular.
A harness solves this by serializing context snapshots and restoring them across sessions. Think of it as save points in a video game. The agent does work, the harness saves a snapshot, and if the agent crashes or hits a rate limit, the harness restores the snapshot and picks up exactly where it left off.
Advanced implementations use structured state objects that persist across runs. There are two main strategies here:
Context Compaction, where the harness continuously summarizes the agent’s history as it approaches the token limit
Context Resets, where the harness clears the window entirely and boots a fresh agent with a structured handoff of artifacts.
Both work. The right choice depends on your task length and coherence requirements.
2. Context Architecture (Progressive Disclosure)
The first agent-friendly codebases I saw produced gigantic AGENTS.md files. This approach fails for the same reason a 500-page employee handbook fails on someone’s first day. The agent gets confused, misses critical rules, and follows outdated instructions that were never cleaned up.
The better approach is progressive disclosure. Give the agent a short table of contents that points to a structured docs/ directory. The agent reads the table of contents first, then navigates to the specific document it needs for the task at hand.
This is the same pattern introduced with the Agent Skills standard. Instead of the early MCP implementations that loaded all the definitions above the user’s first prompt, let the agent find them when needed.
The agent gets a map, not an encyclopedia.
One more thing that is easy to forget: anything the agent cannot access in-context does not exist for it. Your Slack threads, Google Docs, and verbal agreements in meetings… None of that is real to the agent unless provided or instructed to fetch them.
3. Deterministic Guardrails
This is where harness engineering diverges most sharply from prompt engineering. Prompt engineering asks the agent to write clean code or make no mistakes. Harness engineering mechanically enforces it.
You’d need custom linters, structural tests, and CI jobs that validate architecture before merge.
The agent isn’t “discouraged” or “instructed against” skipping those. The agent is blocked.
If a file exceeds a size limit, the linter rejects it.
If a dependency flows in the wrong direction, the structural test fails.
If the JSON output is malformed, the validation script prevents mering the PR.
The error messages in your custom lints and validations should include remediation instructions. When the agent hits a linter failure, the error message itself tells the agent exactly how to fix the problem. That error message gets injected directly into the agent’s context, creating a tight feedback loop that requires zero human intervention.
This was a realization of my early attempts in the agent that modifies JSONs. I was using JQ commands instead of Python scripts. JQ ended all possible failures with a 0 or 1 exit code. These outputs are intended for terminals, not for LLMs to recover from them.
One more thing worth noting: A “boring” codebase is better for agents. Stable APIs, predictable patterns, and simple architectures are far easier for agents to model than clever abstractions. Every layer of complexity you add to your codebase is a layer the agent has to navigate.
Keep it simple.
4. Entropy Management (Garbage Collection)
This is something most people skip. AI agents replicate patterns, including bad ones. Over time, your codebase accumulates “AI slop”: redundant logic, verbose implementations, subtly hallucinated variables that the model keeps copying because they exist in the context.
Left unchecked, this entropy degrades the entire codebase. People call it context poisoning.
Some people use this as an argument that AI is bad. But whenever I face a bad AI output, instead of judging if AI is good or bad for this task, I ask myself how can we make AI work here? The answer is usually adding another harness.
We can have a recurring cleanup agent. Think of it as garbage collection for your repo. For any implementation task, have a separate agent that scans the codebase, looks for drift from your golden principles, and fixes things before raising the PR. You can also execute this kind of agent on a schedule. Because you already designed other harnesses, like having unit tests and linters, you can allow AI to refactor code with confidence.
It is the same concept as a “doc-gardening” agent that scans for stale documentation and updates it. Technical debt is called like this becuase it works like money debt. If you pay it daily, you stay solvent. If you let it accumulate, you end up spending a lot of time later.
The harness should include entropy management from day one, not as an afterthought.
To know where to apply the harnesses, I covered a 3-level framework for AI-assisted coding in this previous post:
The Mindset Shift: From Prompts to Harness Engineering
The biggest change harness engineering requires is not technical. It is mental.
You stop writing prompts. You start designing environments. Your job is neither to write code nor to write the detailed prompts. It is to make the codebase legible to the agent. Every file name, every directory structure, every naming convention, every piece of documentation exists not just for human developers but for the autonomous agents that will read, modify, and extend the codebase at machine speed.
Constraints stop being restrictions and start being multipliers. A custom linter you write once applies to every line of code the agent writes, deterministically, and forever. A structural test you build today catches every future violation automatically. You invest once, and the return compounds with every agent run. That is the leverage engineers had for humans, and we need it for AI agents.
The engineers shipping the most code right now all converged on this independently. OpenAI’s internal team shipped one million lines of code and 1,500 PRs in five months using this approach. Anthropic has released 52 features in 50 days. My team at Amazon ships 100+ PRs per month. The patterns are the same: narrow the problem, use deterministic scripts at the execution boundary, enforce constraints mechanically, and make the codebase legible to the agent.
Now, to apply these harnesses, you need to know the building blocks of IA agents.
If you want the full guide, let me know your email below, and I’ll send you the free “AI Agents Building Blocks” guide inside the newsletter’s welcome email
To Recap:
What is harness engineering in AI?
Harness engineering is the discipline of designing the systems, constraints, execution environments, and feedback loops that wrap around AI agents to make them reliable in production. Unlike prompt engineering, which focuses on a single model interaction, harness engineering governs the entire agent lifecycle, from state management to automated validation.
How is harness engineering different from prompt engineering?
Prompt engineering crafts the input to the model in a single interaction. Harness engineering designs the entire environment the agent operates in: tools, guardrails, documentation structure, and automated feedback loops. The goal is reliable behavior across thousands of runs, not just one.
Why do AI agents fail on large structured files like JSON?
Large JSON files exceed or crowd out the model’s context window, causing the agent to lose track of the surrounding structure. It may correctly modify the target node but corrupt adjacent keys, producing a broken file. The fix is a deterministic script that handles the file surgery, with the agent only providing the intent.
How do you build a simple AI agent harness?
Start by narrowing the problem to one operation. Write deterministic scripts for the execution step. Wire the agent to tool-call those scripts instead of generating the output directly. Add a validation step that the agent cannot bypass (embed it in scripts if needed!). This three-part loop, intent to deterministic execution to validation, is the minimal viable harness.
What is an AGENTS.md file and why does it matter?
AGENTS.md is a file in your repository that tells an AI agent the rules, conventions, and architectural constraints of your codebase. It acts as the agent’s static context, injected at startup, so it knows your team’s norms without you having to repeat them in every prompt. Keep it short (under 100 lines) and use it as a table of contents pointing to deeper documentation.
Conclusion: The Harness IS the Product
The model is the easy part. Everyone has access to the same foundation models. GPT, Claude, Gemini, they are all remarkably capable. The harness is the hard part. The harness is what separates a demo that impresses your manager from a production system that ships real code every day without breaking things.
Here is what I want you to take away from this article:
Narrow the problem before you build the agent. A scoped agent that does one thing well beats a general-purpose agent that does everything poorly.
Use deterministic scripts at the execution boundary. Let the agent provide intent. Let the script provide the guarantee.
Enforce constraints mechanically, not verbally. If a rule matters, make it a linter, a test, or a validation step. Do not put it in a prompt and hope for the best.
Make the codebase legible to the agent, not just to humans. Progressive disclosure, structured documentation, and the repo as the single system of record.
The engineers who figure this out first will have an enormous advantage. Not because they have better models, but because they have better harnesses.
If you read until this point, you have to read this other article with the AI concepts every software engineer needs to know in 2026:












I spend a lot of my time improving our agents’ skills and commands. Without the guardrails, agentic coding feels like unleashing the power of 1000 suns at your codebase. You must channel that energy somehow. I didn't know what we were doing was called Harness Engineering, so thanks for that.
I'm about to publish an article on how you can build a test agent (including some code as well!) then next week a more detailed rundown. I can't speak for others, but these are exciting times in software engineering.
Nice write-up, Fran!