Thoughts on harness engineering (whatever that means)
These days agents write 100% of my code. I plan something, they write the code, we review it, fix whatever comes back, hand off to the next session. Rinse and repeat.
It works. But it can get messy pretty quickly unless something outside the model is holding the whole thing together.
I think of that something as a harness. The work of building and maintaining it is how I think about harness engineering.
A harness is literally straps and gear used to control or secure something. To harness something is to put a resource to effective use.
The model is smart, but left alone it goes all over the place. The harness is how you steer it without wrestling it on every single task.
When I think about it, there are really two layers.
The harness around the model. Cursor, Codex, Claude Code, whatever you use. That's the product wrapped around the LLM. Sessions, tools, permissions, how the agent reads files and runs commands. It's built for everyone. You pick one, you tweak settings, maybe you wire up some MCP servers. But you're not building it. You're a user. By definition it has to work for thousands of people.
The harness around the project. This is the stuff you control. What you build for yourself, your team, your codebase. Local scripts, how tests are laid out, the one command that means "done," AGENTS.md, skills, docs that point at real files. Nobody else is going to write this for you. It's how the agent actually works in your world without starting from scratch every session.
Both matter. This post is mostly about what you control: the layer you own, the part you build for your project and your team.
The model predicts. The harness enforces.
Same prompt, different run, different answer. Both can sound right. That's not a bug. That's how prediction works.
- The model predicts the next plausible token. Output shifts run to run.
- The harness enforces what counts as done. Same repo, same gates, same rules.
So what is harness engineering?
It's not picking Cursor over Codex. And it's not "write a better prompt."
It's the infrastructure you build so the agent isn't flying blind in your codebase:
- Clear intent. What does done actually look like?
- Knowledge in the repo, not in chat
- Checks that prove the work without asking the model "is this going to work?"
- A habit of turning repeated failures into something permanent
When the same thing breaks twice, "fix it, make no mistakes" might work again. The real question is: what's missing, and how do we make it obvious and enforceable?
Prompts are suggestions. Linters are law.
The loop I actually use
Day to day it looks like this:
Plan → Implement → Verify → Review
If review isn't happy, you go back to implement. Verify isn't vibes. It's npm run verify, or whatever your repo's version of that is. Tests, lint, typecheck. Stuff you can run.
| Step | Who | What's going on |
|---|---|---|
| Plan | You (maybe with the agent) | What we're doing, what's in scope, what counts as done |
| Implement | Agent | Code, tests, docs. This is the fuzzy part. |
| Verify | Gates + agent | Same commands every time. Did it pass or not? |
| Review | You and/or other agents | Does it actually make sense? Loop back if not. |
When the same thing breaks twice, fix the repo
There's a side loop that runs alongside all of this:
Mistakes / roadblocks / repeated tasks → capture what you learned → bake it into tooling
You'll know it's time when:
- You leave the same PR comment twice
- The agent hits the same wall twice
- Someone says "we always forget to..."
- Local passes, CI fails, and you realize the gate was never there
- A reviewer keeps fixing the same pattern
When that happens, I escalate in roughly this order:
- Docs. Write down the pattern.
- Tests. Cover the thing that broke.
- Lint. Enforce it if you can.
- Scripts. Turn the workflow into a command the agent can run and verify against on their own.
Every time you explain the same thing twice, you owe the codebase a guardrail.
The harness is the part of the system that remembers when the model forgets.
That stuff feeds back into plan and verify. Better docs mean fewer bad guesses before implement. Better gates catch drift before review. That's how you get reliable output from something that isn't reliable by nature.
What you actually need
You can set this up a bunch of different ways. These are the project-layer pieces I think matter:
Intent and knowledge. If it only lives in chat, it's gone. The codebase is the memory. AGENTS.md should be a map, not a book. Short pointers into real docs. Non-obvious architecture notes. If you're doing X, read Y first. Skills that pull in what you need for the job, not the whole docs folder.
Gates. One command that means done. Fast checks on commit. The full gate before merge. Same command locally and in CI. You want every failure at once, not one at a time. This is what your model harness calls when it runs npm verify or make validate. You give the agent the tool. The project defines what passing means.
Rules. The rules agents will actually hit before they ship. Linters that tell you how to fix the thing, not just that you messed up. Custom rules when the generic stuff isn't enough. My rule: explain it twice in review, encode it.
Testing layers. A pyramid with clear rules about where tests go. Fast tests next to the code by default. High bar for full E2E. Local dev that works offline so the agent isn't waiting on some cloud thing to spin up.
Your time is the bottleneck. Let agents do setup, first-pass checks, a lot of the mechanical review. You pick what matters, what done means, and the calls that docs and gates can't make yet.
The model harness gets smarter every few months. The project harness is yours to keep building. That's the part that compounds.
Give your agents real tools
A model harness can read files and run shell commands. That's not enough. You have to give it the right tools for your codebase.
None of this is new. Linters, formatters, typecheckers, test suites, one command that means the build is good. Teams have relied on that stuff forever.
What's different now is how much code gets written. An agent can blast through a ton of files in seconds. The linters and gates you maybe skipped on a Friday afternoon? You can't really afford that anymore. Way more output than when you were typing it yourself. Same old tools. They just matter a lot more now.
What I mean:
- Something broke? The agent should find out fast, without you explaining how to run tests.
- Code doesn't match your patterns? Linters and typecheckers should tell the agent what's wrong, with enough detail to fix it.
- Local env a mess? Setup should be a command the agent can run, not a thread in Slack.
- Checks failed? The tool should tell the agent what broke and how to fix it, not just that something failed. Then what to run next (
make fix, re-run the gate).
Typecheck, lint, format: these should run on autopilot and talk back. Not "something failed somewhere." File, line, rule, often a hint about what to change. The agent reads that output the same way you'd read compiler errors. You don't want it guessing whether the code matches the house style. You want the house to tell it.
That's the communication layer between the model harness and the project harness. The model harness runs the command. Your AGENTS.md says when. The Makefile (or package script) says what's inside. The linter says what's wrong. The agent fixes it and runs again.
Docs matter here too. Not a wiki nobody updates. Short guidelines the agent actually loads: naming, where tests go, what "done" means, which command to run for which part of the monorepo. Skills for task-specific context so you're not pasting the same lecture every session.
A concrete example
Here's what something like this looks like in practice. Harmoney is a monorepo I work on (mobile, backend, Supabase).
Setup. make prepare-local-env creates missing .env files from examples and fills in safe local defaults. make check-local-env tells you if you're ready before the heavy gates. Agents can self-serve instead of asking you for the twelfth env var.
Guidelines. Root AGENTS.md is the map: product intent in VISION.md, boundaries in ARCHITECTURE.md, cross-cutting notes in LEARNINGS.md. Per-app AGENTS.md under apps/mobile and apps/backend. Locked rules agents can cite: all user-facing copy is pt-BR, identity comes from the JWT not the request body, bugs start with a failing test. Scoped gates: touched mobile only? make check-mobile. Backend only? make check-backend.
Testing layers. The pyramid is documented so agents don't default to the slowest thing every time.
- Fast unit tests, colocated. Most handler logic gets a
*.test.tsbeside the file it tests.apps/backend/src/payments/handle-refund.tsnext tohandle-refund.test.ts. Mock the DB, run one workflow, pile on assertions. That's the default. - DB tests when Postgres is the point. RLS policies, RPC contracts, migration behavior? pgTAP under
supabase/tests/. Backend integration with a real local database?make db-test-ts. Not every change needs this. Only when the database is actually what you're proving. - API smoke, kept tiny. A small scripted smoke path for auth and one happy-path webhook. The rule is the same as a good E2E suite: don't add a case unless you genuinely need the real HTTP transport, OAuth, and session wiring all at once. Two smoke journeys, not twenty.
- Full UI E2E, last resort. Web or mobile, same bar: one or two journeys that prove login and a critical screen actually work end to end. Slow, brittle, expensive to maintain. Default answer is no. Agents reach for colocated unit tests first.
Fewer, longer tests beat a hundred one-liners. No string-pinning on tool descriptions or UI copy. Offline by default. The harness tells the agent which layer to use so it's not guessing.
Autopilot feedback. Each workspace runs format, lint, and typecheck as part of check. SQL migrations get their own lint scripts. Invariant sentinels catch architectural mistakes statically.
The gate. make verify is definition of done locally: make check plus database checks (pgTAP, contract parity, backend DB tests). make fix across workspaces when auto-fix can help, then re-run the gate you care about.
When something fails. AGENTS.md spells it out: if checks fail, make fix, then re-run the relevant target. No guessing whether you needed the mobile gate or the full DB suite.
The shape is always the same:
- Orient —
AGENTS.md+ the doc for the task - Implement — agent writes code and colocated tests
- Verify — one authoritative command (
make verify) - Review — humans, agents, preview when it helps
- Encode — repeated mistake becomes doc, test, or lint rule
The win isn't memorizing six CI jobs. It's one answer to "are we done?" Run the gate. Green locally should mean green in CI. That's the project harness doing its job.
It's not magic though. DB checks need Docker up. Scoped gates don't run the full suite. Wrong Node version still ruins your afternoon. Agents skip steps if you let them. The harness cuts down on guessing. It doesn't replace your judgment.
What I actually believe
- Probabilistic output is fine. Unguarded probabilistic output in prod is not.
- Good gates compound. Chaos drifts.
- Chat isn't memory. The repo is.
- Short enforceable rules beat a 50-page instruction doc.
- The harness is never done. You build it next to the product.
- Taste goes into tooling so you're not in every PR.
Harness engineering doesn't replace engineers. It just moves where you spend your time. Less typing every line. More designing an environment where agents can do real work.