The codebase is the harness
Previously I wrote about what I mean by harness engineering: the model harness around the LLM, and the project harness you actually own. This one is the practical follow-up. So what belongs in the codebase so the agent isn't asking you the same stuff every time, or worse, making stuff up?
No harness? Same questions, every session. What to read, what proves done, how to set up, where tests go, which commands are safe. The agent forgets. The codebase remembers.
My take is simple: the codebase is the harness. Not a prompt. Not Cursor or Codex. It's the project-owned layer that makes intent, setup, and checks something the agent can verify against.
The model predicts. The codebase enforces.
- The model harness is the product: sessions, tools, permissions, MCP. You configure it. You don't own it.
- The project harness is what lives in the codebase: docs, commands, tests, scripts, CI wiring, the habits that turn repeated mistakes into guardrails.
The model predicts the next plausible token. The codebase enforces what counts as done. Same prompt, different run, different answer. Both can sound right. That's why you need gates the agent can run without asking you.
Prompts are suggestions. Linters are law.
The minimum useful harness
You can grow this over time. These are the pieces I wouldn't skip.
A map, not a book. Root AGENTS.md should answer: what is this project, what are the hard rules, where is product intent, where is architecture, which check do I run for which kind of change, what do I do when a gate fails. Pointers into real files. Not a 50-page instruction dump.
Project intent. A focused doc that says what the product is, who it's for, what it optimizes for, what it explicitly does not optimize for, and which assumptions are unsafe. Without this, the agent basically guesses the product direction from the code around it. Okay for a button color. Not okay for whose data is whose, billing, privacy, or what you're allowed to break in the API.
One command that means done. make verify, pnpm verify, npm run validate, whatever fits your stack. Documented in the map. Broad enough to catch cross-boundary regressions. The local handoff gate should be at least as strong as CI for normal work.
Scoped gates for iteration. Touched only the API package? Run the API check. DB migration? Run the DB gate. The final gate defines handoff; scoped gates keep the loop fast.
Local env check and prepare. Read-only check that says what's missing. Explicit prepare that creates safe local state. Real secrets never invented. Non-empty local values not overwritten. Setup should be a command, not a thread in Slack.
Test layer guide. Tell the agent where proof belongs: unit beside the code, persistence tests when the database is the point, smoke only when you need real transport, full E2E last. Default to the cheapest layer that proves intent.
Script inventory. Scripts are part of the public API whether you meant them to be or not. Document them: where they live, read-only or mutating, who owns them, and which Make/npm target wraps them for agents. Bad name: setup.sh. Better: prepare-local-env so nobody nukes prod by accident.
CI calls codebase commands. GitHub Actions (or whatever) should be thin: checkout, install, run the repo's verify target. Don't duplicate "did they touch the API?" logic in three workflow files. Keep that in the codebase so local and CI stay in sync.
Cheap hooks. Format and lint on commit are fine. Hooks are hygiene, not definition of done. Don't hang the full test suite or DB gates on pre-commit.
| Piece | Job |
|---|---|
AGENTS.md | Route the agent to the right docs and commands |
| Project intent | Product boundaries before code gets written |
| Final gate | One answer to "are we done?" |
| Scoped gates | Fast iteration without skipping handoff |
| Env check / prepare | Self-serve setup |
| Test layers | Right proof at the right cost |
| Script inventory | No surprise mutating commands |
| CI | Same repo-owned logic as local |
Don't dump all of this on the agent every session. Make it discoverable: short map up front, focused docs it can open when the task actually needs them. The agent should be able to find the right piece without you pasting half the repo into chat.
Adoption path
Start small. Add the rest as the pain shows up. You don't need all of this on day one.
Map the codebase. Small root agent file, project intent, architecture pointers, scoped docs where domains need special handling. Make the codebase navigable before you write an encyclopedia.
Define done. One final gate. Runs locally. Documented. Deterministic enough that failures are actionable. CI invokes the same logic or a clearly documented equivalent.
Add scoped gates. Match how the codebase actually changes: frontend only, backend only, one package, contracts, docs. Fast iteration; final gate still owns handoff.
Make CI reproducible locally. Path filters in a manifest or script with tests. Workflows call repo commands. No duplicated target selection spread across YAML files.
Make setup explicit. Check what's missing. Prepare what's safe. Document what the gate may reset locally and why.
Document test layers and script inventory. So the agent isn't defaulting to the slowest test or a random shell script every time.
Turn repeat review into guardrails. If you or a reviewer says the same thing twice, that's a signal: doc it, test it, lint it, or script it. Don't keep explaining it in every PR.
Skeletons you can steal
Stack-agnostic shapes. Swap Make for npm, Vitest for go test, whatever you use.
Root map (AGENTS.md):
# Project name
Short description. Read `docs/project-intent.md` before product decisions.
## Non-negotiables
- Invariant one
- Invariant two
## Where to read
- Intent: `docs/project-intent.md`
- Architecture: `docs/architecture.md`
- Testing: `docs/testing.md`
## Checks
- Area changed only: `make check-<area>`
- Normal handoff: `make verify`
## When checks fail
Run `make fix` if available, then re-run the scoped gate or `make verify`.
Project intent (docs/project-intent.md):
# Project intent
## What this is
One paragraph. Present tense. Real behavior today.
## What this is not
Explicit non-goals.
## Optimizes for
## Does not optimize for
## Unsafe assumptions
Things the agent must not infer from nearby code.
## Hard invariants
Violations are bugs, not style preferences.
Command matrix (in a harness or contributing doc):
| Command | When | Mutates? |
|---|---|---|
make check-local-env | Before heavy work | No |
make prepare-local-env | Missing local files | Yes (local only) |
make check-api | API paths changed | No |
make verify | Behavior-changing handoff | Maybe (documented) |
make fix | After format/lint failures | Yes |
Final gate (Makefile or package script):
.PHONY: verify
verify: check fix-if-needed db-check-if-needed
@echo "verify: ok"
Script inventory row:
| Path | Domain | Public | Mode | Notes |
|---|---|---|---|---|
scripts/seed-local.sh | dev | make prepare-local-env | mutating | Local only; never CI |
Local env:
check-local-env:
@test -f .env.local || (echo "run: make prepare-local-env"; exit 1)
prepare-local-env:
@test -f .env.local || cp .env.example .env.local
Thin CI workflow:
jobs:
verify:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v7
- run: make verify
The point isn't the syntax. It's that every layer points at the same contracts.
Not just theory
I'm not only writing about this in the abstract. I build this in most of my projects and work. One repo I keep on the side is harness: a CLI, callable workflows, packaged skills, and the runner that ties them together. Your product repo still needs its own map, gates, and setup. Harness is what I use to review changes, route agent work, and keep loops explicit.
If the skeletons above help you design, that repo helps you see working examples: workflow wiring, skill structure, contributor docs. Still evolving. Copy what fits. More context: /projects/harness.
What the harness doesn't do
No judgment replacement. Gates catch drift and mechanical mistakes. They don't pick your product strategy.
Stale gates lie. If verify doesn't match what CI runs, or the test guide says one thing and the code does another, the agent learns the wrong lesson.
Agents skip steps if you let them. Scoped gates don't run the full suite. Wrong tool versions still ruin your afternoon. DB checks need Docker up. The harness cuts down on guessing. It doesn't replace your taste.
Chat isn't memory. The codebase is.
That's the whole bet. Put intent, setup, and proof where the agent (and the next human) can run them without you in the loop. Build the harness next to the product. Every time you explain the same thing twice, the codebase probably owes you a guardrail.