Making AI agents prove their work

Casework · plate 04 of 04

The bottleneck moved from writing code to trusting it. So I built the machinery that makes “works” something an agent has to prove, not claim.

Making AI agents prove their work

Engineering harness (generator / evaluator / reviewer, enforced gates) · self-directed · 2026

50 → 39 score inflation the harness caught in itself: a blind re-judge on identical evidence

The situation

By early 2026, AI coding agents could produce more code in a day than a small team used to ship in a sprint. That moved the bottleneck. The problem stopped being “can the agent write it” and became “can anyone trust what it wrote.” The failure mode isn’t an agent writing bad code; it’s an agent writing plausible code and then reporting that it works.

— See the whole system

What I saw

Better prompting doesn’t fix this: separation of powers does. The same principle behind test-driven development applies. Define what “done” means before the work starts, make it machine-checkable, and never let the party that writes the code also grade it. So I stopped treating agent output as the product and started treating the surrounding machinery as the product: a harness of independent roles (a generator that builds, an evaluator that judges against evidence, a reviewer that checks code quality) held together by gates that cannot be talked out of a failing verdict.

— Find the leverage point

What I built

A harness where every feature starts in a default-fail state and only passes when the evaluator flips it on evidence, not on the generator’s claims (the evaluator writes its own adversarial tests against a fresh ephemeral database rather than trusting the generator’s). Pre-commit hooks enforce the contract: no evaluation, no commit. I ran it on a production drug-catalog service (FastAPI, Postgres), where 45 feature stories shipped through the loop across two runs and the harness’s own self-test caught two real bugs in the verifier itself. Then I extracted the whole thing into an open-source scaffold: one command materializes a repository with the roles, gates, and evaluation loop pre-wired. This website was built by the same harness, which forced the interesting problem: design can’t be unit-tested, so the aesthetic seat went to a vision-model judge scoring screenshots against a written rubric, with no source change landing without a scored evaluation row.

— Build the platform

Where this goes

The point is not that one person ships faster (although one person ships considerably faster). It’s that “works” becomes a property you can prove rather than a claim you take on faith, which is exactly what a team needs before agents can carry production work. The loop caught two failures that matter more than any feature it shipped: a miscalibrated rubric that scored a round 48/50 while outside readers saw “a high-end contractor, not a leader” (rewriting it and re-scoring the identical pixels reproduced the human critique at 29/50, unprompted), and judge inflation (a clean 50/50 I didn’t believe, which a blind second judge scored 39/50 on the identical frozen evidence, with a nameable highest-impact fix in every category). AI amplifies whatever process it lands in. This is the machinery that makes sure what it amplifies is good.

— Leave it running

What it left behind

The method left me as an open-source scaffold: one command wires up the roles, gates, and evaluation loop, and it re-validated on a downstream project. This website was built by it.

— Leave it running

By the numbers

45 feature stories shipped through the loop on one production service
2 real bugs the verifier caught in itself, via its own self-test
48 → 29 rubric miscalibration caught across versions

← All casework