I read OpenAI's Harness Engineering a while back and keep wandering back to it. I can't quite tell if it is about a million lines of unmaintainable slop or a glimpse at what near-term software engineering might look like. I encourage you to keep an open mind about it. Reality probably lies somewhere in the middle.
The TL;DR is that a team at OpenAI has been building an internal product without manually typing any code. None. Zero. 5+ months and 1M lines of code in, they estimate it has taken about 1/10th of what it would have taken them to build it out by hand. Where they spent that time was in designing and refining harnesses:
We had weeks to ship what ended up being a million lines of code. To do that, we needed to understand what changes when a software engineering team’s primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow Codex agents to do reliable work.
The article is relatively vague, but it highlights a few different kinds of harnesses:
- Context: Encode intent, system knowledge, and conventions directly in the repo so agents can work without rediscovering architecture from scratch. I assume
AGENTS.md, co-locateddocs/, thorough specs/plans, and similar durable artifacts. - Constraints: Limit the solution space with structural rules automatically enforced so agents can move quickly without violating core principles. These seem to go beyond the usual strict type checking and linting, and more into custom linters.
- Verification: Close the feedback loop with signals of success or failure so agents can test, validate, and iterate autonomously (pre-commit checks, tests, compiler errors, reproducible workflows, runtime signals).
- Maintenance: Continuously reduce entropy through small, automated cleanups and refactors that keep the codebase legible and aligned as changes accumulate.
Taken together, that's a compelling framework. Clear, thorough, and surfacing several gaps in our current tools.
One key concern the article barely touches on is security. How can we hold a quality gate when we don't know what the thing actually does? I expected at least a vague mention of fuzzing as a starting point, but there is none. Eventually, it'll be agents all the way down. But what about the near term? I would love to see a follow-up from that OpenAI team focused on what their current approach to securing this product looks like.
Despite that gap, these kinds of harnesses look like key future-proofing investments, with or without AI. They clarify the definition of “good” and reduce decision fatigue. That is critical for LLMs to perform, but it's useful for humans, too. It's not like they're new either. We've come around to loving strict type-checking, linting, tests, and many other kinds of correctness or quality gates. But the demand is rising. And not just the demand, but the potential impact. We need more of these, stricter, and across more layers, to reduce ambiguity as much as possible.
Ultimately, I see a near future where setting up and nurturing harnesses becomes part of most engineers' day-to-day work.
But we should also be aware that there are few quick wins in this space. That OpenAI team above spent several months improving their harnesses. That's where their “1/10th of the effort” went, and I suspect they are far from done.
“Harness” is probably a word we'll be using a lot more going forward. And fucking finally, I actually like a term that's gaining traction around AI.