Trusting AI-Generated Code

Can layered specifications, tamper-proof tests, and scoped human review build confidence in AI-generated code?

Trusting AI-Generated Code
Photo by Clark Van Der Beken / Unsplash

Companies are telling their developers to use more AI for writing code, while simultaneously requiring more reviews of that code. Trust in AI accuracy has dropped from 40% to 29% last year, but when you ask engineers what would change that number, nobody has a good answer.

I've been sitting with this contradiction for a while, motivated by my own experience with coding agents and what I see across the industry. Speaking with other developers, most of our trust seems to come from actually seeing the code, applying our experience and intuition, and judging the craftsmanship with which it was created. But that kind of review doesn't scale.

In order to understand this better, I have been asking myself what it would take for me personally to ship AI-generated code without any review. What would I need to have the necessary confidence?

Current State

I have always been a big fan of automating as much of the quality control in my projects as possible. Even before coding agents, I thought it was important to allow them to focus on the important parts and automate things like formatting and linting.

Over time, the following set of tools has made its way into every single one of my projects:

  • Test suites with high test coverage execute the code and catch regressions
  • Linters check for common mistakes and errors in code.
  • Formatters ensure the style is consistent.

A new addition are automated code reviews by AI agents like GitHub Copilot, which provide a really good first pass to catch issues before handing a pull request to a colleague for further review.

The Gap

These automated checks are the baseline. They catch surface-level problems, but they don't address two deeper questions: did the AI actually understand what I wanted, and is it gaming the verification loop? To feel confident in the AI merging its code without my review, a few things are missing for me.

Specifications

Specifications are the most important lever here. Better specs produce better code, but the ecosystem around spec-driven development is still immature. Formats, conventions, and tooling are still changing a lot.

Tools like tracey try to map requirements in specifications to code and tests, ensuring specs are implemented and verified. But they are mostly prototypes, which makes it hard to actually trust them yet.

LLM-as-Judge systems like GitHub Copilot Reviews might help verify that implementations match their specifications, but an LLM reviewing another LLM's work risks them both making the same mistake.

Tamper-Proof Tests

AI can easily cheat and game the tests. I've seen coding agents hardcode values, disable tests, or slip in an assert!(true) to make tests pass without actually exercising the code. Requiring minimum test coverage on modified code helps here: If the AI fakes a test, coverage drops.

Holdout scenarios take this further: tests that the coding agents do not know about, maintained and run by an adversarial process. The coding agent can't cheat, because it doesn't know the test scenarios. The testing agent treats the code as a black box and simply reports the failures.

But I haven't seen an implementation of this that would be easy to copy into my own projects. The tooling to separate these concerns in a real CI pipeline doesn't really exist. At least not for Rust.

Organizational Conventions

I want more deterministic checks on organizational conventions. Like custom lints that don't just check code quality, but enforce consistency in how things are structured across a project. Naming conventions for tests, commit message styles, ... Today, a lot of these rules live in my CLAUDE.md, but if I am no longer reviewing the code they must be deterministically enforceable.

Escalation to Human Review

There will still be a need for humans to review sensitive code, for example authentication logic, handling of user data, and access to secrets and infrastructure. We can gate on some of these today with CODEOWNERS files, but they're file-based, not logic-based. A new module that touches authentication code won't get flagged unless someone remembers to update the rules. Coverage will quietly degrade.

Layering Trust

None of these will close the trust gap on their own. But if we layer specifications, tamper-proof tests, coverage checks, lints, and scoped human review, we might start to create enough confidence to let AI ship its code autonomously.

The question I'm left with is about the review experience itself. I feel forced to either review all or none of a pull request. What I actually want is a code review that surfaces only the parts where automated checks ran out of confidence. Does that exist yet?

Let's Talk!

If you're thinking about trust in AI-generated code too, I'd love to hear from you! Send me an email to [email protected] or find me on Bluesky or Mastodon. 👋