The Agent Stays Up Late, Not Me

Every senior engineer knows the right way to set up a codebase. None of them do it. Here’s the four-stage framework we use — The Ratchet — to take a vibe-coded project all the way to a thing you’d trust in production, and the punchline about why this only just became worth doing.

Keith Pattison

Keith Pattison

April 24, 2026 13 min read Updated April 25, 2026
Pen-and-ink sketch of a small clockwork robot working at a tool-covered workbench late at night while a human sleeps peacefully on a couch in the background, a wall clock reading 2:00 above

Every senior engineer I have ever worked with knows their codebase needs more tests. Knows the linter config is two years out of date. Knows the dependency tree has three known CVEs in it. Knows the CSS bundle has tripled in size since the redesign and nobody can find the dead rules. Knows it. Has done none of it.

The reason isn’t ignorance. The reason is economics. Setting up the boring guardrails takes weeks of senior-engineer time and pays back, in any given quarter, maybe two saved bugs and one unbroken weekend. Across a year, maybe nine. Across a decade of careers, the math just never bent in favor of doing the work.

So we didn’t. We rolled the dice. We told ourselves “we’ll add tests after launch,” and then we didn’t add tests after launch, and three years later we hired somebody to rewrite the whole thing and we called the previous version “legacy.” This is the entire history of software engineering as a profession, and it’s the dirty secret of every shop you’ve ever worked at, including the ones with the nice landing pages.

That math just changed. Not by a little. By an order of magnitude. And almost nobody has updated their internal model.

The Ratchet: a four-stage framework

We ship two production codebases. This week we finished hardening both of them through a framework I’m going to name in this piece, partly because naming things makes them easier to argue about and partly because it’s mine and I get to. We call it The Ratchet.

Four stages. Each one answers a different question about the codebase. Each one has a recommended trigger — a moment when you stop putting it off. The order matters. The trigger conditions matter more.

Stage Question it answers When to start What goes in it
1. Readable Can the next person open this file and not get lost? Day zero. Yes, day zero. Linting (ESLint), Types (TypeScript)
2. Changeable Can I touch this code without breaking three other things? The day you commit a feature you don’t want to rewrite. Usually week two. Tests + coverage floor (vitest), Dead-code detection (fallow)
3. Safe Can this code hurt my users or my business? Before the first real user. Sooner if the data is sensitive. Security audit (npm audit), Accessibility (jsx-a11y), Destructive-action confirms
4. Durable Will this code make sense to someone who didn’t write it? The day a second entity — human or agent — commits to the repo. Commit discipline (commitlint), Decision histories, Style baselines (Wallace), Bundle budget

Thirteen walls across four stages. None of them are new. Every one of these libraries has existed for years. The thing that changed isn’t the tools.

Important

The thing that changed is who pays the cost of running into the walls. Pre-agents, every wall was a tax on a human — a slowdown, a Slack ping, a tired Friday. Now every wall is a tax on tokens. The agent retries until the wall lets it through. The dollar cost is real. The cost in human attention is zero. Once you internalize that one fact, the rest of this article is obvious, and so is the framework.

Let me walk through the four stages.

Stage 1: READABLE

The question: Can someone (or something) open this file tomorrow and understand it without asking three Slack questions?

The walls: ESLint, the linter everyone in JavaScript uses. TypeScript, which checks the shape of your data across function calls.

Plain English on the libraries. A linter is fancy-speak for spellcheck-but-for-code: it reads every file and flags variables you forgot to use, comparisons that are always false, function names you typoed. TypeScript is a second pass that makes sure when one function calls another, the kinds of arguments line up. Together they catch maybe half the bugs a senior reviewer would catch in their first pass, automatically, before any human looks at the PR.

The irony of stage one. Linting is the most boring topic in software engineering. There has not been a single linting talk at a major conference in five years that didn’t feel like a chore to attend. And yet ESLint and TypeScript, between them, have prevented more production outages than every clever architectural decision of the last decade combined. We just don’t talk about them, because nobody high-fives you for setting up a linter.

My recommendation. Do this on day zero. The Day-Zero conversation that ends with “we’ll add it later” is the conversation you’ll regret in month three, when nobody can read each other’s files and the agent is generating new code in three slightly-incompatible styles depending on which file it last touched. There is no cheaper, higher-leverage moment to install these two walls than the moment you run npm init.

Stage 2: CHANGEABLE

The question: Can I change something in this codebase without breaking three other things?

The walls: vitest plus a coverage floor (a rule that says: tests cover at least N% of the code, and the number can only go up). fallow, which finds dead code — files nobody imports, exports nobody uses, dependencies nobody references.

Plain English. Tests are little programs that run the real program and check the answers come out right. The coverage floor is the part most teams skip; it’s a single configuration file that says “build fails if test coverage drops below today’s number.” Dead-code detection is a tool that maps the import graph across the whole project and flags the orphaned bits.

The juxtaposition. A codebase without tests is a relationship without a memory. Every change feels like the first time you’ve touched the system, because the system can’t tell you whether you broke anything. A codebase with tests and a coverage floor is the opposite — every change is having a conversation with everyone who shipped before you. The system speaks back. The agent loves this. It tries something. The tests tell it what broke. It tries again, narrower, and the tests tell it that worked. Without those tests, the agent has to ask me. With them, it doesn’t.

My recommendation. Add this the day you ship a feature you don’t want to throw away. For us that’s usually week two, sometimes week one. Before stage two you are vibe-coding. After stage two you are building software. The transition is structural, not vibes-based, and the gate is what makes it real.

Stage 3: SAFE

The question: Can this code hurt somebody?

The walls: npm audit for known-vulnerable dependencies. jsx-a11y for accessibility issues at lint time. A type-level pattern (we use Convex’s v.literal(true) in tool definitions) that requires explicit confirmation for any irreversible action.

Plain English. Modern projects depend on hundreds of small open-source libraries. Some of those libraries turn out to have security holes — ways for attackers to get into your system through a vulnerability in code you didn’t write. npm audit checks our full dependency list against a public database of known issues. Accessibility means the site works for people who don’t see your screen the way you do — keyboard users, screen-reader users, low-vision users. Destructive-action confirms means: when the agent calls a function that publishes to production or deletes a record, the call has to include confirmPublish: true or the type system rejects it. Two bytes of effort. Free at runtime. Eliminates an entire category of disaster.

export const publish = mutation({
  args: {
    id: v.id("drafts"),
    confirmPublish: v.literal(true),
  },
  handler: async (ctx, { id }) => {
    // ...
  },
});

The metaphor. Stage 3 is the smoke detectors in your house. You don’t think about them most days. You forget you have them. Then one Tuesday at 3 a.m. they save your life and you remember why they were never optional in the first place. Most teams treat security and accessibility like smoke detectors with the batteries pulled out — still installed, technically there, just not actually doing the job.

My recommendation. Stage 3 happens before the first real user. If you’re building anything stakeholders see — a demo, a beta, a friends-and-family rollout — you need this in place. If your data is sensitive, you need it sooner. The day you create a real database with real customer data and don’t have stage 3 walls is the day a story starts that ends with you on a phone call you don’t want.

Stage 4: DURABLE

The question: Will this code make sense to someone who didn’t write it — in six months, in a year, after the team turns over?

The walls: commitlint enforces a convention on git commit messages (every commit starts with feat: or fix: or chore:). Decision histories: markdown files checked into the repo that explain why each major architectural choice was made. Style baselines via Project Wallace, which freezes the complexity of the compiled CSS so it can’t silently grow. A bundle-size budget so the JavaScript payload can’t silently grow either.

Plain English. Commitlint makes every git commit message follow a structure, so when somebody (or the agent) greps the git log six months from now they can actually find what they’re looking for. Decision histories are the equivalent of commit messages for things that don’t fit in a commit — “why is the admin app separate from the tenant apps” kind of questions. Style and bundle baselines work the same way the coverage floor does: a number is frozen, and the build fails if it gets worse.

The irony of stage four. This is the stage nobody high-fives you for. There is no demo of “our commit messages are very consistent.” There is no investor deck slide for “we have a markdown file explaining why we picked Convex.” And yet stage four is the stage that determines whether your codebase is alive in five years or whether it’s a thing you pay somebody to gradually replace. Every legacy codebase you’ve ever encountered failed stage four years before it failed any other stage.

My recommendation. Stage four lands the day a second entity — a teammate, an agent, a contractor — commits to the repo. Before that, the codebase has one author and one author’s memory, and the durability question doesn’t bind. The moment a second pair of hands touches it, the question “will this still make sense” goes from rhetorical to operational, and you need the walls in place before the divergence starts.

What this looks like in practice

Thursday night I went to bed with the agent on its third PR for the same deploy fix. The first two had landed red and been reverted — one died on a Cloudflare token scope, one died on a secret that hadn’t been wired through the workflow. As I closed the laptop, the agent was opening attempt three.

Woke up. PR #134 green. All four tenants deploying clean.

The walls did the work. Stage 1 caught the type errors as the agent iterated. Stage 2 ran the tests against each new attempt. Stage 3 caught a missing destructive-action confirm on one of the rollout helpers. Stage 4’s commit-discipline gate forced the agent to write a clean PR description on each retry, which means the git log of those three PRs reads as a debugging journal: failed at A, hypothesis was B, B was wrong, real cause was C, fixed.

flowchart LR
  A([agent commit]) --> C["npm run check — thirteen gates, 90s ceiling"]
  C -->|all pass| M([merge])
  C -.->|any fails| A

All thirteen walls run behind one command. The agent types npm run check and either everything passes or something fails and the agent reads the failure and tries again. Ninety seconds, max. Fast gates keep the retry loop cheap; slow gates would kill it.

Pen-and-ink sketch of a clockwork robot carrying a stack of paper labeled COMMIT, walking through a row of nine wooden gates each marked with a letter, toward a smiling human standing at a finish line ribbon labeled MERGE

Pen-and-ink sketch of a small clockwork robot pressing its forehead against a tall brick wall painted with the word TYPECHECK, cracks radiating from the contact point and a sliver of sunlight visible on the far side

Tip

A junior engineer doing this would burn out by Wednesday. The agent doesn’t. It hits the wall a hundred times in a perfectly even mood, logs what it learned, and tries a hundred and one. That is the entire reason walls work — the thing running into them doesn’t get tired, and never has the conversation with itself about whether the gate is being unreasonable.

The literal ratchet

Pen-and-ink close-up of a hand turning a ratchet wrench labeled RATCHET on a bolt set in a steel plate; a paper tag on the bolt reads COVERAGE FLOOR and a small note below reads no going back

A ratchet only turns one direction. That is the move. Most of the walls in stages 2, 3, and 4 share a single mechanism: each one has a baseline file checked into the repo that records the worst the codebase is allowed to be on that dimension today. Coverage can’t drop below today’s number. CSS complexity can’t grow past today’s measurement. Dead exports can’t increase. The agent cannot loosen a baseline. Only a human can. So every PR either holds the line or tightens the bolt one notch, and drift just isn’t a thing that can happen on this codebase.

A representative shape:

// each app freezes its own coverage floor; build fails if any number drops
{
  "coverage": {
    "thresholds": {
      "lines":      82.4,
      "functions":  79.1,
      "branches":   74.0,
      "statements": 82.4
    }
  }
}

Four numbers per app. The test suite runs, the script reads the report, the build fails if anything drops. That’s the mechanism. It is the most boring code in our entire stack and the reason the bar only moves one direction.

What it cost

Token bill
~$200
across one week
Hours returned
15–20
of human review friction
Sundays saved
the number that matters

Back-of-envelope. The token side wasn’t metered per-PR, and the hours figure is my own reconstruction, not a timesheet.

The four ironies

I want to leave you with the parts of this whole story that I find genuinely funny.

The boring stuff turned out to be the leverage point. For twenty years, the highest-status work in software engineering was architecture — picking the right framework, the right database, the right async paradigm. The boring grunt work — linting, types, tests, coverage — was junior work, the stuff you delegated. It turns out the boring stuff is what determines whether your codebase is still working at month thirty-six. The architecture doesn’t.

The agent — the thing senior engineers feared would lower quality — is the only thing that makes maximum quality affordable. Two years ago the loudest engineers in my circle were predicting the death of software craftsmanship. Their model was: agents will write more code, faster, sloppier, and we’ll all drown in it. They were half right. Agents do write more code. The other half is that agents are also the only entity capable of running into thirteen walls in a row, all night, without quitting. The walls were always the right idea. Until 2024 we couldn’t afford them. We can now.

We argued about test discipline for a decade. The agent never argued. Every team I have ever worked on had at least one Slack thread about whether to write a test for a piece of code. The thread was always long, always frustrating, and always ended in a shrug. The agent does not have this argument with itself. You configure a coverage floor and the agent writes tests, because the alternative is the build failing and the agent retrying. It is the stupidest, most boring resolution to a decade-long debate, and it is correct.

Most teams will read this article and not change anything. Because the activation energy of starting feels higher than the cost of continuing not to. They’re wrong. They were wrong about it before agents existed too. The difference is that, in 2026, they’re also wrong by an order of magnitude. The math has updated. The model in their head hasn’t.

It’s no longer my job to be the wall. It’s my job to build them and check the calibration. If you’re running an engineering team in 2026 and you don’t have a Ratchet — a four-stage progression of automated, agent-readable checks for every dimension of code quality — you’re putting on your humans every cost those checks would put on tokens. Wrong direction. Tokens are cheap. People are not.

One more thing

This article was drafted, formatted, illustrated, and published by the agent over the same MCP it’s been arguing for. The four images came out of one tool call each. The callouts and tables and code blocks and the little flowchart you scrolled past — all emitted from typed tool calls into the CMS.

I reviewed. Pushed back five times. Once on length. Once on the cost numbers being too confident. Once because the first draft read like a list of tools instead of an argument. Once because the rewrite that fixed the argument still read like nothing a human would actually write. Once more, because the next rewrite had a thesis but no framework — no name, no stages, no recommendation. This is the version after that.

Keystroke to live URL: under forty minutes. Most of those minutes were me typing “this still doesn’t feel right.” The agent kept rewriting. That is the article.

About the author

Keith Pattison
Keith Pattison

Founder, Black Flag Design

Keith leads Black Flag Design, a studio that ships production-ready software with AI-assisted development. He writes about the disciplines — small scope, weekly evidence, and human oversight — that keep AI-built systems reliable in the real world.

Related stories

More from the journal

Black Flag Journal
claude code April 20, 2026 5 min read

What a Year of Claude Code Trails Tells You About Your Team

Claude Code leaves evidence — sessions, commits, PRs, review notes. Read it like a logbook and you'll find what devs actually need to know before they go deeper.

After a year of shipping with Claude Code across real client work, the signal isn't in any single session — it's in the trails. Here's what those trails told us about where Claude Code shines, where it drifts, and the habits devs should build before they lean in harder.

Eli Wood headshot

Eli Wood

CEO, Black Flag Design

Read
Black Flag Journal
playbook April 20, 2026 6 min read

The Black Flag Playbook: Six Principles for Shipping with AI

Battle-tested principles for teams building real software with AI-generated code. Human judgment, tight scope, and weekly evidence — the disciplines that keep AI-built systems reliable.

The six rules we use to ship production software with AI. Small scope, weekly demos, human-led oversight, and continuous improvement — drawn from six months of real client engagements.

Keith Pattison

Keith Pattison

Founder, Black Flag Design

Read
The Death of Software as a Service (SaaS) cover image
ai systems March 27, 2026 2 min read

The Death of Software as a Service (SaaS)

Denver AI, a local group focused on moving AI out of theory and into execution; bringing together operators, founders, and builders to share real-world use cases, demos, and practical approaches to applying AI inside...

Eli Wood headshot

Eli Wood

CEO, Black Flag Design

Read