When Being Wrong Costs Someone Their Freedom

In public systems where an error changes the course of a person's life, the goal of applied AI is not to decide faster. It is to make expensive human judgment go further — with every step visible, contestable, and owned by a person.

Eli Wood headshot

Eli Wood

June 24, 2026 5 min read
A careful human hand on a railway switch lever at a forking track, with a magnifying lens inspecting the junction before the choice

Some systems can absorb a wrong answer. A recommendation engine that misjudges your taste suggests a bad movie, you scroll past, nobody is harmed. Public systems that touch a person's liberty are not those systems. Here a wrong answer is not a bad suggestion — it is someone held longer than they should be, flagged who should not have been, denied something they were owed. The cost of being wrong is not measured in churn. It is measured in a life.

That changes the entire calculus for applied AI. In a low-stakes setting, a model that is right ninety percent of the time is a triumph. In a high-stakes public system, the question is not the average — it is what the ten percent costs, who bears it, and whether anyone can see it happen. A tool built for these systems has to be designed around its errors, not its accuracy, because the errors are where people get hurt.

The problem: the work is judgment, and the volume is crushing

The people inside these systems — case managers, analysts, officers, administrators — are not short on judgment. They are short on time. Each of them carries a caseload that makes careful, individual attention nearly impossible, so they spend their scarcest resource — judgment — on work that does not require it: assembling the file, reconciling records that disagree, re-deriving the same summary for the hundredth person this month. The human judgment that should decide the hard call gets burned on the clerical work that precedes it.

So the pull toward automation is real and correct. There is enormous repetitive load here that does not need a human and is actively starving the parts that do. But there is a cliff next to that opportunity. Automate one step too far — let the model make the call instead of preparing it — and you have built a system that decides a person's path with no one accountable for the decision, dressed in the false authority of a number. That is not efficiency. It is harm at scale, and it is harder to contest than a human ever was.

Why it is stuck: tools optimize for a score, not for being questioned

Most tooling in this space treats the problem as prediction: feed in the history, output a risk number, rank the list. The trouble is that a number trained on the past faithfully reproduces the past — including every disparity baked into who got policed, charged, and confined before. The tool does not remove the bias. It launders it into something that looks objective and is much harder to argue with.

The deeper failure is that a score cannot be interrogated. When a person's outcome turns on a model's output, the only thing that matters in the room is why — and "the system said so" is not an answer a caseworker can stand behind, a supervisor can review, or a person can appeal. The real work in these systems is judgment under uncertainty with a duty to explain it. That is exactly the shape of problem modern AI can assist with, and exactly the shape that punishes a system built with no accountability inside it.

The path: build the tool as a judgment amplifier, with receipts

The tools that belong in high-stakes public systems are not the ones with the best predictive score. They are the ones built on a few principles:

  • Keep a human in the loop wherever being wrong is costly. That is the entire premise here, not a safety bolt-on. The system reads the file, assembles the picture, and surfaces what likely matters; a person makes every decision that changes a life and signs their name to it. AI compresses the preparation, never the judgment.
  • Separate the rules engine from the judgment engine. Eligibility thresholds, statutory requirements, and policy constraints are rules — explicit, auditable, owned by the agency, changeable without a model retrain. Whether a specific person's situation warrants a specific action is judgment, and it stays with a human. Tangle the two and you can never tell whether an outcome came from a policy someone chose or a model no one understands.
  • Start where judgment is expensive and repetitive. The highest-leverage build is not the decision — it is everything that precedes it: gathering the record, reconciling conflicting data, drafting the summary an overloaded human currently builds by hand for every case. Give that time back and the human spends their judgment where it counts.
  • Earn trust with explainability, because the explanation is the product. Every output has to carry its receipts: this conclusion, from these facts, with this gap, here is the source. A caseworker has to be able to verify it, a supervisor to review it, and the person affected to contest it. In a system where decisions are appealable and ought to be, an opaque answer is not just useless — it is a liability the moment someone asks how you knew.

Building for high-stakes public systems is not a prediction project with a dashboard on top. It is a focused question: which judgment in your workflow is most expensive and most repetitive right now — the intake, the file assembly, the recurring review — and what is the smallest system that helps a human make it better without making it for them? That is a two-day conversation before it is a roadmap. You sit with one real workflow, find the one place where preparation is eating judgment, and build a thin tool that does the preparation, shows its work, and hands the decision back to the person who has to own it.

The temptation in these systems is always to automate the decision, because the decision is the bottleneck. The discipline is to automate everything around it instead — to make the scarce human judgment go further, not to replace it with a number nobody can answer for. The tools that earn a place here will not be the ones that decided fastest. They will be the ones a caseworker trusted, a supervisor could audit, and a person could challenge — because where being wrong costs someone their freedom, being able to show your work is not a feature. It is the whole job.


Black Flag Design builds applied-AI products for decisions that can't afford to be wrong. If this is your world, spend two days with us — we call it a Foundation Sprint.

About the author

Eli Wood headshot
Eli Wood

CEO, Black Flag Design

Eli Wood leads Black Flag Design, a creative technology company focused on shipping ambitious digital products, AI systems, and design-forward software with a direct point of view on how technology changes work.

Related stories

More from the journal

Pen-and-ink sketch of a small clockwork robot working at a tool-covered workbench late at night while a human sleeps peacefully on a couch in the background, a wall clock reading 2:00 above
ai April 24, 2026 13 min read

The Agent Stays Up Late, Not Me

Every senior engineer knows the right way to set up a codebase. None of them do it. Here’s the four-stage framework we use — The Ratchet — to take a vibe-coded project all the way to a thing you’d trust in production, and the punchline about why this only just became worth doing.

Most teams have always known they should be running tests, type-checking, security audits, accessibility checks, dead-code analysis, prose linting, and a coverage floor. Most teams run two of those. Here’s why that math has finally inverted, and the four-stage framework we use to ratchet a vibe-coded project to a hardened one.

Keith Pattison

Keith Pattison

Founder, Black Flag Design

Read
Black Flag Journal
claude code April 20, 2026 5 min read

What a Year of Claude Code Trails Tells You About Your Team

Claude Code leaves evidence — sessions, commits, PRs, review notes. Read it like a logbook and you'll find what devs actually need to know before they go deeper.

After a year of shipping with Claude Code across real client work, the signal isn't in any single session — it's in the trails. Here's what those trails told us about where Claude Code shines, where it drifts, and the habits devs should build before they lean in harder.

Eli Wood headshot

Eli Wood

CEO, Black Flag Design

Read
Black Flag Journal
playbook April 20, 2026 6 min read

The Black Flag Playbook: Six Principles for Shipping with AI

Battle-tested principles for teams building real software with AI-generated code. Human judgment, tight scope, and weekly evidence — the disciplines that keep AI-built systems reliable.

The six rules we use to ship production software with AI. Small scope, weekly demos, human-led oversight, and continuous improvement — drawn from six months of real client engagements.

Keith Pattison

Keith Pattison

Founder, Black Flag Design

Read