The Judgment Engine: Building AI That Knows When to Stop and Ask

Most AI failures in health settings aren't technical failures. They're failures of knowing when the machine should hand it back to the human.

Eli Wood headshot

Eli Wood

June 24, 2026 4 min read
The Judgment Engine: Building AI That Knows When to Stop and Ask

There is a category of problem where AI gets everything right—the parsing, the retrieval, the pattern match—and still causes harm. Not because the model hallucinated, but because no one designed the moment where it stops and says: this one needs a human.

In high-stakes health settings, that moment is everything.

The problem

Teams building AI for health workflows face a paradox. The use cases that would deliver the most operational value—flagging a care gap, surfacing a risk signal, triaging a queue—are precisely the ones where a wrong output isn't just incorrect, it's consequential. So AI gets relegated to the safer periphery: scheduling, document summarization, FAQ answering. Useful, but not transformative.

The result is a generation of health AI that is technically competent and operationally timid.

Why it stays stuck

The instinct is to solve this with accuracy. If we can get the model to 95% precision, maybe 97%, surely it earns the right to act. But precision on a benchmark doesn't translate to safety in context. A model can be highly accurate in the aggregate and still be wrong at exactly the wrong moment—for exactly the wrong patient.

The deeper issue is architectural. Most teams build one engine that is asked to do two jobs: apply known rules and exercise judgment. Those are different cognitive tasks. Rules can be automated; judgment requires accountability. When you collapse them into a single inference call, you lose the ability to distinguish between an output the model is confident about and one it's guessing at.

Trust erodes quickly when the system can't explain itself. Clinicians don't object to AI on principle—they object to AI that can't show its work.

The path

The teams making real progress in this space have converged on a set of design principles that are simple to state and genuinely hard to execute without intention:

Separate the rules engine from the judgment engine. Deterministic checks—formulary compliance, eligibility rules, standard protocol steps—belong in a rules layer the model doesn't touch. Let the model handle interpretation: ambiguous documentation, edge cases, context that doesn't fit a rule. This separation makes auditing tractable and failure modes legible.

Start where judgment is expensive and repetitive. The highest-ROI entry point is usually a workflow where a credentialed human is doing the same cognitive work dozens of times a day—reading the same type of note, making the same type of triage call. That's where AI augmentation pays off fastest and where you can instrument feedback loops from day one.

Build the hand-off before you build the inference. Before any model goes near a production workflow, design the escalation path: what triggers a human review, who receives it, what they see, how their decision gets logged. The model's job is to reduce the volume hitting that path, not to replace it. That framing changes how stakeholders think about the system—and dramatically lowers the approval barrier.

Earn trust with explainability, not just accuracy. An AI that surfaces a risk flag alongside the three specific data points that generated it will be trusted faster than one that delivers a score. Clinicians and care managers are trained to evaluate evidence. Give them evidence.

The 2-day start

Pick one queue. Find the workflow where a human expert is making the same judgment call repeatedly—reviewing prior auth requests, triaging inbound messages, flagging patients due for outreach. Spend one day mapping the inputs and the decision criteria they actually use (not the policy document—the real heuristics). Spend the second day building a prototype that surfaces those inputs, generates a recommendation with a visible rationale, and puts a single-click escalation button in front of the reviewer. Don't automate the decision yet. Instrument what gets overridden and why. That data is your trust-building asset for everything that comes next.

The goal isn't to remove the human from the loop. It's to make the human's time in the loop worth something—informed, efficient, accountable.

Black Flag Design builds applied-AI products. If this is the problem you're staring at, spend two days with us—we call it a Foundation Sprint.

About the author

Eli Wood headshot
Eli Wood

CEO, Black Flag Design

Eli Wood leads Black Flag Design, a creative technology company focused on shipping ambitious digital products, AI systems, and design-forward software with a direct point of view on how technology changes work.

Related stories

More from the journal

Pen-and-ink sketch of a small clockwork robot working at a tool-covered workbench late at night while a human sleeps peacefully on a couch in the background, a wall clock reading 2:00 above
ai April 24, 2026 13 min read

The Agent Stays Up Late, Not Me

Every senior engineer knows the right way to set up a codebase. None of them do it. Here’s the four-stage framework we use — The Ratchet — to take a vibe-coded project all the way to a thing you’d trust in production, and the punchline about why this only just became worth doing.

Most teams have always known they should be running tests, type-checking, security audits, accessibility checks, dead-code analysis, prose linting, and a coverage floor. Most teams run two of those. Here’s why that math has finally inverted, and the four-stage framework we use to ratchet a vibe-coded project to a hardened one.

Keith Pattison

Keith Pattison

Founder, Black Flag Design

Read
Black Flag Journal
claude code April 20, 2026 5 min read

What a Year of Claude Code Trails Tells You About Your Team

Claude Code leaves evidence — sessions, commits, PRs, review notes. Read it like a logbook and you'll find what devs actually need to know before they go deeper.

After a year of shipping with Claude Code across real client work, the signal isn't in any single session — it's in the trails. Here's what those trails told us about where Claude Code shines, where it drifts, and the habits devs should build before they lean in harder.

Eli Wood headshot

Eli Wood

CEO, Black Flag Design

Read
Black Flag Journal
playbook April 20, 2026 6 min read

The Black Flag Playbook: Six Principles for Shipping with AI

Battle-tested principles for teams building real software with AI-generated code. Human judgment, tight scope, and weekly evidence — the disciplines that keep AI-built systems reliable.

The six rules we use to ship production software with AI. Small scope, weekly demos, human-led oversight, and continuous improvement — drawn from six months of real client engagements.

Keith Pattison

Keith Pattison

Founder, Black Flag Design

Read