Scoring people fairly: the explainability burden of rating humans with AI

Composite scores that rank people — blending grades, social signals, and demonstrated skill — are some of the highest-stakes models you can ship. The hard part isn't the math; it's earning the right to be trusted with a number that changes someone's life.

Keith Pattison

Keith Pattison

June 24, 2026 4 min read
Three distinct data streams flowing into a single balanced scale that outputs one numbered score, with a human hand resting on the scale's beam

Plenty of teams now build systems that boil a person down to a single number. A composite score that fuses academic performance, social signals, and demonstrated skill into one ranking that decides who gets seen, recruited, funded, or passed over. The pitch is always the same: there's too much data and too few human hours to weigh it fairly, so let the model do it.

The pitch is right about the problem and dangerously casual about the stakes.

The problem: a number that decides a life is not a normal feature

Most product metrics are forgiving. A bad recommendation costs a click. A miscalibrated score on a human costs them an opportunity they may never know they lost — and it costs you the moment someone asks why. Because someone always asks why.

The person being rated wants to know why they ranked where they did. The people acting on the score want to defend the decision. A regulator, a journalist, or a lawyer wants to know whether the number encodes a bias you never intended. "The model said so" is not an answer to any of them. The instant a composite score touches a real decision about a real person, explainability stops being a nice-to-have and becomes the product itself.

And composite scores are uniquely good at hiding bias. When you blend GPA, social reach, and skill assessments into one figure, you also blend in every correlation those inputs carry — access to better schools, the demographics of who builds a following, who had the resources to get coached. The single number launders all of it into something that looks objective. The aggregation is exactly where the bias goes to hide.

The insight: separate the rules engine from the judgment engine

The most useful move we know is architectural: stop building one model that does everything, and split the system in two.

The rules engine holds the parts that should be deterministic and inspectable — eligibility, hard thresholds, weighting policy, the explicit "a score below X never auto-advances" guardrails. This is policy, not prediction. It should be legible to a non-engineer, versioned, and changeable without retraining anything.

The judgment engine holds the genuinely probabilistic work — inferring skill from messy evidence, normalizing signals that don't share a scale, surfacing patterns a human would miss. This is where the model earns its keep, and where it's allowed to be uncertain.

Keeping them apart buys you three things. You can explain a score by walking someone through the rules without exposing them to a black box. You can audit each engine for bias independently instead of arguing about the whole tangle at once. And you can keep a human in the loop precisely where being wrong is costly — let the judgment engine propose, but make a person own the call whenever the score crosses a line that changes someone's path.

That last point is the discipline that separates a defensible system from a liability. Full automation is fine where mistakes are cheap and reversible. Rating a human is neither. So the goal isn't to remove the human; it's to spend their attention well — route the routine, repetitive, low-stakes scoring to the machine, and reserve human judgment for the expensive, contested edges.

The path: a two-day starting point

You don't need a six-month overhaul to find out whether this works. You need two days and an honest look at one decision.

Day one — map the decision and the inputs. Pick the single highest-stakes decision your score drives. Write down every input feeding the composite and, beside each, the proxy it actually encodes — what does "social signal" really measure, and who does that systematically favor? Then write the explanation a real person would demand if the score went against them. If you can't draft that explanation today, that's your finding: the system isn't explainable yet, and you've found the gap before a regulator does.

Day two — draw the two-engine line and set the human gate. Take the same decision and sort each rule into "deterministic policy" or "probabilistic judgment." Make the policy side legible and the judgment side bounded. Then define one threshold above which no score auto-acts without a person signing off, and pull a sample of recent scores to see how often that gate would have fired. You'll know immediately whether your automation is saving judgment for where it matters or quietly making life-altering calls on its own.

Two days won't give you a finished platform. It will give you something more valuable: a clear-eyed read on where your scoring system earns trust and where it's borrowing it. Start where the judgment is expensive and repetitive, make the reasoning legible, and keep a human on the costly edges. That's how a score about a person earns the right to be believed.

Black Flag Design builds applied-AI products where being wrong is costly. If this is your world, spend two days with us — we call it a Foundation Sprint.

About the author

Keith Pattison
Keith Pattison

Founder, Black Flag Design

Keith leads Black Flag Design, a studio that ships production-ready software with AI-assisted development. He writes about the disciplines — small scope, weekly evidence, and human oversight — that keep AI-built systems reliable in the real world.

Related stories

More from the journal

Pen-and-ink sketch of a small clockwork robot working at a tool-covered workbench late at night while a human sleeps peacefully on a couch in the background, a wall clock reading 2:00 above
ai April 24, 2026 13 min read

The Agent Stays Up Late, Not Me

Every senior engineer knows the right way to set up a codebase. None of them do it. Here’s the four-stage framework we use — The Ratchet — to take a vibe-coded project all the way to a thing you’d trust in production, and the punchline about why this only just became worth doing.

Most teams have always known they should be running tests, type-checking, security audits, accessibility checks, dead-code analysis, prose linting, and a coverage floor. Most teams run two of those. Here’s why that math has finally inverted, and the four-stage framework we use to ratchet a vibe-coded project to a hardened one.

Keith Pattison

Keith Pattison

Founder, Black Flag Design

Read
Black Flag Journal
claude code April 20, 2026 5 min read

What a Year of Claude Code Trails Tells You About Your Team

Claude Code leaves evidence — sessions, commits, PRs, review notes. Read it like a logbook and you'll find what devs actually need to know before they go deeper.

After a year of shipping with Claude Code across real client work, the signal isn't in any single session — it's in the trails. Here's what those trails told us about where Claude Code shines, where it drifts, and the habits devs should build before they lean in harder.

Eli Wood headshot

Eli Wood

CEO, Black Flag Design

Read
Black Flag Journal
playbook April 20, 2026 6 min read

The Black Flag Playbook: Six Principles for Shipping with AI

Battle-tested principles for teams building real software with AI-generated code. Human judgment, tight scope, and weekly evidence — the disciplines that keep AI-built systems reliable.

The six rules we use to ship production software with AI. Small scope, weekly demos, human-led oversight, and continuous improvement — drawn from six months of real client engagements.

Keith Pattison

Keith Pattison

Founder, Black Flag Design

Read