Scoring people fairly: the explainability burden of AI ratings

Plenty of teams now build systems that boil a person down to a single number. A composite score that fuses academic performance, social signals, and demonstrated skill into one ranking that decides who gets seen, recruited, funded, or passed over. The pitch is always the same: there's too much data and too few human hours to weigh it fairly, so let the model do it.

The pitch is right about the problem and dangerously casual about the stakes.

The problem: a number that decides a life is not a normal feature

Most product metrics are forgiving. A bad recommendation costs a click. A miscalibrated score on a human costs them an opportunity they may never know they lost — and it costs you the moment someone asks why. Because someone always asks why.

The person being rated wants to know why they ranked where they did. The people acting on the score want to defend the decision. A regulator, a journalist, or a lawyer wants to know whether the number encodes a bias you never intended. "The model said so" is not an answer to any of them. The instant a composite score touches a real decision about a real person, explainability stops being a nice-to-have and becomes the product itself.

And composite scores are uniquely good at hiding bias. When you blend GPA, social reach, and skill assessments into one figure, you also blend in every correlation those inputs carry — access to better schools, the demographics of who builds a following, who had the resources to get coached. The single number launders all of it into something that looks objective. The aggregation is exactly where the bias goes to hide.

The insight: separate the rules engine from the judgment engine

The most useful move we know is architectural: stop building one model that does everything, and split the system in two.

The rules engine holds the parts that should be deterministic and inspectable — eligibility, hard thresholds, weighting policy, the explicit "a score below X never auto-advances" guardrails. This is policy, not prediction. It should be legible to a non-engineer, versioned, and changeable without retraining anything.

The judgment engine holds the genuinely probabilistic work — inferring skill from messy evidence, normalizing signals that don't share a scale, surfacing patterns a human would miss. This is where the model earns its keep, and where it's allowed to be uncertain.

Keeping them apart buys you three things. You can explain a score by walking someone through the rules without exposing them to a black box. You can audit each engine for bias independently instead of arguing about the whole tangle at once. And you can keep a human in the loop precisely where being wrong is costly — let the judgment engine propose, but make a person own the call whenever the score crosses a line that changes someone's path.

That last point is the discipline that separates a defensible system from a liability. Full automation is fine where mistakes are cheap and reversible. Rating a human is neither. So the goal isn't to remove the human; it's to spend their attention well — route the routine, repetitive, low-stakes scoring to the machine, and reserve human judgment for the expensive, contested edges.

The path: a two-day starting point

You don't need a six-month overhaul to find out whether this works. You need two days and an honest look at one decision.

Day one — map the decision and the inputs. Pick the single highest-stakes decision your score drives. Write down every input feeding the composite and, beside each, the proxy it actually encodes — what does "social signal" really measure, and who does that systematically favor? Then write the explanation a real person would demand if the score went against them. If you can't draft that explanation today, that's your finding: the system isn't explainable yet, and you've found the gap before a regulator does.

Day two — draw the two-engine line and set the human gate. Take the same decision and sort each rule into "deterministic policy" or "probabilistic judgment." Make the policy side legible and the judgment side bounded. Then define one threshold above which no score auto-acts without a person signing off, and pull a sample of recent scores to see how often that gate would have fired. You'll know immediately whether your automation is saving judgment for where it matters or quietly making life-altering calls on its own.

Two days won't give you a finished platform. It will give you something more valuable: a clear-eyed read on where your scoring system earns trust and where it's borrowing it. Start where the judgment is expensive and repetitive, make the reasoning legible, and keep a human on the costly edges. That's how a score about a person earns the right to be believed.

Black Flag Design builds applied-AI products where being wrong is costly. If this is your world, spend two days with us — we call it a Foundation Sprint.

About the author

Keith Pattison

Founder, Black Flag Design

Keith leads Black Flag Design, a studio that ships production-ready software with AI-assisted development. He writes about the disciplines — small scope, weekly evidence, and human oversight — that keep AI-built systems reliable in the real world.

Scoring people fairly: the explainability burden of rating humans with AI

The problem: a number that decides a life is not a normal feature

The insight: separate the rules engine from the judgment engine

The path: a two-day starting point

More from the journal

The Agent Stays Up Late, Not Me

What a Year of Claude Code Trails Tells You About Your Team

The Black Flag Playbook: Six Principles for Shipping with AI