The problem
Every health system is sitting on a version of the same gap: they know that food security, housing stability, and transportation access drive health outcomes more than clinical care does. They have community health workers, patient navigators, and case managers whose entire job is bridging that gap. And they have a backlog that never clears.
The obvious move is automation. Screen patients, detect needs, route them to the right program. Some teams have tried to bolt a rules engine onto their EHR workflow and call it AI. Others have handed the whole thing to a large language model and hoped for the best.
Neither works. The rules engine breaks the moment the resource landscape changes — and it always changes. The unsupervised LLM hallucinates availability, misreads eligibility, and occasionally routes a vulnerable patient somewhere that can't serve them. When the stakes are a family losing housing or a diabetic patient going hungry, those failures aren't acceptable.
Why it's stuck
The teams building these tools are conflating two very different problems.
The first is a matching problem: given what we know about a patient's situation, what resources in the community are actually relevant? This is tractable with AI. The signal is structured. The features are learnable. The feedback loop — did the patient get connected? — is measurable.
The second is a judgment problem: of the relevant resources, which one do we actually act on, given everything we know about this specific person's circumstances, preferences, and vulnerabilities? That judgment belongs to a human. Not because AI can't pattern-match at scale, but because being wrong here has asymmetric costs that a model cannot fully price.
When teams try to collapse both problems into one system, they either over-automate (the LLM makes calls it shouldn't) or under-automate (the navigator still screens everything by hand because they don't trust the output). Both outcomes are losing ones.
The path
The architecture that actually moves the needle separates these two engines cleanly — and it can be proven in two days.
Separate the rules engine from the judgment engine. Let the model handle matching: surface the right resources, rank them by fit, flag eligibility gaps. Keep the human in the loop for the final decision, especially for vulnerable or complex cases. The model's job is to make the navigator faster and better-informed, not to replace their judgment.
Start where judgment is expensive and repetitive. Every navigator team has a class of cases that are genuinely straightforward — stable patients with a single, clearly matched need. Automate those with high confidence and a lightweight human-review step. Reserve navigator attention for the cases that actually need it.
Earn trust with explainability. Navigators will not use a black box. The model's recommendation needs to show its work: why this resource, what's the eligibility fit, what's the waitlist status. When the navigator can see the reasoning, they can override it intelligently — and that override data becomes the best training signal you have.
Build the feedback loop from day one. Did the patient connect? Did the resource actually serve their need? Without a structured feedback loop, the model drifts. With one, it compounds. The compounding is the whole point.
The two-day start looks like this: map the current navigator workflow by hand, identify the one or two resource categories with the highest volume and lowest variance, and build a prototype that routes just those cases. Instrument the feedback loop before you write another line of model code. Ship it to one navigator. Watch what breaks.
What you learn in those two days will be more valuable than six months of architecture meetings.
Black Flag Design builds applied-AI products. If this is the problem you're staring at, spend two days with us — we call it a Foundation Sprint.