When AI Tutors Think Out Loud With a Student

A model that guides a child's reasoning in the moment is doing the most consequential thing software can do in a classroom. The bar for that is not engagement. It's the bar you'd hold a student teacher to.

Eli Wood headshot

Eli Wood

June 24, 2026 4 min read
A teacher's hand resting lightly on the back of a chair where a young student works through a problem with a glowing guided thread of reasoning on the desk

The moment the stakes change

There's a difference between software that delivers content and software that shapes how a child thinks. A worksheet generator is the first kind — if it produces a bad problem, a teacher catches it and moves on. A model that sits with a student in the moment, reads a half-formed answer, and decides what to say next is the second kind. It is reaching into the most consequential thing that happens in a classroom: the live formation of a young mind's reasoning.

The instinct in ed-tech is to measure that experience by engagement — minutes on task, problems attempted, a child who looks busy. But engagement is the wrong bar for a system operating live inside a student's thinking. A confident, wrong, well-paced explanation is more engaging than a careful one, and far more damaging. When the cost of being wrong is a misconception lodged in a ten-year-old, you cannot grade the tool on whether kids liked it.

Hold it to the bar you'd hold a student teacher to

The useful frame is to separate the rules engine from the judgment engine. The rules engine is everything procedural: which problem comes next, whether the arithmetic checks out, what the curriculum says about prerequisites. Software is excellent at this, and it should run unattended. The judgment engine is the live call — this child, who just said this thing, needs this response right now. That is where being wrong is costly and where a human has to stay in the loop.

Staying in the loop does not mean a teacher hovering over every exchange; that defeats the point. It means the classroom teacher sets the boundaries, sees what the model said and why, and can intervene when it matters. The model's job in the judgment moment is narrower than it looks: not to be the authority, but to keep a student productively stuck, surface their reasoning, and hand the teacher a clear picture of where each child actually is. The teacher remains the accountable adult in the room.

And the system has to be able to show its work. When a model nudges a student down a reasoning path, a teacher needs to see the path — what the student said, what the model inferred, why it responded the way it did. A real-time tutor you cannot inspect is not a teaching aid; it is an unaccountable adult talking to children. Explainability is how the tool earns a teacher's trust, and it is the only honest basis for the claim that the thing is safe. You start where the judgment is both expensive and repetitive — the same misconception, surfacing across thirty students, that a teacher cannot be everywhere to catch — and you make every move legible.

A two-day starting point

The trap is to build a general-purpose tutor that talks to students about everything and is accountable for nothing. The fix is to go narrow. Pick one well-understood place where students reliably get stuck — a specific misconception in fractions, a predictable error in reading inference — where good teachers already know the right move and the cost of a wrong move is visible fast.

In two days you can stand up a single live interaction that does that one job: it engages a student in the moment, applies the known-good instructional move, logs every exchange in plain language, and routes anything outside its boundary straight to the teacher. Run it against real student responses and you learn the thing that matters most — exactly where the model's judgment is trustworthy and where it has to defer. That boundary, drawn on one narrow interaction, is the pattern for the next ten. Automate the procedural relentlessly; keep the teacher in the judgment; make every move something a teacher can see.

Black Flag Design builds applied-AI products for high-stakes, judgment-heavy work. If you're putting a model in front of students in real time and want to find the safety line before you scale, spend two days with us — we call it a Foundation Sprint.

About the author

Eli Wood headshot
Eli Wood

CEO, Black Flag Design

Eli Wood leads Black Flag Design, a creative technology company focused on shipping ambitious digital products, AI systems, and design-forward software with a direct point of view on how technology changes work.

Related stories

More from the journal

Pen-and-ink sketch of a small clockwork robot working at a tool-covered workbench late at night while a human sleeps peacefully on a couch in the background, a wall clock reading 2:00 above
ai April 24, 2026 13 min read

The Agent Stays Up Late, Not Me

Every senior engineer knows the right way to set up a codebase. None of them do it. Here’s the four-stage framework we use — The Ratchet — to take a vibe-coded project all the way to a thing you’d trust in production, and the punchline about why this only just became worth doing.

Most teams have always known they should be running tests, type-checking, security audits, accessibility checks, dead-code analysis, prose linting, and a coverage floor. Most teams run two of those. Here’s why that math has finally inverted, and the four-stage framework we use to ratchet a vibe-coded project to a hardened one.

Keith Pattison

Keith Pattison

Founder, Black Flag Design

Read
Black Flag Journal
claude code April 20, 2026 5 min read

What a Year of Claude Code Trails Tells You About Your Team

Claude Code leaves evidence — sessions, commits, PRs, review notes. Read it like a logbook and you'll find what devs actually need to know before they go deeper.

After a year of shipping with Claude Code across real client work, the signal isn't in any single session — it's in the trails. Here's what those trails told us about where Claude Code shines, where it drifts, and the habits devs should build before they lean in harder.

Eli Wood headshot

Eli Wood

CEO, Black Flag Design

Read
Black Flag Journal
playbook April 20, 2026 6 min read

The Black Flag Playbook: Six Principles for Shipping with AI

Battle-tested principles for teams building real software with AI-generated code. Human judgment, tight scope, and weekly evidence — the disciplines that keep AI-built systems reliable.

The six rules we use to ship production software with AI. Small scope, weekly demos, human-led oversight, and continuous improvement — drawn from six months of real client engagements.

Keith Pattison

Keith Pattison

Founder, Black Flag Design

Read