Red/Green TDD When Your Dep Is a Generative API

There is a recurring complaint when teams try to TDD code that calls a generative API. The objection sounds like this: "How am I supposed to write a red test for a function that calls Gemini? The response is non-deterministic. My assertions would either be so loose they check nothing, or so tight they would fail on the next model revision."

The complaint is real. The conclusion — that TDD does not apply here — is wrong.

TDD still works. You just have to be honest about where the testable code ends and the third party begins. That boundary is called the seam, and where you place it determines whether your red/green loop is worth running.

What TDD is actually buying you

Red/green/refactor, as a practice, is not about having tests. It is about the loop:

You write a test for a behavior you have not implemented. You watch it fail for the reason you expect.
You implement the minimum code that makes it pass. You watch it pass.
You refactor without changing the test.

That loop is fast, cheap, and repeatable. It is fast and cheap because the test does not talk to anything it does not have to. The moment your red test calls a paid API, the loop becomes slow, flaky, and expensive. You stop running it. Then you stop writing it. Then you are not doing TDD.

So the goal is to preserve the loop. The loop wants pure inputs and deterministic outputs. Gemini gives you neither. The trick is not to force Gemini into the test — it is to draw the line so the test does not need to see Gemini at all.

The seam

Diagram showing YOUR CODE above a dashed SEAM line, with arrows branching to either a green MOCK block or a red REAL GEMINI block

The seam is an interface — the thinnest possible abstraction over the thing you do not control. For Gemini, it looks like this:

type GeminiClient = (args: {
  prompt: string;
  layout: ImageLayout;
  seed?: number;
}) => Promise<GeminiImageResult>;

One function. One input shape. One output shape. That is the whole contract.

Everything above the seam is your code: validation, auth checks, rate limits, telemetry, persistence, error mapping. Everything below the seam is Gemini's problem. Your tests exercise the code above the seam. Gemini is represented in the test by a function you hand in that returns exactly what you want it to.

The load-bearing design choice is that your real implementation takes the Gemini client as a dependency, not as an import. The action becomes a thin wrapper that passes the real client in; the tests pass a mock in. The same pure helper runs in both cases.

export async function generateImageImpl(
  deps: GenerateImageDeps,  // gemini, storage, db, rateLimit, telemetry, identity
  args: GenerateImageArgs,
): Promise<MediaAsset> { ... }

This pattern is older than LLMs. It is what Steve Freeman and Nat Pryce called ports and adapters in Growing Object-Oriented Software. Generative APIs are a new reason to apply it, not a reason to abandon it.

Three kinds of tests, in order of cost

Once the seam exists, three classes of test fall out naturally. Write them in this order; each subsequent class runs less often.

1. Pure-helper tests (fast, free, run on every save)

These are the bulk of your coverage. They drive generateImageImpl directly with an in-memory mock client. You assert the orchestration:

On the happy path, the function returns a MediaAsset with the right shape.
An unknown layout throws InvalidLayoutError and the mock is never called.
A SafetyRefusedError from the mock propagates; no media row is written; no cost event fires.
A rate-limit denial short-circuits before the mock runs; the error carries retryAfterMs.
The cost-telemetry hook fires exactly once on success, with the model name and price.

These are the tests that run during the red/green loop. They finish in milliseconds. They never cost money. They are the reason you can refactor without anxiety.

2. Adapter-contract tests (fast, free, run on every commit)

These tests exist in the file that implements the seam — the real Gemini client. You assert only the things you fully control: given these inputs, what HTTP request do we emit? Given that response, what do we return or throw?

You stub fetch and verify:

The URL, method, and auth header on the outgoing request.
The request body contains the prompt and the responseModalities: ["IMAGE"] config.
A response with finishReason: "SAFETY" maps to a typed SafetyRefusedError.
A non-2xx HTTP response throws a clear error with the status code baked in.
The returned bytes and dimensions are parsed correctly from a known-good response.

These tests do not call Gemini. They do not assert anything about what Gemini actually returns. They assert your adapter's side of the contract. If Gemini changes their response shape, these tests will fail, and you will know exactly which line of your adapter to touch.

3. Live integration tests (slow, paid, gated)

At some point you do need to call the real API. Put that test behind an env flag:

it("R3.3 — live: calls real Gemini and returns a fetchable image", async () => {
  if (process.env.GEMINI_LIVE_TESTS !== "1") return;
  // ...
});

This test runs on demand — locally before a risky change, in CI gated to a specific job, or as part of a nightly smoke. It is the only test that proves Gemini still returns what you think it returns. It is also the only test that spends money. Keep the assertions coarse: bytes are non-empty, the MIME type is an image, dimensions are positive. Anything finer belongs in class 1 or 2.

What you do not try to test

The most important thing about this pattern is what it excludes. These assertions do not belong in any of your tests:

"The generated image is high quality."
"The caption is compelling."
"The safety filter refuses the exact set of prompts we think it should."

Those are evaluation questions, not unit tests. They belong in a separate pipeline — hand-graded review, LLM-as-judge on a fixture set, telemetry on production outputs — and they run on a different cadence. The unit-test suite's job is to verify your code, not the model. If you let the two mix, you end up with a suite that is both flaky and lies about your code's correctness.

Why this is worth the discipline

On a feature we shipped last week — a Convex-backed CMS action that generates blog images via Gemini — the whole vertical slice was test-driven through seven stacked PRs. The pure-helper suite has eight tests and runs in under a hundred milliseconds. The adapter suite has six. The live test has run twice: once from my laptop, once on staging after deploy. The real Gemini call costs about four cents each time.

All the red/green loops — the ones that actually shape the code — happen against mocks. I never once waited on a network call while iterating. I never once spent a cent on a failed test. When a refactor broke the schema mapping, seven tests went red in about 80 milliseconds and told me exactly where the break was. That is the loop TDD is supposed to produce, and a generative API does nothing to interrupt it — as long as the seam is in the right place.

The short version

Define a one-function interface for the third party. That is the seam.
Write the orchestration as a pure helper that takes the seam as a dependency.
Unit-test the helper with an in-memory mock. That is where red/green lives.
Contract-test the real adapter against stubbed HTTP responses.
Live-test the real API behind an env flag, with coarse assertions only.
Do not test the model's output quality in your unit suite. That is an eval, not a test.

Generative APIs are not the end of TDD. They are just one more adapter you write once and mock forever after.

About the author

Keith Pattison

Founder, Black Flag Design

Keith leads Black Flag Design, a studio that ships production-ready software with AI-assisted development. He writes about the disciplines — small scope, weekly evidence, and human oversight — that keep AI-built systems reliable in the real world.

Red/Green TDD When Your Dep Is a Generative API

What TDD is actually buying you

The seam

Three kinds of tests, in order of cost

What you do not try to test

Why this is worth the discipline

The short version

More from the journal

The Agent Stays Up Late, Not Me

What a Year of Claude Code Trails Tells You About Your Team

The Black Flag Playbook: Six Principles for Shipping with AI