Migrating Mission-Critical LLM Systems from Freeplay to Arize Phoenix

Freeplay shuts down May 15. If evals and traces are in your production loop, here's how to move them to Phoenix without losing the discipline you've built.

Eli Wood headshot

Eli Wood

April 20, 2026 4 min read

Freeplay is shutting down on May 15, 2026. If your team was using it as a serious part of the LLM observability stack — evals in CI, traces in production, review loops that close — you have a forced migration on your hands, and not a lot of runway.

This post is for the teams where "we'll just export the JSON" isn't enough. If your prompt PRs gate on eval scores, if Support pulls traces to diagnose bad replies, if a weekly review promotes production failures into eval cases — then Freeplay isn't a nice-to-have logging tool. It's wired into how you ship. Losing it without a replacement means shipping blind.

Arize Phoenix is the most defensible landing spot for most of these teams. Here's how to think about the move.

Why Phoenix is the pragmatic target

A few things line up:

  • OpenTelemetry-native. Phoenix speaks OTel spans out of the box. If your traces are already instrumented via OpenInference or a standard OTel SDK, you're mostly re-pointing the exporter.
  • Open source, with a hosted option. You can self-host Phoenix during migration and validation, then move to Arize's hosted tier (AX) once you're confident — or stay self-hosted if compliance demands it.
  • Evals and traces in one surface. Phoenix treats eval datasets, experiments, and live traces as first-class, related objects — the same mental model Freeplay taught your team.
  • Active, well-resourced. Arize is not at risk of being the next shutdown notice. For mission-critical systems, vendor durability is a feature.

The four things you actually have to migrate

Skip the "recreate everything" urge. For a hard-deadline migration, focus on the artifacts that have real operational value:

  1. Golden-set eval cases. Every graded case you've accumulated. These are the expensive asset — the curation, not the infra.
  2. Rubrics and LLM-judge prompts. The grading logic. These often live half in Freeplay and half in a notebook; get them all into version control.
  3. Production trace instrumentation. Every service that writes traces needs a new exporter target.
  4. The review loop. Whoever runs your weekly "promote bad traces to eval cases" ritual needs the new UI in muscle memory before May 15, not after.

Everything else — dashboards, saved views, alert routes — is re-creatable from scratch in a day.

A 3-week migration plan

You have about four weeks from today (April 20) to the shutdown. Burn one week on slack. Here's a sequence that works:

Week 1 — Parallel write, read-only Phoenix

  • Stand up Phoenix (self-hosted via docker run is fine for week one).
  • Export every eval dataset and rubric from Freeplay. Commit them to your repo as JSON or YAML. This is the one export you cannot be lazy about.
  • Add Phoenix as a second OTel exporter alongside Freeplay. Both systems receive every trace. Production is unchanged.
  • Re-run your last 30 days of eval cases against Phoenix experiments to confirm scores land where you expect.

Week 2 — Cut over the review loop

  • Move the weekly trace-review ritual to Phoenix. Keep Freeplay open in a second tab for one cycle as a safety net.
  • Rewire the "promote trace → eval case" flow. In Phoenix, that's a span → dataset row. If you had a button in Freeplay, script the equivalent — don't leave it manual.
  • Fix whatever drifts. Trace field names, metadata conventions, and span attributes will not match 1:1. Normalize now; you'll thank yourself later.

Week 3 — Cut over CI, then turn Freeplay off

  • Point your CI eval gate at Phoenix experiments. Fail the build on regression exactly as before.
  • Verify the red/green signal on a few real PRs.
  • Remove the Freeplay exporter. Archive Freeplay exports to cold storage.
  • Before May 15, do one full dry-run: prompt PR → eval gate → deploy record → prod trace → weekly review → promoted eval case. If any link is broken, you have a week to fix it.

Don't lose the discipline in the move

The temptation during a forced migration is to cut corners on the layers that aren't "broken" yet. Resist.

  • Keep the CI gate on. If you have to run with a smaller golden set for two weeks while you port cases, fine — but do not merge an LLM change with no automated signal just because the tool moved. That's how regressions reach customers.
  • Keep traces on every prod call. A week of missing traces is a week you can't debug. Parallel-writing in week one is cheap insurance.
  • Keep the weekly review on the calendar. The review loop is what compounds your eval set over time. Skipping it "until Phoenix settles" is how the golden set quietly stops growing.

Evals and observability are not infrastructure you bolt on. They're the feedback loop that makes iteration on LLM systems tractable. A vendor change is a plumbing problem. The discipline has to survive it.

If you need help

We've been doing this work — evals in CI, OTel traces in prod, review loops that actually close — across several clients, and we've built migration playbooks for exactly this situation. If Freeplay was load-bearing for your team and May 15 feels too close, reach out: [email protected].

Don't ship blind after May 15.

About the author

Eli Wood headshot
Eli Wood

CEO, Black Flag Design

Eli Wood leads Black Flag Design, a creative technology company focused on shipping ambitious digital products, AI systems, and design-forward software with a direct point of view on how technology changes work.

Related stories

More from the journal

Pen-and-ink sketch of a small clockwork robot working at a tool-covered workbench late at night while a human sleeps peacefully on a couch in the background, a wall clock reading 2:00 above
ai April 24, 2026 13 min read

The Agent Stays Up Late, Not Me

Every senior engineer knows the right way to set up a codebase. None of them do it. Here’s the four-stage framework we use — The Ratchet — to take a vibe-coded project all the way to a thing you’d trust in production, and the punchline about why this only just became worth doing.

Most teams have always known they should be running tests, type-checking, security audits, accessibility checks, dead-code analysis, prose linting, and a coverage floor. Most teams run two of those. Here’s why that math has finally inverted, and the four-stage framework we use to ratchet a vibe-coded project to a hardened one.

Keith Pattison

Keith Pattison

Founder, Black Flag Design

Read
Black Flag Journal
claude code April 20, 2026 5 min read

What a Year of Claude Code Trails Tells You About Your Team

Claude Code leaves evidence — sessions, commits, PRs, review notes. Read it like a logbook and you'll find what devs actually need to know before they go deeper.

After a year of shipping with Claude Code across real client work, the signal isn't in any single session — it's in the trails. Here's what those trails told us about where Claude Code shines, where it drifts, and the habits devs should build before they lean in harder.

Eli Wood headshot

Eli Wood

CEO, Black Flag Design

Read
Black Flag Journal
playbook April 20, 2026 6 min read

The Black Flag Playbook: Six Principles for Shipping with AI

Battle-tested principles for teams building real software with AI-generated code. Human judgment, tight scope, and weekly evidence — the disciplines that keep AI-built systems reliable.

The six rules we use to ship production software with AI. Small scope, weekly demos, human-led oversight, and continuous improvement — drawn from six months of real client engagements.

Keith Pattison

Keith Pattison

Founder, Black Flag Design

Read