Migrating Mission-Critical LLM Systems from Freeplay to Arize Phoenix

Freeplay is shutting down on May 15, 2026. If your team was using it as a serious part of the LLM observability stack — evals in CI, traces in production, review loops that close — you have a forced migration on your hands, and not a lot of runway.

This post is for the teams where "we'll just export the JSON" isn't enough. If your prompt PRs gate on eval scores, if Support pulls traces to diagnose bad replies, if a weekly review promotes production failures into eval cases — then Freeplay isn't a nice-to-have logging tool. It's wired into how you ship. Losing it without a replacement means shipping blind.

Arize Phoenix is the most defensible landing spot for most of these teams. Here's how to think about the move.

Why Phoenix is the pragmatic target

A few things line up:

OpenTelemetry-native. Phoenix speaks OTel spans out of the box. If your traces are already instrumented via OpenInference or a standard OTel SDK, you're mostly re-pointing the exporter.
Open source, with a hosted option. You can self-host Phoenix during migration and validation, then move to Arize's hosted tier (AX) once you're confident — or stay self-hosted if compliance demands it.
Evals and traces in one surface. Phoenix treats eval datasets, experiments, and live traces as first-class, related objects — the same mental model Freeplay taught your team.
Active, well-resourced. Arize is not at risk of being the next shutdown notice. For mission-critical systems, vendor durability is a feature.

The four things you actually have to migrate

Skip the "recreate everything" urge. For a hard-deadline migration, focus on the artifacts that have real operational value:

Golden-set eval cases. Every graded case you've accumulated. These are the expensive asset — the curation, not the infra.
Rubrics and LLM-judge prompts. The grading logic. These often live half in Freeplay and half in a notebook; get them all into version control.
Production trace instrumentation. Every service that writes traces needs a new exporter target.
The review loop. Whoever runs your weekly "promote bad traces to eval cases" ritual needs the new UI in muscle memory before May 15, not after.

Everything else — dashboards, saved views, alert routes — is re-creatable from scratch in a day.

A 3-week migration plan

You have about four weeks from today (April 20) to the shutdown. Burn one week on slack. Here's a sequence that works:

Week 1 — Parallel write, read-only Phoenix

Stand up Phoenix (self-hosted via docker run is fine for week one).
Export every eval dataset and rubric from Freeplay. Commit them to your repo as JSON or YAML. This is the one export you cannot be lazy about.
Add Phoenix as a second OTel exporter alongside Freeplay. Both systems receive every trace. Production is unchanged.
Re-run your last 30 days of eval cases against Phoenix experiments to confirm scores land where you expect.

Week 2 — Cut over the review loop

Move the weekly trace-review ritual to Phoenix. Keep Freeplay open in a second tab for one cycle as a safety net.
Rewire the "promote trace → eval case" flow. In Phoenix, that's a span → dataset row. If you had a button in Freeplay, script the equivalent — don't leave it manual.
Fix whatever drifts. Trace field names, metadata conventions, and span attributes will not match 1:1. Normalize now; you'll thank yourself later.

Week 3 — Cut over CI, then turn Freeplay off

Point your CI eval gate at Phoenix experiments. Fail the build on regression exactly as before.
Verify the red/green signal on a few real PRs.
Remove the Freeplay exporter. Archive Freeplay exports to cold storage.
Before May 15, do one full dry-run: prompt PR → eval gate → deploy record → prod trace → weekly review → promoted eval case. If any link is broken, you have a week to fix it.

Don't lose the discipline in the move

The temptation during a forced migration is to cut corners on the layers that aren't "broken" yet. Resist.

Keep the CI gate on. If you have to run with a smaller golden set for two weeks while you port cases, fine — but do not merge an LLM change with no automated signal just because the tool moved. That's how regressions reach customers.
Keep traces on every prod call. A week of missing traces is a week you can't debug. Parallel-writing in week one is cheap insurance.
Keep the weekly review on the calendar. The review loop is what compounds your eval set over time. Skipping it "until Phoenix settles" is how the golden set quietly stops growing.

Evals and observability are not infrastructure you bolt on. They're the feedback loop that makes iteration on LLM systems tractable. A vendor change is a plumbing problem. The discipline has to survive it.

If you need help

We've been doing this work — evals in CI, OTel traces in prod, review loops that actually close — across several clients, and we've built migration playbooks for exactly this situation. If Freeplay was load-bearing for your team and May 15 feels too close, reach out: [email protected].

Don't ship blind after May 15.

About the author

Eli Wood

CEO, Black Flag Design

Eli Wood leads Black Flag Design, a creative technology company focused on shipping ambitious digital products, AI systems, and design-forward software with a direct point of view on how technology changes work.

LinkedIn Website

Migrating Mission-Critical LLM Systems from Freeplay to Arize Phoenix

Why Phoenix is the pragmatic target

The four things you actually have to migrate

A 3-week migration plan

Week 1 — Parallel write, read-only Phoenix

Week 2 — Cut over the review loop

Week 3 — Cut over CI, then turn Freeplay off

Don't lose the discipline in the move

If you need help

More from the journal

The Agent Stays Up Late, Not Me

What a Year of Claude Code Trails Tells You About Your Team

The Black Flag Playbook: Six Principles for Shipping with AI