Case Study
FluxRun Case Study: Replaying Production Backend Failures Safely
FluxRun turns production-only backend failures into replayable evidence and regression guards. The project proves the full loop: identify a hard debugging gap, design a safe replay system, and ship product-quality tooling around it.
Production-only failures
The hardest backend bugs showed up only after a real request, real upstream response, real clock, and real customer state lined up.
Evidence without reproduction
Logs and traces could prove what happened, but they still left the developer rebuilding the same inputs and mocks by hand.
Fixes without durable guards
After a fix shipped, the exact failed path often disappeared back into incident notes instead of becoming a regression check.
Why this matters
Logs and traces explain. They do not replay.
Monitoring tools show timestamps, spans, and error messages. They rarely preserve enough executable state to re-run the exact branch. FluxRun focuses on the missing step after detection: replaying the failed backend path with recorded IO and no production writes.
AI-assisted backend teams
Next.js route and webhook owners
Teams debugging payments, queues, external APIs, and tool calls
01
Capture
Wrap the backend route and record request, logs, network calls, time, random values, host calls, result, and error.
02
Replay
Run the same execution path again with recorded IO so external systems do not fire during debugging.
03
Regression Guard
Keep the repaired branch as a repeatable proof that the production failure does not return.
export const POST = withFluxNextJs(
'orders.create',
async (flux) => {
const order = await flux.db.createOrder(flux.request.body);
return { status: 201, body: order };
},
{ host: { db } },
);System and ownership
Product work plus platform depth
The useful proof is not one UI screen. It is the stitched system from SDK boundary capture through ingest, query, replay, and regression workflow.
Current tech stack
SDK
TypeScript package, Next.js adapter, framework adapters, QuickJS execution, boundary capture
Control plane
Svelte dashboard, Cloudflare Pages, Cloudflare Workers, GraphQL APIs
Data plane
Golang edge and data-node services for ingest, durability, query, and payload storage
Storage
PostgreSQL for control-plane state, OCI Object Storage for execution payloads
Safety
Customer-hosted replay agent, public/private replay keys, encrypted protected payloads
What I personally built
Designed the capture -> replay -> regression product loop and kept it narrower than generic observability.
Built SDK primitives for boundary capture, host RPC, framework adapters, and deterministic replay behavior.
Shipped the Svelte dashboard flows for runs, execution detail, replay readiness, and issue investigation.
Connected live ingest, agent verification, protected payload unlock, and production smoke checks.
Reworked the public story around safe backend replay instead of logs, tracing, or frontend session replay.
Hard parts
Why the system is hard
The difficulty is not collecting another log line. It is preserving enough execution state for repeatable replay while keeping production data and side effects under control.
Replay without live side effects
Replay had to preserve the failure path while preventing DB, payment, queue, email, and external API calls from firing again.
Capturing useful boundaries
The SDK needed enough data to reproduce behavior while keeping unsupported host APIs explicit instead of silently unsafe.
Sensitive payload handling
Protected data needed encryption and agent-mediated unlock so the dashboard could inspect executions without owning private keys.
Operational ingest reliability
Execution batches needed durable ingest, query visibility, token rotation, and clear diagnostics when project auth drifted.
Current status
FluxRun is controlled-beta ready: SDK capture, hosted ingest, dashboard visibility, replay readiness, and live execution flow have been validated. The next milestone is repeated real-user onboarding rather than broad public launch.
Next roadmap
Tighten the first-run setup so one Next.js endpoint reaches first replay in minutes.
Expand regression guard export and CI handoff from saved failing runs.
Broaden framework coverage only after the Next.js/webhook path is repeatedly smooth.
Add sharper team workflows for incident handoff, payload access review, and alert routing.