Case Study

FluxRun Case Study: Replaying Production Backend Failures Safely

FluxRun turns production-only backend failures into replayable evidence and regression guards. The project proves the full loop: identify a hard debugging gap, design a safe replay system, and ship product-quality tooling around it.

Production-only failures

The hardest backend bugs showed up only after a real request, real upstream response, real clock, and real customer state lined up.

Evidence without reproduction

Logs and traces could prove what happened, but they still left the developer rebuilding the same inputs and mocks by hand.

Fixes without durable guards

After a fix shipped, the exact failed path often disappeared back into incident notes instead of becoming a regression check.

Why this matters

Logs and traces explain. They do not replay.

Monitoring tools show timestamps, spans, and error messages. They rarely preserve enough executable state to re-run the exact branch. FluxRun focuses on the missing step after detection: replaying the failed backend path with recorded IO and no production writes.

AI-assisted backend teams

Next.js route and webhook owners

Teams debugging payments, queues, external APIs, and tool calls

Capture

Wrap the backend route and record request, logs, network calls, time, random values, host calls, result, and error.

Replay

Run the same execution path again with recorded IO so external systems do not fire during debugging.

Regression Guard

Keep the repaired branch as a repeatable proof that the production failure does not return.

export const POST = withFluxNextJs(
  'orders.create',
  async (flux) => {
    const order = await flux.db.createOrder(flux.request.body);
    return { status: 201, body: order };
  },
  { host: { db } },
);

System and ownership

Product work plus platform depth

The useful proof is not one UI screen. It is the stitched system from SDK boundary capture through ingest, query, replay, and regression workflow.

Current tech stack

SDK

TypeScript package, Next.js adapter, framework adapters, QuickJS execution, boundary capture

Control plane

Svelte dashboard, Cloudflare Pages, Cloudflare Workers, GraphQL APIs

Data plane

Golang edge and data-node services for ingest, durability, query, and payload storage

Storage

PostgreSQL for control-plane state, OCI Object Storage for execution payloads

Safety

Customer-hosted replay agent, public/private replay keys, encrypted protected payloads

What I personally built

Designed the capture -> replay -> regression product loop and kept it narrower than generic observability.

Built SDK primitives for boundary capture, host RPC, framework adapters, and deterministic replay behavior.

Shipped the Svelte dashboard flows for runs, execution detail, replay readiness, and issue investigation.

Connected live ingest, agent verification, protected payload unlock, and production smoke checks.

Reworked the public story around safe backend replay instead of logs, tracing, or frontend session replay.

Hard parts

Why the system is hard

The difficulty is not collecting another log line. It is preserving enough execution state for repeatable replay while keeping production data and side effects under control.

Replay without live side effects

Replay had to preserve the failure path while preventing DB, payment, queue, email, and external API calls from firing again.

Capturing useful boundaries

The SDK needed enough data to reproduce behavior while keeping unsupported host APIs explicit instead of silently unsafe.

Sensitive payload handling

Protected data needed encryption and agent-mediated unlock so the dashboard could inspect executions without owning private keys.

Operational ingest reliability

Execution batches needed durable ingest, query visibility, token rotation, and clear diagnostics when project auth drifted.

Current status

FluxRun is controlled-beta ready: SDK capture, hosted ingest, dashboard visibility, replay readiness, and live execution flow have been validated. The next milestone is repeated real-user onboarding rather than broad public launch.

Next roadmap

Tighten the first-run setup so one Next.js endpoint reaches first replay in minutes.

Expand regression guard export and CI handoff from saved failing runs.

Broaden framework coverage only after the Next.js/webhook path is repeatedly smooth.

Add sharper team workflows for incident handoff, payload access review, and alert routing.