Patent Pending

Trevi

Infrastructure that keeps the inference stream alive.

vLLM. Proxies. Load balancers. Compressed streams. pgbouncer. Databases. A continuity layer for every stage of your AI inference pipeline. We built it. We tested it. Connections survive what your infrastructure does to itself.

Trevi — uninterrupted streaming inference: a connection enters at the edge and flows continuously through proxy, gateway, connector, and vLLM workers while every layer restarts underneath. The stream survives.
The assumption is breaking

Most distributed systems are built around interruption.

We are watching this assumption break the frontier AI ecosystem in real time.

May 21, 2026
A massive global outage hits OpenAI.

Millions of users and API developers stare at "Error in message stream" and dropped generations. The culprit isn't a broken model — it's a breakdown in how long-lived HTTP/SSE streams survive backend infrastructure churn.

March 2026
A cascading infrastructure collapse paralyzes Anthropic's Claude and Claude Code.

Minor control-plane blips trigger immediate client-retry loops, multiplying traffic into an aggressive upstream thundering herd that brings down the entire platform.

The industry historically accepted these multi-million-dollar disruptions because we believed infrastructure cannot change while live state flows through it. Every proxy, gateway, and worker holds session state in process memory. Replace the process, lose the state, drop the stream — and watch the retries stampede the cluster.

Backoff became our architecture.

Trevi changes that assumption.

The pushback

Distributed systems are designed for disruption. Clients reconnect, exponential backoff kicks in, the system heals itself. It mostly works.

For a basic, stateless web app, it does.

It does not work when you lift-and-shift an enterprise database to the cloud.

It does not work when a transient streaming break triggers a reconnect storm that turns a minor blip into a cascading 10% fleet-wide outage.

And it absolutely, catastrophically fails when those dropped sessions are running frontier inference across $200,000 8×GPU hosts — or eating capacity inside a $100 billion datacenter.

We've lived this

The operational tax has an exact anatomy.

Having built and operated infrastructure at two of the world's most critical hyperscalers — AWS and OCI — we know the shape of the bill by heart.

The 1:00 AM SRE page.

The nightly ops call with a 20-engineer war room watching dashboards.

The CFO counting the literal cash burn of idle silicon in real time while models reload from disk.

A temporary patch. A few months of quiet. Then another event.

Velocity dies. Team productivity drops to zero.

Not anymore.

The fix

We are system hackers and first-principles thinkers.

We refuse the lazy industry dogma that says infrastructure must be treated as dumb cattle, forcing operators to swallow the drops. We reject the legacy enterprise trap that claims long-lived streams can't be recovered without multi-million-dollar, on-premise, active-active hardware.

We took the problem head-on. We innovated at the transport and runtime layer to build a fluid architecture that protects any stateful workload directly in the execution path.

We didn't just design a blueprint to prove this can be fixed.

We fixed it.

What happens

The infrastructure changes underneath. The stream continues.

A client opens a connection at the edge.

The stream flows through proxies, gateways, queues, connectors, databases, model servers, and distributed runtimes.

The infrastructure changes underneath it.

The stream continues.

Not via reconnect.

Not via replay.

Not via retry semantics.

The actual in-flight stream survives.

What Trevi is

A continuity layer for long-lived distributed streams.

Active streams pass through every event that traditionally breaks them, without disconnecting.

Rolling deploys
Proxy restarts
Gateway upgrades
Process replacement
SIGTERM / SIGKILL events
Traffic reshaping
Live reconfiguration
Worker reseating
Runtime migration
Infrastructure maintenance
Every layer

One continuity property, applied across the entire inference pipeline.

We didn't pick one layer and ship a point fix. We built the property at every layer where the stream lives — and measured it.

L4 / L7 proxy & load balancer
Hanger-proxy holds in-flight streams across container SIGKILL. 6 / 6 trials, mean 1.56 s container-down, invisible to the client. Vanilla nginx: 0 / 6.
Long-idle keepalive & WebSocket
HTTP/1.1 idle keepalive (≤ 60 s) and WebSocket sessions (≤ 240 s tested) survive proxy restart at every idle duration — zero RSTs, zero reconnect logic on the client.
Streaming-response compression
Compressed SSE / HTTP response streams flow through the proxy's safe-boundary detector intact across restart; encoder state is reattached without re-emission of the prefix.
pgbouncer & connection pooling
Held client TCP session, cancel key, and SET parameters survive container SIGKILL. 50 / 50 trials, mean 209 ms container-down (~36% faster than vanilla's 285 ms).
Database session continuity
Postgres prepared statements, varcache state, and the open psql socket survive pool-side restart; second query post-kill returns on the same socket.
vLLM / model server
In-flight inference is byte-perfect across SIGKILL: 50 / 50 byte-identical to the no-kill baseline, mean 28.32 s recovery, GPU pages reclaimed (no disk reload), zero missing tokens. Vanilla: 0 / 50.
End-to-end customer stack — client → hanger-proxy → vLLM
Full path survives SIGKILL on the vLLM backend mid-SSE-stream: 50 / 50 trials, mean 31.0 s end-to-end, single TCP, no application-level retry, no client reconnect.

Same property.
Every layer of your inference pipeline.

What this enables

Infrastructure can finally take a breath.

Inference clusters roll continuously.

Databases upgrade without disconnect storms.

PgBouncer restarts without breaking sessions.

Fintech, healthcare, and any long-session pipeline survives infrastructure churn.

Streaming systems stop treating maintenance as failure.

Deploys stop being events.

Infrastructure becomes fluid instead of fragile.

What your inference could look like

Three measurements that pin the property down.

Two-lane Azure test bed, real vLLM workloads serving facebook/opt-6.7b over HTTPS, single-flow end-to-end tests against shipping code.

Proxy SIGKILL — active stream
6 / 6 survived
Hanger-proxy docker kill --signal=SIGKILL mid-stream; mean 1.56 s container-down (max 1.96 s) invisible to the client, 480 / 480 tokens delivered, zero reconnects. Vanilla nginx: 0 / 6 — every SIGKILL RSTs the connection.
PgBouncer SIGKILL — held session
50 / 50 survived
Modified pgbouncer + host hanger: same TCP socket, same cancel key, same SET parameters across container SIGKILL. Mean 209 ms container-down (vs vanilla's 285 ms — ~36% faster cold-reconnect, half the variance). Vanilla pgbouncer: 0 / 50.
vLLM SIGKILL — byte-perfect continuation
50 / 50 byte-identical
In-flight inference (facebook/opt-6.7b, T4, single TCP) across docker kill --signal=SIGKILL: every trial byte-identical to the no-kill baseline, mean 28.32 s recovery (stdev 187 ms), zero missing tokens, zero duplicates, zero client reconnects. Vanilla vLLM: 0 / 50 — request lost, model reload from disk 30–100 s. Model-load phase alone ~50–200× faster via GPU page reclaim.

Full customer stack (client → hanger-proxy → vLLM) survives SIGKILL on the vLLM backend mid-SSE-stream: 50 / 50 trials, mean 31.0 s end-to-end, one TCP, no application-level retry. Long-idle keepalive (HTTP/1.1 to 60 s, WebSocket to 240 s) survives the proxy restart at every idle duration tested. Across the n = 50 SIGKILL trials per stack, the signal is binary: every hanger-enabled trial preserved the session, every baseline trial dropped it.

A hypothetical year — methodology note

10,000 connections per stack. 10 million across the fleet. ~100 disruptive events per stack per year.

From here forward, the numbers shift from measured to modeled. The per-event measurements above are real (Azure test bed, single TCP, shipping code). The dollar figures below translate those measurements into a representative production year, with every assumption stated inline so the model is recalculable on your own numbers. The intent is not to assert what your bill is — it is to make the cost shape visible.

~50 routine deploys per year — roughly one a week.

~50 unplanned crashes per year — roughly one a week.

~100 disruptive events per stack, times a thousand stacks.

The published industry baseline for what happens on the unprotected path:

~25% of deploys in non-elite production infra trigger a reconnect storm of some shape. The DORA 2023 State-of-DevOps survey puts medium-performing teams at a 16–30% change-failure rate; a sizeable share of those failures show up as connection-level disruption rather than logic bugs.

1–5% of the fleet is the typical blast radius when a storm fires. Cloudflare's 2022 LBR incident postmortem reports ~1.5% of traffic impacted by a single deploy-triggered cascade. Discord's pre-resilience-investment gateway rolls were documented at 5–10% client disruption windows. Stack Overflow's 2019 pgbouncer outage reached 100% for ~15 minutes — extreme tail, but it shipped.

So: 50 deploys × ¼ storm rate ≈ 13 storm-causing deploys per year on the baseline stack — plus 50 crashes a year, every one of which RSTs every in-flight session on the dying process by definition. Each storm-causing deploy drops 100k – 500k connections across the 10 M fleet; each crash drops every session on the affected stack instantly. Annual total: millions of dropped sessions per year, plus the on-call and capacity tax of absorbing them.

With Trevi, the same ~100 events fire — but every one of them is harmless. The connection's kernel TCP state, the in-flight HTTP/SSE response, the pgbouncer session, the vLLM scheduler queue — all of it is held by a broker that outlives the crashing process and is reclaimed by the successor:

~50 deploys per year. ~50 crashes per year. ~100 disruptive events.

Zero sessions dropped.

Zero application-level retries.

Zero on-call response.

Zero added overhead on the steady-state path — fd handoff and broker publish-diff cost is sub-percent of CPU, and the kernel TCP state lives where it always did.

Same fleet. Same deploy cadence. Same crash rate.
~100 events a year. Zero losses. Zero overhead.

The cost of not being fluid

Three line items. Every one of them disappears.

An organization paying for a non-resilient stack pays in three distinct currencies. Reasonable industry midpoints for a 10 M-connection SaaS:

Operational, ~$40k – $130k / year. Each storm-causing deploy spends 2–4 hours of on-call response × 2–3 engineers, plus a 2–5× auto-scaling surge for ~5–30 minutes while the reconnect cliff drains. 13 incidents × ~$3k–$10k of burdened response cost each.

Brand & revenue, the largest line item. 1.3 M – 6.5 M visible disruptions per year on a 10 M-user product is 13–65 disruptions per 100 active users. NPS is empirically –5 to –10 points per percentage point of reliability incidents that reach the user. At $10 / user / month and a 1% churn lift from reliability frustration, the annual revenue exposure on this fleet sits in the $1 M – $10 M band before any goodwill loss.

Productivity, $500k – $1 M / year for a 20-engineer team. Fear of deploys reshapes engineering culture: deploys batch larger, push to off-hours, gate on extensive pre-deploy testing. DORA's elite cohort deploys multiple times per day; the bottom quartile deploys monthly. Feature lead time stretches 2–3×. The hidden cost isn't the deploy itself — it's everything the team didn't ship because the deploy was scary.

Trevi removes the precondition for all three.

No dropped sessions, no on-call.

No visible disruption, no NPS hit.

No deploy fear, no batching, no slowdown.

Deploys stop being events. So does the operational tax on top of them.

The same year, on frontier inference

vLLM serving Opus / Codex-class models. Same 104 events. A completely different cost shape.

Re-run the exercise with frontier-inference economics: each replica is 8× H100 (~$30 / hour idle), ML on-call burdened comp is ~$500 k / year in 2025 (~2.5× a typical SaaS engineer), and customers pay ~$15 per million output tokens for a SOTA model and expect the response to actually arrive.

The restart numbers above hold: vanilla SIGKILL loses the in-flight request entirely (0 / 50) and forces a 30–100 s disk-reload retry; Trevi delivers the in-flight stream byte-perfect in mean 28.3 s, model-load phase itself 50–200× faster via GPU page reclaim.

The dollar figures attached to those seconds do not hold. They go up sharply.

The base scale we'll cost against: 1 000 replicas × 8× H100 = 8 000 H100s. At ~$30 k / GPU capex that's roughly $0.24 B of hardware and ~$0.26 B / year of run-cost — a small-to-mid frontier inference deployment, smaller than a single H100 row in a hyperscaler datacenter. The reason for the small base is honesty: the line-item math below uses that scale so the assumptions are easy to interrogate. The very next subsection scales it linearly to the buildouts that are actually happening in 2025–2027 ($100 B-class datacenters in the Stargate / xAI Colossus / Meta-2026 vein).

Apply the same fleet shape (1 000 replicas × 104 disruptive events / year, ¼ of deploys triggering reconnect storms, 1–5 % blast radius) and decompose by cost line:

GPU idle during the restart window, ~$15 k / year recovered. Vanilla SIGKILL holds the replica idle for ~46 s × 104 events × 1 000 replicas ≈ 1 335 GPU-hours / year (and the in-flight request is dropped on the floor). Trevi cuts that to ~818 GPU-hours and the in-flight request survives byte-perfect. At ~$30 / hour per 8×H100 replica that's about $15 k of pure GPU-time recovered annually. Small on its own; matters because every idle H100-second is capacity the operator paid for and didn't ship a token for.

Token revenue forfeited during downtime, ~$72 k → ~$44 k / year. At ~1 000 output tokens / sec per replica × $15 / million, every minute of API-down on a replica is ~$0.90 of unbillable capacity. Across the same 104 × 1 000 events: vanilla forfeits ~$72 k / year just to restart windows; Trevi cuts that ~39 % AND — the part that matters more — the in-flight inferences resume on the same connection instead of being lost and regenerated.

Capacity over-provisioning for storm absorption, ~$8 M – $13 M / year. Production inference fleets keep dedicated H100 headroom specifically to absorb reconnect-storm thundering herds during deploys — the only response to a sudden 5 % reconnection spike on a non-resilient stack is more compute. On a fleet of 1 000 replicas at $263 k / replica / year, even a 3–5 % share of total spare capacity attributable to storm absorption is $8–13 M / year of recoverable spend. With Trevi the herds don't form and that buffer stops being load-bearing.

Operational cost at 2025 ML-engineer rates, ~$13 k – $38 k / year. ML on-call at ~$500 k / year fully burdened is ~$240 / hour. 13 storm-causing deploys × 2–4 hours × 2–3 engineers per response → $13 k–$38 k of pure incident response cost per year. Small relative to the GPU lines, but every one of those hours is an engineer not shipping code.

Engineering productivity drag, direct, ~$3 M / year for a 20-engineer ML team. 20 engineers × $500 k = $10 M payroll. The DORA elite-to-bottom-quartile gap reapplied to "deploy fear from reconnect storms" is closer to a 30 % effective velocity drag, not the 10–15 % the SaaS framing above used — at frontier-AI rates of change every quarter without a major rollout is a quarter of lost ground. 30 % × $10 M = $3 M / year of features that didn't land because deploys were treated as events.

Engineering productivity, indirect: $20 M+ / year, easily. This is the line that dominates everything else. A 30 % velocity drag on a frontier-AI team doesn't just delay features — it visibly slows model rollouts, lengthens the customer-acquisition cycle, lets faster-shipping competitors close gaps the team had previously opened, and starves adjacent product lines of attention because the deploy-fear ML team is monopolizing infra time. On a frontier inference business, these compound annually. A reasonable midpoint puts this line above $20 M / year — and unlike the GPU lines, it doesn't reverse the moment you next deploy.

Orchestration software to work around the failures, ~$3 M – $5 M / year inside the operator. Every retry layer, backoff schedule, idempotency-key store, circuit breaker, fallback path, replay queue, and "did the client actually get the response" verification system exists because the underlying inference fabric drops requests. Frontier labs typically dedicate 5–10 engineers across SRE / infra / SDK teams to nothing but this reliability-workaround layer. At $500 k burdened comp that is $3–5 M / year of pure operator engineering spent compensating for the non-resilient layer — every line of which Trevi makes deletable.

Customer-side workaround cost, $750 M – $3 B / year across the ecosystem. Every API customer writes the same orchestration code: retry-with-exponential-backoff, request-deduplication, partial-response reassembly, "is the agent loop still alive after the 502", graceful-degradation paths. Across ~10 000 enterprise accounts on a frontier-model platform with an average of 5 engineers each spending 10–20 % of their AI time on reliability glue at ~$300 k burdened comp, that is roughly $750 M – $3 B / year of global engineering time spent doing the same workaround on every team. This is not the operator's P&L line, but it is the ecosystem cost the operator's reliability creates.

Customer trust at SOTA pricing, the largest line on the operator's own P&L and the hardest to bound. Enterprises paying ~$0.075 per Opus-tier API call (and growing their dependency on the response shape every quarter) expect the call to succeed. A single dropped response inside a coding-agent loop, an embedded support workflow, or a sales-tool generation is the customer's lived experience of the product breaking. A 1 % account-churn lift on a $100 M ARR book of frontier-AI business is $1 M of recurring revenue — every year, compounded, indexed against the cost of the deploy event that caused it.

Sum the operator-internal lines: roughly $34 M – $41 M / year of recoverable cost on a 1 000-replica frontier-inference fleet, before counting customer trust or the ecosystem cost. The single largest line — indirect productivity drag from deploy fear — alone is >$20 M. The GPU and ops lines combined are ~$50 k; what kills you isn't the wasted compute, it's the slowdown of every other thing your engineering team was going to do and the workaround code your customers had to write in response.

Trevi makes that tax
zero.

Scale this up

Same per-event math, applied to the buildouts that are actually being announced.

A "$100 B datacenter" is no longer hypothetical. Stargate is $500 B over multiple years; xAI's planned expansion sits around $100 B; Meta's 2026 GPU footprint is on a similar curve. The base 8 000-H100 fleet we costed against is a single rack-row inside one of these buildouts. The cost lines scale roughly linearly with GPU count (GPU-idle, capacity over-provisioning, token revenue, customer trust) and sub-linearly with team-size-driven lines (orchestration software, productivity drag, on-call).

The same exercise, rerun at four scales:

scaleH100scapexrun/yrrecoverable/yr

Base (what we modeled)8 k$0.24 B$0.26 B$34 M – $41 M

Mid frontier lab (~Anthropic 2024)50 k$1.5 B$1.64 B$120 M – $154 M

Large frontier lab (~xAI Colossus, Meta-2026)500 k$15 B$16.4 B$800 M – $1.0 B

$100 B-class datacenter (Stargate-tier)2.5 M$75 B+$82 B$3.5 B – $4.3 B

At a $100 B-class buildout, the recoverable cost on the operator's own P&L is between $3.5 B and $4.3 B per year, dominated by capacity over-provisioning for storm absorption (~$2.5–3.5 B) and indirect productivity drag (~$0.4–0.8 B, capped because team size doesn't scale linearly with GPU count). The numbers are recoverable; whether they are recovered is a deployment decision.

And that's still only the operator's invoice. The macro line below — the fraction of the foundation model's economic value taxed off by reliability friction — is independent of how many H100s you bought, because it is keyed to what the model itself is worth to the world.

The macro

A frontier model promises ~$100 B / year of productivity to the world. Reliability friction taxes a real fraction of that.

The numbers above are the operator's invoice. They are not the full bill. The bigger number is the value the foundation model itself was supposed to create — and didn't, every time the inference dropped, the agent loop broke, the sales-tool generation half-streamed, the coding-assistant response truncated.

The frontier-model class — Opus, Codex, the next thing — is the productivity layer for the next decade of work.

McKinsey's generative-AI productivity work and the frontier labs' own forecasts converge on the same order of magnitude: a single SOTA model line, fully adopted, delivers roughly $100 B / year of economic value across its customer base at maturity.

Apply even a 5–15 % reliability tax — an analytical estimate of the loss when a meaningful share of inferences drop mid-flow, when teams batch around fragility, when agentic workflows reset because the connection died — and the implied figure is $5 B – $15 B / year of model productivity that should have flowed to customers, and didn't.

The operator captures a fraction of that as revenue. The customer captures a fraction of it as their own competitive advantage. Both fractions shrink in lockstep with the reliability tax.

Three observations follow:

The operator's reliability layer is not a cost-of-doing-business — it is a direct discount on the model's monetizable value. Every dropped inference is a customer experiencing a slightly weaker product than the model itself is capable of.

The customer-side workaround layer is not "good engineering hygiene" — it is productivity diverted from product, paid for at $300 k–$500 k per engineer-year, multiplied across every customer of every frontier model in the world.

The teams building frontier models cannot ship faster than the deploy fabric lets them. A 30 % velocity drag on the team that ships the model translates one-to-one into a 30 % slower cadence of capability rollouts to the customers paying $100 B / year to use them. Reliability friction at the infrastructure layer is, structurally, friction on the rate of AI progress.

Trevi is a continuity layer for long-lived distributed streams. The narrow version of that claim recovers $30 M+ / year on a frontier inference fleet. The full version unlocks the fraction of the model's $100 B in promised productivity that reliability friction has been quietly costing the world.

Talk to us about Trevi

Three design partners. Through 2026.

If you operate a long-lived streaming workload — frontier inference, agentic loops, regulated streaming sessions — and the cost of dropped streams shows up in your invoice, we want to talk. Direct partnership with the founder.

The Aqua Virgo aqueduct has fed Rome since 19 BC. Empires fell, wars came and went. The water kept flowing.
— Two thousand years of uninterrupted service.

Trevi brings that property to AI inference.