The Bleed - AI Compute as a Treasury Function

2026-05-04 · 06:14 CEST The chat opened on a routine confirmation. A balance poller had been added to the cron list the previous evening - a small bash script behind a systemd timer with one job: hit the DeepSeek and Moonshot balance endpoints once a day, write the result to a JSON file, route a line into the morning digest. The 03:30 Stockholm fire was its second-ever run. The expectation was nothing. The digest landed.

💰 Balances polled 01:30 UTC
DeepSeek  $16.39  (Δ −$0.16)
Moonshot  $96.76  (Δ −$33.35 · $96 cash + $0 voucher)

Sixteen cents on DeepSeek. Thirty-three dollars on Moonshot. In twenty-four hours. No production trading was running. The Hermes agent - the only one nominally talking to Moonshot - had not been triggered by hand in days. A screenshot from platform.moonshot.ai/console/fee-detail arrived a minute later: kimi-k2.6 calls at a seven-second cadence, all night, against the same project ID, from one API key.

The bleed was already happening when the bleed was discovered.

What follows is the May bleed - two paid-API leaks that ran quietly on a self-managed agentic stack, the forensics that found them, the two rules that fell out, and the observer-based circuit breaker that now caps every paid provider at a dollar a day or ten calls - whichever fires first. The bug was small. The lesson was not. Any organisation running a production agentic workload past the demo stage owns a treasury problem disguised as an engineering problem, and the controls that prevent the bleed from happening again look more like CFO discipline than like a fix.

The early-warning system that surfaced it

The balance tracker that fired at 01:30 UTC that morning had been live for one day. Fifty lines of bash behind a daily systemd timer. Two prepaid providers polled - DeepSeek and Moonshot - because those were the two with usable balance APIs at the time; the others on the chain either had no public endpoint or sat on flat-fee subscriptions where the question was irrelevant.

The architecture was small on purpose. The poller read its keys via a sed-based helper out of the gateway's existing systemd drop-in (eighteen secrets already lived there; standing up a separate secrets file to ship a two-key poller would have been disproportionate). It wrote state atomically to ~/.openclaw/state/balances.json, held the current balance and the previous one with a delta computed at write time, and the morning digest rendered a section from the file. Four hours of evening work. Daily was the right cadence - the relevant signal is the delta, not the resolution.

The fact that the tracker existed at all is what made the bleed visible inside a day rather than at the end of a month. Before it landed, the only place a paid balance lived was the provider's web dashboard - and a web dashboard nobody is looking at is no signal at all. The cost of building it was an evening. The cost of not building it would have been the entire month's run on whichever bleed happened first. The cheapest piece of observability in an agentic system is the one that says the balance moved while you were not looking.

Two bleeds, two shapes

The two leaks looked identical in the digest line - a balance number that had moved further than it should have - and could not have been more different in their internal cause. One was a chain-level bug; the other was a session-level state quirk. The reason the same observability layer caught both is that the layer asks a question - did the balance move? - that is ignorant of the cause. Any sufficiently visible cost signal will catch any sufficiently expensive bug regardless of its shape.

Moonshot - the chain bleed

Hermes' fallback chain at the time read primary ollama/kimi-k2.6:cloud, fallback-1 ollama/deepseek-v4-pro:cloud, fallback-2 moonshot/kimi-k2.6 - that last one a direct paid-API call. The chain looked designed: free cloud primary, free cloud fallback-1, paid fallback-2 as the safety net. The safety net was the problem.

The cloud provider sat on a three-slot Pro plan. When the orchestrator, the research swarm, and Hermes all called the cloud at the same minute - a regular pattern when the morning briefing, an autoresearch sweep, and the market-monitor cron converged - the slots ran out. Primary and fallback-1 both timed out. The chain fell through to fallback-2, and Moonshot answered cleanly. The market-monitor cron fired every thirty minutes during market hours; each fire that found the cloud contended ran a tool-call loop on the paid endpoint at roughly twenty cents per call until the timeout. About a hundred and thirty-eight calls in twenty-four hours. Thirty-three dollars.

The fix was structural, not surgical. The paid-direct fallback came out of every chain. Hermes' fallback-2 was replaced with another cloud variant - slower under contention but free - and the agent was allowed to fail-stop rather than fall through to a billed endpoint. The market-monitor cron was disabled until the chain was verified. The Moonshot API key was disabled at the wall, which surfaced a separate finding: the key was still loaded in a stale migration backup of .env that nothing currently read. The nothing currently reads is the precondition for the next accident.

DeepSeek - the session bleed

The DeepSeek burn was smaller in absolute terms - about $2.55 a day extrapolated, against a $0.16 baseline - and harder to find. The chain config was clean. Every primary read as cloud. Every fallback-1 read as cloud. The paid DeepSeek endpoint sat at fallback-2, exactly where the new policy said it should sit. The chain was not the source.

The bleed lived in sessions.json. A months-old Telegram inbound session for the orchestrator had two override fields stamped on it from a prior debugging episode that nobody had cleared:

// agent:main:telegram:direct:499042523
"providerOverride": "deepseek",
"modelOverride":    "deepseek-v4-flash"

These fields override chain resolution at request time. Every Telegram DM coming in to the orchestrator was being pinned directly to the paid DeepSeek endpoint, bypassing the chain entirely. Forensics walked five agents' session manifests. The orchestrator had one stale override - the bleeder. Newton and George had dozens between them, all benign, pinning to cloud rather than paid direct. The fix was a single edit removing the rogue keys after archiving the affected session. The next DM landed on a clean cloud session. Balance delta: zero.

Two leaks, two distinct causes, one symptom. The chain bleed was architectural. The session bleed was hygienic - overrides accumulate, and manifest fields beat the chain. The fix to each was small. The fix to the class of bug they belonged to was not.

The cost curve before and after Layer 1. Left, the May 4 bleed climbing unbounded against no ceiling. Right, the same shape clipped by the $1/day cap and the provider removed from chains on trip - operator paged, audit row written, flatline through the daily reset.

Two rules that fell out

The diagnostics produced two governance rules before they produced any code. Both look obvious in retrospect. Neither was obvious before the bleed happened.

Rule 7-bis - paid-direct dormancy. Direct paid-API providers may only sit in the fallback-2 position or lower of any chain. They must never appear at primary or at fallback-1. If a paid fallback fires regularly, that is a primary-reliability problem masquerading as a fallback-working-as-designed event - investigate the contention on the primary before tolerating the spend. The rule is structurally about blast radius. A primary or fallback-1 paid endpoint is a hot path. A fallback-2 paid endpoint is a safety net, and a safety net that is being walked on every day has become a floor.

Rule 7-ter - hard daily cap. Paid-provider spend hard-capped at $1 a day or 10 calls a day per provider, whichever fires first. Enforced at the gateway via a circuit breaker that observes journal events, attributes spend, and patches the chain on trip. The thresholds were small on purpose: the system can lose a dollar before anything bigger happens, and the principle of "better the agent goes down than the wallet does" writes itself into the runtime.

The breaker

The thing the rules cost to implement was Layer 1 of a circuit breaker, scoped tightly and built in a single four-hour session. The watcher already tailed the gateway journal for fallback events. The breaker extended it: every paid-provider event observed in the journal incremented an in-memory counter for that provider; a tokens-to-dollars estimator ran against a small pricing table; if the per-provider call count crossed ten or the estimated 24-hour spend crossed a dollar, the breaker tripped that provider only - and tripping meant removing that provider's models from every agent's chain in openclaw.json, writing the patched config back, and waiting for the gateway's hot-reload. Per-provider isolation: tripping DeepSeek did not affect Moonshot or Zhipu, and the chains for agents that did not use DeepSeek were untouched.

A small CLI sat on top of the state file for status, audit-tail, simulate-call, reset, and cap-setting. A daily reset timer fired at UTC 00:00 and zeroed the counters with one audit row. Six synthetic test scenarios ran before any production-bound enforcement was activated - call-count trip, single-call spend trip, per-provider isolation, manual reset, state-file deletion, state-file corruption. The canary went live with a deliberately low cap on a single provider (two calls or ten cents, not ten and a dollar) and ran for twenty-four hours before being scaled. Around a hundred and thirty lines of Python, a hundred-line bash CLI, a small state file.

The cleanest design choice in the breaker is one it shares with the watcher from the previous case study - observer and actor are separate components with separate write paths. The watcher writes a pin file; the breaker writes the canonical configuration. The observability layer never touches canonical state. The control layer is the only thing allowed to. Every chain mutation is traceable to a breaker trip, and every trip is traceable to a specific stream of journal events.

The cheapest piece of observability in an agentic system is the one that says the balance moved while you were not looking. Everything else - the chain audit, the rules, the breaker, the alerts - is the controlled response. The first piece is what made any of the rest possible.

What translates to a bank

The bleed itself was footnote-sized - a quiet evening on a self-managed VPS. The pattern is not. Banks running production agentic workloads in 2026 face the same shape of problem in a different denomination: tokens that cost real money, providers billing on different curves, chains that route on automatic decisions taken faster than any committee can see them, and a treasury function that has not historically had a seat at the routing decision.

Nexus pattern	Translates to
Balance tracker as early-warning observability	Treasury-grade daily balance reconciliation against every paid AI provider, with delta thresholds wired to an alerting layer that finance and operations both watch. Most bank AI pilots track spend in the provider's web console and reconcile monthly against a procurement invoice. That cadence is the difference between a $33 footnote and a $33,000 month.
Rule 7-bis - paid-direct dormancy at fallback-2+	Provider-tier discipline as a procurement constraint. Free or fixed-fee providers occupy hot positions on the chain; metered paid providers occupy safety-net positions only. A paid-direct provider sitting at primary or fallback-1 is a billing exposure the firm has accepted without measuring. The chain shape becomes a control narrative.
Rule 7-ter - daily hard cap per provider, enforced at the gateway	Daily compute caps as a CFO-tier control, set per workload and per provider, with named Process Owners accountable for spend variance. The supervisory question is not "do you cap spend?" - it is "where, by whom, and how does the gateway enforce it when the operator is asleep?" A self-applying cap is the only answer that survives the audit. Dashboards are where the operator finds out spend already happened; the breaker is where spend gets stopped.
Observer-writes-pin / actor-writes-config persistence boundary	Separation of monitoring from configuration change, with named owners, separate code paths, and independently testable failure modes - the same three-lines-of-defence shape that already governs trading-book and credit-risk decisioning. Compute spend belongs in the same governance model, not a parallel one.

Why AI compute is now a treasury function Production agentic workloads run continuously, scale-on-demand, and bill in increments that aggregate quickly into material numbers. The cost variance is not a line item in the IT budget any more - it is a treasury exposure that needs daily reconciliation, per-provider caps, and a single accountable Process Owner per workload. The familiar quarterly cycle of "AI spend report at the steering committee" is the wrong cadence for a system that can produce a month's spend in a night. The cadence has to be daily, the cap has to be self-applying, and the responsibility has to belong to someone whose job title already says "controls" rather than someone whose job title says "build."

What I would do differently at bank scale

Three things. The first is ownership. The May bleed was caught by the same engineer who built the system, watching the morning digest he had written himself. That works for one operator on one VPS. It does not scale, and it is the wrong accountability shape even when it does. Every production agentic workload needs a named Process Owner who is not on the engineering team - someone in finance or risk whose explicit deliverable is the daily reconciliation, the cap review, the variance investigation, and the sign-off when a chain shape changes. The breaker stays an engineering artefact; the governance around it lives somewhere with a different reporting line.

The second is where enforcement sits. Layer 1 of the breaker is an observer - it counts journal events, estimates spend, patches the chain on trip. Right pattern for a single-operator system. At bank scale the cap has to sit further to the left, closer to where the request originates: declared in the agent's configuration with a hard schema, validated at boot, enforced at the gateway request boundary rather than reconstructed from journal events afterward. Observer-based breakers catch what you forgot to declare; declarative caps stop you forgetting. Both belong in the stack. The first one ships in an evening, and the May bleed is the argument for shipping it first.

The third is the dormancy rule itself. Rule 7-bis says paid-direct providers may only sit at fallback-2 or lower. At bank scale that rule needs teeth - config-schema validation that refuses to load an agent with paid-direct at primary or fallback-1, a CI check that fires on the same condition, and a quarterly audit that lists every paid-direct fallback that actually fired and asks why the primary above it was unreliable enough to need it. If your safety net is being used, what is your floor doing? The rule becomes a contract the platform enforces on itself. The engineering team does not get to forget it because the loader will not let them.

Closing the Nexus series

Seven case studies, one stack, the same translation problem

From the Telegram security hardening to the May bleed, the through-line has been the same - the patterns that keep a small, self-managed agentic stack honest at 03:00 are the patterns regulated firms are now being asked to evidence under DORA, AMLR, CCD2, the AI Act, and an emerging set of cost-governance expectations that do not yet have a regulation number but already have a line in the audit programme. The internals differ. The shape of the answer is recognisable. If any of it is useful to a problem you are working on, I would like to hear about it.

Get in touch