When the Watcher Couldn't Be Trusted

2026-05-20 · 14:38 CEST The Telegram alert channel had been unusually busy all afternoon. Three fallback notifications, three recoveries, three more fallbacks, three more recoveries - same agent, same primary, same provider, in a loop tight enough that the dedup window was barely holding the line. Ollama Cloud's status page said the platform was healthy. The model itself, called directly with curl, returned a clean response in under two seconds.

And yet the watcher kept flipping the pin. Primary good · primary bad · primary good · primary bad - the flap visible in fallback-state.json if you stared at the file's modification time long enough. Somewhere in the middle of the loop, an inverted spike on an otherwise quiet sine wave, the same agent on the same primary kept landing on the local fallback for a few minutes at a time before being yanked back. Nobody had pressed a button. The system was overseeing itself, and the oversight was wrong.

The watcher had become the bug.

The failover watcher is the second-line safety mechanism for a production multi-agent stack - eighty-odd Python lines that tail the gateway journal, classify model-fallback events, alert the operator on the interesting transitions, and actively probe the primary on a sixty-second cadence to flip the pin back once it is healthy again. It was supposed to be the calmest piece of code in the system. For six days it was. On the seventh day a transient throttle wave on the cloud provider exposed a flaw in the probe logic, and the story that follows is the bug, the diagnosis, and the patch - a short one by case-study standards, with weight that comes from what it surfaces about the boundary between monitoring and decision-making under AI Act Article 14.

The watcher and what it watched

Nexus runs five agents - an orchestrator, a research swarm, a coding agent, a trading-intelligence agent, and a vision agent - across roughly a dozen foundation models from four providers. Each agent has a primary, a fallback-1, and a fallback-2, and the gateway routes a request through that chain on every call. When the primary fails, a [model-fallback/decision] line lands in the journal with the requested model, the candidate that served, a short reason, and the next rung. The watcher parses those lines, classifies them into five event types - fallback fired, hard failure, paid-provider success, paid-provider failure, recovery - and emits one Telegram card per transition with a sixty-second dedup window.

The watcher's pin state lives in fallback-state.json - one record per agent with current_serving, in_fallback, primary, and last_event_ts. The pin is consulted by the gateway on every new session. If current_serving says deepseek/deepseek-v4-flash (the paid Direct fallback), the next session lands there. If it says the cloud primary, the next session goes to the cloud. The pin is a small file with large consequences.

Two failures, six days apart

The first failure of the watcher was the absence of failback. Through the first week of May, the watcher could fire RECOVERY alerts when it observed a PRIMARY_SUCCESS event in the journal - but no such event ever appeared, because once an agent was pinned to fallback, the gateway never tried the primary again. The pin sat. On the 14th of May a transient 401 wave pushed the orchestrator onto the paid Direct fallback at 16:04 UTC and left it there for three hours, bleeding paid API quietly while the operator was elsewhere. An eighty-line patch added active failback: every sixty seconds, walk the agents, probe the primary for any agent on a non-primary pin, flip after two consecutive successful probes. End-to-end validation recovered four stale pins. Finding #95 - the long-running drift mystery - closed cleanly.

The second failure arrived six days later, dressed as the same problem in a different coat. On the afternoon of the 20th of May, alerts began clustering in pairs at a tempo that was not consistent with any real upstream incident. The cloud provider was fine. The model was fine.

The probe that always said yes

The failback probe was a single function: open an HTTP connection to 127.0.0.1:11434, the local Ollama endpoint, post a one-token chat completion against the primary model, return True on a 200 with non-empty content. Five-second timeout. Never raises. That endpoint is what cloud models and local models share - the local Ollama daemon proxies cloud requests through the same socket, so a request for deepseek-v4-flash:cloud and a request for a fully local model both go through 127.0.0.1:11434. From the probe's perspective the two are indistinguishable. It asks "is there a healthy model behind this endpoint right now?" - and the answer, almost always, is yes.

The bug followed from the probe's reach. When the cloud provider rate-limited a request - returning HTTP 429 with a body like {"error":"rate_limit_exceeded"} - the gateway correctly classified that as a primary failure and fell over to the fallback model. The pin moved. The watcher logged the transition. Sixty seconds later the failback layer fired its probe - the same probe, against the same endpoint - and the local daemon, having quietly served other traffic in the meantime, answered with a healthy 200. Two such answers in a row tripped the hysteresis counter. The watcher flipped the pin back to primary. The next real request hit the same 429 ceiling on the cloud provider. The gateway fell over again. The probe ran again. The probe succeeded again. The flap closed on itself, and the agent oscillated between primary and fallback at the cadence of the probe timer for the duration of the throttle window.

The false-failback loop - left, the unbounded oscillation under cloud throttling; right, the eight-signature classifier and 300-second cooldown that interrupts the probe path

Two details made the flap quiet. The Telegram dedup window suppressed every second or third pair, so the operator saw a flap but not the full tempo. And each RECOVERY alert was technically true - the primary had answered the probe. The lie was not in any single event. It was in the inference between events.

The fix - eight signatures and a five-minute pause

The patch is small, in keeping with the bug. A new classifier function, _is_transient_throttle(), scans the reason field of every fallback event against eight signature patterns - 429, rate_limit, throttle, and five more in the same family covering capacity, quota, and provider-specific phrasings. When a match lands, the agent record gains a transient_throttle_until timestamp set to five minutes ahead. The failback pass - the loop that walks the agents every sixty seconds - checks that timestamp before probing. If the agent is inside its cooldown, the probe does not fire. The pin stays on the fallback. The loop never starts.

The five-minute window is a compromise between two costs. Too short, and a flap that outlasts the cooldown can still close on itself. Too long, and a genuine recovery is delayed past the point where the operator might prefer a flip-back. Five minutes matches the typical throttle-window length for the cloud provider in question, and bounds the worst-case loss of paid-API budget to a tractable number of minutes.

Eight signatures is not a magic number. It is the union of every throttle phrasing observed in the journal across two weeks of operation against three providers, taking adversarial liberty with substring overlap so future variants are likely to be caught. Pattern lists are a place where pragmatism is the right discipline - the classifier errs slightly on the false-positive side, and the false positives are easy to spot: a fallback that should have been a fast recovery now waits five minutes. Two days of observation since the patch landed have surfaced none.

The lie was not in any single event. It was in the inference between events - a recovery probe and a fallback decision treated as independent, when in fact one had caused the other.

Monitoring is not decision-making

The cleanest lesson lives at the persistence boundary. The watcher writes to fallback-state.json - an in-memory pin file the gateway reads on each new session. It does not write to openclaw.json, the canonical configuration file that defines every agent's primary and fallback chain. Only one component does - breaker_state.py, the cost circuit that disables a provider after its daily cap is hit. The watcher observes; the breaker decides.

That split is not a coincidence. It was a deliberate design choice on the principle that a monitoring layer's failure modes should not corrupt the foundational state that defines what the system is. The flap loop bug was real. It was also bounded. The damage was confined to the pin file - recoverable with a single jq command, and self-recovering once the classifier was in place. Had the watcher been allowed to write to openclaw.json, the same bug would have rewritten the canonical chain on every flip - and the post-mortem would have started with a corrupted source of truth instead of a misbehaving cache.

The discipline generalises. In any production AI stack with feedback loops between observation and routing, the components that observe and the components that change canonical state are different processes - different write paths, different code-review surfaces, different deployment gates. Monitoring informs decision-making; it does not short-circuit it. The boundary is invisible most of the time. The week it is not invisible, it is the entire difference between a contained bug and an unrecoverable one.

What translates to a bank

The bug was a footnote in a self-managed AI stack. The pattern is not. Any bank running an agentic workload with a model-routing layer will grow some equivalent of this watcher inside the year, by design or by the accumulation of operational tooling. The shape of the failure mode generalises almost exactly. The regulatory framing has changed since the last time banks built infrastructure like this.

Nexus pattern	Translates to
Watcher with active failback, hysteresis, and cooldown	AI Act Article 14 effective human oversight - the obligation is not "have a monitor"; it is to design the system so a natural person can correctly interpret its outputs, override its decisions, and intervene before harm. A flap loop hidden behind a dedup window fails that test even when every individual alert is technically true. Oversight has to be auditable through the loop, not just the event.
Persistence boundary - watcher writes pin file, breaker writes canonical config	DORA Article 6 ICT risk-management framework and Article 9 protection-and-prevention controls - separation of monitoring from configuration change, with named owners, separate code paths, and independently testable failure modes. Also where a bank's three-lines-of-defence model lands cleanly: monitoring is the first line; configuration change requires the second.
Transient-throttle classifier - eight signature patterns, five-minute cooldown	Provider-failure taxonomy as a first-class control. The supervisory expectation under DORA and the AI Act both push toward classifying failures by cause - throttle vs. outage vs. drift vs. content-policy refusal - because the right response is different in each case. A monolithic "fallback" event is a control-narrative weakness; a classified one is a strength.
Bounded blast radius - pin file recoverable with one command	Resilience design: the recoverability of the worst-case state, not the avoidability of every failure. The supervisor's question is no longer "did your system fail" - it is "when it did, how long was it broken for, and prove it." A bounded blast radius answers that question with evidence; an unbounded one tries to argue it.

Why AI Act Article 14 turns this from a curiosity into a control Article 14 obliges high-risk AI systems - credit decisioning, hiring, biometric identification, several others under Annex III - to be designed so they can be effectively overseen by natural persons during use. The text is explicit about two things the failback episode tests directly: the operator must be able to correctly interpret the system's output, and must be able to intervene or interrupt the operation. A flap loop hidden behind a dedup window fails the first test. A watcher that can quietly rewrite routing state without crossing a configuration-change boundary fails the second. The fix is not "add a human to the loop" - it is to make the oversight layer's failure modes visible and its decision surface bounded, by design.

What I would do differently at bank scale

Three things. First, the classifier becomes a contract, not a function. The eight-signature list is fine for one operator on one VPS who can read the journal when a new pattern shows up. At bank scale the provider-failure taxonomy belongs in a shared library with a version, a test suite, and a published change history. Every new provider integration adds entries through the same gate. The supervisor's question - "how do you know your routing layer correctly distinguishes throttle from outage" - gets answered with a test run, not a paragraph.

Second, the dedup window earns a visibility audit. The flap loop was made quieter, and harder to detect, by the same dedup logic that protects the operator's attention. Both properties are correct in isolation; together they are a hazard. At bank scale, the suppression logic publishes its own counter - alerts suppressed in the last hour, by event hash - and a separate watcher flags when suppression is doing more work than alerting. Oversight of the oversight.

Third, the persistence boundary becomes a deployment constraint, not a habit. The watcher and the breaker live in separate repositories, deploy through separate pipelines, and write to non-overlapping paths enforced by the runtime. The discipline that kept the bug contained on the 20th of May was an artefact of how the code happened to be written. At bank scale, that artefact becomes a constraint a procurement officer can point at when a vendor proposes a monitoring product that wants write access to the configuration store. The right answer is no, and the reason is on the page.

Closing the Nexus series

Six case studies, one stack, the same translation problem

From the Telegram security hardening to the false-failback bug, the through-line has been the same - the patterns that keep a small, self-managed agentic stack honest at 03:00 are the patterns regulated firms are now being asked to evidence under DORA, AMLR, CCD2, and the AI Act. The internals differ. The shape of the answer is recognisable. If any of it is useful to a problem you are working on, I would like to hear about it.

Get in touch