Multi-Model Orchestration Without Vendor Lock-In

Every agentic AI system eventually answers the same question: which model runs which agent, and what happens when that model goes down? The question sounds like a procurement footnote. It is not. It is the architectural decision that determines whether an outage on one provider takes the whole system off the air, or whether the system carries on - and whether the answer to that is a matter of policy or a matter of luck.

Nexus - the multi-agent system I run on a self-managed VPS in Stockholm - orchestrates six co-operating agents across five distinct foundation-model families, with a fallback chain that has been exercised in anger. The routing is not accidental. Every agent is pinned to a primary model chosen for capability fit; every agent has at least one fallback on a different provider; and the scheduling respects a hard concurrency constraint that the provider imposes.

None of the patterns below are Nexus-specific. They are the generic patterns any organisation ends up needing when it wires more than one foundation model into production. Banks are walking into this problem now, under regulatory pressure - DORA Article 28 on ICT third-party risk and Article 29 on subcontracting chains make vendor-independence a design requirement rather than a preference. The translation at the bottom of this piece is explicit.

Six agents, five model families

Nexus is an OpenClaw-based orchestrator fronting five named agents plus itself: Nexus (the main orchestrator), Hermes (trading intelligence, nightly backtests, market monitoring), Newton (deep research, with a swarm pattern that spawns up to a hundred parallel sub-agents), Leonardo (vision, chart and image analysis), George (coding fallback and nightly maintenance), and a Claude Code agent reached through ACP for heavier engineering work. Each runs on a model chosen for the shape of the job.

The underlying provider surface is an Ollama Cloud Pro account that gives me three concurrent model slots. Claude sits outside that pool entirely, reached through Anthropic's own product via the Agent-to-Agent Coordination Protocol. Those two facts - three slots for the open-weight fleet, a separate path for Claude - shape every routing decision that follows.

Figure 1. Nexus routing - primary models sit inside a three-slot Ollama Cloud pool; every agent has at least one cross-provider fallback; Claude is reached through a separate path.

Five provider families appear in the chart: GLM (Z.ai), MiniMax, Kimi (Moonshot), Gemma (Google DeepMind), and Claude (Anthropic). Every agent can degrade at least once to a provider different from its primary. That last property is the one that matters most, and the one most systems get wrong.

Five decisions that shape the routing

The model-by-model allocation is the surface; the decisions underneath it are portable. The same five decisions show up whenever anyone has to front an agentic system across more than one foundation model. I count five, card them below, and walk through each in turn.

Decision 1

Capability-first assignment

Models are pinned per agent by task shape - long-context research, vision-capable, tool-calling, coding - not by whatever is cheapest at the minute.

Decision 2

Two-deep, cross-provider fallback

Every agent has at least one fallback, and the fallback is on a different provider family. One-deep fallback is a single point of failure with a second name.

Decision 3

Slot-aware concurrency

The Ollama Pro pool caps at three concurrent models. Routing is planned against that ceiling; heavy jobs are serialised rather than bursted.

Decision 4

Session continuity across swap

When a model fails mid-task the session state is preserved, the successor picks up with full context, and the user is told a swap happened.

Decision 5

Claude on a separate path

The subscription-gated provider sits outside the Ollama pool. It is reached through ACP, quota-metered separately, and never counted against the three-slot budget.

Decision 1 - capability-first assignment

The starting move is boring and important: pick the model for the shape of the job, not for the headline benchmark. Leonardo's role is visual - chart analysis, image-table validation - so Leonardo runs on Gemma 4 31B, which accepts image input. Newton runs a research swarm where a hundred sub-agents each do narrow fetch-and-summarise work; Kimi K2.5 is unusually good at that DeepSearchQA shape and degrades gracefully under swarm load. Nexus itself is the orchestrator and sits on GLM-5.1 because GLM's tool-calling reliability under MCP-style fan-out is currently best-in-class among the open-weight families. George and the Claude Code agent both cover coding, but at different tiers: Claude Opus 4.6 for architectural work through ACP, Gemma 4 as the cheap-and-local fallback when Claude is unreachable.

The anti-pattern is to pick one model and run everything on it because that is operationally simpler. It is operationally simpler right up until the day the one model goes down, at which point the whole system is down. A per-agent assignment is marginally more work to maintain and dramatically more resilient.

Decision 2 - two-deep, cross-provider fallback

OpenClaw's config supports a fallback chain per agent, expressed as an ordered array. The rule I enforce is that the chain is at least two deep, and at least one fallback is served from a different provider than the primary. One-deep fallback within the same provider is not fallback; it is rebranding.

// ~/.openclaw/openclaw.json - Hermes agent definition, excerpt
{
  "id": "hermes",
  "model": "ollama/kimi-k2.5:cloud",
  "fallbacks": [
    "ollama/glm-5.1:cloud",   // different family
    "ollama/glm-5:cloud"     // last-resort, same family
  ],
  "tools": { "deny": ["image"] }
}

Hermes is Kimi-primary because Kimi K2.5 is the strongest research-and-reasoning model in the pool for the nightly-backtest workload. Its first fallback is GLM-5.1, a different family. The second fallback, GLM-5, sits on the same family as the first but is functionally a different model - slower, longer-context, older - and exists only so that a simultaneous outage of Kimi and the GLM-5.1 endpoint still leaves something available. The chain is three models from two families, not one.

Decision 3 - slot-aware concurrency

Ollama Cloud Pro caps me at three concurrently loaded models. Six agents, five distinct primaries - that arithmetic does not fit inside three slots, so routing has to be planned against the ceiling rather than hoping for the best. The overlap is deliberate: Newton and Hermes both land on Kimi K2.5, Leonardo and George both land on Gemma 4. At any moment the live slot count rarely exceeds three: GLM-5.1 for Nexus orchestration, Kimi K2.5 for research and trading intelligence, and Gemma 4 for vision and cheap coding.

When a cron job needs a fourth model simultaneously - for example the weekly quality-review job which leans on GLM-5.1 while Nexus is mid-task - the fallback chain is what makes the collision survivable. The chain is not just there for vendor outage; it is also there for slot contention. Treating a slot-full response as a transient failure, not a permanent one, keeps the system smooth.

Decision 4 - session continuity across swap

The hardest problem is not failover; it is failover without dropping context. OpenClaw's session store is keyed per channel-sender pair and persists across model swaps. When the primary fails and a fallback promotes, the successor inherits the full conversation history, the system prompt, the SOUL.md behavioural rules, and the tool-call state. The user is told, in-line, that a swap happened - "resuming on GLM-5.1 after MiniMax M2.7 timed out" - not so they can act on it but so the trust property of the system stays honest.

The opposite pattern, silent swap, is what bank pilots often default to and it is a trap. If an answer arrives on a fallback model without the user being told, then two subtly different answers to the same question - one from each model - will eventually land and there will be no way to tell which is which. Observability is a control, not a nice-to-have.

Decision 5 - Claude on a separate path

Claude is treated as structurally different from the open-weight fleet. It is not in the Ollama pool; it is not counted against the three-slot budget; it is reached through ACP from inside Nexus when the coding task is heavy enough to warrant it. The trigger phrase in the orchestrator is explicit - "run this in Claude Code" - and the budget is tracked against the Anthropic Max subscription, not the Ollama Pro quota.

Keeping the subscription-gated provider on its own path is a cleanliness move. It lets me run open-weight models hot for 90% of traffic and reach for Claude only when the job clearly earns it - architecture work, security audits, complex refactors - without mixing two different billing and rate-limit regimes in the same routing table.

A production incident, told plainly

In mid-April, MiniMax M2.7 went down during a working session. I was mid-conversation with Nexus - it was the orchestrator's primary at the time - and a tool-call came back with the Ollama provider's standard rate_limit error, which in this case was not a rate limit but an upstream outage masquerading as one. OpenClaw did exactly what the fallback chain asked it to: rolled to the configured next model, resumed the session with full history, and continued the tool call. From my side, one message was re-tried; from the session's side, it was a one-second stutter.

What I did after the incident is the part worth reporting. I promoted GLM-5.1 to be Nexus's primary permanently, and demoted MiniMax M2.7 to the first fallback. The demotion was not because M2.7 was a bad model - its PinchBench and SWE-Pro scores on the orchestration workload had been strong - but because an outage on the primary is worse than the same outage on a fallback, even if the fallback is a slightly weaker model. A fallback outage is a minor correctness degradation; a primary outage is a UX interruption. If two models perform comparably and only one of them has been proven to survive an outage, the survivor gets promoted.

Fallback chains are often written and never exercised. The one that matters is the one you have watched absorb a real outage without telling the user anything more than "swapped."

There is a second lesson inside the same incident, quieter than the first. The two-deep chain did its job. If MiniMax had been the only fallback, the outage would have turned into a cascading failure the moment it was reached. Because GLM-5 sat behind it, the system kept a third option available - never used, on that day, but present. Two-deep matters most on the day nobody tells you about.

What translates to a bank

Every routing decision above has a direct analogue in a bank spinning up an internal agentic platform. The mapping is one-for-one. DORA Article 28 requires a written policy on ICT third-party risk covering concentration, exit, and substitutability. Article 29 extends the same obligations down the subcontracting chain. Together they formalise exactly the property Nexus's routing is designed for: the system must not become unavailable because one vendor becomes unavailable.

Nexus decision	Bank translation
Capability-first model assignment	Pre-approved model catalogue with capability tags (long-context, tool-calling, vision, code); routing policy selects from the catalogue per use-case, not per preference.
Two-deep cross-provider fallback	DORA Art. 28 substitutability requirement answered with at least two hot alternatives from distinct legal entities; at least one non-US, non-CN if the primary is either, to cover sovereignty risk.
Slot-aware concurrency	Capacity contract expressed against simultaneous model load, not monthly token count; routing scheduler aware of per-tenant slot budget; red alerts on slot-full events feeding into the ICT risk MI pack.
Session continuity across swap	State store (context, tool calls, behavioural policy) externalised from the model runtime; swap events logged to an immutable audit with before/after model identifier; user-visible indication that a fallback is active.
Claude / subscription-gated path separate	Premium-tier models (closed-weight, contractual SLA) quota-metered on a separate budget; breach of SLA on the premium path triggers a named incident, not a silent degradation to the open-weight pool.

Why this matters for procurement The conversation bank procurement committees are having right now is not "which model do we buy." It is "what does the exit plan look like on day one." Vendor-independence is no longer a negotiating stance; under DORA it is a documented design requirement. An agentic platform that cannot name its fallbacks and demonstrate that they have been exercised has not met the bar.

None of the above is exotic. All of it is what a DORA review would expect to see for any ICT service classified as critical, which an internal LLM platform almost certainly is. A surprising number of in-flight bank pilots run single-provider, single-model, no-fallback configurations that would not pass a five-minute review on the substitutability question alone. That is the gap this case study exists to point at.

What I would do differently at bank scale

Three things. First, the per-agent model assignment would be a versioned policy artefact, signed off by the model-risk function, not a JSON file anyone with shell access can edit. The Nexus pattern - openclaw.json under the operator's home directory - is defensible for one person running one system; it is not defensible for a regulated estate. Change-control on the routing policy has to be as strict as change-control on the models themselves.

Second, the fallback chain would be continuously exercised, not discovered during an outage. A weekly synthetic job would deliberately degrade each agent onto each fallback in turn and measure correctness, latency, and cost deltas against the primary. The point is not to catch regressions in the fallback model; it is to catch regressions in the swap machinery - the routing code, the session-carry logic, the observability hooks. Unexercised fallback is assumed fallback.

Third, the routing layer would emit a structured event stream to the bank's SIEM and to the ICT risk MI pack - every primary failure, every fallback promotion, every slot-contention event - so that the concentration risk DORA Article 28 demands visibility into is actually visible. Operational evidence, produced continuously, replaces the annual vendor-risk questionnaire as the primary artefact. That is the shape of ICT third-party risk management when it has grown up into something that can survive a real outage rather than merely document the possibility of one.

Those three changes are the gap between "a production system one person runs" and "a production system a bank runs." Everything else - the capability-first assignment, the two-deep cross-provider fallback, the slot-aware concurrency, the continuity-preserving session store, the separate path for the subscription-gated provider - generalises cleanly.

Next case study

Agentic KYC - unravelling corporate UBO across jurisdictions

Enhanced customer due diligence, adverse-media screening, and corporate beneficial-ownership unwind framed against AMLR Article 42 and the BORIS register. Research swarms and vision agents as proof-of-build components. The AFC transformation shape banks are currently budgeting for.

Read case study →