Agentic KYC - Unravelling Corporate UBO Across Jurisdictions

KYC is where banks bleed time. Onboarding a mid-sized corporate customer in Europe today takes four to six weeks on average, and the bulk of that window is not account opening, risk scoring, or transaction testing - it is tracing the ownership chain. Ultimate beneficial ownership starts with a legal entity and must terminate at named natural persons, and the path between the two runs through corporate layers distributed across company registers, beneficial ownership registers, shareholder agreements, trust deeds, and - increasingly - foundation structures whose transparency is a political negotiation rather than a technical fact. The work is expensive, the staff who do it are scarce, and under AMLR the bar rises again.

This is precisely the class of problem agentic AI is good at. Not because language models are clever at company law - they are not - but because the shape of the problem is agentic-shaped. The work is embarrassingly parallel: every jurisdictional register can be queried independently. It is multi-modal: some registers return JSON, some HTML, some scanned PDF, some require a session login. It is iterative: the output of one query determines the next. And it is retry-heavy: registers go down, rate-limit, or return incomplete filings. A single-shot LLM call to "find the UBO of Acme Holdings AB" is the wrong shape. A coordinated fan-out across specialised agents, consolidated back into an ownership graph, is the right one.

One honest note before the architecture. The previous two case studies in this series were rooted in things Nexus actively does in production - the Telegram security model is live 24/7, the April multi-model outage was an actual incident. This one is different. UBO unravelling is a capability I have architected using components Nexus already runs (Newton's research swarm, Leonardo's vision model, Firecrawl's scraper), but I have not pointed the assembled pipeline at a bank's onboarding queue. What follows is an architectural demonstration of how the components compose, plus the specific additional controls a bank would need on top. The distinction matters in anti-financial-crime work. Overclaiming dies loudly.

The components assembled

A KYC query arrives at Nexus as a named entity - "Acme Holdings AB, org.nr. 556XXX-XXXX, corporate onboarding, risk tier Enhanced." The orchestrator classifies the query against a jurisdiction-policy file (which registers to hit, in which order, when to give up), decomposes it into independent sub-questions, and fans them out in parallel across four specialised workers.

Newton runs the research swarm - up to a hundred parallel sub-agents, each resolving a specific sub-question: "what does the Swedish Bolagsverket list as the direct shareholders of Acme Holdings AB", "is there an adverse-media hit on each surfaced natural person." The swarm runs on kimi-k2.5:cloud, chosen for research over reasoning - cheaper tokens, longer context, tolerance for noisy input. Parallelism is the point: the jurisdiction with the slowest API does not block the chain.

Leonardo handles vision. A surprising number of registers still return filings as scanned PDFs, especially for historic records - parts of the Cayman Islands GCR, Dutch KVK documents pre-2015, older French INPI filings. Leonardo runs gemma4:31b-cloud and extracts a structured representation - directors, shareholdings, dates - against a fixed schema. The agent is graded on whether its extraction validates, not on the prose it generates.

Firecrawl covers the long tail of registers without a usable API - structured HTML scrape behind rate limits, occasionally behind a reader-only service login. Every request is rate-limited respectfully; every response is cached and timestamped so freshness can be asserted later.

Sanctions and PEP screening is an overlay, not a separate agent - the consolidation step intersects each surfaced natural person against the union list (EU, OFAC, UK OFSI, UN) and a licensed PEP feed before the graph is finalised.

A small consolidation routine - dull Python, not an agent - merges the returns into an ownership graph, de-duplicates nodes across name-spelling variants, reconciles shareholding percentages where multiple sources disagree, and flags legs where no source resolved. What comes out is a canonical graph with provenance on every edge.

Figure 1. Agentic KYC architecture - orchestrator fans out across specialised agents, consolidator emits a canonical UBO graph with provenance.

The architecture is deliberately boring in shape. What agentic adds is not the shape; it is the ability to run the shape against dozens of heterogeneous sources without a bespoke integration per source.

The five capabilities, one per card

The work breaks down into five capabilities, each mapping to one or more of the agents above. They answer different KYC questions and they fail in different ways - which matters, because the control function needs to know which capability it is trusting for which claim.

Capability 1

Identity resolution

Collapse name variants into a single golden ID. Deterministic against the national register; LLM only disambiguates on ties.

Capability 2

Adverse media

Swarm sweep of news, court records, leak indexes, press coverage. Structured findings only - no prose.

Capability 3

UBO chain walking

Recursive shareholder descent with cycle handling, nominee detection, and trust/foundation carve-outs.

Capability 4

Structure parsing

Vision extraction from scanned filings against a fixed schema. Explicit abstention beats a confident guess.

Capability 5

Sanctions & PEP

Last-stage list intersection. Fuzzy matches produce annotations, not decisions - humans own the call.

Capability 1 - identity resolution

Before the graph can be built, the entity being onboarded must be a singular, deduplicated, stable object. "Acme Holdings AB" in one system, "Acme Holdings Aktiebolag" in another, and org.nr. 556XXX-XXXX in a third must collapse to one golden ID with a single authoritative spelling and a version history. This is boring, essential, and not a job for a language model - it is a deterministic resolver against the national register, with the LLM only involved when the resolver returns more than one candidate. The rule I hold the pipeline to: if a language model ever invents a golden ID that does not exist in the register, the entire trace is discarded and the leg routed to human review.

Capability 2 - adverse media

Once the customer entity is resolved and directors and shareholders are known, Newton's swarm runs an adverse-media sweep against every surfaced natural person. Each sub-agent takes one person and searches across news, court records, leaked-data indexes (ICIJ's Offshore Leaks, OCCRP's Aleph), and sanctions-adjacent press coverage. The model is told to return structured findings - source URL, publication date, jurisdiction, topic category, confidence score - not prose summaries. Prose summaries are how false positives smuggle themselves into a KYC file; a hit either materialises as a structured record tied to a source, or it does not exist as far as the graph is concerned.

Capability 3 - UBO chain walking

The chain walk is the core algorithm. Starting from the customer entity, the orchestrator recursively pulls shareholdings from each corporate layer. A corporate shareholder at 25% or more is descended into; a natural person at 25% or more is recorded and the leg terminates; anything below 25% is logged but not descended unless the aggregate for a linked party pushes the threshold. The recursion must handle cycles (entity A owns B owns A - real, and common in family holding structures), nominee holdings, bearer-share jurisdictions where direct-holder data is itself a pointer, and trust or foundation structures that do not map to a shareholder model and require a separate query path.

Capability 4 - corporate structure parsing

This is Leonardo's strongest contribution. When a filing returns as a scanned shareholder register - an image, not data - the vision agent is handed the page, a fixed extraction schema, and a strict instruction that any unverifiable field must return null rather than a guess. The bank's appetite for "the model was pretty sure" hallucinations in a KYC trace is zero. On clean filings the schema validates reliably; on marginal scans with handwritten amendments or faded stamps, the agent abstains and the leg routes to human review. That two-track outcome - confident extract, or explicit abstention - is the behaviour the control function wants.

Capability 5 - sanctions and PEP overlay

Every natural person on the surfaced graph is intersected against the EU consolidated list, OFAC SDN, UK OFSI, and UN Security Council lists, plus a licensed PEP feed. Matches are not binary: fuzzy name matches at high confidence produce graph annotations, not automatic rejections. This is deliberate. The sanctions control is the last line before a customer record is written, and it has to route a potential match to a human adjudicator who owns the decision. The LLM's contribution at this step is limited to name-spelling variants in non-Latin scripts and transliteration ambiguity.

A composite onboarding walkthrough

The clearest way to show how the components compose is to walk through a specific, composite case. No real customer; the names are invented; the shape is representative of onboarding traces a senior AFC architect would recognise.

The customer is Acme Holdings AB, a mid-sized Swedish group being onboarded as a corporate banking customer. The direct shareholder register at Bolagsverket returns cleanly as JSON: 60% held by a Swedish parent, Acme Group AB; 40% held by a Maltese corporate, Trident Maritime Holdings Ltd. The Swedish parent leg resolves in seconds - the swarm walks up one more step, finds two natural-person founders at majority holding, and terminates that leg.

The Maltese leg is harder. The MBR returns the shareholder list as a scanned PDF of a 2019 filing; Leonardo's vision extract returns a corporate shareholder, Trident Holdings Ltd, Cayman Islands, at 100% holder. The Cayman leg is harder still: the GCR is responsive but rate-limited, and the filing returned is a multi-page scan with a handwritten amendment that Leonardo flags with low confidence - the agent abstains rather than guess.

A retry queue kicks in. The orchestrator backs off, waits the rate-limit window, and re-queries for a different filing type at the same source - the annual return, not the incorporation document - which reports the same ownership data in machine-readable form. That succeeds. Trident Holdings Ltd is wholly owned by a Liechtenstein foundation, Trident Stiftung. The foundation is not a shareholder structure; it is a beneficiary structure, which triggers a separate query path. Newton pulls the foundation deed from the Liechtenstein commercial register, Leonardo parses the scan, and the graph surfaces a single named natural person as the primary beneficiary.

The whole trace consolidates: five entities across three jurisdictions, one natural-person UBO, one leg that required a retry-with-alternate-filing, one explicit abstention that required a schema-aware follow-up. The consolidator emits the graph with provenance on every edge - which register, which filing, which date, which agent, what confidence. A compliance officer opens the file and sees the chain, the sources, and two explicit flags: "retry succeeded on alternate filing type" and "manual review recommended on handwritten amendment in the Cayman filing."

What the agents add is not legal judgement. It is the ability to run the same diligence walk simultaneously across every jurisdiction a customer touches, and to surface the specific places where a human needs to look.

What translates to a bank

The architectural pattern maps cleanly onto the AMLR regime coming into force and the EU beneficial ownership register interconnection under BORIS. The mapping is one-for-one; the gap between Nexus and a bank-grade pipeline is not conceptual, it is operational.

Nexus capability	Bank translation
Newton swarm - parallel jurisdictional queries	Registry-access broker layer with per-source credentials, rate budgets, and SLA monitoring; queries fan out across Bolagsverket, Companies House, KVK, MBR, GCR, and the BORIS interconnect with consistent telemetry.
Leonardo vision - scanned-filing extraction	Document-AI pipeline with an extraction schema under change control, confidence thresholds tied to a human-review queue, and explicit abstention on marginal scans rather than silent best-guess.
Firecrawl - long-tail registers without APIs	A managed scraping tier with written reader-access agreements per source, backoff-and-rotate on rate limits, and an immutable response cache retained for audit.
Consolidation with provenance graph	An ownership-graph service as the canonical record of UBO determination - per-edge source, date, and confidence - suitable for re-examination by the compliance function years later.
Sanctions & PEP overlay	Last-stage intersection against licensed lists; fuzzy-match thresholds tuned by the risk function; hard routing of candidate matches to a named human adjudicator who owns the decision.

None of this is exotic. Every line is a pattern a Tier-1 AFC function already has ambitions to run, in most cases in disconnected pockets. What the agentic assembly adds is making the fan-out-consolidate shape visible end-to-end - so the control function can see, for every UBO determination, where a human needs to be, at which confidence threshold, on which edge of the graph.

What I would do differently at bank scale

Three changes. First, the entity resolver is a per-query lookup in my lab; at bank scale it would be a service, with a golden-ID store that every downstream pipeline reads from and writes to. KYC is one consumer; CRM, sanctions, client lifecycle management, and the data warehouse are others. A UBO trace that selects a different spelling of an entity name than the customer-master spelling is the bug that silently produces dual customer records. The failure is rare. It is also catastrophic.

Second, registry access must be brokered, not ad-hoc. In my lab each agent calls each register directly. At bank scale there is one broker layer between agents and registers, with per-source credentials, explicit rate budgets, response caching, and - crucially - change-data-capture. When a shareholder register updates, downstream UBO graphs need to know; a pull-based model that only sees state when asked is insufficient for ongoing CDD refresh. The compliance function wants subscription, not polling.

Third, the UBO determination does not write to the customer record without a human gate. The pipeline produces a proposal - graph, sources, confidence, flags - and a human compliance officer approves the write. That seems obvious; it is also the step every "AI-native KYC" pitch currently wants to automate away on the grounds that human review is a cost line. In every serious bank implementation I would retain the gate, and design the UI around making the gate fast rather than around removing it. Automating the ninety-five percent of easy cases, and making the five percent of hard cases faster to adjudicate, beats automating everything and absorbing model-error risk onto the balance sheet.

Those three changes are what separates a well-architected lab pipeline from a pipeline a regulated AFC function can put its signature on. Everything else - the fan-out, the vision extraction, the sanctions overlay, the provenance graph - generalises directly.

Next case study

Agentic AI in retail and consumer lending

Origination through servicing - PSD2/3 data aggregation, document ingestion at application time, affordability beyond traditional credit scoring, real-time decisioning, back-book repricing. A shift in scope from AFC to lending, with the same component-level proof-of-build anchoring.

Read case study →