The Adversarial Stack: When Cross-Model Routing Becomes a Spec, Not a Hedge

Six papers across 2024-2026 converge on the same finding: same-family verification breaks down at scale. The reframe is technical. What it implies for foundation-model margins is not.

Share

The Convergent Finding

One of the more interesting patterns in the multi-agent LLM literature of the past eighteen months is not a single paper but a convergence. Between February 2024 and May 2026, six research groups working independently across academia, industry-adjacent labs and corporate research arrived at a version of the same claim: long-horizon verification by a model from the same family as the executor is unreliable, and the reliability gap widens as model capability grows.

A useful entry point is recent. Jack Lu and colleagues at the NYU Agentic Learning AI Lab, in When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers (December 2025), evaluated 37 models across 9 benchmarks covering logical reasoning, mathematics, structured puzzles and factual recall.1 Their headline finding is that "verifier gain is often lower for self-verification and intra-family verification than for cross-family verification, particularly as model size increases or post-training is applied", and that this decrease in verifier gain "correlates with greater similarity between the solver's and verifier's solution distributions." Cross-family verification's advantage, in other words, grows with scale rather than diminishing with it.

That finding sits alongside a set of converging contributions from other groups. Akbir Khan, John Hughes and seven co-authors at UCL DARK, in Debating with More Persuasive LLMs Leads to More Truthful Answers (ICML 2024), showed that adversarial debate raises non-expert truth-recovery from a 48% baseline to 76%, with the effect strongest when the debaters argue from different positions.2 Mert Cemri, Melissa Pan, Shuyi Yang and collaborators, in Why Do Multi-Agent LLM Systems Fail? (NeurIPS 2025 Spotlight), built a 1,600-trace failure taxonomy across seven multi-agent frameworks, and classified 13.5% of all observed failures as task-verification failures specifically.3 The Adaptive Heterogeneous Multi-Agent Debate paper (Journal of King Saud University, 2025) reported absolute accuracy gains of four to six percentage points from agent heterogeneity, with a 3.5-point drop when specialised agents were replaced with identical copies.4 Correlated Errors in Large Language Models (Kim, Garg, Peng and Garg, ICML 2025) provided the mechanism, namely that more capable models make more similar mistakes.5

Against this background, the ARIS paper that Ruofeng Yang, Yongcan Li and Shuai Li posted to arXiv on 4 May 2026 reads not as a discovery but as an unusually clear architectural articulation of an empirical pattern that had already converged in the literature.6 Their executor-reviewer split, with Claude Code, GPT-5.4 and Gemini drawn from three different families by design, is one implementation. The convergence behind it is the news.

CHART 01 · The convergent finding: six papers, six groups, one direction2024 - 2026
Feb 2024ICML 2024
Debating with More Persuasive LLMs Leads to More Truthful Answers
Khan, Hughes et al. (UCL DARK and collaborators)
Adversarial debate raises non-expert truth-recovery from 48% baseline to 76%; effect strongest with opposed positions.
Mar 2025NeurIPS 2025
Why Do Multi-Agent LLM Systems Fail? (MAST taxonomy)
Cemri, Pan, Yang et al.
1,600-trace failure taxonomy across 7 frameworks; 13.5% of failures classified as task-verification failures.
Jun 2025ICML 2025
Correlated Errors in Large Language Models
Kim, Garg, Peng, Garg
350+ models evaluated; capable models make more similar mistakes, even with distinct architectures and providers.
2025Springer J. KSU
Adaptive Heterogeneous Multi-Agent Debate (A-HMAD)
Journal of King Saud University
4-6 point accuracy gains from heterogeneity; 3.5-point drop when specialised agents are replaced with identical copies.
Dec 2025arXiv preprintEMPIRICAL PILLAR
When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers
Jack Lu et al. (NYU Agentic Learning AI Lab)
37 models, 9 benchmarks: cross-family verification beats self- and intra-family pairings, and the advantage grows with model scale.
May 2026arXiv preprint
ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration
Yang, Li, Li (Shanghai Jiao Tong University)
Architectural articulation: executor and reviewer drawn from different model families by design.
THE CONVERGENCE
Cross-family verification is necessary for long-horizon correctness, and the advantage grows with model scale.
SOURCE · Khan et al., arXiv:2402.06782 (ICML 2024). Cemri et al., arXiv:2503.13657 (NeurIPS 2025 Spotlight). Kim et al., arXiv:2506.07962 (ICML 2025). A-HMAD, Springer JKSU CIS 2025. Lu et al., arXiv:2512.02304 (NYU, Dec 2025). Yang et al., arXiv:2605.03042 (4 May 2026).

What "Adversarial" Technically Means

The failure mode the literature targets is not classical hallucination. ARIS calls it "plausible unsupported success", namely outputs that read as well-formed and even correct but whose evidential support is incomplete or silently inherited from the executor's earlier framing. Two LLMs from the same training lineage share inductive biases. When one writes and another from the same family reviews, the reviewer arrives carrying the writer's blind spots. The Lu et al. mechanism makes this concrete: verifier effectiveness is a function of the distributional distance between solver and verifier, and within-family pairs cluster close in that distribution.

The picture has one important nuance. The Kim, Garg, Peng and Garg result documents substantial error correlation across model boundaries, not only within them. On one leaderboard their study covers, two arbitrary models agree 60% of the time on incorrect answers; on resume-screening tasks, agreement is similar. Cross-family pairing reduces the correlated-error problem; it does not eliminate it. For long-horizon agentic workloads, where the executor's framing propagates downstream unchecked, the reduction is large enough to matter, and the Lu et al. finding that the cross-family advantage grows with scale suggests it will matter more, not less, as models improve.

ARIS frames the architectural choice in bandit language. Single-family self-review behaves like a stochastic bandit, in which the reviewer's biases are predictable to the executor by construction. Cross-family review behaves like an adversarial bandit, harder to game but not ungameable. The three architectural layers of ARIS (execution, orchestration and assurance) make the design visible, and the assurance layer, namely a three-stage process running integrity verification, result-to-claim mapping and claim auditing, is the layer that depends on the cross-family pairing to function as intended.

The Fourth Stage of Multi-Vendor Logic

Enterprise multi-LLM adoption is not new. Industry surveys have, for three years, framed it as a procurement decision running through one of three rationales. The first is risk hedge, the avoidance of single-supplier dependency. The second is cost arbitrage, the routing of cheap tasks to cheap models. The third is capability fit, the choice of the best model per task. Menlo Ventures' enterprise-spend data for late 2025, as reported by secondary trackers, puts Anthropic at roughly 40% of enterprise LLM API spend and OpenAI at roughly 27%, with the rest distributed across the next five providers.7 Multi-vendor is the operating reality, not a future state.

What the convergent literature implies is a fourth rationale: cross-family pairing as a correctness specification. The first three are procurement decisions. They are negotiable, and they reverse when the price of the dominant vendor drops. The fourth has a different character, because the failure mode it addresses is not commercial but architectural; a price cut does not unwind it. The fourth rationale is also workload-specific. It applies to long-horizon agentic tasks and autonomous-research workloads, namely the workloads where the downstream consumer of the output will not re-verify it. For single-turn classification and most enterprise back-office workflows, the within-family critic story remains adequate. The slice the fourth rationale describes is narrower than total LLM inference. It is also the slice where margin sits.

What This Does to the Foundation-Model Business Model

The foundation-model business model rests on a specific claim, namely that best-in-class capability combined with best-in-class ecosystem produces lock-in within the stack. For the workloads expected to justify the premium-pricing tier (long-horizon agents, autonomous research, regulated outputs) the technical literature now argues against single-vendor stacks. That argument removes the premium tier's defensibility at the model layer, and the Lu et al. finding that cross-family advantage grows with scale means the problem deepens rather than resolves as the labs improve their models.

The margin context is already compressing. Anthropic, Google and OpenAI have all cut headline API rates repeatedly across 2025 and 2026, with documented reductions in the 50% to 80% range on selected tiers, according to compilation trackers rather than primary disclosures.8 Glen Rhodes' capability-compression note captures the dynamic in one phrase: the performance gap between tiers collapses faster than the price gap.9 Horace Dediu, writing on Asymco on 19 February 2026, made the same point from a different vantage. Foundation models commoditise; value accrues to applications and devices.10

CHART 02 · Value migration: model layer down, orchestration layer up2025-2026
A · API PRICE CUTS, 2025-2026headline reductions on selected tiers-0%-25%-50%-75%-100%-67%Anthropic-75%Google(70-80%)-50%OpenAI(successive cuts)B · LLM OBSERVABILITY MARKET (USD BN)36.3% CAGR$0.0B$1.0B$2.0B$3.0B$1.97B2025$2.69B2026Martian: $9M seed (Nov 2023) → reportedly nearing $1.3B (Apr 2026).
SOURCE · Anthropic, Google and OpenAI price-cut data via CloudZero and BenchLM compilations (2025-2026). Observability market: Research and Markets, LLM Observability Platform Market Report 2026. Martian arc: TechCrunch (Nov 2023); valuation per single-source Medium reporting, treated as reported.

Three consequences follow. First, enterprise procurement for high-rigor workloads will specify multi-vendor configurations by design, not as a backup. Second, the value layer migrates one level up the stack, to orchestration, routing, evaluation and assurance. Third, the agentic tier cannot retain the same margin structure as single-shot inference, because the workload-class definition implies the need for an out-of-family verifier the executor itself cannot supply.

The funding picture is consistent with this migration without yet confirming its magnitude. Martian, the LLM router that raised a $9M seed in November 2023, is reportedly nearing a $1.3B valuation as of April 2026, though that figure rests on a single Medium piece and should be read as reported rather than confirmed.11 The adjacent LLM-observability category, where governance and audit tooling overlap, is growing in the mid-thirty-percent range annually according to industry estimates.12 The signal direction is consistent across the funding and market data. The magnitude is not yet measurable with high confidence.

How the Labs Will Respond

Two paths are visible in the public record, and they point in different directions.

The first is the within-family critic path, and the major labs are already shipping it. On 22 January 2026, Anthropic published a revised version of its Constitutional AI framework. The underlying mechanism continues the company's RLAIF approach, in which a model from the same family generates the critique signal that drives the harmlessness training.13 On the agentic side, OpenAI shipped Auto-review in Codex in April 2026, a deployment mode in which a separate agent approves or denies boundary-crossing actions; OpenAI reports a 200-fold reduction in human-approval pauses, with the caveat that this is the company's own metric on its own evaluation set.14 Google DeepMind's Gemini Deep Think and the AI co-scientist system rely on self-play scientific debate and ranking tournaments, with every step running inside the Gemini family.15 The within-family critic path is the labs' commercial preference because every step in the chain remains inside their own pricing envelope.

CHART 03 · Two paths to adversarial review2026 shipping examples
WITHIN-FAMILY CRITIC
Executor and reviewer from the same model family. Commercially preferred by the labs.
Anthropic
Revised Constitutional AI (22 Jan 2026). AI self-critique replaces human feedback during training.
OpenAI
Codex Auto-review (April 2026). “A separate agent” approves boundary-crossing actions; ~200x fewer human pauses.
Google DeepMind
Gemini Deep Think + AI co-scientist. Self-play scientific debate, ranking tournaments, all within Gemini.
STRUCTURAL RISK
Correlated errors tighten as capability scales.
CROSS-VENDOR ORCHESTRATION
Executor and reviewer from different families. What the correlated-error literature implies.
Hyperscaler platforms
MS Agent Framework 1.0 (Apr 2026): connectors for OpenAI, Anthropic, Gemini, Bedrock, Foundry, Ollama. AWS, Vertex AI.
Pure-play routers
Martian (reportedly nearing $1.3B, Apr 2026), NotDiamond, Portkey, OpenRouter, Unify. Routing as a service.
Assurance incumbents
Big Four (Deloitte 2026 audit hot topics), IAPP-aligned vendors, AI Assurance Levels framework.
STRUCTURAL ADVANTAGE
No shared inductive bias between executor and reviewer.
SOURCE · Anthropic Claude's Constitution (22 Jan 2026); OpenAI Auto-review (Apr 2026); Google Research AI co-scientist. Microsoft Agent Framework 1.0 (Visual Studio Magazine, 6 Apr 2026). Deloitte Internal Audit Hot Topics 2026.

The second is the cross-vendor orchestration path, and it is being built by the hyperscalers in their cloud-platform role and by a layer of independent pure-plays. Microsoft's Agent Framework 1.0, released on 6 April 2026, ships first-party connectors for OpenAI, Anthropic, Google, AWS Bedrock, Microsoft Foundry and Ollama, with multi-agent orchestration over the A2A and MCP protocols.16 AWS Bedrock and Google Vertex AI host equivalent multi-provider catalogues. Outside the hyperscalers, a set of independent routers offer cross-vendor routing as a service, with Martian as the most-funded example.

The tension between the two paths is the load-bearing observation of this piece. The within-family critic is the response the labs commercially want to work. Cross-family orchestration is the direction the convergent literature suggests is necessary for exactly the workloads that justify premium pricing. Both cannot be right at the same scale. Which one captures the agentic tier will be decided by the workload mix that enterprise buyers actually pay for, not by which architecture wins the next benchmark cycle.

The Strongest Objections

The real objections to the framing above are practical, not theoretical.

The first is latency and cost. Cross-family routing means every step in a long-horizon workflow hits two API providers in series, with two billing meters, two rate limits and two failure surfaces. The latency premium is meaningful for agentic workloads that already run for minutes to hours. The cost premium is meaningful because the reviewer model is, by design, comparable in capability to the executor. The within-family alternative looks worse on correctness in the literature but better on every operational metric a buyer measures.

The second is tool integration. The within-family critic systems cited above have the advantage that the critic and the executor share the same tool surface, the same context format and the same failure conventions. Cross-family orchestration requires bridging at every one of those layers, and the bridging adds its own error surface.

The third is the empirical residual. The Kim, Garg, Peng and Garg result shows that cross-architecture and cross-provider models still produce substantially correlated errors. Cross-family pairing reduces the problem; it does not eliminate it. For the highest-rigor workloads, neither within-family critics nor cross-family orchestration is sufficient on its own. That is why the assurance layer, namely the audit-style review that runs after both the executor and the reviewer have agreed, matters as much as the orchestration layer in the ARIS architecture itself. The long-run value migration probably runs through assurance, not orchestration alone.

These objections narrow the thesis. They do not remove it. Within-family is clearly worse for the workloads in question; cross-family is clearly necessary; neither is sufficient at the top end of rigor.


Personal View

The body above is the reporting. Here is my view of what to make of it.

What interests me about this story is not the question of which player wins the agentic tier. Cross-family verification is converging across the literature faster than the business model around it can adjust, and the labs will publish increasingly capable within-family multi-agent frameworks as their commercial response. Those frameworks will work well enough for short tasks, classification and the bulk of enterprise back-office workloads. For the slice where the empirical advantage of cross-family verification is largest, namely the long-horizon agentic workloads that justify premium pricing, none of the three obvious candidates for the verification layer is the obvious answer.

The question that I keep returning to is not who wins, but who has the right shape to win. The pure-play routers have the technical credibility but not the trust profile or the procurement reach that a Fortune-500 buyer requires before signing. The hyperscalers have the distribution but cannot be adversarial reviewers of the executors they also resell. The audit incumbents have the trust and the procurement relationships but not the technical execution to build the routing layer themselves. Each is a piece. None is the whole shape.

I do not have a confident answer to which combination of those pieces materialises, and I am not sure anyone does in 2026. What I know is that the question is open in a way that most questions in AI infrastructure are not, because the technical, economic and institutional vectors are pointing in three different directions at once. That is what makes it interesting to watch.

Footnotes

  1. Jack Lu et al., When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers, arXiv:2512.02304, 2 December 2025. NYU Agentic Learning AI Lab. arXiv. Cited findings on cross-family vs intra-family verifier gain are in the abstract and Section 4.

  2. Akbir Khan, John Hughes et al., Debating with More Persuasive LLMs Leads to More Truthful Answers, ICML 2024 / arXiv:2402.06782. UCL DARK and collaborators. arXiv.

  3. Mert Cemri, Melissa Z. Pan, Shuyi Yang et al., Why Do Multi-Agent LLM Systems Fail?, NeurIPS 2025 Spotlight / arXiv:2503.13657. arXiv. MAST taxonomy and 1,600-trace dataset.

  4. Adaptive heterogeneous multi-agent debate for enhanced educational and factual reasoning in large language models, Journal of King Saud University Computer and Information Sciences, Springer, 2025. Springer.

  5. Elliot Kim, Avi Garg, Kenny Peng, Nikhil Garg, Correlated Errors in Large Language Models, ICML 2025 / arXiv:2506.07962. arXiv.

  6. Ruofeng Yang, Yongcan Li, Shuai Li, ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration, arXiv:2605.03042v1, 4 May 2026. Shanghai Jiao Tong University and Shanghai Innovation Institute. arXiv.

  7. Menlo Ventures enterprise LLM API share, late 2025, as reported via index.dev. Secondary coverage; primary Menlo publication referenced therein.

  8. LLM API price-cut compilation 2025 to 2026: CloudZero. Aggregated from public price changes rather than primary disclosures.

  9. Glen Rhodes, Claude Sonnet 4.6 capability compression, AI pricing commoditization. Source.

  10. Horace Dediu, While Big Tech Burns Cash on AI, Apple Waits, Asymco, 19 February 2026. Source.

  11. Martian funding: $9M seed via TechCrunch, 15 November 2023. Valuation per Medium, April 2026, single-source — treat as reported.

  12. Research and Markets, LLM Observability Platform Market Report 2026. Source. Industry-report sizing; figures are estimates, not audited.

  13. Anthropic, Claude's Constitution (revised), 22 January 2026. Source. Original framework: Constitutional AI: Harmlessness from AI Feedback, arXiv:2212.08073.

  14. OpenAI, Auto-review of agent actions without synchronous human oversight, April 2026. Source. The 200-fold figure is OpenAI's own metric on its internal evaluation set.

  15. Google Research, Accelerating scientific breakthroughs with an AI co-scientist. Source.

  16. Microsoft, Agent Framework 1.0 release, 6 April 2026, coverage: Visual Studio Magazine.

END · 05