Why One Model Isn't Enough
Part 1 of a 6-part series on agentic AI, multi-agent architectures, and the theory of LLM collaboration.
A single large language model can write code, summarize papers, answer medical questions, and generate poetry. These are impressive feats. They are also misleading, because they suggest that scaling a single model is all you need. It is not. A single model, no matter how large, has systematic failure modes that no amount of additional training data or parameters will fix. This post is about why, and about the ideas — old and new — that point toward a different architecture: systems of collaborating agents.
The argument proceeds at three levels. The first gives the intuition using everyday examples. The second formalizes the argument using ensemble theory and the bias-variance-diversity decomposition. The third goes to the foundations: Condorcet's Jury Theorem, the conditions under which aggregation provably helps, and the limits imposed by correlation.
The Case for Multiple Minds
▾The Panel of Experts
Imagine you are a patient with a complex medical condition. You visit a single specialist. She is brilliant — trained at the best institution, highly experienced, deeply knowledgeable. She examines you and gives a diagnosis. Do you act on it immediately?
Most people would not. Most people would seek a second opinion. Not because they doubt the first doctor's competence, but because any single expert, no matter how skilled, has blind spots. She trained in a particular school of thought. She has seen a particular distribution of cases. She has cognitive biases — anchoring, availability, confirmation bias — that are invisible to her but systematic in their effects. A second opinion from a doctor with different training, different experience, and different biases can catch errors that the first doctor cannot.
Large language models are the same. A single LLM is trained on a massive but specific corpus of text. It has absorbed particular patterns, particular biases, particular blind spots. When it hallucinates — confidently generating factually incorrect information — it does so systematically, not randomly. The errors are not noise; they are structural artifacts of the training process, the data distribution, and the architecture. A different model, trained on different data with a different architecture, will have different structural blind spots. The errors are correlated within a single model but partially independent across different models.
This is the fundamental insight: diversity of error is the raw material of collective intelligence.
A single model's errors go undetected. Multiple models with diverse errors can cross-check and correct each other.
The Wisdom of Crowds
The idea that groups outperform individuals is not new. In 1907, Francis Galton attended a county fair where 787 people guessed the weight of an ox. The individual guesses varied wildly — some were absurdly high, others absurdly low. But the median of all guesses was 1,207 pounds. The actual weight was 1,198 pounds. The crowd was off by less than 1%.
Galton's observation was not a fluke. The Marquis de Condorcet proved a version of this mathematically in 1785: if each member of a jury is more likely than not to reach the correct verdict, and their votes are independent, then the probability that the majority is correct approaches 1 as the jury grows. This is Condorcet's Jury Theorem, and it is one of the oldest and most important results in collective decision theory.
But the theorem has a crucial condition: independence. If every juror copies the same textbook, their errors are perfectly correlated, and adding more jurors does not help at all. The benefit of aggregation comes entirely from the diversity of errors. This is why a panel of specialists from different medical traditions is more informative than three copies of the same specialist.
The orchestra. A single virtuoso violinist can play beautifully, but there are pieces that require an orchestra: a violin alone cannot produce the harmonic richness of a full symphony. The orchestra does not work because each musician is better than the soloist — in fact, individually, they may be worse. It works because their contributions are complementary. The violin carries the melody, the cello provides the bass, the oboe adds a timbre that no string instrument can produce. The conductor does not play a note; they coordinate the timing and dynamics of the ensemble. A multi-agent AI system is the same: individual models contribute diverse capabilities, and an aggregator coordinates their outputs into something richer than any single model could produce alone.
Why LLMs Fail Systematically
The failures of a single LLM are not random. They cluster around specific structural weaknesses:
- Hallucination. LLMs generate plausible-sounding but factually incorrect statements. This happens because the training objective is next-token prediction, which rewards fluency and pattern-matching, not factual accuracy. A model that has seen "the capital of Australia is Sydney" in enough training contexts will reproduce it confidently, even though the actual capital is Canberra. The hallucination is a deterministic function of the training data and the model's weights — not noise.
- Bounded context. An LLM can only attend to a finite window of tokens. When the relevant information for answering a question is spread across a 100-page document, the model may miss critical passages. This is not a failure of intelligence; it is a failure of architecture.
- No grounding. LLMs have no mechanism to verify their outputs against external reality. They cannot query a database, check a fact, or run an experiment. They can only pattern-match against their training data. Any fact not in the training data, or any fact that has changed since training, is invisible to the model.
- Reasoning brittleness. While chain-of-thought prompting helps, LLMs still struggle with multi-step logical reasoning, especially when the reasoning chain is long or requires backtracking. A single error early in the chain propagates and corrupts the final answer.
Each of these failure modes suggests a different remedy: retrieval systems for grounding, tool use for verification, specialized models for different subtasks, and redundancy for error correction. But no single model can do all of these things well simultaneously. The solution is not a bigger model; it is a system of models that collaborate.
A single LLM's errors are systematic, not random. Different models make different systematic errors. When you aggregate diverse models, their errors partially cancel, producing outputs that are more accurate and more reliable than any individual model. This is the founding principle of multi-agent AI: diversity of error is the resource; aggregation is the mechanism.
Ensemble Theory and the Diversity Decomposition
▾Setup
Consider $K$ models $f_1, \ldots, f_K$ producing outputs for input $x$. For concreteness, think of each $f_k(x)$ as a real-valued prediction (this applies to classification via voting margins, and to language generation via log-probabilities or scalar scores). The ensemble prediction is the average:
$$\bar{f}(x) = \frac{1}{K}\sum_{k=1}^{K} f_k(x)$$Let $y$ be the true target. The squared error of the ensemble is:
$$\mathcal{L}_{\text{ens}} = (y - \bar{f}(x))^2$$The average squared error of the individual models is:
$$\bar{\mathcal{L}} = \frac{1}{K}\sum_{k=1}^{K} (y - f_k(x))^2$$The Ambiguity Decomposition
Krogh and Vedelsby (1995) proved a clean identity relating the ensemble error to the individual errors. Define the ambiguity (or diversity) of the ensemble as:
$$\bar{A} = \frac{1}{K}\sum_{k=1}^{K} (f_k(x) - \bar{f}(x))^2$$This measures how much the individual predictions disagree with each other. The identity is:
$$\boxed{\mathcal{L}_{\text{ens}} = \bar{\mathcal{L}} - \bar{A}}$$The ensemble error equals the average individual error minus the ambiguity. This is not an inequality — it is an exact identity. It says two things simultaneously:
- The ensemble is always at least as good as the average individual (since $\bar{A} \geq 0$).
- The improvement is exactly equal to the diversity of the ensemble: the more the individual models disagree, the better the ensemble.
The ambiguity decomposition: ensemble error = average individual error minus diversity. More disagreement means more improvement.
The Bias-Variance-Covariance Decomposition
Taking expectations over the randomness in model training (e.g., different random seeds, different data subsets, or different model architectures), the expected ensemble error decomposes further. Let $\bar{f}(x) = \frac{1}{K}\sum_k f_k(x)$ be the ensemble prediction. The expected squared error of the ensemble is:
$$\mathbb{E}[(\bar{f}(x) - y)^2] = \text{Bias}^2 + \frac{1}{K}\text{Var} + \frac{K-1}{K}\text{Covar}$$where:
- $\text{Bias}^2 = (\mathbb{E}[\bar{f}(x)] - y)^2$ — the systematic error, unchanged by averaging.
- $\text{Var} = \frac{1}{K}\sum_k \mathbb{E}[(f_k(x) - \mathbb{E}[f_k(x)])^2]$ — the average variance of individual models.
- $\text{Covar} = \frac{1}{K(K-1)}\sum_{j \neq k} \text{Cov}(f_j(x), f_k(x))$ — the average pairwise covariance.
This reveals the precise mechanism:
- Bias is untouched. If all models are wrong in the same direction, averaging does not help. The bias term does not decrease with $K$.
- Variance decreases as $1/K$. If the models are independent (zero covariance), the variance of the ensemble shrinks linearly. With $K = 10$ independent models, you get a 10x reduction in variance.
- Covariance is the bottleneck. In practice, models are never independent. LLMs trained on overlapping data with similar architectures have high positive covariance. As $K \to \infty$, the variance term vanishes but the covariance term persists: $\lim_{K \to \infty} \frac{K-1}{K}\text{Covar} = \text{Covar}$. The irreducible ensemble error is $\text{Bias}^2 + \text{Covar}$.
This is why diversity is essential. The only way to reduce the ensemble error below the individual error is to reduce the covariance between models. And the only way to reduce covariance is to use models that make different errors — different architectures, different training data, different prompting strategies, or different specializations.
| Component | Effect of adding models | How to reduce |
|---|---|---|
| $\text{Bias}^2$ | Unchanged | Better individual models |
| $\frac{1}{K}\text{Var}$ | Decreases as $1/K$ | More models |
| $\frac{K-1}{K}\text{Covar}$ | Converges to Covar | More diverse models |
Signal averaging in MRI. When you take an MRI scan, each acquisition contains the true brain signal plus random thermal noise. If you average $K$ scans, the signal (bias) stays the same, but the noise (variance) drops by $\sqrt{K}$. This works beautifully because thermal noise across scans is independent. But if there is a systematic artifact — say, a motion ghost — that appears in every scan, averaging does nothing to remove it. The artifact is the "covariance" term: it is shared across all acquisitions and cannot be averaged away. To remove it, you need a different kind of acquisition (a different pulse sequence, a different head position). In multi-agent AI, the "different kind of acquisition" is a different model with different error patterns.
From Regression to Language Generation
The decomposition above applies to scalar predictions. For language generation, the output is a sequence of tokens, and "error" is harder to define. But the same principle operates through several mechanisms:
- Log-probability aggregation. Given $K$ models, you can average their next-token log-probabilities and sample from the resulting distribution. This is equivalent to a product-of-experts model and directly inherits the variance-reduction property.
- Best-of-K sampling. Generate $K$ candidate responses and select the best one (by a scoring model or by majority agreement). If the probability that any single model produces a correct response is $p$, the probability that at least one of $K$ independent models is correct is $1 - (1-p)^K$, which approaches 1 exponentially in $K$.
- Iterative refinement. One model generates a draft; another model critiques and revises it; a third model verifies the revision. This is not simple averaging but a sequential aggregation that can reduce both variance and bias, because each stage can catch and correct the errors of the previous one.
Aggregation reduces variance but cannot reduce bias or covariance. The entire research program of multi-agent AI can be understood as a search for architectures that maximize diversity (minimize covariance) while maintaining or improving individual model quality (minimizing bias). The Mixture of Agents architecture (next post) is one such approach: it uses diverse models as "proposers" and a separate model as "aggregator," creating a layered system where each layer can correct the biases of the previous one.
Condorcet, Correlation, and the Limits of Aggregation
▾Condorcet's Jury Theorem: Formal Statement
Let $V_1, \ldots, V_K$ be binary random variables representing the votes of $K$ jurors, where $V_k = 1$ means juror $k$ votes correctly. The majority decision is:
$$M_K = \mathbf{1}\left\{\sum_{k=1}^{K} V_k > \frac{K}{2}\right\}$$Theorem (Condorcet, 1785). If the jurors are independent and each has probability $p > 1/2$ of voting correctly ($P(V_k = 1) = p$ for all $k$), then:
$$P(M_K = 1) \to 1 \quad \text{as } K \to \infty$$Proof sketch. By the law of large numbers, $\frac{1}{K}\sum_k V_k \to p > 1/2$ almost surely. Therefore $\sum_k V_k > K/2$ with probability approaching 1. More precisely, by Hoeffding's inequality:
$$P(M_K = 0) = P\left(\frac{1}{K}\sum_k V_k \leq \frac{1}{2}\right) \leq \exp\left(-2K\left(p - \frac{1}{2}\right)^2\right)$$The error probability decays exponentially in $K$. With $p = 0.6$ and $K = 100$ jurors, the probability of the majority being wrong is less than $e^{-2} \approx 0.14$. With $K = 1000$, it is less than $e^{-20} \approx 10^{-9}$.
Condorcet's Jury Theorem: exponential error decay under independence. Correlation and incompetence both break the guarantee.
The Condorcet Converse and the Danger of Correlated Errors
The theorem has a devastating converse: if $p < 1/2$ (each juror is more likely wrong than right), the majority is also more likely wrong, and the error probability approaches 1 as $K \to \infty$. Aggregation amplifies systematic incompetence. In the LLM setting, this means: if every model in your ensemble makes the same mistake (e.g., the same hallucination appears in every model's training data), aggregation will reinforce the error, not correct it.
More subtly, correlation undermines the theorem even when $p > 1/2$. Let $\rho$ denote the average pairwise correlation between models. Berend and Paroush (1998) showed that the effective number of independent models in a correlated ensemble is approximately:
$$K_{\text{eff}} = \frac{K}{1 + (K-1)\rho}$$When $\rho = 0$ (independent models), $K_{\text{eff}} = K$. When $\rho = 1$ (identical models), $K_{\text{eff}} = 1$ — no benefit from aggregation. For intermediate correlations, the benefit is real but diminished. With $\rho = 0.5$ and $K = 100$, the effective ensemble size is only $K_{\text{eff}} \approx 2$.
This formula explains a widely observed empirical phenomenon: adding more LLMs to an ensemble quickly saturates. Going from 1 to 3 models helps a lot; going from 10 to 100 helps very little. The marginal value of each additional model decreases rapidly because the correlation with existing models is high.
The Bias-Variance-Covariance Derivation
We now derive the full decomposition stated in the Technical section. Let $f_k(x)$ denote the prediction of model $k$ on input $x$, where the randomness is over the model training process (different random seeds, data subsets, etc.). Let $\mu_k = \mathbb{E}[f_k(x)]$ and $\bar{\mu} = \frac{1}{K}\sum_k \mu_k$. The ensemble prediction is $\bar{f}(x) = \frac{1}{K}\sum_k f_k(x)$.
The mean squared error of the ensemble is:
$$\text{MSE}(\bar{f}) = \mathbb{E}[(\bar{f}(x) - y)^2] = (\bar{\mu} - y)^2 + \text{Var}(\bar{f}(x))$$The variance of the ensemble decomposes as:
$$\text{Var}(\bar{f}) = \text{Var}\left(\frac{1}{K}\sum_k f_k\right) = \frac{1}{K^2}\left(\sum_k \text{Var}(f_k) + \sum_{j \neq k} \text{Cov}(f_j, f_k)\right)$$Assuming each model has the same variance $\sigma^2$ and each pair has the same covariance $c$:
$$\text{Var}(\bar{f}) = \frac{\sigma^2}{K} + \frac{K-1}{K}c$$As $K \to \infty$:
$$\text{Var}(\bar{f}) \to c$$The ensemble variance converges to the pairwise covariance. If the models are perfectly correlated ($c = \sigma^2$), the variance never decreases. If they are independent ($c = 0$), the variance vanishes. The entire gain from ensembling is captured by the gap between $\sigma^2$ and $c$ — that is, by the diversity of the models.
PAC-Bayes Bounds for Ensembles
The PAC-Bayes framework provides a more general bound on the ensemble error. Let $\pi$ be a prior distribution over models (chosen before seeing data) and $\rho$ be a posterior distribution over models (the ensemble weights). The PAC-Bayes bound (McAllester, 1999; Catoni, 2007) states that with probability at least $1 - \delta$ over the training data:
$$\mathbb{E}_{f \sim \rho}[\mathcal{L}(f)] \leq \mathbb{E}_{f \sim \rho}[\hat{\mathcal{L}}(f)] + \sqrt{\frac{\text{KL}(\rho \| \pi) + \ln(2n/\delta)}{2n}}$$where $\mathcal{L}(f)$ is the population loss, $\hat{\mathcal{L}}(f)$ is the empirical loss, and $\text{KL}(\rho \| \pi)$ is the Kullback-Leibler divergence between the posterior and prior.
The bound says: the ensemble's true error is close to its empirical error, with a penalty that grows with the complexity of the ensemble (measured by KL divergence from the prior). For a uniform ensemble over $K$ models with a uniform prior over a larger pool of $M$ models, the KL term is $\ln(M/K)$ — logarithmic in the model pool size. This means you can search over a large space of models and still get tight generalization bounds, provided you select a small, high-quality subset.
In the context of multi-agent systems, the PAC-Bayes bound justifies a key design principle: select a diverse subset of high-quality models rather than using all available models. The ensemble error is controlled by the empirical performance of the selected models plus a complexity penalty that grows slowly with the selection pool size.
From Ensembles to Agents
Classical ensemble theory assumes a simple aggregation rule (typically averaging or voting). Multi-agent systems go beyond this in three ways:
- Heterogeneous roles. Instead of $K$ models doing the same task, different agents specialize in different subtasks. One agent retrieves information, another reasons, a third verifies. This is more like a team than a committee.
- Sequential interaction. Instead of parallel aggregation, agents interact sequentially: one agent's output becomes another's input. This allows later agents to correct the errors of earlier ones — a mechanism that can reduce bias, not just variance.
- Learned aggregation. Instead of a fixed aggregation rule, a separate model learns to combine the agents' outputs. This aggregator can learn non-linear, context-dependent weighting — giving more weight to the agent that is most reliable for the current input.
These three extensions — specialization, sequencing, and learned aggregation — transform the classical ensemble into a Mixture of Agents architecture, which is the subject of the next post.
Multi-agent systems extend classical ensembles with specialization, sequential refinement, and learned aggregation.
The analysis in this section establishes the theoretical case for multi-agent systems: aggregation provably reduces error, the magnitude of the reduction equals the diversity of the ensemble, and the limits are set by the correlation structure. But the classical theory assumes simple aggregation rules and homogeneous models. The next posts in this series explore what happens when you move beyond these assumptions — to layered architectures (Post 2), to the problem of optimal aggregation (Post 3), and to systems where agents reason, act, and learn from each other (Posts 4–6).
Further Reading
- Condorcet, M. (1785). Essai sur l'application de l'analyse à la probabilité des décisions rendues à la pluralité des voix.
- Krogh, A. & Vedelsby, J. (1995). Neural network ensembles, cross validation, and active learning. Advances in Neural Information Processing Systems, 7.
- Galton, F. (1907). Vox Populi. Nature, 75, 450–451.
- Berend, D. & Paroush, J. (1998). When is Condorcet's Jury Theorem valid? Social Choice and Welfare, 15(4), 481–488.
- McAllester, D. A. (1999). PAC-Bayesian model averaging. Proceedings of COLT.
- Brown, G. et al. (2005). Diversity creation methods: A survey and categorisation. Information Fusion, 6(1), 5–20.
- Wang, J. et al. (2024). Mixture-of-Agents enhances large language model capabilities. arXiv:2406.04692.