Scaling Laws for Multi-Agent Systems

Intuitive

When More Agents Help (and When They Don't)

▾

The Mythical Man-Month of AI

In 1975, Fred Brooks published The Mythical Man-Month, one of the most influential books in software engineering. His central insight: adding more programmers to a late software project makes it later. The reason is communication overhead. If $K$ programmers need to coordinate, there are $\binom{K}{2} = K(K-1)/2$ communication channels. With 5 programmers, there are 10 channels. With 50, there are 1,225. The work that goes into keeping everyone aligned grows quadratically, eventually consuming more resources than the actual programming.

Multi-agent AI systems face the same challenge. Each agent added to the system increases the total compute, the amount of information flowing between agents, and the complexity of coordination. If the task does not benefit from additional agents — or if the communication overhead exceeds the benefit — adding agents makes the system slower and worse, not better.

flowchart LR subgraph Good["Scaling Works"] direction TB G1["Task is decomposable\ninto independent subtasks"] G2["Agents are diverse\n(different errors)"] G3["Communication is\nefficient (sparse)"] end subgraph Bad["Scaling Fails"] direction TB B1["Task requires tight\ncoordination"] B2["Agents are similar\n(correlated errors)"] B3["Communication is\nexpensive (dense)"] end Good --> WIN["More agents =\nbetter + faster"] Bad --> LOSE["More agents =\nworse + slower"] style WIN fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style LOSE fill:#fce4ec,stroke:#c62828,color:#1c1917

Scaling helps when tasks are decomposable, agents are diverse, and communication is sparse. It hurts when tasks require coordination, agents are similar, and communication is expensive.

Three Regimes of Scaling

Empirically, multi-agent systems exhibit three distinct scaling regimes:

Early gains (K = 1 to 3–5 agents). The biggest improvements come from the first few agents. Going from 1 to 3 models in an ensemble or MoA system typically produces the largest quality jump. This is because the first additional models introduce the most diversity: any second model is likely to have very different error patterns from the first.
Diminishing returns (K = 5 to 10–20). Each additional agent still helps, but less and less. The marginal diversity of each new model decreases because the model space has limited diversity: most LLMs are trained on similar data with similar architectures. The quality curve flattens.
Saturation or degradation (K > 20). Beyond a certain point, adding agents provides negligible benefit and may actually hurt. The communication overhead grows, the aggregation becomes harder (more opinions to reconcile), and the marginal diversity of each new agent approaches zero. In some cases, additional agents introduce noise that degrades the consensus.

Analogy

The kitchen problem. One cook makes a meal. Two cooks can divide the work and make a better meal, faster. Five cooks, each handling a different dish, can produce a feast. But fifty cooks in one kitchen? They bump into each other, argue about seasoning, use the same oven at the same time, and spend more time coordinating than cooking. The food is not fifty times better than what one cook produces — it might actually be worse, because coordination failures introduce errors that individual cooks would never make. The optimal number of cooks depends on the size of the kitchen (communication bandwidth), the complexity of the meal (task decomposability), and how different their culinary traditions are (diversity).

Communication Topology

How agents are connected matters as much as how many there are. The main topologies are:

Star (centralized). All agents communicate with a central coordinator, but not with each other. Communication cost: $O(K)$. The coordinator is a bottleneck but prevents the quadratic overhead of all-to-all communication.
Chain (sequential). Agent 1's output goes to Agent 2, whose output goes to Agent 3, and so on. Communication cost: $O(K)$. Good for sequential refinement (like MoA layers) but the last agent has no direct access to the first agent's output.
Fully connected (all-to-all). Every agent sees every other agent's output. Communication cost: $O(K^2)$. Maximizes information sharing but is the most expensive. This is the topology of multi-agent debate.
Tree (hierarchical). Agents are organized in a hierarchy: leaf agents generate, intermediate agents aggregate, and the root produces the final output. Communication cost: $O(K)$. Good for decomposable tasks where subtasks have natural hierarchical structure.

flowchart TD subgraph Star["Star"] SC["Coordinator"] SA1["A1"] --> SC SA2["A2"] --> SC SA3["A3"] --> SC SA4["A4"] --> SC end subgraph Chain["Chain"] CA1["A1"] --> CA2["A2"] --> CA3["A3"] --> CA4["A4"] end subgraph FullC["Fully Connected"] FA1["A1"] <--> FA2["A2"] FA1 <--> FA3["A3"] FA2 <--> FA3 end subgraph Tree["Tree"] TR["Root"] TM1["M1"] --> TR TM2["M2"] --> TR TL1["L1"] --> TM1 TL2["L2"] --> TM1 TL3["L3"] --> TM2 TL4["L4"] --> TM2 end

Four communication topologies. Star and chain scale linearly. Fully connected scales quadratically. Tree scales linearly with logarithmic depth.

Key Takeaway

Scaling multi-agent systems is not about maximizing the number of agents. It is about finding the optimal tradeoff between diversity (more agents with different capabilities), communication efficiency (sparse topologies that avoid quadratic overhead), and task fit (matching the architecture to the problem's decomposition structure). The best systems use 3–10 diverse agents in a carefully chosen topology, not 100 homogeneous agents in a fully connected graph.

Technical

Formal Scaling Analysis

▾

Ensemble Scaling Law

Recall from Post 1 the ensemble error decomposition for $K$ models with average variance $\sigma^2$ and average pairwise covariance $c$:

$$\text{MSE}(K) = \text{Bias}^2 + \frac{\sigma^2}{K} + \frac{K-1}{K}c$$

As $K$ increases, the second term ($\sigma^2/K$) decreases but the third term ($\frac{K-1}{K}c$) increases monotonically toward $c$. The total error converges to:

$$\text{MSE}(\infty) = \text{Bias}^2 + c$$

The improvement from $K = 1$ to $K = \infty$ is $\sigma^2 - c$, which equals the expected disagreement between a single model and the ensemble average. The rate of diminishing returns is governed by the derivative:

$$\frac{d}{dK}\text{MSE}(K) = -\frac{\sigma^2 - c}{K^2}$$

The marginal improvement from adding the $K$-th model decreases as $1/K^2$ — very fast. The first few models capture most of the benefit. With $\rho = c/\sigma^2$ as the average correlation:

$K = 2$ captures $\frac{1}{2}(1 - \rho)$ of the total possible improvement.
$K = 5$ captures $\frac{4}{5}(1 - \rho)$ of the improvement.
$K = 10$ captures $\frac{9}{10}(1 - \rho)$ of the improvement.

For $\rho = 0.8$ (high correlation, typical for LLMs from the same family), the total possible improvement is only $20\%$ of the individual variance, and 90% of that is captured with just $K = 10$ models. Beyond 10, additional models provide negligible benefit.

Collaborative Scaling Laws

Qian et al. (2025) studied scaling laws specifically for multi-agent collaborative systems, examining how performance scales with the number of agents, the number of interaction rounds, and the communication topology. Their key findings:

Agent scaling. For a fixed number of interaction rounds $T$, performance as a function of agent count $K$ follows:

$$P(K) = P_{\max} - \alpha K^{-\beta}$$

where $P_{\max}$ is the asymptotic performance, $\alpha$ captures the initial gap, and $\beta$ is the scaling exponent. They found $\beta \approx 0.3$–$0.7$ depending on the task, meaning performance improves sublinearly in $K$ (consistent with the diminishing returns predicted by ensemble theory).

Round scaling. For a fixed number of agents $K$, performance as a function of interaction rounds $T$ follows:

$$P(T) = P_{\max}(K) \cdot (1 - e^{-\gamma T})$$

Performance approaches a $K$-dependent ceiling exponentially in $T$. More agents raise the ceiling ($P_{\max}(K)$ increases with $K$); more rounds approach the ceiling faster. This suggests an optimal allocation: given a fixed compute budget $C = K \times T$, the optimal split depends on whether the task is more "diversity-limited" (invest in $K$) or "refinement-limited" (invest in $T$).

Scaling dimension	Law	Typical exponent	Implication
Agents $K$	Power law: $\alpha K^{-\beta}$	$\beta \approx 0.3$–$0.7$	Sublinear gains; 3–7 agents capture most benefit
Rounds $T$	Exponential: $1 - e^{-\gamma T}$	$\gamma \approx 0.5$–$1.5$	Rapid saturation; 2–4 rounds usually enough
Compute $C = KT$	Jointly concave	—	Optimal allocation depends on task type

Communication Cost Analysis

The total communication cost of a multi-agent system depends on the topology. For $K$ agents, each producing outputs of length $L$ tokens:

Topology	Edges	Total tokens communicated	Context per agent
Star	$K$	$K \cdot L$	$K \cdot L$ (coordinator only)
Chain	$K - 1$	$(K-1) \cdot L$	$L$ per agent (previous only)
Fully connected	$K(K-1)$	$K(K-1) \cdot L$	$(K-1) \cdot L$ per agent
MoA (layered)	$K_l \cdot K_{l+1}$ per layer	$\sum_l K_l \cdot K_{l+1} \cdot L$	$K_{l-1} \cdot L$ per agent

The practical bottleneck is often not the total communication but the context window size per agent. In a fully connected topology with $K = 10$ agents each producing $L = 1000$ tokens, each agent must process a context of 9,000 tokens from other agents plus the original prompt. With $K = 100$, the per-agent context is 99,000 tokens — pushing the limits of even long-context models. This context pressure is a hard scaling barrier that favors sparse topologies.

The Efficiency Frontier

Define the efficiency of a multi-agent system as the quality per unit of compute:

$$E(K, T) = \frac{P(K, T)}{C(K, T)}$$

where $P$ is performance and $C$ is total compute (proportional to $K \times T$ for a fixed model size). The efficiency frontier is the set of $(K, T)$ pairs that maximize quality for each compute budget.

Given the scaling laws above, the efficiency frontier has a characteristic shape:

At low compute budgets: $K = 1$, $T = 1$ (single model, no multi-agent overhead).
At moderate budgets: $K = 3$–$5$, $T = 2$–$3$ (diverse ensemble with a few refinement rounds).
At high budgets: the optimal $K$ grows slowly while $T$ grows faster (increasing refinement rounds has better marginal returns than adding more agents, once diversity is saturated).

flowchart LR subgraph Frontier["Efficiency Frontier"] direction TB L["Low budget:\nK=1, T=1\nSingle model"] M["Medium budget:\nK=3-5, T=2-3\nDiverse MoA"] H["High budget:\nK=5-10, T=3-5\nRefined MoA"] L --> M --> H end subgraph Diminishing["Above frontier"] D1["K=50, T=1\nToo many agents,\ntoo little refinement"] D2["K=2, T=20\nToo few agents,\ntoo much refinement"] end style L fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style M fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style H fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style D1 fill:#fce4ec,stroke:#c62828,color:#1c1917 style D2 fill:#fce4ec,stroke:#c62828,color:#1c1917

The efficiency frontier: optimal (K, T) combinations at each compute budget. Off-frontier configurations waste compute on too many agents or too many rounds.

Practical Guidelines

Based on the scaling analysis: (1) Start with 3 diverse models and 2 interaction rounds — this captures most of the benefit at modest cost. (2) Use sparse topologies (star or layered) rather than fully connected. (3) Add agents only if they bring genuine diversity; adding another version of GPT-4 helps less than adding Llama or Claude. (4) Increase rounds before increasing agents once you have 5–7 models. (5) Monitor the quality-per-compute ratio and stop scaling when it drops below a threshold.

Advanced

Graph Theory, Coordination Games, and Open Problems

▾

Graph-Theoretic Bounds on Communication Efficiency

Model the multi-agent system as a directed communication graph $G = (V, E)$ where vertices are agents and edges are communication channels. The information capacity of the graph is bounded by its graph-theoretic properties.

Let $\lambda_2(G)$ denote the second-smallest eigenvalue of the graph Laplacian (the algebraic connectivity or Fiedler value). For consensus problems (all agents converging to the same value), the convergence rate is $O(e^{-\lambda_2 t})$ where $t$ is the number of communication rounds. Graphs with high algebraic connectivity converge faster.

For a fully connected graph on $K$ vertices: $\lambda_2 = K$ (fastest convergence). For a chain: $\lambda_2 = 2(1 - \cos(\pi/K)) \approx \pi^2/K^2$ (slowest convergence among connected graphs). For a star: $\lambda_2 = 1$ (independent of $K$, moderate convergence).

The tradeoff: fully connected graphs converge fastest but have $O(K^2)$ edges (expensive). Expander graphs provide a middle ground: $O(K)$ edges but $\lambda_2 = \Omega(1)$ (constant algebraic connectivity independent of $K$). Random $d$-regular graphs with $d \geq 3$ are expanders with high probability, and achieve near-optimal convergence with linear communication cost.

Graph	Edges	$\lambda_2$	Convergence	Communication cost
Complete $K_K$	$O(K^2)$	$K$	$O(e^{-Kt})$	$O(K^2 L)$
Star	$O(K)$	$1$	$O(e^{-t})$	$O(KL)$
Chain	$O(K)$	$O(1/K^2)$	$O(e^{-t/K^2})$	$O(KL)$
Random 3-regular	$O(K)$	$\Omega(1)$	$O(e^{-\Omega(t)})$	$O(KL)$

This suggests that multi-agent systems should use expander-like topologies rather than chains or fully connected graphs. Expanders provide the convergence speed of dense graphs at the communication cost of sparse graphs. In practice, this can be approximated by randomly connecting each agent to $d = 3$–$5$ other agents.

Coordination Games and the Price of Anarchy

Multi-agent collaboration can be modeled as a coordination game. Each agent $k$ chooses an effort level $e_k \in [0, 1]$ representing how much compute it allocates to the joint task. The payoff depends on both individual effort and the coordination of efforts:

$$u_k(e_k, e_{-k}) = v\left(\frac{1}{K}\sum_j e_j\right) - c_k \cdot e_k$$

where $v(\cdot)$ is the value of the collective output (increasing and concave in average effort) and $c_k$ is agent $k$'s cost per unit of effort. The social optimum maximizes total welfare:

$$\sum_k u_k = K \cdot v\left(\frac{1}{K}\sum_j e_j\right) - \sum_k c_k e_k$$

At the Nash equilibrium, each agent equates marginal benefit to marginal cost:

$$\frac{1}{K} v'\left(\frac{1}{K}\sum_j e_j\right) = c_k$$

But the social optimum requires:

$$v'\left(\frac{1}{K}\sum_j e_j\right) = c_k$$

The Nash equilibrium effort is $K$ times too low — this is the classic free-rider problem. Each agent underinvests because it bears the full cost of its effort but receives only $1/K$ of the benefit. The price of anarchy for this game is:

$$\text{PoA} = \frac{v(\bar{e}_{\text{NE}})}{v(\bar{e}_{\text{OPT}})} \leq \frac{v(c/K)}{v(c)}$$

For concave $v$, this ratio approaches 0 as $K$ grows — larger systems have worse equilibria relative to the optimum, because the free-rider problem intensifies.

In the context of multi-agent AI, the "effort" is the quality of each agent's output. If agents are rewarded based on the collective output (not their individual contribution), they may produce lower-quality responses — the AI analogue of social loafing. Mechanism design (tying individual rewards to individual contributions, as in Shapley value attribution) can mitigate this, but at the cost of additional evaluation complexity.

Optimal Topology Discovery

Given a task and a set of available agents, what is the optimal communication topology? This is a combinatorial optimization problem:

$$G^* = \arg\max_{G \in \mathcal{G}_K} \frac{P(G)}{C(G)}$$

where $P(G)$ is the performance of the multi-agent system with topology $G$ and $C(G)$ is its compute cost. The search space $\mathcal{G}_K$ contains all directed graphs on $K$ vertices, which is $2^{K(K-1)}$ — doubly exponential in $K$.

GPTSwarm (Zhuge et al., 2024) approaches this via continuous relaxation: instead of discrete edge inclusion/exclusion, each edge has a continuous weight $w_{ij} \in [0, 1]$ representing the strength of information flow from agent $i$ to agent $j$. The weights are optimized via gradient descent on the task performance. After optimization, low-weight edges are pruned, yielding a sparse graph.

The theoretical question is whether this continuous relaxation preserves the optimality gap. For convex objective functions, the relaxation is tight (the optimal continuous solution can be rounded to an optimal discrete solution with constant loss). For non-convex objectives (typical in multi-agent systems), the gap can be arbitrary. Understanding this gap is an open problem.

Open Problems

The theory of multi-agent scaling is in its infancy. Several fundamental questions remain open:

Theory of agent composition. When two agents are composed (one's output feeds into the other), what is the resulting system's capability? Is there a "composition theorem" that predicts the quality of a composed system from the qualities of its components? For simple averaging, the Krogh-Vedelsby decomposition provides this. For sequential composition with learned aggregation, no analogous result exists.
Optimal agent specialization. Given a compute budget and a distribution of tasks, how should you divide the budget between having more general agents and fewer specialized ones? This is an exploration-exploitation tradeoff: general agents are safe (they handle any task) but mediocre; specialized agents are excellent at their specialty but useless outside it.
Emergent capabilities. Single LLMs exhibit emergent capabilities — abilities that appear suddenly at a certain scale. Do multi-agent systems exhibit analogous emergent behaviors? Are there capabilities that only emerge when $K$ agents interact that no single agent possesses? The existence of such emergent multi-agent capabilities would fundamentally change how we think about scaling.
Formal scaling laws. The Chinchilla scaling laws (Hoffmann et al., 2022) precisely characterize how single-model performance scales with parameters and data. No analogous law exists for multi-agent systems. A "multi-agent Chinchilla law" that predicts performance as a function of number of agents, agent size, communication topology, and interaction rounds would be transformative for system design.
Robust aggregation. Current aggregation methods (voting, synthesis, debate) assume benign agents. In adversarial settings — where some agents may be compromised, biased, or malicious — robust aggregation that tolerates a fraction of adversarial agents is needed. Byzantine fault tolerance from distributed systems provides a starting point ($f < K/3$ adversarial agents can be tolerated), but the connection to LLM-based systems is unexplored.

flowchart TD subgraph Known["What We Know"] K1["Diversity reduces ensemble error\n(Krogh-Vedelsby)"] K2["Diminishing returns with K\n(1/K² marginal improvement)"] K3["Correlation is the bottleneck\n(effective K depends on ρ)"] K4["Topology affects convergence\n(algebraic connectivity)"] end subgraph Open["What We Don't Know"] O1["Composition theorems\nfor sequential agents"] O2["Multi-agent scaling laws\n(the Chinchilla for agents)"] O3["Emergent multi-agent\ncapabilities"] O4["Optimal specialization\nvs. generalization"] O5["Robust aggregation\nunder adversarial agents"] end Known --> Bridge["Current frontier"] Bridge --> Open style Open fill:#faf8f5,stroke:#a0522d,color:#1c1917

The theory of multi-agent scaling: established results on diversity and diminishing returns, alongside major open questions about composition, emergence, and robustness.

Where This Is Going

The trajectory of multi-agent AI research is toward systems that are simultaneously more powerful and more principled. The first wave (2023–2024) was empirical: researchers discovered that multi-agent systems work well and catalogued which architectures perform best on which benchmarks. The second wave (2024–2025) is beginning to develop theory: scaling laws, convergence analyses, and game-theoretic models that explain why these systems work. The third wave, still ahead, will need to solve the open problems listed above — and will likely discover new ones.

The deepest question is whether multi-agent collaboration exhibits genuine emergence — capabilities of the collective that cannot be predicted from the capabilities of the individuals. If so, the analogy to human teams is apt: a well-organized team is not just faster than individuals; it can solve problems that no individual, given unlimited time, could solve alone. Whether this holds for AI agents is an empirical and theoretical question that the next generation of research will need to answer.

Series Summary

This series has traced a single thread through the theory and practice of multi-agent AI: from the ensemble error decomposition that explains why multiple models help (Post 1), through the MoE-to-MoA architectural evolution that shows how to organize them (Post 2), the social choice theory that governs how to aggregate their outputs (Post 3), the reasoning architectures that structure how agents think (Post 4), the agentic RAG systems that ground agents in external knowledge (Post 5), and finally the scaling laws that determine how many agents to use and how to connect them (this post). The unifying theme: multi-agent AI is not about making bigger models. It is about making models work together, and the mathematics of collaboration — from Condorcet to Arrow to graph theory — tells us both the promise and the limits of that approach.