The Constant-Width Problem: When Your Error Bars Lie
Part 3 of a 10-part series on prediction intervals, conformal prediction, and leverage scores.
In Part 2, we saw that split conformal prediction delivers a finite-sample, distribution-free coverage guarantee: compute calibration residuals, take a quantile, form an interval. The coverage is at least $1-\alpha$, guaranteed.
But look at the interval again:
$$\hat{C}(x) = [\hat{f}(x) - \hat{q}, \;\; \hat{f}(x) + \hat{q}]$$The half-width $\hat{q}$ is a single number, the same for every test point. The interval is centered at the model's prediction, but its width does not depend on x. This is the constant-width problem, and it is the central limitation of vanilla conformal prediction.
The Hospital Temperature Problem
The Average Is Fine. The Per-Room Distribution Is Terrible.
Imagine a hospital administrator announces: "The average temperature across all rooms is a comfortable 72 degrees Fahrenheit." That sounds reassuring. But walk through the building and you find that the ICU is a sweltering 95 degrees, the surgical suite is a frigid 50, and a handful of rooms in the middle wing happen to be exactly 72. The average is truthful. The lived experience is quite different.
This is exactly what happens with constant-width prediction intervals. The average coverage across all test points is 90%, as promised. But some test points are covered 99% of the time (the intervals are unnecessarily wide), and others are covered only 70% of the time (the intervals are unreliably narrow). The guarantee is real, but it is hiding a significant unevenness. The marginal coverage statement is not wrong — it is just not telling you the whole story.
The House Price Example
Suppose you train a model to predict house prices. Your training data contains thousands of suburban homes in the 250K–400K range and a handful of remote luxury estates worth 2M+. The model learns the suburban pattern well — it has seen thousands of similar houses with similar square footage, lot sizes, and neighborhood characteristics. But it is essentially guessing for the luxury estates, where each property is unique: a lakefront mansion, a converted historical building, a ranch on 200 acres. There is almost no comparable data for the model to generalize from, so its predictions in that region are much less reliable.
Now conformal prediction hands you a single number — say $\hat{q} = 50{,}000$ — and tells you to use it everywhere. A suburban house predicted at 350K gets the interval [300K, 400K]. A remote estate predicted at 2.5M gets the interval [2.45M, 2.55M]. The same half-width is applied to both, even though the model's accuracy differs enormously between these two settings.
- The suburban interval is too wide. The model rarely misses by more than 20K here, because the training set is dense in this region. That extra 30K of width on each side is wasted precision — it makes the interval less informative than it could be. A homebuyer seeing a 100K-wide range for a 350K house may reasonably wonder whether the model is telling them anything useful.
- The estate interval is too narrow. The model routinely misses by 200K+ on these unusual properties, because the training set contains so few comparable examples. The 50K half-width is nowhere near enough. These intervals will fail to cover the true price far more than 10% of the time, which means a buyer relying on them could be seriously misled about the property's value.
Wasting precision
Coverage: ~99%"] H --> HO["Interval too NARROW
Unreliable undercoverage
Coverage: ~70%"] EO --> M["Marginal average: 90% ✓"] HO --> M style Q fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style E fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style H fill:#fce4ec,stroke:#c62828,color:#1c1917 style EO fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style HO fill:#fce4ec,stroke:#c62828,color:#1c1917 style M fill:#e3f2fd,stroke:#1565c0,color:#1c1917
A single quantile forces a tradeoff: overcoverage in easy regions subsidizes undercoverage in hard regions.
This tradeoff is not an artifact of a bad model or insufficient data. It is a structural consequence of using a single number to summarize the uncertainty across the entire feature space. To see this more clearly, it helps to visualize what constant-width intervals look like compared to what we actually want.
Constant Width vs. Adaptive Width
The following diagram contrasts what constant-width intervals look like against what we actually want. In the constant-width case, the band is a uniform ribbon around the prediction. In the adaptive case, the band breathes — narrow where the model is confident, wide where it is uncertain.
Same width everywhere
━━━━━━━━━━━━━━━"] end subgraph Adaptive["Ideal Adaptive Band"] direction TB A1["━━━━━━━━━━━━━━━
Narrow where easy
━━━
━━━━━━━━━━━━━━━━━━━━━
Wide where hard
━━━━━━━━━━━━━━━━━━━━━"] end Constant --> |"What we want"| Adaptive style Constant fill:#fce4ec,stroke:#c62828,color:#1c1917 style Adaptive fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style C1 fill:#fce4ec,stroke:#c62828,color:#1c1917 style A1 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917
Left: constant-width intervals treat all predictions equally. Right: adaptive intervals allocate width where it is needed.
It is like giving every student in a class the same grade. The struggling students pass when they should not, and the top students are undervalued. The class average looks reasonable, but no individual grade is meaningful.
The Key Question
If we know that some predictions are harder than others — if we know the model has lots of training data in one region and almost none in another — why give them all the same error bar?
The answer, frustratingly, is that vanilla conformal prediction has no mechanism for adjusting. It calibrates a single quantile against the entire calibration set, and that quantile is blind to where in the feature space each point lives.
Exact conditional coverage is mathematically impossible without structural assumptions (we will formalize this in the Advanced section below). But approximate conditional coverage is very much achievable, and the gap between constant-width intervals and well-designed adaptive intervals is enormous in practice. Closing that gap is the goal of the rest of this series.
Marginal vs. Conditional Coverage
Two Kinds of Coverage
The intuitive picture from the previous section — overcoverage in easy regions subsidizing undercoverage in hard ones — has a precise mathematical formulation. It comes down to the distinction between marginal and conditional coverage, which is the conceptual engine driving everything that follows in this series.
Marginal coverage is the guarantee that conformal prediction actually provides:
$$P(Y_{n+1} \in \hat{C}(X_{n+1})) \geq 1 - \alpha$$This probability averages over both the random draw of the test point $X_{n+1}$ and its response $Y_{n+1}$. It says: if you draw a random test point from the same distribution, the interval covers it with probability at least $1-\alpha$. Over many test points drawn from the population, the fraction covered will be at least 90%.
Conditional coverage is what we actually want:
$$P(Y_{n+1} \in \hat{C}(X_{n+1}) \mid X_{n+1} = x) \geq 1 - \alpha \quad \text{for all } x$$This says: for each specific test point $x$, the interval covers the true response with probability at least $1-\alpha$. This is a much stronger requirement. It means the interval is correctly calibrated everywhere in the feature space, not just on average.
| Property | Marginal Coverage | Conditional Coverage |
|---|---|---|
| Averages over | All test points $X$ | Fixed at each $x$ |
| Guarantee | Overall fraction $\geq 1-\alpha$ | Per-point fraction $\geq 1-\alpha$ |
| Achieved by vanilla CP | Yes (finite-sample) | No |
| Allows constant width | Yes | Only if noise is homoscedastic |
| Practical implication | 90% coverage on average | 90% coverage at every point |
| Analogy | Average hospital temp = 72°F | Every room = 72°F |
Why Constant Width Fails: A Heteroscedastic Example
Consider a one-dimensional regression where the noise variance increases with the input:
$$Y = f(x) + \varepsilon(x), \quad \text{where } \text{Var}(\varepsilon(x)) \text{ increases with } |x|$$Points near the center have small noise, and points far from the center have large noise. A constant-width interval calibrated on the calibration set uses a single quantile $\hat{q}$, which is calibrated to be correct on average over the calibration set — a mixture of easy (center) and hard (extremes) points.
Overcoverage at the center (99%) subsidizes undercoverage at the extremes (70%). The marginal average is 90%.
The overcoverage near the center compensates for the undercoverage at the extremes, so the marginal guarantee holds. But in high-stakes applications, the undercovered regions are often exactly where predictions matter most: unusual patients, extreme market conditions, edge cases in autonomous systems.
The Root Cause
The root cause is that conformal prediction uses a single quantile for all test points. The quantile $\hat{q}$ is calibrated to be correct on average over the calibration set, which is a mixture of easy and hard points. For easy points, $\hat{q}$ is too large; for hard points, $\hat{q}$ is too small.
What we need is a way to make the interval width depend on $x$: wider where prediction is harder, narrower where it is easier. This requires knowing — or estimating — what makes some predictions harder than others. The question, then, is where this information comes from.
Two Approaches to Adaptive Intervals
There are two fundamentally different strategies for making intervals adaptive, and they differ in how they answer the question above:
- Learn the difficulty. Train an auxiliary model to estimate where predictions are hard. For example, estimate the conditional variance $\text{Var}(Y \mid X=x)$ or fit quantile regressors. This is the approach taken by Conformalized Quantile Regression (CQR) and Studentized Conformal Prediction.
- Derive the difficulty from geometry. Use structural properties of the design matrix — specifically, how far each point is from the training distribution — to determine prediction difficulty without any auxiliary model. This is the leverage-based approach.
Regression (CQR)"] L --> SCP["Studentized Conformal
Prediction"] G --> LEV["Leverage scores
from the hat matrix"] G --> LB["Leverage-based
weighting"] CQR --> PRO1["+ Captures heteroscedasticity"] CQR --> CON1["- Requires training
quantile regressors"] LEV --> PRO2["+ No auxiliary model needed"] LEV --> CON2["- Only captures
geometric difficulty"] style Q fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style L fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style G fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style CQR fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style SCP fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style LEV fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style LB fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style PRO1 fill:#faf8f5,stroke:#1565c0,color:#1c1917 style CON1 fill:#faf8f5,stroke:#1565c0,color:#1c1917 style PRO2 fill:#faf8f5,stroke:#2e7d32,color:#1c1917 style CON2 fill:#faf8f5,stroke:#2e7d32,color:#1c1917
Two paradigms for adaptive intervals: learn difficulty with auxiliary models, or derive it from the geometry of the data.
The first approach is powerful but requires extra modeling effort (and the auxiliary model can itself be wrong). The second approach is elegant and free — in linear models, the hat matrix tells you exactly how hard each prediction is. Both have merits, and we will explore each in subsequent posts.
The Impossibility Theorem and What Lies Beyond
The Impossibility Result
We have seen that constant-width intervals can fail badly on a per-point basis, and that adaptive intervals are the natural remedy. But how adaptive can we actually be? Before diving into solutions, it is worth understanding the theoretical limits. The important mathematical backdrop to this entire discussion is a result established by Vovk (2012) and formalized more precisely by Barber, Candes, Ramdas, and Tibshirani (2021):
For any distribution-free prediction interval procedure that achieves marginal coverage $1-\alpha$, there exist distributions under which the conditional coverage at some points is as low as 0 and at other points is as high as 1.
More precisely: let $\hat{C}$ be any prediction set procedure satisfying $P(Y_{n+1} \in \hat{C}(X_{n+1})) \geq 1 - \alpha$ for all distributions $P$ on $(X,Y)$. Then for any $\delta > 0$, there exists a distribution $P^*$ such that:
$$P^*\bigl(Y_{n+1} \in \hat{C}(X_{n+1}) \mid X_{n+1} = x\bigr) \leq \delta \quad \text{for some } x$$ $$P^*\bigl(Y_{n+1} \in \hat{C}(X_{n+1}) \mid X_{n+1} = x'\bigr) \geq 1 - \delta \quad \text{for some } x'$$Proof Intuition
The proof works by adversarial construction. Given any distribution-free method $\hat{C}$, one constructs a distribution $P^*$ that concentrates the noise variance at points where $\hat{C}$ allocates the least width. Because $\hat{C}$ must be distribution-free (it cannot "know" where $P^*$ will place its variance), the adversary can always find a distribution that exposes its weaknesses.
Concretely: if $\hat{C}$ gives narrow intervals at some point $x_0$, set $P^*$ so that $\text{Var}(Y \mid X = x_0)$ is enormous. The interval at $x_0$ will have near-zero conditional coverage. Meanwhile, at points where $\hat{C}$ gives wide intervals, set the variance to zero; the conditional coverage there approaches 1. The marginal coverage can still be $1-\alpha$ because the overcoverage at the easy points compensates, but the conditional coverage is maximally misallocated.
The impossibility result does not say that all methods are equally bad — far from it. It sets a lower bound on what is achievable in the fully distribution-free setting. In practice, real data is not adversarially constructed, and even modest structural assumptions (which we discuss next) can dramatically improve conditional coverage. The space between that lower bound and what practical methods achieve is where all the interesting work lives.
Escaping Impossibility: Structural Assumptions
So the fully distribution-free setting is a dead end for conditional coverage. But the key insight is that we rarely need to be fully distribution-free. The impossibility result applies only in that extreme setting. Under structural assumptions that are weaker than full parametric models but stronger than making no assumptions at all, near-perfect conditional coverage becomes achievable:
- Scale families: If $Y = f(X) + \sigma(X) \cdot \varepsilon$ where $\varepsilon$ is independent of $X$ and $\sigma(X)$ can be estimated, then dividing by $\hat{\sigma}(X)$ makes residuals exchangeable conditional on $X$, and conditional coverage follows.
- Smoothness: If the conditional distribution of $Y \mid X$ varies smoothly, local calibration methods can achieve conditional coverage at rate $O(n^{-\beta/(2\beta+d)})$ where $\beta$ is the smoothness and $d$ is the dimension.
- Linear models: The hat matrix provides an exact decomposition of prediction variance, enabling precise width allocation with no auxiliary modeling.
Distribution-free + exact conditional
= impossible"] I --> |"Add structure"| S["Structural assumptions"] S --> S1["Scale families
Y = f(X) + σ(X)·ε"] S --> S2["Smoothness
Slow variation of P(Y|X)"] S --> S3["Linear models
Hat matrix geometry"] S1 --> P1["Near-exact conditional
coverage achievable"] S2 --> P2["Rates: O(n⁻β/⁽²β⁺d⁾)"] S3 --> P3["Exact variance
decomposition"] P1 --> G["The practical landscape:
All interesting methods live here"] P2 --> G P3 --> G style I fill:#fce4ec,stroke:#c62828,color:#1c1917 style S fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style S1 fill:#faf8f5,stroke:#1565c0,color:#1c1917 style S2 fill:#faf8f5,stroke:#1565c0,color:#1c1917 style S3 fill:#faf8f5,stroke:#1565c0,color:#1c1917 style P1 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style P2 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style P3 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style G fill:#e8f5e9,stroke:#2e7d32,color:#1c1917
The landscape of conditional coverage: impossibility sets the floor, structural assumptions open the door to practical methods.
Quantifying the Cost: Wasted Width Ratio
Structural assumptions tell us that adaptive intervals are achievable in principle. But how much do we actually lose by using constant-width intervals instead? To make this concrete, we can define a simple metric. To measure how much a constant-width method loses relative to the ideal, define the Wasted Width Ratio:
$$\text{WWR} = \frac{\mathbb{E}[\text{width}(\hat{C}(X))]}{\mathbb{E}[\text{width}(\hat{C}^*(X))]}$$where $\hat{C}^*$ is the oracle adaptive interval that achieves exact $1-\alpha$ conditional coverage everywhere. The oracle interval has width $2 \cdot q_\alpha \cdot \sigma(x)$ at each point (where $q_\alpha$ is the $\alpha$-quantile of the standardized noise distribution), so it is narrow where noise is small and wide where noise is large.
For a constant-width interval, the WWR is always $\geq 1$, and it grows with the heterogeneity of the noise. In a linear model with heteroscedastic noise driven by leverage:
- If leverage scores range from 0.01 to 0.3, the WWR can exceed 2, meaning the constant-width interval uses more than twice the average width needed for the same marginal coverage.
- In high-dimensional settings where $p/n$ is appreciable, leverage heterogeneity is large, and the WWR grows further.
The WWR is not just a theoretical curiosity. In practice, excess width translates directly to less informative intervals: a doctor looking at a prediction interval twice as wide as necessary will dismiss it as useless, even if the coverage guarantee is valid. Tightness matters, not just coverage.
The Road Ahead
The impossibility result tells us we cannot have everything. But it also tells us that the gap between "marginal coverage with constant width" and "near-conditional coverage with adaptive width" is exactly the gap that structural assumptions can close. In linear models, the hat matrix provides all the structure we need. That is where we go next.
Further Reading
- Vovk, V. (2012). Conditional validity of inductive conformal predictors. Asian Conference on Machine Learning, 475–490.
- Barber, R. F., Candes, E. J., Ramdas, A., & Tibshirani, R. J. (2021). The limits of distribution-free conditional predictive inference. Information and Inference, 10(2), 455–482.
- Tibshirani, R. J. (2023). Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning.