The Constant-Width Problem: When Your Error Bars Lie

Intuitive

The Hospital Temperature Problem

The Average Is Fine. The Per-Room Distribution Is Terrible.

Imagine a hospital administrator announces: "The average temperature across all rooms is a comfortable 72 degrees Fahrenheit." That sounds reassuring. But walk through the building and you find that the ICU is a sweltering 95 degrees, the surgical suite is a frigid 50, and a handful of rooms in the middle wing happen to be exactly 72. The average is truthful. The lived experience is quite different.

This is exactly what happens with constant-width prediction intervals. The average coverage across all test points is 90%, as promised. But some test points are covered 99% of the time (the intervals are unnecessarily wide), and others are covered only 70% of the time (the intervals are unreliably narrow). The guarantee is real, but it is hiding a significant unevenness. The marginal coverage statement is not wrong — it is just not telling you the whole story.

The House Price Example

Suppose you train a model to predict house prices. Your training data contains thousands of suburban homes in the 250K–400K range and a handful of remote luxury estates worth 2M+. The model learns the suburban pattern well — it has seen thousands of similar houses with similar square footage, lot sizes, and neighborhood characteristics. But it is essentially guessing for the luxury estates, where each property is unique: a lakefront mansion, a converted historical building, a ranch on 200 acres. There is almost no comparable data for the model to generalize from, so its predictions in that region are much less reliable.

Now conformal prediction hands you a single number — say $\hat{q} = 50{,}000$ — and tells you to use it everywhere. A suburban house predicted at 350K gets the interval [300K, 400K]. A remote estate predicted at 2.5M gets the interval [2.45M, 2.55M]. The same half-width is applied to both, even though the model's accuracy differs enormously between these two settings.

The suburban interval is too wide. The model rarely misses by more than 20K here, because the training set is dense in this region. That extra 30K of width on each side is wasted precision — it makes the interval less informative than it could be. A homebuyer seeing a 100K-wide range for a 350K house may reasonably wonder whether the model is telling them anything useful.
The estate interval is too narrow. The model routinely misses by 200K+ on these unusual properties, because the training set contains so few comparable examples. The 50K half-width is nowhere near enough. These intervals will fail to cover the true price far more than 10% of the time, which means a buyer relying on them could be seriously misled about the property's value.

flowchart TD Q["Single conformal quantile q̂ = 50K"] Q --> E["Easy Region: Suburban homes"] Q --> H["Hard Region: Luxury estates"] E --> EO["Interval too WIDE
Wasting precision
Coverage: ~99%"] H --> HO["Interval too NARROW
Unreliable undercoverage
Coverage: ~70%"] EO --> M["Marginal average: 90% ✓"] HO --> M style Q fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style E fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style H fill:#fce4ec,stroke:#c62828,color:#1c1917 style EO fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style HO fill:#fce4ec,stroke:#c62828,color:#1c1917 style M fill:#e3f2fd,stroke:#1565c0,color:#1c1917

A single quantile forces a tradeoff: overcoverage in easy regions subsidizes undercoverage in hard regions.

This tradeoff is not an artifact of a bad model or insufficient data. It is a structural consequence of using a single number to summarize the uncertainty across the entire feature space. To see this more clearly, it helps to visualize what constant-width intervals look like compared to what we actually want.

Constant Width vs. Adaptive Width

The following diagram contrasts what constant-width intervals look like against what we actually want. In the constant-width case, the band is a uniform ribbon around the prediction. In the adaptive case, the band breathes — narrow where the model is confident, wide where it is uncertain.

flowchart LR subgraph Constant["Constant-Width Band"] direction TB C1["━━━━━━━━━━━━━━━
Same width everywhere
━━━━━━━━━━━━━━━"] end subgraph Adaptive["Ideal Adaptive Band"] direction TB A1["━━━━━━━━━━━━━━━
Narrow where easy
━━━

━━━━━━━━━━━━━━━━━━━━━
Wide where hard
━━━━━━━━━━━━━━━━━━━━━"] end Constant --> |"What we want"| Adaptive style Constant fill:#fce4ec,stroke:#c62828,color:#1c1917 style Adaptive fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style C1 fill:#fce4ec,stroke:#c62828,color:#1c1917 style A1 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917

Left: constant-width intervals treat all predictions equally. Right: adaptive intervals allocate width where it is needed.

Analogy

It is like giving every student in a class the same grade. The struggling students pass when they should not, and the top students are undervalued. The class average looks reasonable, but no individual grade is meaningful.

The Key Question

If we know that some predictions are harder than others — if we know the model has lots of training data in one region and almost none in another — why give them all the same error bar?

The answer, frustratingly, is that vanilla conformal prediction has no mechanism for adjusting. It calibrates a single quantile against the entire calibration set, and that quantile is blind to where in the feature space each point lives.

Important Caveat

Exact conditional coverage is mathematically impossible without structural assumptions (we will formalize this in the Advanced section below). But approximate conditional coverage is very much achievable, and the gap between constant-width intervals and well-designed adaptive intervals is enormous in practice. Closing that gap is the goal of the rest of this series.

Technical

Marginal vs. Conditional Coverage

Two Kinds of Coverage

The intuitive picture from the previous section — overcoverage in easy regions subsidizing undercoverage in hard ones — has a precise mathematical formulation. It comes down to the distinction between marginal and conditional coverage, which is the conceptual engine driving everything that follows in this series.

Marginal coverage is the guarantee that conformal prediction actually provides:

$$P(Y_{n+1} \in \hat{C}(X_{n+1})) \geq 1 - \alpha$$

This probability averages over both the random draw of the test point $X_{n+1}$ and its response $Y_{n+1}$. It says: if you draw a random test point from the same distribution, the interval covers it with probability at least $1-\alpha$. Over many test points drawn from the population, the fraction covered will be at least 90%.

Conditional coverage is what we actually want:

$$P(Y_{n+1} \in \hat{C}(X_{n+1}) \mid X_{n+1} = x) \geq 1 - \alpha \quad \text{for all } x$$

This says: for each specific test point $x$, the interval covers the true response with probability at least $1-\alpha$. This is a much stronger requirement. It means the interval is correctly calibrated everywhere in the feature space, not just on average.

Property	Marginal Coverage	Conditional Coverage
Averages over	All test points $X$	Fixed at each $x$
Guarantee	Overall fraction $\geq 1-\alpha$	Per-point fraction $\geq 1-\alpha$
Achieved by vanilla CP	Yes (finite-sample)	No
Allows constant width	Yes	Only if noise is homoscedastic
Practical implication	90% coverage on average	90% coverage at every point
Analogy	Average hospital temp = 72°F	Every room = 72°F

Why Constant Width Fails: A Heteroscedastic Example

Consider a one-dimensional regression where the noise variance increases with the input:

$$Y = f(x) + \varepsilon(x), \quad \text{where } \text{Var}(\varepsilon(x)) \text{ increases with } |x|$$

Points near the center have small noise, and points far from the center have large noise. A constant-width interval calibrated on the calibration set uses a single quantile $\hat{q}$, which is calibrated to be correct on average over the calibration set — a mixture of easy (center) and hard (extremes) points.

flowchart LR subgraph Center["Near Center"] direction TB CN["Small noise variance"] CC["Coverage: ~99%"] CL["Interval: much too wide"] end subgraph Middle["Mid Range"] direction TB MN["Moderate noise variance"] MC["Coverage: ~90%"] ML["Interval: about right"] end subgraph Extremes["At Extremes"] direction TB EN["Large noise variance"] EC["Coverage: ~70%"] EL["Interval: far too narrow"] end Center --- Middle --- Extremes style Center fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style Middle fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style Extremes fill:#fce4ec,stroke:#c62828,color:#1c1917 style CN fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style CC fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style CL fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style MN fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style MC fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style ML fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style EN fill:#fce4ec,stroke:#c62828,color:#1c1917 style EC fill:#fce4ec,stroke:#c62828,color:#1c1917 style EL fill:#fce4ec,stroke:#c62828,color:#1c1917

Overcoverage at the center (99%) subsidizes undercoverage at the extremes (70%). The marginal average is 90%.

The overcoverage near the center compensates for the undercoverage at the extremes, so the marginal guarantee holds. But in high-stakes applications, the undercovered regions are often exactly where predictions matter most: unusual patients, extreme market conditions, edge cases in autonomous systems.

The Root Cause

The root cause is that conformal prediction uses a single quantile for all test points. The quantile $\hat{q}$ is calibrated to be correct on average over the calibration set, which is a mixture of easy and hard points. For easy points, $\hat{q}$ is too large; for hard points, $\hat{q}$ is too small.

What we need is a way to make the interval width depend on $x$: wider where prediction is harder, narrower where it is easier. This requires knowing — or estimating — what makes some predictions harder than others. The question, then, is where this information comes from.

Two Approaches to Adaptive Intervals

There are two fundamentally different strategies for making intervals adaptive, and they differ in how they answer the question above:

Learn the difficulty. Train an auxiliary model to estimate where predictions are hard. For example, estimate the conditional variance $\text{Var}(Y \mid X=x)$ or fit quantile regressors. This is the approach taken by Conformalized Quantile Regression (CQR) and Studentized Conformal Prediction.
Derive the difficulty from geometry. Use structural properties of the design matrix — specifically, how far each point is from the training distribution — to determine prediction difficulty without any auxiliary model. This is the leverage-based approach.

flowchart TD Q["How to make intervals adaptive?"] Q --> L["Learn the difficulty"] Q --> G["Derive from geometry"] L --> CQR["Conformalized Quantile
Regression (CQR)"] L --> SCP["Studentized Conformal
Prediction"] G --> LEV["Leverage scores
from the hat matrix"] G --> LB["Leverage-based
weighting"] CQR --> PRO1["+ Captures heteroscedasticity"] CQR --> CON1["- Requires training
quantile regressors"] LEV --> PRO2["+ No auxiliary model needed"] LEV --> CON2["- Only captures
geometric difficulty"] style Q fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style L fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style G fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style CQR fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style SCP fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style LEV fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style LB fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style PRO1 fill:#faf8f5,stroke:#1565c0,color:#1c1917 style CON1 fill:#faf8f5,stroke:#1565c0,color:#1c1917 style PRO2 fill:#faf8f5,stroke:#2e7d32,color:#1c1917 style CON2 fill:#faf8f5,stroke:#2e7d32,color:#1c1917

Two paradigms for adaptive intervals: learn difficulty with auxiliary models, or derive it from the geometry of the data.

The first approach is powerful but requires extra modeling effort (and the auxiliary model can itself be wrong). The second approach is elegant and free — in linear models, the hat matrix tells you exactly how hard each prediction is. Both have merits, and we will explore each in subsequent posts.

Advanced

The Impossibility Theorem and What Lies Beyond

The Impossibility Result

We have seen that constant-width intervals can fail badly on a per-point basis, and that adaptive intervals are the natural remedy. But how adaptive can we actually be? Before diving into solutions, it is worth understanding the theoretical limits. The important mathematical backdrop to this entire discussion is a result established by Vovk (2012) and formalized more precisely by Barber, Candes, Ramdas, and Tibshirani (2021):

For any distribution-free prediction interval procedure that achieves marginal coverage $1-\alpha$, there exist distributions under which the conditional coverage at some points is as low as 0 and at other points is as high as 1.

More precisely: let $\hat{C}$ be any prediction set procedure satisfying $P(Y_{n+1} \in \hat{C}(X_{n+1})) \geq 1 - \alpha$ for all distributions $P$ on $(X,Y)$. Then for any $\delta > 0$, there exists a distribution $P^*$ such that:

$$P^*\bigl(Y_{n+1} \in \hat{C}(X_{n+1}) \mid X_{n+1} = x\bigr) \leq \delta \quad \text{for some } x$$ $$P^*\bigl(Y_{n+1} \in \hat{C}(X_{n+1}) \mid X_{n+1} = x'\bigr) \geq 1 - \delta \quad \text{for some } x'$$

Proof Intuition

The proof works by adversarial construction. Given any distribution-free method $\hat{C}$, one constructs a distribution $P^*$ that concentrates the noise variance at points where $\hat{C}$ allocates the least width. Because $\hat{C}$ must be distribution-free (it cannot "know" where $P^*$ will place its variance), the adversary can always find a distribution that exposes its weaknesses.

Concretely: if $\hat{C}$ gives narrow intervals at some point $x_0$, set $P^*$ so that $\text{Var}(Y \mid X = x_0)$ is enormous. The interval at $x_0$ will have near-zero conditional coverage. Meanwhile, at points where $\hat{C}$ gives wide intervals, set the variance to zero; the conditional coverage there approaches 1. The marginal coverage can still be $1-\alpha$ because the overcoverage at the easy points compensates, but the conditional coverage is maximally misallocated.

The impossibility result does not say that all methods are equally bad — far from it. It sets a lower bound on what is achievable in the fully distribution-free setting. In practice, real data is not adversarially constructed, and even modest structural assumptions (which we discuss next) can dramatically improve conditional coverage. The space between that lower bound and what practical methods achieve is where all the interesting work lives.

Escaping Impossibility: Structural Assumptions

So the fully distribution-free setting is a dead end for conditional coverage. But the key insight is that we rarely need to be fully distribution-free. The impossibility result applies only in that extreme setting. Under structural assumptions that are weaker than full parametric models but stronger than making no assumptions at all, near-perfect conditional coverage becomes achievable:

Scale families: If $Y = f(X) + \sigma(X) \cdot \varepsilon$ where $\varepsilon$ is independent of $X$ and $\sigma(X)$ can be estimated, then dividing by $\hat{\sigma}(X)$ makes residuals exchangeable conditional on $X$, and conditional coverage follows.
Smoothness: If the conditional distribution of $Y \mid X$ varies smoothly, local calibration methods can achieve conditional coverage at rate $O(n^{-\beta/(2\beta+d)})$ where $\beta$ is the smoothness and $d$ is the dimension.
Linear models: The hat matrix provides an exact decomposition of prediction variance, enabling precise width allocation with no auxiliary modeling.

flowchart TB I["Impossibility boundary
Distribution-free + exact conditional
= impossible"] I --> |"Add structure"| S["Structural assumptions"] S --> S1["Scale families
Y = f(X) + σ(X)·ε"] S --> S2["Smoothness
Slow variation of P(Y|X)"] S --> S3["Linear models
Hat matrix geometry"] S1 --> P1["Near-exact conditional
coverage achievable"] S2 --> P2["Rates: O(n⁻β/⁽²β⁺d⁾)"] S3 --> P3["Exact variance
decomposition"] P1 --> G["The practical landscape:
All interesting methods live here"] P2 --> G P3 --> G style I fill:#fce4ec,stroke:#c62828,color:#1c1917 style S fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style S1 fill:#faf8f5,stroke:#1565c0,color:#1c1917 style S2 fill:#faf8f5,stroke:#1565c0,color:#1c1917 style S3 fill:#faf8f5,stroke:#1565c0,color:#1c1917 style P1 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style P2 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style P3 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style G fill:#e8f5e9,stroke:#2e7d32,color:#1c1917

The landscape of conditional coverage: impossibility sets the floor, structural assumptions open the door to practical methods.

Quantifying the Cost: Wasted Width Ratio

Structural assumptions tell us that adaptive intervals are achievable in principle. But how much do we actually lose by using constant-width intervals instead? To make this concrete, we can define a simple metric. To measure how much a constant-width method loses relative to the ideal, define the Wasted Width Ratio:

$$\text{WWR} = \frac{\mathbb{E}[\text{width}(\hat{C}(X))]}{\mathbb{E}[\text{width}(\hat{C}^*(X))]}$$

where $\hat{C}^*$ is the oracle adaptive interval that achieves exact $1-\alpha$ conditional coverage everywhere. The oracle interval has width $2 \cdot q_\alpha \cdot \sigma(x)$ at each point (where $q_\alpha$ is the $\alpha$-quantile of the standardized noise distribution), so it is narrow where noise is small and wide where noise is large.

For a constant-width interval, the WWR is always $\geq 1$, and it grows with the heterogeneity of the noise. In a linear model with heteroscedastic noise driven by leverage:

If leverage scores range from 0.01 to 0.3, the WWR can exceed 2, meaning the constant-width interval uses more than twice the average width needed for the same marginal coverage.
In high-dimensional settings where $p/n$ is appreciable, leverage heterogeneity is large, and the WWR grows further.

The WWR is not just a theoretical curiosity. In practice, excess width translates directly to less informative intervals: a doctor looking at a prediction interval twice as wide as necessary will dismiss it as useless, even if the coverage guarantee is valid. Tightness matters, not just coverage.

The Road Ahead

The impossibility result tells us we cannot have everything. But it also tells us that the gap between "marginal coverage with constant width" and "near-conditional coverage with adaptive width" is exactly the gap that structural assumptions can close. In linear models, the hat matrix provides all the structure we need. That is where we go next.