Heteroscedasticity and Variance Stabilization

In Part 6, we saw that prediction error variance depends on leverage: $\text{Var}(Y_{\text{new}} - \hat{Y}(x)) = \sigma^2(1 + h(x))$. This means that residuals from different points in feature space are not directly comparable — some are drawn from high-variance distributions and others from low-variance distributions. If we want conformal prediction intervals that are well-calibrated across the entire feature space, we need a way to put all residuals on a common scale. In this post, we develop the standard tool for doing so: variance stabilization.

Intuitive

Grading on a Curve for Each Region

The Exam Grading Analogy

Imagine comparing test scores from two different classes. Class A had a straightforward exam: scores range from 85 to 100, and most students cluster around 93. Class B had a much harder exam: scores range from 40 to 90, and most students cluster around 65.

Now suppose a student from each class scores 85. In Class A, that is an unremarkable result — right around the middle. In Class B, it places the student well above the class average. The raw number 85 is hard to interpret on its own; you need to know the exam's difficulty to give it context. To compare students fairly, you adjust for the difficulty: divide each score by its class's standard deviation, or convert to a percentile within the class.

The same issue arises with model errors. Some predictions are inherently harder (more variable) than others, and a residual of a given size means different things depending on the local difficulty. A residual of 10 at an easy-to-predict point suggests the model performed poorly. But a residual of 10 at a hard-to-predict point may be well within the expected range of variation — not a sign of model failure at all. Without knowing the local difficulty, we cannot tell whether a particular residual is large or small in a meaningful sense.

Variance stabilization is essentially "grading on a curve" for each region of feature space. It adjusts each residual by the local difficulty so that all residuals become directly comparable. Once this adjustment is made, a residual of 10 means the same thing everywhere — whether the prediction was easy or hard.

flowchart LR A["Raw Residuals
(unfair comparison)
Different difficulty
at each point"] B["Divide by
local difficulty
√g(x)"] C["Stabilized Residuals
(fair comparison)
Same scale everywhere"] A --> B --> C style A fill:#fce4ec,stroke:#c62828,color:#1c1917 style B fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style C fill:#e8f5e9,stroke:#2e7d32,color:#1c1917

Variance stabilization: divide each residual by the local difficulty to make all residuals comparable.

Why Leverage Creates Unequal Difficulty

Even when the noise in the data is perfectly constant — every observation has the same amount of random error — the model's prediction errors still have different variances at different points. This is the leverage effect from Part 6. Points far from the training data center have high leverage, and the model is more uncertain about its predictions there. Points near the center have low leverage, and the model is more confident.

The practical consequence is that raw residuals naturally fan out as leverage increases: high-leverage points produce larger residuals on average, not because the data is noisier there, but because the model is less certain. If you plotted residuals against leverage, you would see a funnel shape widening to the right. After dividing by the difficulty factor $\sqrt{1+h}$, this funnel collapses into a horizontal band where all residuals have the same spread. The stabilized residuals can then be compared on a level footing.

flowchart TD subgraph Before["Before: Raw Residuals"] direction TB B1["Low leverage: small residuals"] B2["Medium leverage: moderate residuals"] B3["High leverage: large residuals"] B1 --- B2 --- B3 end subgraph After["After: Stabilized Residuals"] direction TB A1["Low leverage: same scale"] A2["Medium leverage: same scale"] A3["High leverage: same scale"] A1 --- A2 --- A3 end Before --> |"Divide by √(1+h)"| After style Before fill:#fce4ec,stroke:#c62828,color:#1c1917 style After fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style B1 fill:#fce4ec,stroke:#c62828,color:#1c1917 style B2 fill:#fce4ec,stroke:#c62828,color:#1c1917 style B3 fill:#fce4ec,stroke:#c62828,color:#1c1917 style A1 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style A2 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style A3 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917

Raw residuals fan out with leverage (left). After dividing by the difficulty factor, they collapse into a uniform band (right).

Analogy

Comparing raw residuals across the feature space is like comparing marathon times at sea level vs. at high altitude. A 3:30 marathon at sea level is a solid recreational time. A 3:30 marathon at 10,000 feet elevation is a notably stronger performance, because the thinner air makes the same pace harder to sustain. To rank runners fairly, you need to adjust for the conditions before comparing their times.

Key Insight

Variance stabilization transforms heteroscedastic (varying difficulty) residuals into homoscedastic (equal difficulty) ones. Once stabilized, a single threshold can fairly judge all residuals — which is exactly what conformal prediction needs.

Technical

Scale Families, VSTs, and Weight Functions

Homoscedasticity vs. Heteroscedasticity

In regression, homoscedasticity means the error variance is constant across all observations:

$$\text{Var}(\varepsilon_i \mid X_i) = \sigma^2 \quad \text{for all } i$$

Heteroscedasticity means the error variance depends on the features:

$$\text{Var}(\varepsilon_i \mid X_i) = \sigma^2 \, g(X_i) \quad \text{for some function } g$$

The function $g$ captures the local difficulty: where $g(x) > 1$, errors are larger than average; where $g(x) < 1$, errors are smaller. In most real data, some form of heteroscedasticity is the rule rather than the exception.

Scale Families

A particularly clean form of heteroscedasticity is the scale family. In a scale family, the shape of the error distribution is the same everywhere — only the scale changes:

$$\varepsilon_i = \sigma \sqrt{g(X_i)} \cdot \eta_i, \quad \eta_i \stackrel{iid}{\sim} F_\eta$$

where $g$ is a known or estimable function, and the $\eta_i$ are identically distributed with mean 0 and variance 1. The key property is that $\varepsilon_i / \sqrt{g(X_i)}$ has the same distribution for all $i$ — not just the same variance, but the same shape and the same quantiles. This is a stronger condition than equal variance alone, and it is what makes scale families particularly well-suited to conformal prediction.

Variance-Stabilizing Transformations

Given a scale family with known $g$, we can stabilize the variance by dividing out the scale factor. If $Z_i$ has variance proportional to $g(X_i)$, define:

$$\tilde{Z}_i = \frac{Z_i}{\sqrt{g(X_i)}}$$

The transformed variables $\tilde{Z}_i$ have constant variance. More importantly, if the $Z_i$ form a scale family, the $\tilde{Z}_i$ are identically distributed — exchangeable in the sense conformal prediction requires. To give a concrete example: if the prediction error at a point with leverage $h = 0.3$ has standard deviation $\sigma\sqrt{1.3}$, dividing by $\sqrt{1.3}$ rescales it back to standard deviation $\sigma$. A point with leverage $h = 0.05$ has standard deviation $\sigma\sqrt{1.05} \approx \sigma$, so the division barely changes it. The transformation applies a large correction where the variance is large and a small correction where the variance is already close to the baseline.

VST in Conformal Prediction

Apply this to conformal prediction. Instead of raw absolute residuals $S_i = |Y_i - \hat{f}(X_i)|$, use the stabilized scores:

$$\tilde{S}_i = \frac{|Y_i - \hat{f}(X_i)|}{\sqrt{g(X_i)}}$$

If the errors form a scale family, these stabilized scores are identically distributed across the calibration set. Computing the conformal quantile on these stabilized scores yields a threshold that reflects the common scale rather than a mixture of different scales. The resulting prediction interval then adapts to local difficulty when we invert the transformation:

$$\hat{C}(x) = \left[\hat{f}(x) - \hat{q} \cdot \sqrt{g(x)}, \;\; \hat{f}(x) + \hat{q} \cdot \sqrt{g(x)}\right]$$

The interval is wider where $g(x)$ is large (harder predictions) and narrower where $g(x)$ is small (easier predictions).

flowchart LR R["Raw scores
|Y - f̂(X)|"] W["Weight function
w(x) = 1/√g(x)"] S["Stabilized scores
|Y - f̂(X)| · w(x)"] Q["Conformal quantile
q̂"] I["Adaptive interval
f̂(x) ± q̂ / w(x)"] R --> W --> S --> Q --> I style R fill:#fce4ec,stroke:#c62828,color:#1c1917 style W fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style S fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style Q fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style I fill:#e8f5e9,stroke:#2e7d32,color:#1c1917

The variance stabilization pipeline: raw scores are multiplied by the weight function, a quantile is computed, and the interval width is scaled by 1/w(x).

The Weight Function Perspective

An equivalent way to express variance stabilization is through a weight function $w$:

$$w(x) = \frac{1}{\sqrt{g(x)}}$$

The stabilized score is $\tilde{S}_i = |Y_i - \hat{f}(X_i)| \cdot w(X_i)$, and the prediction interval width at $x$ is $2\hat{q} / w(x)$. Different weight functions correspond to different assumptions about the variance structure:

flowchart TD WF["Weight Function w(x)"] WF --> W1["w = 1
Constant weight"] WF --> W2["w = 1/σ̂(x)
Inverse estimated variance"] W1 --> A1["Vanilla CP
Assumes constant variance"] W2 --> A2["Studentized CP
Estimates g from data"] A1 --> P1["No extra model needed
but ignores heteroscedasticity"] A2 --> P2["Captures noise heteroscedasticity
but needs auxiliary model
and risks sign flip"] style WF fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style W1 fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style W2 fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style A1 fill:#faf8f5,stroke:#1565c0,color:#1c1917 style A2 fill:#faf8f5,stroke:#1565c0,color:#1c1917 style P1 fill:#fce4ec,stroke:#c62828,color:#1c1917 style P2 fill:#fce4ec,stroke:#c62828,color:#1c1917

Two common weight functions and what they assume about the variance structure.

Comparing the Two Approaches

Approach	Weight $w(x)$	Variance model	Requires aux. model?	Sign-flip risk?
Vanilla CP	$1$	$g(x) = 1$ (constant)	No	No
Studentized CP	$1/\hat{\sigma}(x)$	$g(x)$ estimated from data	Yes	Yes (Part 6)

Advanced

General VST Theory and Optimal Weights

The General Variance-Stabilizing Principle

The general approach to variance stabilization is straightforward: if a random variable $Z$ has variance that depends on a parameter $\theta$, specifically $\text{Var}(Z) = f(\theta)$, then the transformation

$$\tilde{Z} = \frac{Z}{\sqrt{f(\theta)}}$$

stabilizes the variance to a constant. This is a fundamental tool in statistics dating back to the early 20th century, with several classical instantiations:

Square root transformation for Poisson data: If $Z \sim \text{Poisson}(\lambda)$, then $\text{Var}(Z) = \lambda$, so $Z / \sqrt{\lambda}$ stabilizes. In practice, $\sqrt{Z}$ is used since $\text{Var}(\sqrt{Z}) \approx 1/4$ for large $\lambda$. For example, if one city reports 400 events and another reports 25, the raw counts suggest the first is 16 times larger — but $\sqrt{400} = 20$ and $\sqrt{25} = 5$ have similar relative uncertainty (standard deviation $\approx 0.5$ for both).
Arcsine transformation for binomial proportions: If $\hat{p} = Z/n$ with $Z \sim \text{Binomial}(n,p)$, then $\text{Var}(\hat{p}) = p(1-p)/n$. The variance peaks at $p = 0.5$ and shrinks toward zero at $p = 0$ or $p = 1$. The transformation $\arcsin(\sqrt{\hat{p}})$ stabilizes the variance to approximately $1/(4n)$, making proportions near the extremes and near the center equally amenable to standard statistical comparisons.
Log transformation for proportional variance: If $\text{Var}(Z) \propto [\mathbb{E}(Z)]^2$, then $\log(Z)$ stabilizes the variance (the coefficient of variation is constant). This arises naturally with financial returns, gene expression levels, and other quantities that are multiplicative in nature. Taking logarithms converts multiplicative variation into additive variation with constant spread.

Optimal Weight Under Leverage Heteroscedasticity

In the linear model setting, the prediction error variance is exactly $\sigma^2(1+h(x))$, so the variance function is $g(h) = 1+h$. The optimal weight follows immediately:

$$w^*(x) = \frac{1}{\sqrt{1 + h(x)}}$$

This gives stabilized scores $\tilde{S}_i = |Y_i - \hat{f}(X_i)| / \sqrt{1 + h(X_i)}$, which have constant variance $\sigma^2$ regardless of leverage.

When Noise Is Also Heteroscedastic

The analysis requires more care when the noise itself is heteroscedastic. Suppose:

$$\text{Var}(\varepsilon \mid X) = \sigma^2 g(h(X))$$

where $g$ depends on leverage. Then the total prediction error variance combines the noise variance and the estimation error variance:

$$\text{Var}(Y_{\text{new}} - \hat{Y}(x)) = \text{Var}(\varepsilon_{\text{new}}) + \text{Var}(x^\top(\hat{\beta} - \beta^*))$$ $$= \sigma^2 g(h(x)) + \sigma^2 h(x) = \sigma^2\bigl(g(h(x)) + h(x)\bigr)$$

The optimal weight for stabilizing the total prediction error is therefore:

$$w^*(h) = \frac{1}{\sqrt{g(h) + h}}$$

flowchart TD V["Total prediction variance
σ²(g(h) + h)"] V --> D["Decomposition"] D --> N["Noise term
σ² g(h)"] D --> E["Estimation term
σ² h"] N --> C1["g = 1
Homoscedastic noise"] N --> C2["g = 1 + h
Noise grows with leverage"] C1 --> W1["Total: σ²(1 + h)
w* = (1 + h)⁻¹ᐟ²"] C2 --> W2["Total: σ²(1 + 2h)
w* = (1 + 2h)⁻¹ᐟ²"] W1 --> O["Optimal weight
depends on g"] W2 --> O style V fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style D fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style N fill:#faf8f5,stroke:#1565c0,color:#1c1917 style E fill:#faf8f5,stroke:#1565c0,color:#1c1917 style C1 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style C2 fill:#fce4ec,stroke:#c62828,color:#1c1917 style W1 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style W2 fill:#fce4ec,stroke:#c62828,color:#1c1917 style O fill:#f3f0ec,stroke:#a0522d,color:#1c1917

The optimal weight depends on whether the noise itself varies with leverage. Two special cases are shown.

Special Cases

Two special cases illustrate the range of behavior:

Case 1: Homoscedastic noise ($g = 1$). The noise variance is constant at $\sigma^2$. The total prediction variance is $\sigma^2(1 + h)$, and the optimal weight is $(1+h)^{-1/2}$. This is the simplest and most common setting. The weight corrects only for the estimation error term, and it is exact at any sample size because the formula $\sigma^2(1+h)$ is an exact identity, not an approximation.

Case 2: Noise proportional to $(1+h)$ ($g = 1+h$). The noise itself grows with leverage — points that are geometrically unusual also happen to be noisier. The total variance is:

$$\sigma^2\bigl((1+h) + h\bigr) = \sigma^2(1 + 2h)$$

and the optimal weight is $(1+2h)^{-1/2}$. The weight must correct for both the noise heteroscedasticity and the estimation error, leading to a steeper decay with $h$.

When Does Variance Stabilization Help?

Variance stabilization improves conformal intervals when two conditions hold simultaneously:

Condition 1: The variance actually varies. If leverage scores are nearly constant across the feature space (all points are roughly equidistant from the training centroid and the intrinsic noise is constant), there is nothing to stabilize. The diagnostic $\hat{\eta} = \text{std}(h) / \text{mean}(h)$ measures leverage heterogeneity: when $\hat{\eta} > 1$, leverage variation is substantial and leverage weighting is likely to help; when $\hat{\eta} < 0.5$, leverage variation is small and vanilla CP performs comparably.

Condition 2: The variance function is known or well-estimated. Dividing by the wrong scale factor can make things worse — it introduces a systematic distortion rather than removing one. The leverage-based weight $w(h) = (1+h)^{-1/2}$ satisfies this condition by construction because it uses the exact variance formula from the OLS prediction error decomposition, not an estimate. No auxiliary model is needed, and there is no estimation error to worry about. Studentized CP, which estimates $g$ from training residuals, is vulnerable to the sign-flip problem (Part 6): the estimator learns $\sigma^2(1-h)$ instead of $\sigma^2(1+h)$, producing corrections that point in the wrong direction.

flowchart TD Q["Does variance stabilization help?"] Q --> C1["Condition 1:
Does variance actually vary?"] Q --> C2["Condition 2:
Is g known or well-estimated?"] C1 --> Y1["η̂ > 1
Leverage varies a lot"] C1 --> N1["η̂ < 0.5
Leverage nearly constant"] C2 --> Y2["Leverage-based w
Exact formula, no estimation"] C2 --> N2["Estimated σ̂(x)
Risk of sign flip"] Y1 --> G["Leverage weighting\nhelps significantly"] Y2 --> G N1 --> B["Leverage weighting\n≈ Vanilla CP"] N2 --> B2["May hurt more than help"] style Q fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style C1 fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style C2 fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style Y1 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style N1 fill:#fce4ec,stroke:#c62828,color:#1c1917 style Y2 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style N2 fill:#fce4ec,stroke:#c62828,color:#1c1917 style G fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style B fill:#faf8f5,stroke:#a0522d,color:#1c1917 style B2 fill:#fce4ec,stroke:#c62828,color:#1c1917

Decision diagram: variance stabilization helps when leverage varies substantially and the variance function is correctly specified.

Summary

The key ideas from this post:

Heteroscedasticity — non-constant variance — arises naturally in prediction errors, even under constant noise, because of the leverage term.
Scale families provide a structured model where the variance changes but the distributional shape does not.
Variance-stabilizing transformations divide by the scale factor to make observations comparable.
In conformal prediction, this translates to weighted nonconformity scores and adaptive interval widths.
The weight function is the key design choice: $w = 1$ (vanilla) or $w = 1/\hat{\sigma}(x)$ (studentized). The choice depends on whether you can reliably estimate the variance function from data.

In the next post, we survey the existing adaptive conformal methods — CQR, Studentized CP, Localized CP — and examine their trade-offs in light of everything we have learned.