The Sign Flip: Why Training Residuals Lie About Prediction Uncertainty

Intuitive

The Rubber Band and the Lie

▾

The Rubber Band Analogy

Imagine stretching a rubber band between two nails hammered into a board. The rubber band sits taut and straight between them—this is your model, a line fit through training data.

Now add a third nail far to the right, well away from the other two. The rubber band bends to touch it. At that distant nail, the fit looks perfect—the residual is tiny, the band passes right through the point. The model appears to be doing a good job.

To make this concrete, suppose the two original nails are at $x=0$ and $x=1$, and the distant nail is at $x=10$. When you fit a line through these three points, the distant point at $x=10$ has high leverage: it sits far from the centroid of the data, so it exerts strong pull on the fitted line. The model tilts to accommodate it, and the residual at that point is small—perhaps nearly zero. Meanwhile, the residuals at $x=0$ and $x=1$ may actually grow, because the line tilted away from them to reach the distant nail.

But here is the catch. Remove that third nail and put a new nail nearby—even just slightly offset. Let go of the rubber band. It swings back. The prediction error at the new point is large, because the band was being held in place by the original nail, not by any genuine understanding of the pattern.

The model overfit locally. It bent itself to touch the distant point, making the training residual misleadingly small. But this bending was driven by noisy data. At prediction time, that same bending creates substantial uncertainty. The key mechanism is that fitting absorbs noise at training points—it hides the noise rather than eliminating it—and at prediction time, that absorbed noise reappears as estimation error pointing in the wrong direction.

flowchart LR subgraph Training["During Training"] direction TB T1["Distant data point\n(high leverage)"] T2["Model bends\nto fit it"] T3["Residual is TINY\nModel looks accurate"] T1 --> T2 --> T3 end subgraph Prediction["During Prediction"] direction TB P1["New point arrives\nnear the distant one"] P2["Model still bent by\noriginal noisy point"] P3["Error is HUGE\nModel is most uncertain"] P1 --> P2 --> P3 end Training -.->|"Same location,\nopposite conclusion"| Prediction style T3 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style P3 fill:#fce4ec,stroke:#c62828,color:#1c1917

The same high-leverage region of feature space produces tiny training residuals but enormous prediction errors.

The Sign Literally Flips

This is not a vague metaphor. The mathematics says something precise and quantifiable:

Training residuals are small where the model was forced to pay attention (high-leverage points). The model absorbed the noise by bending to fit them.
Prediction errors are large at those same spots, because the model's bending was based on noisy data, and new observations are not protected by the same fitting.

The sign flips. Training says: "I am certain here." Reality says: "I am most uncertain here."

To see why this happens, think about what the model does at a high-leverage point during training. It adjusts its coefficients to reduce the error at that point, effectively absorbing the noise into the fit. The residual shrinks because the model spent its degrees of freedom accommodating that point. But when a new observation arrives at the same location, it brings fresh noise that the model has no ability to absorb. Worse, the coefficients are now biased by the original noise they absorbed, so the estimation error and the new noise compound rather than cancel.

This is not a subtle or edge-case issue. It is a systematic inversion. Wherever the model is most confident about its training data, it is least reliable for new predictions. Wherever the model admits to being imprecise on training data, it is actually most trustworthy for prediction.

Why This Breaks Uncertainty Estimation

Think about how most uncertainty estimators work. They look at training residuals—the gaps between what the model predicted and what it observed during training—and try to learn a pattern: "Where are the residuals big? That is where I should be uncertain." This is a natural and seemingly reasonable strategy: if the model struggled to fit a particular region of feature space, that region is probably harder to predict, so intervals there should be wider.

But the sign flip means the residuals are big at exactly the wrong places. The estimator learns the pattern backwards. Low-leverage points, where prediction is already easy, show large training residuals and receive wide intervals they do not need. High-leverage points, where prediction is hardest, show small training residuals and receive narrow intervals that undercover.

flowchart TD A["Variance Estimator"] --> B["Trains on training\nresiduals"] B --> C["Learns the pattern:\nsmall residuals at\nhigh-leverage points"] C --> D["Concludes: LOW\nuncertainty there"] D --> E["Applied to new\ntest point at\nhigh leverage"] E --> F["Reality: HIGH\nuncertainty there"] F --> G["Systematic error\nin the wrong direction"] style D fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style F fill:#fce4ec,stroke:#c62828,color:#1c1917 style G fill:#fce4ec,stroke:#c62828,color:#1c1917

The chain of reasoning: training residuals encode the wrong variance structure, and anything downstream inherits the error.

Analogy

Judging a teacher by their own exam. It is like judging a teacher's knowledge by how well their own students do on an exam the teacher wrote. Of course the students do well—the teacher DESIGNED the test for them. The questions were tailored to what was taught. The real test is how well new students, from a different class, do on that exam. A teacher who "teaches to the test" looks brilliant by their own metric but may produce students who fail any outside evaluation. High-leverage points are the students the teacher paid the most attention to. Their performance looks great—but the next class will struggle at exactly those spots.

Key Insight

The sign flip is not a minor detail. It means that residual-based uncertainty estimation is systematically wrong in the opposite direction from what you need. It does not just add noise—it inverts the signal. High-leverage points, where accurate uncertainty quantification matters most, are precisely where residual-based methods are least reliable.

Technical

The Variance Decomposition and the Sign Flip

▾

Setting Up the Prediction Error

Consider a linear model $Y = X^\top \beta^* + \varepsilon$ where $\mathbb{E}[\varepsilon \mid X] = 0$ and $\text{Var}(\varepsilon \mid X) = \sigma^2$. The OLS estimator is $\hat{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y}$, and the prediction at a new test point $x$ is $\hat{Y}(x) = x^\top\hat{\beta}$.

The prediction error decomposes cleanly into two independent pieces:

$$Y_{\text{new}} - \hat{Y}(x) = \underbrace{\varepsilon_{\text{new}}}_{\text{new noise}} - \underbrace{x^\top(\hat{\beta} - \beta^*)}_{\text{estimation error}}$$

The new noise $\varepsilon_{\text{new}}$ is independent of $\hat{\beta}$ (the training data cannot predict the noise in a future observation). So the variances simply add:

flowchart LR subgraph Var["Prediction Error Variance"] direction LR V1["Noise variance\nfrom new observation"] V2["Estimation error\nfrom imprecise coefficients"] V3["Total prediction\nerror variance"] V1 -->|"+"| V3 V2 -->|"+"| V3 end N1["sigma squared"] -.-> V1 N2["sigma squared times h of x"] -.-> V2 N3["sigma squared times\n1 + h of x"] -.-> V3 style V3 fill:#e3f2fd,stroke:#1565c0,color:#1c1917

The two independent components of prediction error variance: irreducible noise and estimation uncertainty amplified by leverage.

Specifically:

Noise component: $\text{Var}(\varepsilon_{\text{new}}) = \sigma^2$. This is irreducible—it comes from the randomness in the new observation itself.
Estimation error component: $\text{Var}(x^\top\hat{\beta} \mid \mathbf{X}) = \sigma^2 h(x)$, where $h(x) = x^\top(\mathbf{X}^\top\mathbf{X})^{-1}x$ is the leverage of the test point.

Adding these together:

$$\text{Var}(Y_{\text{new}} - \hat{Y}(x) \mid x, \mathbf{X}) = \sigma^2(1 + h(x))$$

The "1" is the noise floor. The "$h(x)$" is the penalty for predicting at a point that is far from the center of the training data. At the centroid ($h \approx 0$), prediction variance is just noise. At a high-leverage point, estimation error dominates.

Training Residuals: The Other Side of the Coin

Now consider training residuals. The $i$-th training residual is $e_i = Y_i - \hat{Y}_i$. Using the hat matrix $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$, the residual vector is:

$$\mathbf{e} = (\mathbf{I} - \mathbf{H})\mathbf{Y} = (\mathbf{I} - \mathbf{H})\boldsymbol{\varepsilon}$$

The second equality holds because $(\mathbf{I} - \mathbf{H})\mathbf{X}\beta^* = \mathbf{0}$—the true signal lies in the column space of $\mathbf{X}$ and is annihilated by the projection $\mathbf{I} - \mathbf{H}$. The variance of the $i$-th residual is:

$$\text{Var}(e_i \mid \mathbf{X}) = \sigma^2(1 - h_i)$$

The Sign Flip, Precisely

Now place the two formulas side by side:

Quantity	Variance	Sign of leverage
Prediction error $Y_{\text{new}} - \hat{Y}(x)$	$\sigma^2(1 + h)$	+h (amplified)
Training residual $e_i$	$\sigma^2(1 - h)$	−h (attenuated)

The leverage contribution enters with opposite signs. Why? The core reason is that the hat matrix $\mathbf{H}$ and its complement $\mathbf{I} - \mathbf{H}$ partition the noise into two orthogonal pieces. Training residuals live in the $\mathbf{I} - \mathbf{H}$ subspace: they see only the fraction of noise that the model did not absorb. Prediction errors live in the full space: they see the new observation's noise plus the estimation error from noise that the model did absorb during training. The more noise the model absorbs (high leverage), the smaller the residual but the larger the estimation error carried forward into prediction.

Prediction errors are amplified by leverage because the model is uncertain in high-leverage directions. The coefficient estimates $\hat{\beta}$ are imprecise, and that imprecision is magnified when projecting onto a direction $x$ that is far from the training centroid. A new observation at that point brings its own noise $\varepsilon_{\text{new}}$, and on top of that, the fitted value $\hat{Y}(x)$ carries estimation error proportional to $h(x)$. The two sources of variance add because they are independent.
Training residuals are attenuated by leverage because the model has already seen the training point and adjusted itself to accommodate it. A high-leverage point pulls the fitted surface toward itself, absorbing some of the noise. The residual $e_i$ is smaller than the true noise $\varepsilon_i$ because a fraction $h_i$ of $\varepsilon_i$ has been absorbed into the fit. Formally, $e_i = (1 - h_i)\varepsilon_i + \text{(contributions from other points' noise)}$, and the net effect is a variance of $\sigma^2(1-h_i)$.

flowchart TD subgraph flip["The Sign Flip"] direction TB H["High leverage\nh close to 1"] TP["Training residual\nvariance: small\nsigma-squared times 1 minus h\napproaches 0"] PP["Prediction error\nvariance: large\nsigma-squared times 1 plus h\napproaches 2-sigma-squared"] H --> TP H --> PP end subgraph low["Low leverage comparison"] direction TB L["Low leverage\nh close to 0"] TL["Training residual\nvariance: about sigma-squared"] PL["Prediction error\nvariance: about sigma-squared"] L --> TL L --> PL end style TP fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style PP fill:#fce4ec,stroke:#c62828,color:#1c1917 style TL fill:#faf8f5,stroke:#a0522d,color:#1c1917 style PL fill:#faf8f5,stroke:#a0522d,color:#1c1917

At low leverage, training and prediction variances are similar. At high leverage, they diverge in opposite directions.

Quantifying the Mismatch

If you estimate prediction uncertainty from training residuals, you learn $\sigma\sqrt{1-h}$ but need $\sigma\sqrt{1+h}$. The ratio between reality and the estimate is:

$$\text{Underestimation ratio} = \frac{\sigma\sqrt{1+h}}{\sigma\sqrt{1-h}} = \sqrt{\frac{1+h}{1-h}}$$

This ratio grows rapidly with leverage:

Leverage $h$	$\sigma^2(1-h)$	$\sigma^2(1+h)$	Ratio $\sqrt{(1+h)/(1-h)}$	Underestimate
0.05	$0.95\sigma^2$	$1.05\sigma^2$	1.05	5%
0.10	$0.90\sigma^2$	$1.10\sigma^2$	1.11	11%
0.20	$0.80\sigma^2$	$1.20\sigma^2$	1.22	22%
0.30	$0.70\sigma^2$	$1.30\sigma^2$	1.36	36%
0.50	$0.50\sigma^2$	$1.50\sigma^2$	1.73	73%
0.70	$0.30\sigma^2$	$1.70\sigma^2$	2.38	138%
0.90	$0.10\sigma^2$	$1.90\sigma^2$	4.36	336%
$\to 1$	$\to 0$	$\to 2\sigma^2$	$\to \infty$	$\to \infty$

At $h = 0.3$, the residual-based estimator underestimates uncertainty by 36%. At $h = 0.5$, by 73%. As $h \to 1$—which happens for outliers in feature space or when $p$ is close to $n$—the estimator predicts near-zero uncertainty while the true prediction uncertainty is large. The ratio diverges to infinity.

This is not a small correction. For any dataset with meaningful leverage variation, the mismatch is large enough to substantially distort the resulting prediction intervals.

The Implication

Any method that estimates prediction uncertainty from training residuals will systematically underestimate uncertainty at high-leverage points and overestimate it at low-leverage points. The error is not random—it is structured, predictable, and exactly backwards.

Advanced

Full Derivation and Consequences for Conformal Prediction

▾

Derivation of Prediction Error Variance

We derive the prediction error variance from first principles. The OLS estimator is:

$$\hat{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y} = \beta^* + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}$$

The estimation error in the coefficients is:

$$\hat{\beta} - \beta^* = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}$$

The variance of the fitted value at a test point $x$ is:

$$\text{Var}(x^\top\hat{\beta} \mid \mathbf{X}) = x^\top \text{Var}(\hat{\beta} \mid \mathbf{X}) \, x = x^\top \left[\sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}\right] x = \sigma^2 \cdot x^\top (\mathbf{X}^\top\mathbf{X})^{-1} x = \sigma^2 h(x)$$

The prediction error is $Y_{\text{new}} - \hat{Y}(x) = \varepsilon_{\text{new}} - x^\top(\hat{\beta} - \beta^*)$. Since $\varepsilon_{\text{new}} \perp \hat{\beta}$ (the new noise is independent of the training data), the variances add:

$$\boxed{\text{Var}(Y_{\text{new}} - \hat{Y}(x) \mid x, \mathbf{X}) = \sigma^2 + \sigma^2 h(x) = \sigma^2(1 + h(x))}$$

Derivation of Training Residual Variance

The training residual vector is $\mathbf{e} = (\mathbf{I} - \mathbf{H})\mathbf{Y}$. Since $\mathbf{H}\mathbf{X}\beta^* = \mathbf{X}\beta^*$ (the hat matrix projects onto the column space of $\mathbf{X}$), we have:

$$\mathbf{e} = (\mathbf{I} - \mathbf{H})(\mathbf{X}\beta^* + \boldsymbol{\varepsilon}) = (\mathbf{I} - \mathbf{H})\boldsymbol{\varepsilon}$$

The covariance matrix of the residual vector is:

$$\text{Cov}(\mathbf{e} \mid \mathbf{X}) = (\mathbf{I} - \mathbf{H}) \, \sigma^2 \mathbf{I} \, (\mathbf{I} - \mathbf{H})^\top = \sigma^2 (\mathbf{I} - \mathbf{H})$$

where the last step uses the fact that $\mathbf{I} - \mathbf{H}$ is idempotent and symmetric: $(\mathbf{I} - \mathbf{H})^2 = \mathbf{I} - \mathbf{H}$ and $(\mathbf{I} - \mathbf{H})^\top = \mathbf{I} - \mathbf{H}$. The variance of the $i$-th residual is the $i$-th diagonal entry:

$$\text{Var}(e_i \mid \mathbf{X}) = \sigma^2 (1 - H_{ii}) = \sigma^2(1 - h_i)$$

flowchart TD Start["OLS: β̂ = (XᵀX)⁻¹XᵀY"] Start --> Branch1["Prediction at new x"] Start --> Branch2["Training residuals"] Branch1 --> Pred1["Yₙₑw - Ŷ = εₙₑw - xᵀ(β̂ - β*)"] Pred1 --> Pred2["Var = σ² + σ²h(x)"] Pred2 --> Pred3["= σ²(1 + h(x))"] Branch2 --> Res1["e = (I - H)ε"] Res1 --> Res2["Cov(e) = σ²(I - H)"] Res2 --> Res3["Var(eᵢ) = σ²(1 - hᵢ)"] Pred3 --> Flip["SIGN FLIP: +h vs -h"] Res3 --> Flip style Pred3 fill:#fce4ec,stroke:#c62828,color:#1c1917 style Res3 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style Flip fill:#e3f2fd,stroke:#1565c0,color:#1c1917

The proof structure: both variances derive from the same OLS estimator, but the sign of leverage is opposite because one involves projection onto the column space and the other involves projection onto its complement.

Contamination of Studentized Conformal Scores

In studentized conformal prediction, one fits a variance estimator $\hat{\sigma}(x)$ to the training absolute residuals $\{(X_i, |e_i|)\}_{i=1}^n$ and forms scores:

$$s_i = \frac{|Y_i - \hat{f}(X_i)|}{\hat{\sigma}(X_i)}$$

If the variance estimator is well-calibrated to training data, it learns $\hat{\sigma}(x) \approx \sigma\sqrt{1 - h(x)}$, because that is the standard deviation of training residuals. On the calibration set, the studentized scores are approximately:

$$s_i^{\text{stud}} = \frac{|e_i|}{\hat{\sigma}(X_i)} \approx \frac{|e_i|}{\sigma\sqrt{1 - h_i}}$$

But $e_i = [(I-H)\varepsilon]_i$, and the calibration residuals (from a separate calibration set, not the training set) have a different variance structure. For calibration point $j$ with leverage $h_j$ relative to the training design matrix, the prediction error has variance $\sigma^2(1+h_j)$, so:

$$s_j^{\text{stud}} \approx \frac{\sigma\sqrt{1+h_j} \cdot |\eta_j|}{\sigma\sqrt{1-h_j}} = \sqrt{\frac{1+h_j}{1-h_j}} \cdot |\eta_j|$$

where $\eta_j$ is a standardized error term. The crucial point: these scores are not identically distributed. The factor $\sqrt{(1+h_j)/(1-h_j)}$ varies across the calibration set, introducing a systematic, leverage-dependent bias into the conformal quantile.

Points with high leverage produce inflated scores (the ratio is large), biasing the conformal quantile upward. Points with low leverage produce deflated scores (the ratio is near 1), creating a pull in the other direction. The quantile is a compromise that is too large for low-leverage points and too small for high-leverage points—exactly the opposite of what adaptive intervals need.

Generality of the Sign Flip

The sign flip is not an artifact of a particular estimator or a particular model. It is a structural property of least-squares regression. It holds for:

Any linear model: OLS, ridge, LASSO (to the extent that the hat matrix analog applies).
Any variance estimator trained on residuals: Random forests, neural networks, kernel methods—if they train on $\{(X_i, |e_i|)\}$, they learn $\sigma\sqrt{1-h}$, not $\sigma\sqrt{1+h}$.
Any sample size: The sign flip is exact, not asymptotic. It holds for $n = 20$ and $n = 20{,}000$.

The implication is broad: any method that estimates prediction uncertainty from training residuals alone will systematically underestimate uncertainty at high-leverage points and overestimate it at low-leverage points. This includes studentized conformal prediction, many Bayesian approaches that use in-sample variability as a proxy for prediction uncertainty, and ensemble methods that measure disagreement on training data.

The Way Out: Use Leverage Directly

The formula $\sigma^2(1 + h(x))$ describes the prediction variance without any reference to training residuals. It depends only on:

The noise variance $\sigma^2$ (a single scalar, estimated from the training MSE).
The leverage score $h(x)$ (computed from the design matrix via the SVD, no residuals needed).

Because leverage is computed directly from the design matrix geometry—not from training residuals—it is immune to the sign flip. Any method that uses leverage to quantify prediction difficulty avoids the systematic bias described above. This is a fundamental advantage: leverage tells you how hard a prediction will be before you see the outcome, and it gets the sign right.

The Boxed Result

The sign flip in one line:

$$\text{Training: } \text{Var}(e_i) = \sigma^2(1 - h_i) \qquad \longleftrightarrow \qquad \text{Prediction: } \text{Var}(Y_{\text{new}} - \hat{Y}) = \sigma^2(1 + h)$$

The $+h$ and $-h$ are not negotiable. They come from the algebra of orthogonal projections: the hat matrix $\mathbf{H}$ projects onto the column space of $\mathbf{X}$, and $\mathbf{I} - \mathbf{H}$ projects onto its orthogonal complement. Training residuals live in the complement (the $1-h$ world). Prediction errors incorporate both the complement and the column space (the $1+h$ world).