The Hat Matrix: A Built-In Diagnostic for Linear Regression
Part 4 of a 10-part series on prediction intervals, conformal prediction, and leverage scores.
We now shift gears. Parts 1–3 covered conformal prediction and its constant-width limitation. This post begins a new track: the geometry of data. We introduce a quantity called the leverage score that measures how unusual each data point is, and a matrix called the hat matrix that encodes these scores for every point simultaneously. By the end of this series, leverage will connect back to conformal prediction in a natural and useful way.
This post is organized in three progressive difficulty levels. Start wherever you are comfortable.
The Big Picture: What Is Leverage?
A Social Circle Analogy
Imagine you are in a room of 100 people. Most of them are clustered in the center, chatting in groups. But you are standing alone in the far corner. Now someone asks the room a question and averages all the answers. Your answer has disproportionate influence — because you are the only one representing "the corner." If you change your answer, the average shifts noticeably. If someone in the center changes theirs, it barely matters; there are dozens of others nearby giving similar answers.
That, in essence, is leverage — how much influence a single data point has on the model's answer.
Leverage in Machine Learning
The same principle applies when fitting a model. When a model fits data, most training points are "average" — they share similar features with many other points and sit in well-populated regions of the feature space. But some points are unusual: they sit in sparse regions, far from the crowd. These unusual points have high leverage. The model is forced to pay extra attention to them because there is no one else nearby to "vote" on what the prediction should be in that region. As a result, the fitted model surface is disproportionately shaped by these isolated points.
HIGH leverage)) C1 --- C2 C2 --- C3 C3 --- C4 C4 --- C5 C5 --- C6 C6 --- C7 C1 --- C4 C2 --- C5 C3 --- C6 O -. "model bends
toward this point" .-> C7 style O fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#c62828 style C1 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32 style C2 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32 style C3 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32 style C4 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32 style C5 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32 style C6 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32 style C7 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32
A cluster of ordinary points (low leverage) and one isolated point (high leverage). The model "bends" toward the isolated point because it is the only representative of that region.
From Data to Leverage Scores
matter most?} C --> D[Leverage Scores] D --> E[High leverage =
unusual position] style A fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style B fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style C fill:#faf8f5,stroke:#a0522d,color:#1c1917 style D fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style E fill:#fce4ec,stroke:#c62828,color:#c62828
Leverage scores summarize which data points have the most influence on the model by virtue of their position in feature space.
Leverage is like being the only voter in your district — your vote counts much more than someone in a crowded district. The model must listen to you because there is no one else nearby to balance you out.
Leverage depends on where you are in feature space, not on your Y value. A point can have a perfectly normal outcome but high leverage. It is entirely about position — how far you are from the "center" of the training data, measured in a way that accounts for correlations between features.
Why Should You Care?
With this intuition in hand, we can ask: does leverage matter in practice? The answer is yes, and it matters in two directions. High-leverage points are not necessarily bad — a point can sit in an unusual region and still follow the true pattern perfectly. But they are influential. If a high-leverage point happens to have an unusual response (maybe it is an error, or just an outlier in Y), it can drag the entire model off course. Conversely, removing a high-leverage point can change the fitted model substantially, while removing a low-leverage point typically changes it very little. Knowing which points have high leverage lets you:
- Identify which training points the model depends on most
- Flag test points where the model is effectively extrapolating
- Understand why prediction intervals should be wider in some regions than others
This last point is the connection to Parts 1–3. Leverage will turn out to be the key to making prediction intervals adaptive — wider where the model is less certain, narrower where it is more certain. But we are getting ahead of ourselves. First, let us see where leverage comes from mathematically.
The Hat Matrix and Leverage Scores
Ordinary Least Squares
Given training data $\{(X_1, Y_1), \ldots, (X_n, Y_n)\}$ with $X_i \in \mathbb{R}^p$ and $Y_i \in \mathbb{R}$, OLS finds the coefficient vector that minimizes the sum of squared residuals. The solution is the normal equation:
$$\hat{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y}$$where $\mathbf{X}$ is the $n \times p$ design matrix (rows are training feature vectors) and $\mathbf{Y}$ is the $n$-vector of responses. The fitted values are:
$$\hat{\mathbf{Y}} = \mathbf{X}\hat{\beta} = \underbrace{\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top}_{\mathbf{H}} \mathbf{Y}$$Defining the Hat Matrix
The matrix that transforms $\mathbf{Y}$ into fitted values $\hat{\mathbf{Y}}$ is the hat matrix:
$$\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$$It is called the hat matrix because it "puts the hat on Y": $\hat{\mathbf{Y}} = \mathbf{H}\mathbf{Y}$. Geometrically, this is an orthogonal projection. The response vector $\mathbf{Y}$ lives in $\mathbb{R}^n$, and the columns of $\mathbf{X}$ span a $p$-dimensional subspace within that space. The hat matrix finds the point in this subspace that is closest (in Euclidean distance) to $\mathbf{Y}$. The residual vector $\mathbf{e} = \mathbf{Y} - \hat{\mathbf{Y}}$ is perpendicular to the column space of $\mathbf{X}$, which is exactly the condition that defines OLS.
The hat matrix H projects Y onto the column space of X. The complement (I - H) projects onto the orthogonal complement, giving the residuals.
Properties of the Hat Matrix
Because $\mathbf{H}$ is a projection matrix, it inherits several useful algebraic properties. These are not just formal curiosities — each one has a concrete statistical consequence, as the table below summarizes.
| Property | Statement | Meaning |
|---|---|---|
| Symmetric | $\mathbf{H} = \mathbf{H}^\top$ | The influence of point $i$ on prediction $j$ equals the influence of $j$ on $i$ |
| Idempotent | $\mathbf{H}^2 = \mathbf{H}$ | H is a projection matrix — projecting twice is the same as projecting once |
| Trace | $\text{tr}(\mathbf{H}) = p$ | The total leverage across all training points equals the number of features |
| Complement | $(\mathbf{I} - \mathbf{H})^2 = \mathbf{I} - \mathbf{H}$ | The residual-maker matrix is also a projection |
Leverage Scores: The Diagonal of H
The diagonal entries of $\mathbf{H}$ are the leverage scores:
$$h_i = H_{ii} = X_i^\top(\mathbf{X}^\top\mathbf{X})^{-1}X_i$$Self-Influence
The leverage score has a direct interpretation as self-influence. Since $\hat{Y}_i = \sum_{j=1}^n H_{ij} Y_j$, we can decompose:
$$\hat{Y}_i = h_i Y_i + \sum_{j \neq i} H_{ij} Y_j$$The leverage $h_i$ is the weight that $Y_i$ receives in its own prediction. If $h_i \approx 1$, the model essentially passes through the observed value — the point "leverages" the fit completely, and the residual at that point will be near zero regardless of the actual Y value. If $h_i \approx 0$, the point has negligible influence on its own fitted value; its prediction is almost entirely determined by the other training points. Most points fall somewhere in between, with a typical leverage near the average $p/n$.
Properties of Leverage Scores
| Property | Formula | Explanation |
|---|---|---|
| Bounded | $0 \leq h_i \leq 1$ | For training points; follows from idempotence of H |
| Average | $\bar{h} = p/n$ | Since $\sum_i h_i = \text{tr}(\mathbf{H}) = p$ |
| Flagging rule | $h_i > 2p/n$ | Classical threshold: leverage more than twice the average is "high" |
| Lower bound | $h_i \geq 1/n$ | If X includes an intercept column |
| Test points | $h(x)$ can exceed 1 | Deep extrapolation — farther from centroid than any training point |
Geometric Interpretation: Mahalanobis Distance
The formula $h(x) = x^\top(\mathbf{X}^\top\mathbf{X})^{-1}x$ is the squared Mahalanobis distance from $x$ to the training centroid (after centering). The matrix $(\mathbf{X}^\top\mathbf{X})^{-1}$ defines an ellipsoid in $\mathbb{R}^p$ — the inverse of the sample covariance of the training features (up to a factor of $n$). Points on the surface of this ellipsoid all have the same leverage, so contours of constant leverage are ellipsoids whose axes are aligned with the principal directions of the training data.
Consider a 2D example where the training data form an elongated elliptical cloud (correlated features). A point that is far along the major axis — the direction of greatest spread — has moderate leverage, because many training points already span that direction. But a point that is far along the minor axis has very high leverage, because the training data are tightly concentrated in that direction and the point lies outside the range the data cover well. The Mahalanobis distance accounts for the shape of the data cloud, not just the Euclidean distance. This is why leverage captures something that individual feature values miss: a point can look "normal" in every individual feature but have high leverage because it sits in an unusual combination of feature values.
Computing Leverage via SVD
The definition of leverage involves the matrix inverse $(\mathbf{X}^\top\mathbf{X})^{-1}$, which can be numerically unstable when features are highly correlated. A more reliable approach — and one that provides additional geometric insight — is to compute leverage through the thin SVD of the design matrix:
$$\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$$where $\mathbf{U}$ is $n \times p$ (orthonormal columns), $\boldsymbol{\Sigma}$ is $p \times p$ (diagonal), and $\mathbf{V}$ is $p \times p$ (orthogonal). Then:
$$h(x) = \|\boldsymbol{\Sigma}^{-1}\mathbf{V}^\top x\|^2$$Computing leverage via SVD: rotate into principal component coordinates, scale by inverse singular values, and take the squared norm.
For training points, the leverages are even simpler: $h_i = \|u_i\|^2$, where $u_i$ is the $i$-th row of $\mathbf{U}$. The hat matrix is just $\mathbf{H} = \mathbf{U}\mathbf{U}^\top$.
Leverage tells you how far each point is from the "center" of the training data, accounting for correlations between features. Components aligned with directions of low variance in the training data get amplified; components aligned with high-variance directions get shrunk. A point that deviates along a direction the training data rarely explores will have high leverage, even if it looks ordinary in each feature individually.
Proofs, Derivations, and the Road to Prediction Variance
Full SVD Derivation
This section fills in the algebraic details that the Technical section summarized. Starting from the thin SVD $\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$, we derive the leverage formula step by step.
First, compute $\mathbf{X}^\top\mathbf{X}$:
$$\mathbf{X}^\top\mathbf{X} = (\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top)^\top (\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top) = \mathbf{V}\boldsymbol{\Sigma}\mathbf{U}^\top\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top = \mathbf{V}\boldsymbol{\Sigma}^2\mathbf{V}^\top$$where we used $\mathbf{U}^\top\mathbf{U} = \mathbf{I}_p$ (orthonormal columns of the thin SVD). Inverting:
$$(\mathbf{X}^\top\mathbf{X})^{-1} = \mathbf{V}\boldsymbol{\Sigma}^{-2}\mathbf{V}^\top$$Now the leverage of an arbitrary point $x$:
$$h(x) = x^\top(\mathbf{X}^\top\mathbf{X})^{-1}x = x^\top\mathbf{V}\boldsymbol{\Sigma}^{-2}\mathbf{V}^\top x = \|\boldsymbol{\Sigma}^{-1}\mathbf{V}^\top x\|^2$$The last step uses the identity $a^\top a = \|a\|^2$ with $a = \boldsymbol{\Sigma}^{-1}\mathbf{V}^\top x$.
For training points, $x = X_i$ (the $i$-th row of $\mathbf{X}$, transposed). Since $\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$, the $i$-th row of $\mathbf{X}$ is $X_i^\top = u_i^\top \boldsymbol{\Sigma}\mathbf{V}^\top$, so $\mathbf{V}^\top X_i = \boldsymbol{\Sigma} u_i$. Substituting:
$$h_i = \|\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma} u_i\|^2 = \|u_i\|^2$$where $u_i$ is the $i$-th row of $\mathbf{U}$. This is a convenient result: all training leverages fall out of the SVD with no additional work.
Proof of Idempotence
The hat matrix is idempotent ($\mathbf{H}^2 = \mathbf{H}$):
$$\mathbf{H}^2 = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \cdot \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\underbrace{(\mathbf{X}^\top\mathbf{X})(\mathbf{X}^\top\mathbf{X})^{-1}}_{\mathbf{I}_p}\mathbf{X}^\top = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{H}$$This confirms that $\mathbf{H}$ is a projection matrix — it projects the response vector onto the column space of the design matrix. Idempotence has a direct geometric meaning: once you have projected $\mathbf{Y}$ onto the column space of $\mathbf{X}$, projecting again does nothing, because the result is already in that subspace. This is the algebraic way of saying "the closest point in a subspace is already in the subspace."
Proof of the Leverage Bounds
The bound $0 \leq h_i \leq 1$ for training points follows directly from idempotence. Since $\mathbf{H}^2 = \mathbf{H}$, looking at the $(i,i)$ entry:
$$h_i = (\mathbf{H}^2)_{ii} = \sum_{j=1}^n H_{ij}^2 \geq H_{ii}^2 = h_i^2$$Therefore $h_i \geq h_i^2$, which gives $h_i(1 - h_i) \geq 0$. Since $h_i = \sum_j H_{ij}^2 \geq 0$, we must have $0 \leq h_i \leq 1$.
Note that both bounds can be achieved: $h_i = 0$ only if $X_i = 0$ (the point is at the origin), and $h_i = 1$ only if $X_i$ is orthogonal to all other training points after projection.
Test Points: Beyond the [0,1] Bound
For a test point $x$ not in the training set, the leverage $h(x) = x^\top(\mathbf{X}^\top\mathbf{X})^{-1}x$ can exceed 1. The bound $h_i \leq 1$ relied on $H_{ii}$ being a diagonal entry of an idempotent matrix, which only applies to the $n$ training points that define $\mathbf{H}$. A new point $x$ is not constrained by this.
When $h(x) > 1$, the model is extrapolating well beyond the training data: the test point is farther from the training centroid (in Mahalanobis distance) than any training point. Predictions at such points should be treated with caution, as the model has little basis for its estimate.
Prediction Variance: Putting It Together
We now arrive at the result that motivates the rest of this series. The question is: how uncertain is the model's prediction at a new test point? The answer decomposes into two independent sources of randomness. Under the linear model $Y = X^\top\beta^* + \varepsilon$ with $\text{Var}(\varepsilon) = \sigma^2$ and a new test point $(x, Y_{\text{new}})$ independent of the training data:
$$\text{Var}(Y_{\text{new}} - \hat{Y}(x)) = \text{Var}(Y_{\text{new}}) + \text{Var}(\hat{Y}(x)) = \sigma^2 + \sigma^2 h(x) = \sigma^2(1 + h(x))$$where $\text{Var}(\hat{Y}(x)) = x^\top \text{Var}(\hat{\beta}) x = \sigma^2 x^\top(\mathbf{X}^\top\mathbf{X})^{-1}x = \sigma^2 h(x)$, and the cross-term vanishes because $Y_{\text{new}}$ is independent of $\hat{\beta}$. The first term, $\sigma^2$, is irreducible noise — even with a perfect model, the new observation has random variation. The second term, $\sigma^2 h(x)$, is estimation uncertainty — the model's coefficients are themselves estimated from noisy data, and that estimation is less precise in directions where the training data are sparse.
∝ σ · √(1 + h(x))"] style X2 fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style XTX fill:#faf8f5,stroke:#a0522d,color:#1c1917 style INV fill:#faf8f5,stroke:#a0522d,color:#1c1917 style H2 fill:#e3f2fd,stroke:#1565c0,color:#1565c0 style HII fill:#e3f2fd,stroke:#1565c0,color:#1565c0 style PV fill:#fce4ec,stroke:#c62828,color:#c62828 style PI fill:#fce4ec,stroke:#c62828,color:#c62828
The full mathematical pipeline: from the design matrix to the hat matrix, to leverage scores, to prediction variance, and finally to prediction interval widths.
This is the central result: leverage directly controls prediction variance. At the centroid of the training data ($h \approx 0$), the prediction variance is approximately $\sigma^2$ — just the irreducible noise. At a high-leverage point, prediction variance grows because the coefficient estimates are uncertain in that direction.
There is a subtlety that complicates the connection between leverage and conformal prediction. For training residuals, $\text{Var}(Y_i - \hat{Y}_i) = \sigma^2(1 - h_i)$ — high-leverage training points have smaller residuals. But for test prediction errors, $\text{Var}(Y_{\text{new}} - \hat{Y}(x)) = \sigma^2(1 + h(x))$ — high-leverage test points have larger errors. The sign flips! This training-test mismatch is the subject of Part 6.
Computational Cost
A natural concern is whether computing leverage scores adds significant overhead to the regression pipeline. It does not. The thin SVD of the $n \times p$ design matrix costs $O(np^2)$ for $n > p$, or $O(n^2 p)$ for $n < p$. Once the SVD is computed, the leverage of any new test point costs $O(p)$ (one matrix-vector product with $\mathbf{V}$, one scaling by $\boldsymbol{\Sigma}^{-1}$, and one inner product). For training points, all $n$ leverages come for free as $\|u_i\|^2$ from the rows of $\mathbf{U}$. In practice, the SVD is typically already computed as part of the regression fit, so leverage scores cost essentially nothing extra.
Further Reading
- Hoaglin, D. C. & Welsch, R. E. (1978). The hat matrix in regression and ANOVA. The American Statistician, 32(1), 17–22.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapter 3 on linear regression.
- Strang, G. (2019). Linear Algebra and Its Applications, 5th ed. Cengage. For the SVD and projection matrix properties.