The Hat Matrix: A Built-In Diagnostic for Linear Regression

Intuitive

The Big Picture: What Is Leverage?

A Social Circle Analogy

Imagine you are in a room of 100 people. Most of them are clustered in the center, chatting in groups. But you are standing alone in the far corner. Now someone asks the room a question and averages all the answers. Your answer has disproportionate influence — because you are the only one representing "the corner." If you change your answer, the average shifts noticeably. If someone in the center changes theirs, it barely matters; there are dozens of others nearby giving similar answers.

That, in essence, is leverage — how much influence a single data point has on the model's answer.

Leverage in Machine Learning

The same principle applies when fitting a model. When a model fits data, most training points are "average" — they share similar features with many other points and sit in well-populated regions of the feature space. But some points are unusual: they sit in sparse regions, far from the crowd. These unusual points have high leverage. The model is forced to pay extra attention to them because there is no one else nearby to "vote" on what the prediction should be in that region. As a result, the fitted model surface is disproportionately shaped by these isolated points.

graph LR subgraph Feature Space C1((Point)) C2((Point)) C3((Point)) C4((Point)) C5((Point)) C6((Point)) C7((Point)) end O((Outlier
HIGH leverage)) C1 --- C2 C2 --- C3 C3 --- C4 C4 --- C5 C5 --- C6 C6 --- C7 C1 --- C4 C2 --- C5 C3 --- C6 O -. "model bends
toward this point" .-> C7 style O fill:#fce4ec,stroke:#c62828,stroke-width:2px,color:#c62828 style C1 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32 style C2 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32 style C3 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32 style C4 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32 style C5 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32 style C6 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32 style C7 fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32

A cluster of ordinary points (low leverage) and one isolated point (high leverage). The model "bends" toward the isolated point because it is the only representative of that region.

From Data to Leverage Scores

flowchart LR A[Data] --> B[Fit Model] B --> C{Which points
matter most?} C --> D[Leverage Scores] D --> E[High leverage =
unusual position] style A fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style B fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style C fill:#faf8f5,stroke:#a0522d,color:#1c1917 style D fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style E fill:#fce4ec,stroke:#c62828,color:#c62828

Leverage scores summarize which data points have the most influence on the model by virtue of their position in feature space.

Analogy

Leverage is like being the only voter in your district — your vote counts much more than someone in a crowded district. The model must listen to you because there is no one else nearby to balance you out.

Key Point

Leverage depends on where you are in feature space, not on your Y value. A point can have a perfectly normal outcome but high leverage. It is entirely about position — how far you are from the "center" of the training data, measured in a way that accounts for correlations between features.

Why Should You Care?

With this intuition in hand, we can ask: does leverage matter in practice? The answer is yes, and it matters in two directions. High-leverage points are not necessarily bad — a point can sit in an unusual region and still follow the true pattern perfectly. But they are influential. If a high-leverage point happens to have an unusual response (maybe it is an error, or just an outlier in Y), it can drag the entire model off course. Conversely, removing a high-leverage point can change the fitted model substantially, while removing a low-leverage point typically changes it very little. Knowing which points have high leverage lets you:

Identify which training points the model depends on most
Flag test points where the model is effectively extrapolating
Understand why prediction intervals should be wider in some regions than others

This last point is the connection to Parts 1–3. Leverage will turn out to be the key to making prediction intervals adaptive — wider where the model is less certain, narrower where it is more certain. But we are getting ahead of ourselves. First, let us see where leverage comes from mathematically.

Technical

The Hat Matrix and Leverage Scores

Ordinary Least Squares

Given training data $\{(X_1, Y_1), \ldots, (X_n, Y_n)\}$ with $X_i \in \mathbb{R}^p$ and $Y_i \in \mathbb{R}$, OLS finds the coefficient vector that minimizes the sum of squared residuals. The solution is the normal equation:

$$\hat{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y}$$

where $\mathbf{X}$ is the $n \times p$ design matrix (rows are training feature vectors) and $\mathbf{Y}$ is the $n$-vector of responses. The fitted values are:

$$\hat{\mathbf{Y}} = \mathbf{X}\hat{\beta} = \underbrace{\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top}_{\mathbf{H}} \mathbf{Y}$$

Defining the Hat Matrix

The matrix that transforms $\mathbf{Y}$ into fitted values $\hat{\mathbf{Y}}$ is the hat matrix:

$$\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$$

It is called the hat matrix because it "puts the hat on Y": $\hat{\mathbf{Y}} = \mathbf{H}\mathbf{Y}$. Geometrically, this is an orthogonal projection. The response vector $\mathbf{Y}$ lives in $\mathbb{R}^n$, and the columns of $\mathbf{X}$ span a $p$-dimensional subspace within that space. The hat matrix finds the point in this subspace that is closest (in Euclidean distance) to $\mathbf{Y}$. The residual vector $\mathbf{e} = \mathbf{Y} - \hat{\mathbf{Y}}$ is perpendicular to the column space of $\mathbf{X}$, which is exactly the condition that defines OLS.

flowchart LR Y["Response Y"] -->|"multiply by H"| Yhat["Fitted Ŷ = HY"] Yhat -->|"H is a projection"| CS["Column space of X"] Y -->|"I - H"| R["Residuals e = (I-H)Y"] R -->|"I-H is a projection"| OC["Orthogonal complement"] style Y fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style Yhat fill:#e3f2fd,stroke:#1565c0,color:#1565c0 style CS fill:#e3f2fd,stroke:#1565c0,color:#1565c0 style R fill:#fce4ec,stroke:#c62828,color:#c62828 style OC fill:#fce4ec,stroke:#c62828,color:#c62828

The hat matrix H projects Y onto the column space of X. The complement (I - H) projects onto the orthogonal complement, giving the residuals.

Properties of the Hat Matrix

Because $\mathbf{H}$ is a projection matrix, it inherits several useful algebraic properties. These are not just formal curiosities — each one has a concrete statistical consequence, as the table below summarizes.

Property	Statement	Meaning
Symmetric	$\mathbf{H} = \mathbf{H}^\top$	The influence of point $i$ on prediction $j$ equals the influence of $j$ on $i$
Idempotent	$\mathbf{H}^2 = \mathbf{H}$	H is a projection matrix — projecting twice is the same as projecting once
Trace	$\text{tr}(\mathbf{H}) = p$	The total leverage across all training points equals the number of features
Complement	$(\mathbf{I} - \mathbf{H})^2 = \mathbf{I} - \mathbf{H}$	The residual-maker matrix is also a projection

Leverage Scores: The Diagonal of H

The diagonal entries of $\mathbf{H}$ are the leverage scores:

$$h_i = H_{ii} = X_i^\top(\mathbf{X}^\top\mathbf{X})^{-1}X_i$$

Self-Influence

The leverage score has a direct interpretation as self-influence. Since $\hat{Y}_i = \sum_{j=1}^n H_{ij} Y_j$, we can decompose:

$$\hat{Y}_i = h_i Y_i + \sum_{j \neq i} H_{ij} Y_j$$

The leverage $h_i$ is the weight that $Y_i$ receives in its own prediction. If $h_i \approx 1$, the model essentially passes through the observed value — the point "leverages" the fit completely, and the residual at that point will be near zero regardless of the actual Y value. If $h_i \approx 0$, the point has negligible influence on its own fitted value; its prediction is almost entirely determined by the other training points. Most points fall somewhere in between, with a typical leverage near the average $p/n$.

Properties of Leverage Scores

Property	Formula	Explanation
Bounded	$0 \leq h_i \leq 1$	For training points; follows from idempotence of H
Average	$\bar{h} = p/n$	Since $\sum_i h_i = \text{tr}(\mathbf{H}) = p$
Flagging rule	$h_i > 2p/n$	Classical threshold: leverage more than twice the average is "high"
Lower bound	$h_i \geq 1/n$	If X includes an intercept column
Test points	$h(x)$ can exceed 1	Deep extrapolation — farther from centroid than any training point

Geometric Interpretation: Mahalanobis Distance

The formula $h(x) = x^\top(\mathbf{X}^\top\mathbf{X})^{-1}x$ is the squared Mahalanobis distance from $x$ to the training centroid (after centering). The matrix $(\mathbf{X}^\top\mathbf{X})^{-1}$ defines an ellipsoid in $\mathbb{R}^p$ — the inverse of the sample covariance of the training features (up to a factor of $n$). Points on the surface of this ellipsoid all have the same leverage, so contours of constant leverage are ellipsoids whose axes are aligned with the principal directions of the training data.

Consider a 2D example where the training data form an elongated elliptical cloud (correlated features). A point that is far along the major axis — the direction of greatest spread — has moderate leverage, because many training points already span that direction. But a point that is far along the minor axis has very high leverage, because the training data are tightly concentrated in that direction and the point lies outside the range the data cover well. The Mahalanobis distance accounts for the shape of the data cloud, not just the Euclidean distance. This is why leverage captures something that individual feature values miss: a point can look "normal" in every individual feature but have high leverage because it sits in an unusual combination of feature values.

Computing Leverage via SVD

The definition of leverage involves the matrix inverse $(\mathbf{X}^\top\mathbf{X})^{-1}$, which can be numerically unstable when features are highly correlated. A more reliable approach — and one that provides additional geometric insight — is to compute leverage through the thin SVD of the design matrix:

$$\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$$

where $\mathbf{U}$ is $n \times p$ (orthonormal columns), $\boldsymbol{\Sigma}$ is $p \times p$ (diagonal), and $\mathbf{V}$ is $p \times p$ (orthogonal). Then:

$$h(x) = \|\boldsymbol{\Sigma}^{-1}\mathbf{V}^\top x\|^2$$

flowchart LR X["Design Matrix X"] --> SVD["Thin SVD"] SVD --> U["U (n x p)"] SVD --> S["Σ (p x p)"] SVD --> V["V (p x p)"] V --> T1["Vᵀx"] S --> T2["Σ⁻¹ Vᵀx"] T1 --> T2 T2 --> H["h(x) = ||Σ⁻¹Vᵀx||²"] style X fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style SVD fill:#e3f2fd,stroke:#1565c0,color:#1565c0 style U fill:#faf8f5,stroke:#a0522d,color:#1c1917 style S fill:#faf8f5,stroke:#a0522d,color:#1c1917 style V fill:#faf8f5,stroke:#a0522d,color:#1c1917 style T1 fill:#faf8f5,stroke:#a0522d,color:#1c1917 style T2 fill:#faf8f5,stroke:#a0522d,color:#1c1917 style H fill:#e8f5e9,stroke:#2e7d32,color:#2e7d32

Computing leverage via SVD: rotate into principal component coordinates, scale by inverse singular values, and take the squared norm.

For training points, the leverages are even simpler: $h_i = \|u_i\|^2$, where $u_i$ is the $i$-th row of $\mathbf{U}$. The hat matrix is just $\mathbf{H} = \mathbf{U}\mathbf{U}^\top$.

Key Insight

Leverage tells you how far each point is from the "center" of the training data, accounting for correlations between features. Components aligned with directions of low variance in the training data get amplified; components aligned with high-variance directions get shrunk. A point that deviates along a direction the training data rarely explores will have high leverage, even if it looks ordinary in each feature individually.

Advanced

Proofs, Derivations, and the Road to Prediction Variance

Full SVD Derivation

This section fills in the algebraic details that the Technical section summarized. Starting from the thin SVD $\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$, we derive the leverage formula step by step.

First, compute $\mathbf{X}^\top\mathbf{X}$:

$$\mathbf{X}^\top\mathbf{X} = (\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top)^\top (\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top) = \mathbf{V}\boldsymbol{\Sigma}\mathbf{U}^\top\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top = \mathbf{V}\boldsymbol{\Sigma}^2\mathbf{V}^\top$$

where we used $\mathbf{U}^\top\mathbf{U} = \mathbf{I}_p$ (orthonormal columns of the thin SVD). Inverting:

$$(\mathbf{X}^\top\mathbf{X})^{-1} = \mathbf{V}\boldsymbol{\Sigma}^{-2}\mathbf{V}^\top$$

Now the leverage of an arbitrary point $x$:

$$h(x) = x^\top(\mathbf{X}^\top\mathbf{X})^{-1}x = x^\top\mathbf{V}\boldsymbol{\Sigma}^{-2}\mathbf{V}^\top x = \|\boldsymbol{\Sigma}^{-1}\mathbf{V}^\top x\|^2$$

The last step uses the identity $a^\top a = \|a\|^2$ with $a = \boldsymbol{\Sigma}^{-1}\mathbf{V}^\top x$.

For training points, $x = X_i$ (the $i$-th row of $\mathbf{X}$, transposed). Since $\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$, the $i$-th row of $\mathbf{X}$ is $X_i^\top = u_i^\top \boldsymbol{\Sigma}\mathbf{V}^\top$, so $\mathbf{V}^\top X_i = \boldsymbol{\Sigma} u_i$. Substituting:

$$h_i = \|\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma} u_i\|^2 = \|u_i\|^2$$

where $u_i$ is the $i$-th row of $\mathbf{U}$. This is a convenient result: all training leverages fall out of the SVD with no additional work.

Proof of Idempotence

The hat matrix is idempotent ($\mathbf{H}^2 = \mathbf{H}$):

$$\mathbf{H}^2 = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \cdot \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\underbrace{(\mathbf{X}^\top\mathbf{X})(\mathbf{X}^\top\mathbf{X})^{-1}}_{\mathbf{I}_p}\mathbf{X}^\top = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top = \mathbf{H}$$

This confirms that $\mathbf{H}$ is a projection matrix — it projects the response vector onto the column space of the design matrix. Idempotence has a direct geometric meaning: once you have projected $\mathbf{Y}$ onto the column space of $\mathbf{X}$, projecting again does nothing, because the result is already in that subspace. This is the algebraic way of saying "the closest point in a subspace is already in the subspace."

Proof of the Leverage Bounds

The bound $0 \leq h_i \leq 1$ for training points follows directly from idempotence. Since $\mathbf{H}^2 = \mathbf{H}$, looking at the $(i,i)$ entry:

$$h_i = (\mathbf{H}^2)_{ii} = \sum_{j=1}^n H_{ij}^2 \geq H_{ii}^2 = h_i^2$$

Therefore $h_i \geq h_i^2$, which gives $h_i(1 - h_i) \geq 0$. Since $h_i = \sum_j H_{ij}^2 \geq 0$, we must have $0 \leq h_i \leq 1$.

Note that both bounds can be achieved: $h_i = 0$ only if $X_i = 0$ (the point is at the origin), and $h_i = 1$ only if $X_i$ is orthogonal to all other training points after projection.

Test Points: Beyond the [0,1] Bound

For a test point $x$ not in the training set, the leverage $h(x) = x^\top(\mathbf{X}^\top\mathbf{X})^{-1}x$ can exceed 1. The bound $h_i \leq 1$ relied on $H_{ii}$ being a diagonal entry of an idempotent matrix, which only applies to the $n$ training points that define $\mathbf{H}$. A new point $x$ is not constrained by this.

When $h(x) > 1$, the model is extrapolating well beyond the training data: the test point is farther from the training centroid (in Mahalanobis distance) than any training point. Predictions at such points should be treated with caution, as the model has little basis for its estimate.

Prediction Variance: Putting It Together

We now arrive at the result that motivates the rest of this series. The question is: how uncertain is the model's prediction at a new test point? The answer decomposes into two independent sources of randomness. Under the linear model $Y = X^\top\beta^* + \varepsilon$ with $\text{Var}(\varepsilon) = \sigma^2$ and a new test point $(x, Y_{\text{new}})$ independent of the training data:

$$\text{Var}(Y_{\text{new}} - \hat{Y}(x)) = \text{Var}(Y_{\text{new}}) + \text{Var}(\hat{Y}(x)) = \sigma^2 + \sigma^2 h(x) = \sigma^2(1 + h(x))$$

where $\text{Var}(\hat{Y}(x)) = x^\top \text{Var}(\hat{\beta}) x = \sigma^2 x^\top(\mathbf{X}^\top\mathbf{X})^{-1}x = \sigma^2 h(x)$, and the cross-term vanishes because $Y_{\text{new}}$ is independent of $\hat{\beta}$. The first term, $\sigma^2$, is irreducible noise — even with a perfect model, the new observation has random variation. The second term, $\sigma^2 h(x)$, is estimation uncertainty — the model's coefficients are themselves estimated from noisy data, and that estimation is less precise in directions where the training data are sparse.

flowchart TD X2["Design Matrix X"] --> XTX["XᵀX"] XTX --> INV["(XᵀX)⁻¹"] X2 --> H2["H = X(XᵀX)⁻¹Xᵀ"] INV --> H2 H2 --> HII["h(x) = xᵀ(XᵀX)⁻¹x"] HII --> PV["Var(Y - Ŷ) = σ²(1 + h(x))"] PV --> PI["Prediction interval width
∝ σ · √(1 + h(x))"] style X2 fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style XTX fill:#faf8f5,stroke:#a0522d,color:#1c1917 style INV fill:#faf8f5,stroke:#a0522d,color:#1c1917 style H2 fill:#e3f2fd,stroke:#1565c0,color:#1565c0 style HII fill:#e3f2fd,stroke:#1565c0,color:#1565c0 style PV fill:#fce4ec,stroke:#c62828,color:#c62828 style PI fill:#fce4ec,stroke:#c62828,color:#c62828

The full mathematical pipeline: from the design matrix to the hat matrix, to leverage scores, to prediction variance, and finally to prediction interval widths.

This is the central result: leverage directly controls prediction variance. At the centroid of the training data ($h \approx 0$), the prediction variance is approximately $\sigma^2$ — just the irreducible noise. At a high-leverage point, prediction variance grows because the coefficient estimates are uncertain in that direction.

Preview: The Sign Flip

There is a subtlety that complicates the connection between leverage and conformal prediction. For training residuals, $\text{Var}(Y_i - \hat{Y}_i) = \sigma^2(1 - h_i)$ — high-leverage training points have smaller residuals. But for test prediction errors, $\text{Var}(Y_{\text{new}} - \hat{Y}(x)) = \sigma^2(1 + h(x))$ — high-leverage test points have larger errors. The sign flips! This training-test mismatch is the subject of Part 6.

Computational Cost

A natural concern is whether computing leverage scores adds significant overhead to the regression pipeline. It does not. The thin SVD of the $n \times p$ design matrix costs $O(np^2)$ for $n > p$, or $O(n^2 p)$ for $n < p$. Once the SVD is computed, the leverage of any new test point costs $O(p)$ (one matrix-vector product with $\mathbf{V}$, one scaling by $\boldsymbol{\Sigma}^{-1}$, and one inner product). For training points, all $n$ leverages come for free as $\|u_i\|^2$ from the rows of $\mathbf{U}$. In practice, the SVD is typically already computed as part of the regression fit, so leverage scores cost essentially nothing extra.