Leverage and Influence

Part 5 of a 10-part series on prediction intervals, conformal prediction, and leverage scores.

This is part of the series From Predictions to Prediction Intervals.

In Part 4, we defined leverage scores as the diagonal of the hat matrix and showed that they measure how far each point sits from the center of the training data in Mahalanobis distance. In this post, we explore the classical use of leverage in regression diagnostics -- a rich tradition dating back to the late 1970s that the machine learning community has largely overlooked.

The concepts here are classical but important: leverage, influence, and Cook's distance together form a diagnostic toolkit that provides a clear way to understand which data points matter most and why. We present this material at three levels of depth.

Intuitive

The Big Picture: Unusual Points and Their Power

Two Types of Unusual Data Points

There are two distinct ways a data point can be "unusual," and it is worth separating them carefully. Confusing the two is a common source of error in regression diagnostics.

Type 1: The Leverage Point. This is a point that sits in an unusual location in feature space but has a normal outcome. It is "different" in where it lives, not in what it does. Think of it as someone who lives in a remote area but behaves like everyone else. Their position is unusual, but their behavior is not. In a regression setting, such a point lies far from the centroid of the training features, yet its response falls right on (or near) the fitted line.

Type 2: The Influential Point. This is a point that sits in an unusual location and has an unusual outcome. Because of its position, it has the ability to change the fitted model. Remove it, and the fitted coefficients shift noticeably. This is the case that warrants closer inspection. A concrete example: if a dataset of house prices has one mansion priced well below market value, that single data point can tilt the entire regression surface because there are few other data points nearby to counterbalance it.

Analogy: The City Council Vote

Imagine a city council making a decision by majority vote. A council member representing a remote, sparsely populated district has high leverage -- their district is geographically unusual, giving them a unique position. If this remote council member votes with the majority (low residual), nothing changes. The decision would have been the same without them. But if this remote council member votes against everyone else (high leverage + high residual), they can swing the entire decision. That is influence.

The 2x2 Grid

Every data point falls into one of four categories, based on two questions: Is its position unusual? Is its outcome unusual? The following table summarizes the four cases. In practice, most points land in the upper-left cell; the lower-right cell is the one that demands attention.

Low Residual (normal outcome) High Residual (unusual outcome)
Low Leverage (normal position) Normal. Typical point, well-fit by the model. These make up the bulk of most datasets. Outlier. Unusual response, but its position gives it little power to distort the fit. Worth noting, but typically harmless to the overall model.
High Leverage (unusual position) Leverage Point. Unusual features, but the model fits it well. Potentially helpful -- it provides information in a sparse region. Influential Point. Unusual features AND unusual response. This point pulls the fitted surface toward itself and merits careful examination.
quadrantChart title Data Point Classification x-axis "Low Leverage" --> "High Leverage" y-axis "Low Residual" --> "High Residual" quadrant-1 "INFLUENTIAL (Investigate)" quadrant-2 "Outlier (Noisy)" quadrant-3 "Normal (Typical)" quadrant-4 "Leverage Point (Unusual but OK)"

The four types of data points, classified by leverage (position unusualness) and residual (outcome unusualness).

Cook's Distance in Plain English

Cook's distance answers a natural question: "If we removed this one data point and refitted the model, how much would all the predictions change?" It quantifies the total shift in the vector of fitted values when a single observation is left out.

A point with high Cook's distance is one whose removal noticeably alters the fitted model. A point with low Cook's distance can be removed with negligible effect. What makes Cook's distance useful is that it captures both leverage and residual in a single number -- you need both an unusual position and an unusual outcome to have high influence. In practice, a common rule of thumb is that a Cook's distance exceeding $4/n$ deserves a closer look; values above 1 indicate that the point is shifting the fitted surface by roughly the width of a joint confidence region for the coefficients.

How a Data Point Gets Classified

flowchart TD A["Observe data point"] --> B["Compute leverage hᵢ"] B --> C["Compute residual eᵢ"] C --> D{"hᵢ > threshold?"} D -- "No" --> E{"Large |eᵢ|?"} D -- "Yes" --> F{"Large |eᵢ|?"} E -- "No" --> G["Normal Point"] E -- "Yes" --> H["Outlier"] F -- "No" --> I["Leverage Point"] F -- "Yes" --> J["INFLUENTIAL"] style G fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style H fill:#fff3e0,stroke:#e65100,color:#1c1917 style I fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style J fill:#fce4ec,stroke:#c62828,color:#1c1917

Flowchart: classifying a data point by computing its leverage and residual.

A Brief History

These ideas have a long history. The key milestones:

timeline title Leverage and Influence: A Timeline 1977 : Cook introduces Cook's distance 1978 : Hoaglin and Welsch formalize the leverage-residual plot 1986 : Chatterjee and Hadi publish an influential survey 2006 : Drineas, Mahoney, and Muthukrishnan introduce randomized algorithms for leverage 2011 : Mahoney publishes a monograph connecting leverage to randomized linear algebra 2019+ : ML community rediscovers leverage for data valuation, active learning, and uncertainty

Nearly five decades of leverage and influence, from classical diagnostics to modern ML.

Why ML Forgot

The machine learning community largely set aside leverage and influence diagnostics starting in the 1990s. There are understandable reasons:

  • Nonlinear models. The hat matrix is defined for linear regression. Random forests and neural networks do not have one, and many practitioners concluded that leverage was therefore irrelevant to their work.
  • High dimensions. When the number of features $p$ is large relative to the number of data points $n$, every point has high leverage (the average leverage is $p/n$). The concept seems to lose its discriminative power.
  • Focus on prediction accuracy. ML evaluates models on held-out accuracy. Leverage is a training-set diagnostic, which feels like it belongs to a different paradigm.

Why It Still Matters

Each of these objections has a natural counterargument:

  • Leverage measures data geometry, which is model-independent. You can compute leverage on the raw feature matrix regardless of whether you fit a linear model, a random forest, or a neural network. The geometry of the data does not change with the model. In practice, even for nonlinear models, the leverage of the input features often correlates well with the difficulty of prediction at that point.
  • In high dimensions, leverage variation -- not the average level -- contains useful signal. Even when the average leverage is high, some points have much higher leverage than others, and those differences can be informative about which regions of feature space are well-covered by training data.
  • Leverage is not just a training diagnostic. As we showed in Part 4 and will develop further in Part 6, leverage directly controls prediction uncertainty. It is a key ingredient for constructing adaptive prediction intervals.
Analogy

A leverage point is like a swing voter in a swing state -- their position gives them disproportionate power, whether they use it or not. An influential point is a swing voter in a swing state who actually votes against the majority. The position creates the potential for influence; the unusual outcome activates it.

Technical

Diagnostic Tools: Plots, Distances, and Algorithms

The Leverage-Residual Plot (Hoaglin & Welsch, 1978)

The leverage-residual plot is a standard tool of regression diagnostics, introduced by Hoaglin and Welsch in 1978 and still widely used today. For each training point $i$, plot the leverage $h_i$ on the horizontal axis against the standardized residual on the vertical axis:

$$r_i = \frac{e_i}{s\sqrt{1 - h_i}}$$

where $e_i = Y_i - \hat{Y}_i$ is the raw residual and $s^2$ is the estimated residual variance. The denominator $s\sqrt{1-h_i}$ accounts for the fact that high-leverage points tend to have smaller residuals (the model is pulled toward them), so standardizing by $\sqrt{1-h_i}$ puts all residuals on a comparable scale.

Points in the upper-right corner of this plot are the influential ones: high leverage and large standardized residual. In practice, one typically overlays reference lines at $h = 2p/n$ (twice the average leverage) on the horizontal axis and at $|r_i| = 2$ or $3$ on the vertical axis. Points beyond both thresholds are candidates for further investigation. It is also common to annotate the plot with contours of constant Cook's distance, which curve from upper-left to lower-right; points lying beyond the $D = 4/n$ contour are those whose removal would noticeably shift the fitted model.

Cook's Distance: The Product of Two Terms

Cook's distance measures the total shift in all fitted values when observation $i$ is deleted:

$$D_i = \frac{(\hat{\mathbf{Y}} - \hat{\mathbf{Y}}_{(i)})^\top(\hat{\mathbf{Y}} - \hat{\mathbf{Y}}_{(i)})}{p \cdot s^2} = \frac{e_i^2}{p \cdot s^2} \cdot \frac{h_i}{(1-h_i)^2}$$

The factorization on the right is the key insight. Cook's distance is the product of two terms:

flowchart LR CD["Cook's Distance Dᵢ"] --> RT["Residual Term: eᵢ² / (p · s²)"] CD --> LT["Leverage Term: hᵢ / (1 − hᵢ)²"] RT --> RX["How far is the point from the fitted surface?"] LT --> LX["How much power does the point have to move the surface?"] RX --> I["Both large => INFLUENTIAL"] LX --> I style CD fill:#fce4ec,stroke:#c62828,color:#1c1917 style I fill:#fce4ec,stroke:#c62828,color:#1c1917

Cook's distance decomposes into a residual component and a leverage component. Both must be large for true influence.

A point with a large residual but low leverage produces a moderate Cook's distance -- it is an outlier but not influential. A point with high leverage but a small residual also produces a moderate Cook's distance -- it is a leverage point but the model handles it well. Only the combination of both yields a large Cook's distance. To build concrete intuition: in a dataset with $n = 1000$ and $p = 10$, a point with leverage $h_i = 0.05$ (five times the average) and a standardized residual of $3$ would have a Cook's distance around $0.024$, which is $6\times$ the $4/n = 0.004$ threshold. Removing that single point would shift the entire vector of fitted values by an amount comparable to the residual noise.

Practical Flagging Rules

The classical literature developed several rules of thumb, still in active use:

Rule Criterion Interpretation Example ($n=1000$, $p=10$)
High leverage $h_i > 2p/n$ More than twice the average leverage $h_i > 0.02$
Very high leverage $h_i > 0.5$ Large effect on the fit regardless of residual size $h_i > 0.5$
Influential (Cook's) $D_i > 4/n$ Removal shifts predictions substantially $D_i > 0.004$
Highly influential $D_i > 1$ Removal changes the model by more than a joint confidence region $D_i > 1$

These rules remain in every introductory regression course. Yet they are almost completely absent from the machine learning literature, where models are typically evaluated only on held-out prediction accuracy.

Randomized SVD for Scalability

Computing exact leverage scores via a full SVD of the $n \times p$ design matrix costs $O(np^2)$. For large-scale datasets, this becomes prohibitive. Randomized algorithms provide a way out.

The approach, developed by Drineas, Mahoney, and Muthukrishnan (2006, 2012), sketches the design matrix $\mathbf{X}$ using a random projection, computes the SVD of the (much smaller) sketch, and uses it to approximate leverage scores. The result is $(1+\varepsilon)$-approximate leverage scores in time $O(np\log p)$.

flowchart LR subgraph Exact ["Exact SVD"] direction TB E1["Full SVD of X (n × p)"] --> E2["Cost: O(np²)"] E2 --> E3["Exact leverage scores"] end subgraph Rand ["Randomized SVD"] direction TB R1["Random projection of X"] --> R2["SVD of sketch"] R2 --> R3["Cost: O(np log p)"] R3 --> R4["(1+ε)-approximate leverage"] end subgraph CW ["Clarkson-Woodruff"] direction TB C1["Sparse random projection"] --> C2["Exploit sparsity in X"] C2 --> C3["Cost: O(nnz(X) + p³)"] C3 --> C4["Input-sparsity time"] end style E2 fill:#fce4ec,stroke:#c62828,color:#1c1917 style R3 fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style C3 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917

Computational cost comparison: exact SVD, randomized SVD, and Clarkson-Woodruff. For large sparse datasets, input-sparsity time is nearly linear.

These algorithmic advances mean that leverage scores are computable at the scale of modern datasets. The cost is comparable to -- or less than -- a single pass over the data.

Leverage Beyond Linear Regression

Although leverage is defined via the hat matrix of a linear model, the concept extends naturally to other settings:

Ridge regression. When $p > n$ or the design matrix is ill-conditioned, OLS leverage is not well-defined. Ridge regression replaces $(\mathbf{X}^\top\mathbf{X})^{-1}$ with $(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}$, giving ridge leverage scores:

$$h_\lambda(x) = x^\top (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} x$$

Ridge leverage scores are always well-defined and bounded above by $1/\lambda$. As $\lambda \to 0$, they approach OLS leverage; as $\lambda \to \infty$, they shrink toward zero.

Kernel methods. In kernel regression or kernel ridge regression, the design matrix lives in a (possibly infinite-dimensional) feature space. Leverage is defined in terms of the kernel matrix:

$$h_i = (\mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1})_{ii}$$

Neural networks. For a neural network, one can define feature-space leverage using the activations of the penultimate layer. If $\Phi(x) \in \mathbb{R}^d$ denotes the penultimate-layer representation, then:

$$h^{NN}(x) = \Phi(x)^\top (\boldsymbol{\Phi}^\top \boldsymbol{\Phi} + \lambda \mathbf{I})^{-1} \Phi(x)$$

This measures how unusual $x$ is in the learned representation space, which may be more relevant than the raw feature space. In all cases, the interpretation is the same: leverage measures how unusual a point is relative to the training distribution, in a geometry adapted to the model's representation.

Advanced

Derivations: Cook's Distance, Sherman-Morrison, and Sketching

Full Derivation of Cook's Distance

We derive Cook's distance from scratch using the leave-one-out formula. The key tool is the Sherman-Morrison-Woodbury identity, which allows us to express the leave-one-out estimator $\hat{\beta}_{(i)}$ in closed form without refitting.

Step 1: The Leave-One-Out Coefficient Update

Start with the OLS estimator $\hat{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{Y}$. When we delete observation $i$, the design matrix becomes $\mathbf{X}_{(i)}$ (with row $i$ removed) and the response becomes $\mathbf{Y}_{(i)}$. The leave-one-out estimator is:

$$\hat{\beta}_{(i)} = (\mathbf{X}_{(i)}^\top\mathbf{X}_{(i)})^{-1}\mathbf{X}_{(i)}^\top\mathbf{Y}_{(i)}$$

The Sherman-Morrison identity gives us a way to invert the rank-one-updated matrix:

$$(\mathbf{X}_{(i)}^\top\mathbf{X}_{(i)})^{-1} = (\mathbf{X}^\top\mathbf{X} - X_i X_i^\top)^{-1} = (\mathbf{X}^\top\mathbf{X})^{-1} + \frac{(\mathbf{X}^\top\mathbf{X})^{-1}X_i X_i^\top(\mathbf{X}^\top\mathbf{X})^{-1}}{1 - h_i}$$

After substitution and simplification, the leave-one-out coefficient vector becomes:

$$\hat{\beta}_{(i)} = \hat{\beta} - \frac{(\mathbf{X}^\top\mathbf{X})^{-1}X_i e_i}{1 - h_i}$$

where $e_i = Y_i - X_i^\top\hat{\beta}$ is the raw residual. This formula is worth pausing on: the leave-one-out coefficient change is proportional to the residual $e_i$, amplified by the leverage factor $1/(1-h_i)$, and directed along $(\mathbf{X}^\top\mathbf{X})^{-1}X_i$.

flowchart TD A["Full model: β̂ = (XᵀX)⁻¹XᵀY"] --> B["Delete observation i"] B --> C["Sherman-Morrison identity on (XᵀX − xᵢxᵢᵀ)⁻¹"] C --> D["Leave-one-out: β̂₍ᵢ₎ = β̂ − (XᵀX)⁻¹xᵢeᵢ / (1 − hᵢ)"] D --> E["Prediction shift: Ŷⱼ − Ŷⱼ₍ᵢ₎ = Hⱼᵢeᵢ / (1 − hᵢ)"] E --> F["Sum of squared shifts: Σⱼ (Hⱼᵢeᵢ / (1−hᵢ))²"] F --> G["= eᵢ²hᵢ / (1 − hᵢ)²"] G --> H["Cook's distance: Dᵢ = eᵢ²hᵢ / (p·s²·(1−hᵢ)²)"] style A fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style D fill:#fff3e0,stroke:#e65100,color:#1c1917 style H fill:#fce4ec,stroke:#c62828,color:#1c1917

The logical chain from the full model to Cook's distance, via Sherman-Morrison.

Step 2: The Prediction Shift

The change in the predicted value for observation $j$ when observation $i$ is deleted is:

$$\hat{Y}_j - \hat{Y}_{j,(i)} = X_j^\top(\hat{\beta} - \hat{\beta}_{(i)}) = X_j^\top \frac{(\mathbf{X}^\top\mathbf{X})^{-1}X_i e_i}{1-h_i} = \frac{H_{ji} \cdot e_i}{1 - h_i}$$

where $H_{ji} = X_j^\top(\mathbf{X}^\top\mathbf{X})^{-1}X_i$ is the $(j,i)$ entry of the hat matrix.

Step 3: Summing the Squared Shifts

Cook's distance sums the squared prediction shifts over all observations, normalized by $p \cdot s^2$:

$$D_i = \frac{1}{p \cdot s^2} \sum_{j=1}^n \left(\hat{Y}_j - \hat{Y}_{j,(i)}\right)^2 = \frac{1}{p \cdot s^2} \sum_{j=1}^n \frac{H_{ji}^2 \cdot e_i^2}{(1-h_i)^2}$$

The sum $\sum_j H_{ji}^2$ simplifies because $\mathbf{H}$ is symmetric and idempotent: $\sum_j H_{ji}^2 = (\mathbf{H}^2)_{ii} = H_{ii} = h_i$. Therefore:

$$D_i = \frac{e_i^2}{p \cdot s^2} \cdot \frac{h_i}{(1-h_i)^2}$$

This completes the derivation. The factorization into a residual term and a leverage term emerges naturally from the algebra.

Ridge Leverage and Its Bounds

When the design matrix is ill-conditioned or $p > n$, the OLS hat matrix is not well-defined. Ridge regression introduces a regularization parameter $\lambda > 0$, which modifies the leverage score to:

$$h_\lambda(x) = x^\top(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}x$$

To see that ridge leverage is bounded by $1/\lambda$, note that $\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I} \succeq \lambda\mathbf{I}$, so $(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1} \preceq \lambda^{-1}\mathbf{I}$, and therefore:

$$h_\lambda(x) = x^\top(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}x \leq \frac{\|x\|^2}{\lambda}$$

In terms of the SVD $\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$, the ridge leverage scores of the training data are:

$$h_{\lambda,i} = \sum_{k=1}^p \frac{\sigma_k^2}{\sigma_k^2 + \lambda} U_{ik}^2$$

Each singular component is shrunk by the factor $\sigma_k^2/(\sigma_k^2 + \lambda)$, which is close to 1 for large singular values and close to 0 for small ones. Ridge leverage effectively discounts directions in which the data have little variance.

Kernel Leverage

In kernel methods, we map data to a (possibly infinite-dimensional) feature space via a kernel function $k(\cdot, \cdot)$. The kernel matrix is $\mathbf{K} \in \mathbb{R}^{n \times n}$ with $K_{ij} = k(X_i, X_j)$. The kernel ridge leverage scores are:

$$h_i = (\mathbf{K}(\mathbf{K} + \lambda\mathbf{I})^{-1})_{ii}$$

This can be expressed via the eigendecomposition of $\mathbf{K} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^\top$ as:

$$h_i = \sum_{k=1}^n \frac{\lambda_k}{\lambda_k + \lambda} Q_{ik}^2$$

The effective dimensionality of the kernel feature space is captured by the sum $\sum_k \lambda_k/(\lambda_k + \lambda)$, which plays the role of $p$ in the linear case.

Neural Network Leverage

For a neural network with a final linear layer on top of a learned representation $\Phi(x) \in \mathbb{R}^d$, the leverage score is:

$$h^{NN}(x) = \Phi(x)^\top(\boldsymbol{\Phi}^\top\boldsymbol{\Phi} + \lambda\mathbf{I})^{-1}\Phi(x)$$

where $\boldsymbol{\Phi} \in \mathbb{R}^{n \times d}$ is the matrix of training representations. This is mathematically identical to ridge leverage, but computed in the learned feature space rather than the raw input space. The key distinction is that $\Phi$ depends on the model parameters, so the leverage scores reflect the geometry that the network has learned to be relevant for the prediction task.

Randomized Leverage Computation

For very large matrices, even $O(np\log p)$ may be expensive. The Clarkson-Woodruff (2013) sketch uses a sparse random sign matrix $\mathbf{S} \in \mathbb{R}^{m \times n}$ (with exactly one nonzero entry per column) to compute $\mathbf{S}\mathbf{X}$ in $O(\text{nnz}(\mathbf{X}))$ time -- the number of nonzero entries, which is the minimum possible for reading the matrix.

The algorithm proceeds as follows:

  1. Compute the sketch $\mathbf{Y} = \mathbf{S}\mathbf{X}$ in $O(\text{nnz}(\mathbf{X}))$ time.
  2. Compute a QR factorization $\mathbf{Y} = \mathbf{Q}_Y\mathbf{R}$ in $O(mp^2)$ time.
  3. Form $\mathbf{Z} = \mathbf{X}\mathbf{R}^{-1}$ in $O(np^2)$ time.
  4. The approximate leverage scores are $\tilde{h}_i = \|z_i\|^2$ where $z_i$ is the $i$-th row of $\mathbf{Z}$.

With $m = O(p^2/\varepsilon^2)$, this gives $(1+\varepsilon)$-multiplicative approximations. For sparse matrices where $\text{nnz}(\mathbf{X}) \ll np$, the total cost is $O(\text{nnz}(\mathbf{X}) + p^3)$ -- the input-sparsity time result of Clarkson and Woodruff.

Key Insight

The ability to approximate leverage scores in nearly input-sparsity time means that these classical diagnostics scale to modern dataset sizes. Computing leverage is comparable in cost to reading the data, which substantially weakens the "computational cost" objection to leverage-based methods in ML.

The Deeper Story

We have presented leverage and influence as diagnostic tools: ways to find unusual, potentially problematic data points. This is their classical role, and it is a valuable one.

But leverage has a much deeper connection to prediction uncertainty. In Part 4, we previewed the formula $\text{Var}(Y_{\text{new}} - \hat{Y}(x)) = \sigma^2(1 + h(x))$. This says that the variance of the prediction error -- the very quantity that prediction intervals need to capture -- is directly controlled by leverage.

In the next post, we examine this connection in detail and identify a subtlety that is easy to miss: the variance of training residuals depends on leverage with the opposite sign from the variance of prediction errors. This "sign flip" has practical consequences for any method that tries to estimate prediction uncertainty from training data.

Further Reading

  • Hoaglin, D. C. & Welsch, R. E. (1978). The hat matrix in regression and ANOVA. The American Statistician, 32(1), 17-22.
  • Chatterjee, S. & Hadi, A. S. (1986). Influential observations, high leverage points, and outliers in linear regression. Statistical Science, 1(3), 379-393.
  • Drineas, P., Mahoney, M. W., & Muthukrishnan, S. (2006). Sampling algorithms for $\ell_2$ regression and applications. Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 1127-1136.
  • Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2), 123-224.