Leverage in High Dimensions: Ridge, Kernel, and Neural Network Leverage

In Parts 4–5, we defined leverage for the classical setting: $n$ observations, $p$ features, $n > p$, and a well-conditioned design matrix. The leverage of a point $x$ was $h(x) = x^T(X^TX)^{-1}x$, and it told us how much influence that point has on the fitted model. The formula was clean, the intuition was geometric, and everything worked because $X^TX$ was invertible.

But modern machine learning regularly operates outside this regime. We may have more features than observations ($p > n$), making $X^TX$ singular. We may use kernel methods where the effective feature space is infinite-dimensional. Or we may train neural networks where the "features" are learned representations that change during training. In each of these settings, the classical leverage formula breaks down — not because the concept of leverage loses meaning, but because the specific matrix inverse $( X^TX)^{-1}$ ceases to exist.

Does leverage still make sense when the classical formula fails? The answer is yes — but the definition needs to be extended. This post covers three such extensions: ridge leverage, which regularizes the matrix to make it invertible; kernel leverage, which operates in a reproducing kernel Hilbert space; and neural network leverage, which uses learned representations from the last hidden layer. These three extensions turn out to be special cases of a single abstract formula, differing only in their choice of feature map and regularization strength.

Intuitive

When the Classical Formula Breaks

▾

The Problem with $p > n$

Recall the classical leverage formula: $h(x) = x^T(X^TX)^{-1}x$. This requires inverting the $p \times p$ matrix $X^TX$. For that inverse to exist, $X^TX$ must have full rank, which means the design matrix $X$ must have at least $p$ linearly independent rows. When $n < p$ — fewer observations than features — the matrix $X^TX$ is singular, and the formula is undefined.

This is not an obscure edge case. In genomics, $p$ might be 20,000 genes and $n$ might be 200 patients. In natural language processing, a bag-of-words representation can easily have $p = 50{,}000$ features with $n = 5{,}000$ documents. In brain imaging, a single fMRI volume has hundreds of thousands of voxels. Whenever $p > n$, the classical leverage formula simply does not work.

Analogy

Think of triangulation. To determine your position on Earth's surface (a 2D problem), you need signals from at least 3 satellites. To determine your position in 3D space, you need at least 4. If you have only 2 satellites in 3D, the system is underdetermined — your position could be anywhere along a curve. The classical leverage formula is like a GPS algorithm that assumes you always have enough satellites. When $p > n$, you do not have enough "satellites" (data points) to resolve the "position" (feature-space geometry), and the algorithm fails. Regularization is the fix: it adds prior information (like assuming you are on Earth's surface) to make the problem well-determined again.

Three Extensions, One Idea

The three extensions we will cover all address this problem in different ways, but they share a common structure. Each one defines a feature map $\phi(x)$ and a regularization parameter $\lambda \geq 0$, and computes leverage via the formula:

$$h(x) = \phi(x)^T(\Phi^T\Phi + \lambda I)^{-1}\phi(x)$$

where $\Phi$ is the matrix whose rows are the feature representations of the training points. The term $\lambda I$ makes the matrix $\Phi^T\Phi + \lambda I$ invertible even when $\Phi^T\Phi$ is singular. The three extensions differ in their choice of $\phi$:

Ridge leverage: The feature map is the identity, $\phi(x) = x$. The regularization $\lambda > 0$ is the ridge penalty. This is the simplest extension — you keep the original features but add a stabilizing term. It is the natural choice in any high-dimensional linear setting: genomics, text classification, or any problem where you have many features and want to prevent overfitting.
Kernel leverage: The feature map $\phi(x)$ maps to a potentially infinite-dimensional reproducing kernel Hilbert space (RKHS). You never compute $\phi(x)$ explicitly; instead, you work with the kernel function $K(x, x') = \langle \phi(x), \phi(x') \rangle$. This captures nonlinear structure in the data — a point might have low leverage in input space but high leverage in the feature space induced by the kernel.
Neural network leverage: The feature map is the learned representation from the last hidden layer of a neural network: $\phi(x) = \phi_\theta(x)$, where $\theta$ are the trained network parameters. This captures the geometry the network has learned from the data, rather than any pre-specified geometry.

flowchart LR A["Classical OLS\np less than n\nno regularization"] --> B["Ridge\np can exceed n\nregularized by lambda"] B --> C["Kernel\ninfinite-dim RKHS\nkernel trick"] C --> D["Neural Net\nlearned features\nlast hidden layer"] style A fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style B fill:#faf8f5,stroke:#a0522d,color:#1c1917 style C fill:#faf8f5,stroke:#a0522d,color:#1c1917 style D fill:#faf8f5,stroke:#a0522d,color:#1c1917

The progression from classical leverage to its modern extensions. Each step generalizes the feature map while maintaining the same algebraic structure.

What Regularization Does to Leverage

In the classical setting ($\lambda = 0$, $p < n$), leverage scores satisfy $\sum_i h_i = p$ and each $h_i \in [0, 1]$. The sum counts the effective number of parameters, and no single point can have leverage exceeding 1.

When we add regularization ($\lambda > 0$), two things happen. First, all leverage scores shrink toward zero. Intuitively, regularization says "I have prior information that the coefficients should be small," which reduces the influence of any single observation. Second, the sum $\sum_i h_i^\lambda$ is no longer equal to $p$ but to a smaller quantity called the effective rank or effective dimension, denoted $p_\lambda$. This reflects the fact that regularization reduces the effective complexity of the model.

As $\lambda \to 0$, ridge leverage recovers the classical leverage (when $X^TX$ is invertible). As $\lambda \to \infty$, all leverage scores go to zero: infinitely strong regularization means the model ignores the data entirely, so no point has any influence. The parameter $\lambda$ controls the smooth transition between these extremes.

Why These Extensions Matter

Leverage scores are not just a theoretical curiosity. They appear in the variance of predictions ($\text{Var}(\hat{y}(x)) \propto 1 + h(x)$), in the construction of prediction intervals, in outlier detection, and in experimental design. When we move to high-dimensional, kernel, or neural network settings, we need leverage scores that are well-defined and computable in those settings. The three extensions in this post provide exactly that.

The key insight is that leverage measures how unusual a point is in the space where the model operates. In a linear model, this is the original feature space. In a kernel model, it is the RKHS. In a neural network, it is the learned representation space. The concept is the same; only the space changes.

The Core Principle

All three extensions — ridge, kernel, and neural network leverage — are instances of a single abstract formula: $h(x) = \phi(x)^T(\Phi^T\Phi + \lambda I)^{-1}\phi(x)$. The feature map $\phi$ determines what space leverage is measured in. The regularization $\lambda$ determines how strongly extreme leverage is dampened. Once you understand this template, you can define leverage for essentially any supervised learning method.

Technical

The Three Extensions

▾

Ridge Leverage

Ridge regression replaces the OLS objective $\|Y - X\beta\|^2$ with the penalized objective $\|Y - X\beta\|^2 + \lambda\|\beta\|^2$. The solution is $\hat{\beta}^\lambda = (X^TX + \lambda I)^{-1}X^TY$. Because $\lambda I$ is added to $X^TX$, the matrix $X^TX + \lambda I$ is always invertible, even when $p > n$.

The ridge leverage of a training point $x_i$ is:

$$h_i^\lambda = x_i^T(X^TX + \lambda I)^{-1}x_i$$

For a new (test) point $x$, the ridge leverage is:

$$h^\lambda(x) = x^T(X^TX + \lambda I)^{-1}x$$

This is well-defined for any $\lambda > 0$, regardless of the relationship between $p$ and $n$. The ridge hat matrix is $H^\lambda = X(X^TX + \lambda I)^{-1}X^T$, and the training leverage scores are its diagonal entries: $h_i^\lambda = H^\lambda_{ii}$.

Properties of ridge leverage:

Range: $h_i^\lambda \in [0, \|x_i\|^2 / \lambda]$. The upper bound decreases with $\lambda$, so strong regularization forces all leverages to be small.
Sum (effective rank): $\sum_{i=1}^n h_i^\lambda = \sum_{j=1}^{\min(n,p)} \frac{s_j^2}{s_j^2 + \lambda} \equiv p_\lambda$, where $s_1, \ldots, s_{\min(n,p)}$ are the singular values of $X$. Each term $s_j^2/(s_j^2 + \lambda)$ lies in $[0, 1)$ and acts as a soft threshold: directions with $s_j^2 \gg \lambda$ contribute approximately 1 (they are "kept"), while directions with $s_j^2 \ll \lambda$ contribute approximately 0 (they are "discarded"). The sum $p_\lambda$ counts the effective number of directions the data uses — the effective dimension.
Limiting behavior: As $\lambda \to 0^+$ (with $p < n$), $h_i^\lambda \to h_i^{\text{OLS}}$. As $\lambda \to \infty$, $h_i^\lambda \to 0$ for all $i$.

Connection to prediction variance: Under the ridge regression model with homoscedastic errors $Y = X\beta^* + \varepsilon$, $\varepsilon \sim (0, \sigma^2 I)$, the prediction variance at a new point $x$ is:

$$\text{Var}(\hat{y}^\lambda(x)) = \sigma^2 h^\lambda(x)$$

and the total mean-squared prediction error includes both this variance term and a bias term (ridge regression introduces bias to reduce variance). The leverage $h^\lambda(x)$ captures exactly the variance contribution.

Kernel Leverage

Kernel methods replace the original features $x \in \mathbb{R}^p$ with a feature map $\phi(x) \in \mathcal{H}$, where $\mathcal{H}$ is a (potentially infinite-dimensional) reproducing kernel Hilbert space. The kernel function is $K(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}}$. The kernel matrix is $K \in \mathbb{R}^{n \times n}$ with entries $K_{ij} = K(x_i, x_j)$.

Kernel ridge regression minimizes $\|Y - f\|^2 + \lambda \|f\|_{\mathcal{H}}^2$ over functions $f$ in the RKHS. By the representer theorem, the solution is $\hat{f}(x) = k_x^T(K + \lambda I)^{-1}Y$, where $k_x = (K(x, x_1), \ldots, K(x, x_n))^T$ is the vector of kernel evaluations between the test point and the training points.

The kernel leverage of a training point $x_i$ is:

$$h_i^{\text{ker}} = \left(K(K + \lambda I)^{-1}\right)_{ii}$$

For a new point $x$:

$$h^{\text{ker}}(x) = k_x^T(K + \lambda I)^{-1} k_x$$

Note the structural similarity to ridge leverage. If we define $\Phi \in \mathbb{R}^{n \times d}$ as the (possibly infinite-dimensional) feature matrix, then $K = \Phi\Phi^T$. Using the matrix identity $(A^TA + \lambda I)^{-1}A^T = A^T(AA^T + \lambda I)^{-1}$, the kernel leverage formula is algebraically equivalent to $h(x) = \phi(x)^T(\Phi^T\Phi + \lambda I)^{-1}\phi(x)$. The kernel trick allows us to compute this without ever forming $\Phi$ explicitly.

Interpretation: Kernel leverage measures how well a point $x$ is "represented" by the training data in the RKHS. If $x$ lies in a region densely populated by training points (in the kernel geometry), $h^{\text{ker}}(x)$ is small — the point is well-represented and has low influence. If $x$ is in a sparse region, $h^{\text{ker}}(x)$ is large — the prediction at $x$ relies heavily on extrapolation.

Computation: Exact computation requires forming and inverting $K + \lambda I$, which costs $O(n^3)$. For large $n$, two approximation strategies are standard:

Nystrom approximation: Select $m \ll n$ landmark points and approximate $K \approx K_{nm}K_{mm}^{-1}K_{nm}^T$, where $K_{nm}$ is the $n \times m$ kernel matrix between all points and the landmarks, and $K_{mm}$ is the $m \times m$ kernel matrix among landmarks. Cost: $O(nm^2)$.
Random Fourier features (Rahimi & Recht, 2007): For shift-invariant kernels like the Gaussian RBF, approximate $\phi(x)$ with a finite-dimensional random feature map $\hat{\phi}(x) \in \mathbb{R}^m$, then compute leverage using the explicit features. Cost: $O(nm^2)$.

Neural Network Leverage

A neural network $f_\theta(x)$ can be decomposed into a feature extractor and a final linear layer. Write:

$$f_\theta(x) = W^T \phi_\theta(x) + b$$

where $\phi_\theta(x) \in \mathbb{R}^d$ is the output of the last hidden layer (the "learned representation"), $W \in \mathbb{R}^{d \times 1}$ is the weight matrix of the final linear layer, and $b$ is the bias. After training, $\phi_\theta$ is fixed, and the network's predictions are linear in the learned features.

The last-layer leverage of a training point $x_i$ is:

$$h_i^{\text{NN}} = \phi_\theta(x_i)^T(\Phi^T\Phi + \lambda I)^{-1}\phi_\theta(x_i)$$

where $\Phi \in \mathbb{R}^{n \times d}$ has rows $\phi_\theta(x_1)^T, \ldots, \phi_\theta(x_n)^T$. For a new point $x$:

$$h^{\text{NN}}(x) = \phi_\theta(x)^T(\Phi^T\Phi + \lambda I)^{-1}\phi_\theta(x)$$

The regularization $\lambda > 0$ is essential here. The hidden layer dimension $d$ may be larger than $n$ (a common case for overparameterized networks), making $\Phi^T\Phi$ singular. Even when $d < n$, numerical stability benefits from small positive $\lambda$.

flowchart LR subgraph NN["Neural Network"] direction LR I["Input x"] --> H1["Hidden\nLayer 1"] H1 --> H2["Hidden\nLayer 2"] H2 --> H3["Last Hidden\nLayer"] H3 --> O["Output\ny-hat"] end H3 -->|"Extract features"| PHI["Features\nphi(x)"] PHI --> LEV["Leverage\nh(x) = phi(x)' M phi(x)\nwhere M = (Phi'Phi + lambda I) inv"] style PHI fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style LEV fill:#e8f5e9,stroke:#2e7d32,color:#1c1917

Neural network leverage uses the learned representation from the last hidden layer as the feature map. The leverage computation is a standard matrix operation on these extracted features.

What does neural network leverage capture? Unlike input-space leverage, which treats all features equally, last-layer leverage captures the geometry the network has learned from the data. If the network has learned to cluster similar inputs together in its representation space, then a point far from any cluster in representation space will have high leverage — even if it looks ordinary in input space. Conversely, a point that looks unusual in input space but maps to a dense region of representation space will have low leverage.

Connection to Laplace approximation: The Laplace approximation to Bayesian inference approximates the posterior of the last-layer weights as Gaussian with covariance proportional to $(\Phi^T\Phi + \lambda I)^{-1}$. The prediction uncertainty at $x$ is then proportional to $\phi_\theta(x)^T(\Phi^T\Phi + \lambda I)^{-1}\phi_\theta(x)$ — which is precisely the last-layer leverage. This connection has been formalized in the CLAPS framework (Fong et al., 2025), which uses last-layer leverage scores to construct prediction intervals for neural networks.

Practical considerations:

Which layer? The last hidden layer is the standard choice because the final linear layer makes the connection to leverage exact. Using earlier layers is possible but less well-justified theoretically.
Dependence on training: The features $\phi_\theta(x)$ depend on the training run (initialization, optimization trajectory, stopping point). Different runs may produce different leverage scores. Averaging over multiple runs or using ensembles can stabilize the estimates.
Scaling: Computing $\Phi^T\Phi \in \mathbb{R}^{d \times d}$ and inverting it costs $O(nd^2 + d^3)$. For large $d$, low-rank or sketching approximations may be necessary. For each new test point, computing $h^{\text{NN}}(x)$ requires a forward pass through the network (to get $\phi_\theta(x)$) plus a matrix-vector product (cost $O(d^2)$).

Unified View

The following table summarizes the three extensions alongside the classical formula.

Method	Feature map $\phi(x)$	Regularization	Cost	When to use
Classical OLS	Identity: $x$	$\lambda = 0$	$O(np^2)$	$p < n$, well-conditioned $X$
Ridge	Identity: $x$	$\lambda > 0$	$O(np^2)$ or $O(n^2p)$	$p \geq n$, or multicollinearity
Kernel	RKHS map: $\phi_K(x)$	$\lambda > 0$	$O(n^3)$ exact; $O(nm^2)$ approx	Nonlinear structure, kernel methods
Neural Net	Last layer: $\phi_\theta(x)$	$\lambda > 0$	$O(nd^2 + d^3)$	Deep learning, learned features

Practical Guidance

If you are fitting a linear or ridge regression model, use ridge leverage with whatever $\lambda$ you used for the regression. If you are using a kernel method (Gaussian process, kernel SVM, kernel ridge regression), use kernel leverage with the same kernel and $\lambda$. If you are using a neural network, extract the last-layer features after training and compute last-layer leverage. In all cases, the leverage scores tell you where the model's predictions are most uncertain due to the geometry of the data.

Advanced

Spectral Properties and the NTK Connection

▾

Spectral Decomposition of Ridge Leverage

The spectral decomposition provides the deepest view into what ridge leverage measures. Let the singular value decomposition (SVD) of the design matrix be $X = U S V^T$, where $U \in \mathbb{R}^{n \times r}$, $S = \text{diag}(s_1, \ldots, s_r)$, $V \in \mathbb{R}^{p \times r}$, and $r = \text{rank}(X)$. Then:

$$X^TX = V S^2 V^T = \sum_{j=1}^r s_j^2 v_j v_j^T$$

and the ridge leverage at a point $x$ decomposes as:

$$h^\lambda(x) = x^T(X^TX + \lambda I)^{-1}x = \sum_{j=1}^r \frac{s_j^2}{s_j^2 + \lambda}(v_j^T x)^2 + \frac{1}{\lambda}\left(\|x\|^2 - \sum_{j=1}^r (v_j^T x)^2\right)$$

The first sum runs over the $r$ directions that the data spans. Each direction $v_j$ contributes $(v_j^T x)^2$ — the squared projection of $x$ onto that direction — weighted by the "signal-to-noise ratio" $s_j^2 / (s_j^2 + \lambda)$. Directions with large singular values (strong signal) contribute nearly their full projection; directions with small singular values (weak signal) are suppressed. The second term captures the component of $x$ in the null space of $X$ (directions the training data does not span at all), weighted by $1/\lambda$.

When $p \leq n$ and $X$ has full column rank ($r = p$), the null-space term vanishes and we get the cleaner formula:

$$h^\lambda(x) = \sum_{j=1}^p \frac{s_j^2}{s_j^2 + \lambda}(v_j^T x)^2$$

This formula is revealing. It says that ridge leverage is a spectrally weighted norm: the norm of $x$ measured in a coordinate system defined by the principal components of the data, with each component scaled by how "reliable" that direction is (relative to the regularization strength $\lambda$). Points that load heavily on well-determined directions have higher leverage than points that load on poorly-determined directions, because the model can actually "see" those directions.

Effective Dimension and the Bias-Variance Tradeoff

The effective dimension (or effective degrees of freedom) is:

$$p_\lambda = \text{tr}(H^\lambda) = \sum_{j=1}^r \frac{s_j^2}{s_j^2 + \lambda}$$

This quantity has a natural interpretation. Each term $s_j^2/(s_j^2 + \lambda)$ is the "fraction of freedom" used in the $j$-th direction. When $\lambda = 0$, each fraction is 1 and $p_\lambda = r$ (the rank of $X$). When $\lambda$ is large, only the top few directions contribute substantially. The effective dimension is a smooth, continuous version of model complexity — more nuanced than counting parameters.

The effective dimension controls the bias-variance tradeoff in ridge regression. The integrated prediction variance scales as $\sigma^2 p_\lambda / n$: more effective parameters means more variance. The integrated squared bias scales as $\lambda^2 \sum_j (v_j^T\beta^*)^2 / (s_j^2 + \lambda)^2$: more regularization means more bias. The optimal $\lambda$ balances these two forces, and $p_\lambda$ quantifies the model complexity at that balance point.

A particularly elegant result connects $p_\lambda$ to generalization. Under certain random matrix assumptions (e.g., $X$ has i.i.d. sub-Gaussian rows), El Karoui (2013) and Dobriban & Wager (2018) showed that the out-of-sample prediction risk of ridge regression converges (as $n, p \to \infty$ with $p/n \to \gamma$) to a quantity determined entirely by $\gamma$, $\lambda$, and the spectrum of the population covariance. The leverage scores and effective dimension are the finite-sample proxies for these asymptotic quantities.

The Neural Tangent Kernel Connection

The connection between neural network leverage and kernel leverage runs deeper than analogy. In the neural tangent kernel (NTK) regime (Jacot, Gabriel, & Hongler, 2018), a neural network trained by gradient descent behaves as a linear model in its features — but the relevant features are not the last-layer activations; they are the gradients of the network output with respect to the parameters.

Specifically, consider a network $f_\theta(x)$ with parameters $\theta \in \mathbb{R}^P$ (total number of parameters). The NTK feature map at initialization $\theta_0$ is:

$$\psi_{\theta_0}(x) = \nabla_\theta f_\theta(x) \big|_{\theta = \theta_0} \in \mathbb{R}^P$$

The NTK is $\Theta(x, x') = \langle \psi_{\theta_0}(x), \psi_{\theta_0}(x') \rangle = \nabla_\theta f(x)^T \nabla_\theta f(x')$. In the infinite-width limit, Jacot et al. showed that the NTK converges to a deterministic kernel $\Theta^*$ that remains constant during training. In this regime, the network is exactly a kernel machine, and the leverage scores are the kernel leverage scores of $\Theta^*$.

This gives us two notions of leverage for neural networks:

Last-layer leverage: $h^{\text{NN}}(x) = \phi_\theta(x)^T(\Phi^T\Phi + \lambda I)^{-1}\phi_\theta(x)$, which treats the last-layer features as fixed after training and computes leverage in their span.
NTK leverage: $h^{\text{NTK}}(x) = \psi(x)^T(\Psi^T\Psi + \lambda I)^{-1}\psi(x)$, which uses the gradient features and captures sensitivity of the output to all parameters.

In the NTK (lazy training) regime where features barely change during training, these two quantities are related but not identical — last-layer leverage uses $d$-dimensional features, while NTK leverage uses $P$-dimensional features (where $P \gg d$ is the total parameter count). In the rich training (feature-learning) regime, features adapt significantly, and last-layer leverage captures the geometry of the learned representation rather than the initialization. Empirically, last-layer leverage tends to be more informative for practical neural networks, which operate in the rich regime.

Double Descent and Leverage

One of the most striking phenomena in modern machine learning is double descent: as the number of parameters $p$ increases past the interpolation threshold $p = n$, the test error first spikes and then decreases again. The spike occurs exactly at the interpolation threshold, where the model has just enough parameters to fit the training data perfectly.

Leverage scores are intimately connected to this phenomenon. At the interpolation threshold, the minimum-norm interpolant has $X^TX$ (or its regularized version with $\lambda \to 0$) approaching singularity. The leverage scores of many points approach 1, meaning individual training points exert maximal influence on the model. The variance of predictions explodes, because $\text{Var}(\hat{y}(x)) \propto 1 + h(x)$ and $h(x) \to 1$ for many $x$.

Ridge regularization smooths this transition. With $\lambda > 0$, the leverage scores at the interpolation threshold are bounded away from 1, and the variance spike is dampened. The effective dimension $p_\lambda$ grows smoothly through $p = n$ rather than exhibiting a singularity. This is one of the practical arguments for always using some regularization: it prevents the leverage-driven variance explosion at the interpolation threshold.

flowchart TD subgraph Underparameterized["Underparameterized: p less than n"] direction TB U1["X'X invertible"] U2["Leverage h in 0, p/n"] U3["Variance well-controlled"] U1 --> U2 --> U3 end subgraph Threshold["Interpolation Threshold: p approx n"] direction TB T1["X'X nearly singular"] T2["Leverage h approaches 1"] T3["Variance explodes"] T1 --> T2 --> T3 end subgraph Overparameterized["Overparameterized: p much greater than n"] direction TB O1["X'X singular, min-norm solution"] O2["Leverage h = 1 for all training pts"] O3["Variance decreases with more features"] O1 --> O2 --> O3 end subgraph Ridge["Ridge Regularization: lambda greater than 0"] direction TB R1["X'X + lambda I always invertible"] R2["Leverage h less than 1/lambda"] R3["Smooth transition, no spike"] R1 --> R2 --> R3 end Underparameterized --> Threshold Threshold --> Overparameterized Threshold -.->|"Regularize"| Ridge style T3 fill:#fce4ec,stroke:#c62828,color:#1c1917 style R3 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917

At the interpolation threshold ($p \approx n$), leverage scores spike and prediction variance explodes. Ridge regularization smooths the transition and prevents the variance blow-up.

Stability of Neural Network Leverage

A natural concern with neural network leverage is stability: do different training runs produce similar leverage scores? In the NTK regime (infinite width, lazy training), the answer is yes. The NTK converges to a deterministic kernel as width tends to infinity, so leverage scores computed from the NTK features are deterministic in the limit and vary by $O(1/\sqrt{\text{width}})$ at finite width.

For practical finite-width networks in the rich training regime, stability is not guaranteed. The learned features $\phi_\theta(x)$ depend on the random initialization and the stochastic optimization trajectory. However, empirical evidence suggests that leverage scores are more stable than individual features. While the features themselves may rotate or permute across runs (the representation is only identified up to invertible linear transformations), the leverage $h^{\text{NN}}(x) = \phi_\theta(x)^T(\Phi^T\Phi + \lambda I)^{-1}\phi_\theta(x)$ is invariant to orthogonal rotations of the feature space and is therefore more robust to the symmetries of the representation.

Practical recommendations:

Always use $\lambda > 0$. A common choice is $\lambda = 10^{-4} \cdot \text{tr}(\Phi^T\Phi) / d$, which scales the regularization to the average eigenvalue of the feature covariance.
Check the diagnostic $\hat{\eta} = \text{std}(h) / \text{mean}(h)$. If $\hat{\eta}$ is very large (say, greater than 2), a few points dominate and the leverage scores may be unstable. If $\hat{\eta}$ is near zero, all points have similar leverage and the scores carry little information.
If stability is critical, compute leverage scores from an ensemble of networks and average. The averaged scores inherit the stability of the ensemble while preserving the geometric information of the learned representation.

The Comparison in Full

To put everything together, here is a detailed comparison of the four leverage variants discussed in this post and its predecessors.

Property	Classical OLS	Ridge	Kernel	Neural Net
Formula	$x^T(X^TX)^{-1}x$	$x^T(X^TX + \lambda I)^{-1}x$	$k_x^T(K + \lambda I)^{-1}k_x$	$\phi(x)^T(\Phi^T\Phi + \lambda I)^{-1}\phi(x)$
Requires $n > p$	Yes	No	No	No
Feature space	Input $\mathbb{R}^p$	Input $\mathbb{R}^p$	RKHS $\mathcal{H}$	Learned $\mathbb{R}^d$
Effective dim	$p$ (fixed)	$p_\lambda \leq p$ (tunable)	$\sum_j s_j^2/(s_j^2+\lambda)$	$\sum_j s_j^2/(s_j^2+\lambda)$
Captures nonlinearity	No	No	Yes (via kernel)	Yes (via learned features)
Computation	$O(np^2)$	$O(np^2)$ or $O(n^2p)$	$O(n^3)$	Forward pass + $O(d^3)$
Stability	Deterministic	Deterministic	Deterministic	Depends on training run

The Deeper Point

Leverage is not a formula; it is a concept. The concept is: how much does a single point influence the model's prediction? In a linear model, this is captured by $(X^TX)^{-1}$. In a regularized model, by $(X^TX + \lambda I)^{-1}$. In a kernel model, by the inverse of the regularized kernel matrix. In a neural network, by the inverse of the regularized Gram matrix of learned features. The matrix changes; the meaning does not. Whenever you have a model and a dataset, you can ask "how influential is this point?" and the answer will involve some version of leverage. The extensions in this post ensure that this question has a well-defined, computable answer even in settings where the classical formula breaks down.