Adaptive Conformal Methods: CQR, Studentized CP, and Their Trade-offs

Intuitive

The Landscape of Adaptive Methods

People Have Already Tried to Fix This

The "same width everywhere" problem from Part 3 has not gone unnoticed. Over the past decade, researchers have proposed three main strategies for making conformal intervals adaptive. Each is clever, each has real merit, and each introduces its own trade-offs. Let us walk through them before asking whether there is a simpler path.

Strategy 1: Learn the Bounds Directly (CQR)

The first idea is straightforward. Instead of predicting a single number and then wrapping a fixed error bar around it, why not train two models — one to predict the lower bound of the interval and one to predict the upper bound? Then use the conformal correction to guarantee coverage.

Analogy

Instead of saying "the house costs 420K plus or minus 50K" (symmetric, constant), you say "the house costs somewhere between 370K and 480K" (asymmetric, adaptive). For a luxury estate, the range might be 1.8M to 3.2M — much wider, reflecting genuine uncertainty. For a cookie-cutter suburban home, it might be 345K to 365K — tight, because the model knows this territory well.

This is Conformalized Quantile Regression (CQR). The conformal correction guarantees the coverage, while the learned bounds provide adaptation. The catch is that you now need to train two additional models (the quantile regressors), tune their hyperparameters, and hope they learn the right shape. In practice, this means choosing an appropriate quantile regression algorithm (e.g., gradient boosted trees, neural networks), selecting hyperparameters for each, and accepting that the quality of the adaptation is only as good as these models. When the quantile regressors fit well, CQR can produce intervals that are impressively well-calibrated to local uncertainty. When they fit poorly — which is harder to diagnose than it sounds — the intervals may be no better than vanilla CP, just more expensive to compute.

Strategy 2: Estimate the Difficulty (Studentized CP)

The second idea is also intuitive. Train a separate model that estimates how wrong your main model usually is at each point. Then scale the error bars accordingly: wider where the model struggles, narrower where it is reliable.

This is Studentized (Normalized) Conformal Prediction. The problem, as we saw in Part 6, is that this difficulty estimator is trained on training mistakes — and training mistakes have the sign flipped. The model looks most accurate at exactly the points where it is least reliable for future predictions. In other words, the scale estimator learns a distorted picture of where the model struggles: it underestimates difficulty at high-leverage points and overestimates it at low-leverage points. This is not a random error that washes out on average — it is a systematic bias tied to the geometry of the data. In practical terms, a user who trusts the Studentized intervals may end up most confident precisely where the intervals are least dependable.

Strategy 3: Only Look Nearby (Localized CP)

The third strategy sidesteps model training entirely. For each test point, find the "nearby" calibration points and compute the error bar using only those neighbors. If all your neighbors have large residuals, your error bar will be wide; if they have small residuals, it will be narrow.

This is Localized Conformal Prediction. It avoids the sign flip (no auxiliary models trained on training data), but "nearby" in high dimensions is unreliable. With 30 or 100 features, almost no calibration points are truly nearby, so you end up averaging over nearly everything anyway — or using so few neighbors that the coverage guarantee breaks down. This is the familiar curse of dimensionality, now appearing in the conformal prediction setting. The method is principled in low dimensions, but as the feature space grows, the kernel weights become nearly uniform and the localization that was supposed to drive adaptation effectively vanishes. In practice, this means the method either degenerates to something close to vanilla CP, or becomes sensitive to the bandwidth parameter in ways that are difficult to tune reliably.

flowchart LR subgraph V["Vanilla CP"] V1["Train model"] --> V2["Compute residuals"] V2 --> V3["Single quantile q̂"] V3 --> V4["Same width everywhere"] end subgraph C["CQR"] C1["Train model +
2 quantile regressors"] --> C2["Compute
quantile scores"] C2 --> C3["Single quantile q̂"] C3 --> C4["Adaptive asymmetric
intervals"] end subgraph S["Studentized CP"] S1["Train model +
scale estimator"] --> S2["Compute
normalized scores"] S2 --> S3["Single quantile q̂"] S3 --> S4["Scaled symmetric
intervals"] end subgraph L["Localized CP"] L1["Train model"] --> L2["Compute residuals"] L2 --> L3["Weighted quantile
per test point"] L3 --> L4["Locally adaptive
intervals"] end style V fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style C fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style S fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style L fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style V4 fill:#fce4ec,stroke:#c62828,color:#1c1917 style C4 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style S4 fill:#fff9c4,stroke:#f9a825,color:#1c1917 style L4 fill:#e8f5e9,stroke:#2e7d32,color:#1c1917

Four parallel pipelines for conformal prediction. Each adaptive method adds complexity at a different stage.

The Key Trade-off

These methods sit on a spectrum. On one end is vanilla CP: zero additional complexity, zero adaptation. On the other end is full conditional coverage (the theoretical ideal): perfect adaptation, but requiring essentially infinite data. Every practical method lives somewhere in between, trading complexity for adaptation quality. For a practitioner, this means every choice involves a judgment call: how much additional modeling effort and computational overhead is justified by the expected improvement in interval quality? The answer depends on the application, the sample size, the dimensionality, and how heterogeneous the prediction difficulty is across the feature space.

flowchart LR A["Less Adaptation
Simpler, faster,
fewer assumptions"] B["Vanilla CP
No extra models
Constant width"] C["Studentized CP
1 extra model
Sign flip risk"] D["CQR
2 extra models
Depends on fit"] E["Localized CP
No extra models
O(n) per point"] F["More Adaptation
Complex, slower,
more assumptions"] A ~~~ B --> C --> D --> E ~~~ F style A fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style B fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style C fill:#fff9c4,stroke:#f9a825,color:#1c1917 style D fill:#fff9c4,stroke:#f9a825,color:#1c1917 style E fill:#fce4ec,stroke:#c62828,color:#1c1917 style F fill:#fce4ec,stroke:#c62828,color:#1c1917

The adaptation-complexity spectrum. Moving right gains adaptation but at increasing cost.

Quick Comparison

Method	Extra models needed	Speed	Sign flip problem?
Vanilla CP	None	Fastest	N/A (no adaptation)
CQR	2 quantile regressors	Fast at test time	Partial (quantile regressors see attenuated residuals)
Studentized CP	1 scale estimator	Fast at test time	Yes — learns $(1-h)$ when reality is $(1+h)$
Localized CP	None	Slow ($O(n_2)$ per test point)	No

The Gap

Every existing adaptive method either trains extra models (CQR, Studentized) or adds per-test-point computational cost (Localized). None uses the geometry of the data — the structure of the design matrix — directly. In Parts 4–6, we discovered that leverage scores already encode exactly the information we need: where predictions are easy, where they are hard, and by how much. This is notable because leverage scores are already computed (or cheaply computable) as a byproduct of fitting the regression. The information is sitting there, unused. The natural question is: can we build an adaptive conformal method that takes advantage of it?

Key Question

Every existing adaptive method introduces additional complexity — extra models, extra hyperparameters, or extra computation. The question is: is there a simpler path? One that uses what the data already provides, at zero additional cost?

Technical

Algorithms and Analysis

Conformalized Quantile Regression (CQR)

Reference: Romano, Patterson, and Candes, NeurIPS 2019.

CQR replaces the point prediction plus constant interval with learned quantile estimates that naturally adapt to the local difficulty of the problem.

Step 1: Train quantile regressors. Split the data into a training set $D_1$ and a calibration set $D_2$. On $D_1$, train two quantile regressors: $\hat{Q}_{\text{lo}}(x)$ targeting the $\alpha/2$ quantile and $\hat{Q}_{\text{hi}}(x)$ targeting the $1-\alpha/2$ quantile of $Y \mid X = x$.

Step 2: Compute conformity scores. On the calibration set $D_2$, compute:

$$S_i = \max\!\left(\hat{Q}_{\text{lo}}(X_i) - Y_i, \;\; Y_i - \hat{Q}_{\text{hi}}(X_i)\right)$$

This score measures how far the true response falls outside the predicted quantile range. If the quantile estimates are perfect, most scores are negative (meaning $Y$ falls within the predicted range).

Step 3: Form the interval. Compute $\hat{q}$ as the $\lceil(1-\alpha)(1+1/|D_2|)\rceil$-th smallest score. The prediction interval is:

$$\hat{C}(x) = \left[\hat{Q}_{\text{lo}}(x) - \hat{q}, \;\; \hat{Q}_{\text{hi}}(x) + \hat{q}\right]$$

The width at $x$ is $\hat{Q}_{\text{hi}}(x) - \hat{Q}_{\text{lo}}(x) + 2\hat{q}$, which varies with $x$ through the learned quantile gap. The conformal correction $\hat{q}$ preserves marginal coverage regardless of quantile regression quality.

flowchart TD D["Data D"] --> D1["D₁: Training"] D --> D2["D₂: Calibration"] D1 --> QL["Train Q̂ₗₒ targeting α/2"] D1 --> QH["Train Q̂ₕᵢ targeting 1−α/2"] QL --> SC["Score: Sᵢ = max(Q̂ₗₒ − Yᵢ, Yᵢ − Q̂ₕᵢ)"] QH --> SC D2 --> SC SC --> QT["Quantile q̂ of scores"] QT --> INT["Interval: [Q̂ₗₒ(x) − q̂, Q̂ₕᵢ(x) + q̂]"] style D fill:#f3f0ec,stroke:#a0522d,color:#1c1917 style D1 fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style D2 fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style QL fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style QH fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style SC fill:#fff9c4,stroke:#f9a825,color:#1c1917 style QT fill:#fff9c4,stroke:#f9a825,color:#1c1917 style INT fill:#f3f0ec,stroke:#a0522d,color:#1c1917

CQR pipeline: two quantile regressors produce an adaptive interval, corrected by a conformal quantile.

Studentized (Normalized) Conformal Prediction

Reference: Papadopoulos et al., 2002; Lei et al., 2018.

Studentized CP normalizes residuals by an estimated standard deviation, following the classical idea of studentization.

Step 1: Train the model and a scale estimator. On $D_1$, train a point prediction model $\hat{f}$ and a scale estimator $\hat{\sigma}(x)$. The scale estimator is typically fit to the absolute training residuals: $\{(X_i, |Y_i - \hat{f}(X_i)|)\}$ for $i \in D_1$.

Step 2: Compute normalized scores. On $D_2$:

$$S_i = \frac{|Y_i - \hat{f}(X_i)|}{\hat{\sigma}(X_i)}$$

Step 3: Form the interval.

$$\hat{C}(x) = \left[\hat{f}(x) - \hat{q} \cdot \hat{\sigma}(x), \;\; \hat{f}(x) + \hat{q} \cdot \hat{\sigma}(x)\right]$$

The width at $x$ is $2\hat{q} \cdot \hat{\sigma}(x)$: wider where $\hat{\sigma}$ predicts large residuals, narrower where it predicts small ones.

The Sign Flip in Detail

Here is the central issue. The scale estimator $\hat{\sigma}$ is trained on training-set absolute residuals $|Y_i - \hat{f}(X_i)|$. From Part 6, we know:

Training residuals: $\text{Var}(Y_i - \hat{f}(X_i) \mid X_i) = \sigma^2(1 - h_i)$. The hat matrix projects training noise away from the residual.
Test/calibration residuals: $\text{Var}(Y_{\text{new}} - \hat{f}(x) \mid x) = \sigma^2(1 + h(x))$. The estimation error adds to the noise.

So $\hat{\sigma}$ learns $\sigma\sqrt{1-h}$, but the interval needs to account for $\sigma\sqrt{1+h}$. At high-leverage points ($h$ large), the estimator predicts small variance when the true prediction variance is large. The adaptation is systematically biased, and the bias grows with leverage — precisely at the points where accurate uncertainty quantification matters most.

flowchart TD TR["Training residuals
Var = σ²(1 − h)"] --> SIGMA["Scale estimator σ̂
learns σ√(1−h)"] SIGMA --> NORM["Normalized score
S = |residual| / σ̂"] NORM --> PROB["At high-leverage points:
σ̂ is SMALL
(training residuals were small)"] PROB --> RESULT["Interval is NARROW
when it should be WIDE"] REAL["Reality at test time
Var = σ²(1 + h)"] --> NEED["Need: wide intervals
at high-leverage points"] RESULT --> CONFLICT["SIGN FLIP
(1−h) vs (1+h)"] NEED --> CONFLICT style TR fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style SIGMA fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style NORM fill:#fff9c4,stroke:#f9a825,color:#1c1917 style PROB fill:#fce4ec,stroke:#c62828,color:#1c1917 style RESULT fill:#fce4ec,stroke:#c62828,color:#1c1917 style REAL fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style NEED fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style CONFLICT fill:#fce4ec,stroke:#c62828,color:#1c1917

The sign flip in Studentized CP: the scale estimator learns the wrong variance function, leading to systematically incorrect intervals at high-leverage points.

Localized Conformal Prediction

Reference: Guan, 2023.

Localized CP re-weights the calibration set to focus on points similar to the test point, computing a local conformal quantile.

Kernel weights. For each test point $x$, assign weights to the calibration points:

$$w_i(x) = K\!\left(\frac{X_i - x}{\text{bandwidth}}\right)$$

where $K$ is a kernel function (Gaussian, Epanechnikov, etc.) and the bandwidth controls how "local" the computation is.

Weighted conformal quantile. Instead of taking the unweighted quantile of the calibration scores, take the $(1-\alpha)$-quantile of the scores under the kernel weights. Points similar to $x$ receive more influence; dissimilar points receive less.

Cost. For each test point, you must evaluate the kernel against every calibration point and compute a weighted quantile. This is $O(n_2)$ per test point, compared to $O(1)$ for vanilla CP, CQR, and Studentized CP (after the training phase). For a test set of 10,000 points and a calibration set of 5,000, this requires 50 million kernel evaluations.

Full Comparison

Method	Auxiliary models	Hyperparameters	Cost per test point	Sign flip?	Adaptation mechanism
Vanilla CP	None	None	$O(1)$	N/A	None (constant width)
CQR	2 quantile regressors	Model hyperparams ($\times 2$)	$O(1)$	Partial	Learned quantile gap
Studentized CP	1 scale estimator	Model hyperparams ($\times 1$)	$O(1)$	Yes	Estimated $\hat{\sigma}(x)$
Localized CP	None	Kernel, bandwidth	$O(n_2)$	No	Kernel-weighted quantile

Three observations emerge from this table:

Every adaptive method involves a trade-off. Either you train auxiliary models (CQR, Studentized), choose kernel parameters (Localized), or accept constant width (Vanilla). There is no existing method that adapts without introducing additional complexity.
Methods that estimate variance from residuals inherit the sign flip. This includes Studentized CP directly, and CQR partially (whose quantile regressors are trained on $D_1$ and therefore see attenuated residuals at high-leverage points).
Computational costs vary widely. Vanilla CP and CQR/Studentized CP have $O(1)$ cost per test point after training. Localized CP has $O(n_2)$ cost per test point, which becomes prohibitive for large calibration or test sets.

flowchart TD subgraph CQR_P["CQR Pipeline"] CQR1["D₁"] --> CQR2["Train Q̂ₗₒ, Q̂ₕᵢ"] CQR2 --> CQR3["Scores on D₂"] CQR3 --> CQR4["q̂"] CQR4 --> CQR5["Width = Q̂ₕᵢ − Q̂ₗₒ + 2q̂"] end subgraph SCP_P["Studentized CP Pipeline"] SCP1["D₁"] --> SCP2["Train f̂, σ̂"] SCP2 --> SCP3["Normalized scores on D₂"] SCP3 --> SCP4["q̂"] SCP4 --> SCP5["Width = 2q̂σ̂(x)"] end subgraph LCP_P["Localized CP Pipeline"] LCP1["D₁"] --> LCP2["Train f̂"] LCP2 --> LCP3["Scores on D₂"] LCP3 --> LCP4["Weighted q̂(x)
per test point"] LCP4 --> LCP5["Width = 2q̂(x)"] end style CQR_P fill:#faf8f5,stroke:#1565c0,color:#1c1917 style SCP_P fill:#faf8f5,stroke:#f9a825,color:#1c1917 style LCP_P fill:#faf8f5,stroke:#2e7d32,color:#1c1917

Detailed pipelines for each adaptive method, highlighting where the key design choices differ.

Advanced

Theoretical Properties and the Missing Piece

CQR: Theoretical Properties

CQR inherits the standard split conformal coverage guarantee: $\mathbb{P}(Y_{n+1} \in \hat{C}(X_{n+1})) \geq 1 - \alpha$. This holds for any quality of quantile regression — the coverage is guaranteed even if $\hat{Q}_{\text{lo}}$ and $\hat{Q}_{\text{hi}}$ are arbitrarily poor estimators.

However, the adaptation quality depends critically on the calibration of the quantile estimates. If $\hat{Q}_{\alpha/2}(x) \approx Q_{\alpha/2}(x)$ and $\hat{Q}_{1-\alpha/2}(x) \approx Q_{1-\alpha/2}(x)$, then CQR achieves approximate conditional coverage. In the opposite extreme, if both quantile regressors output constants (no adaptation), CQR degenerates to vanilla CP with an offset.

A subtlety: the conformity score $S_i = \max(\hat{Q}_{\text{lo}}(X_i) - Y_i, Y_i - \hat{Q}_{\text{hi}}(X_i))$ is a one-sided worst-case measure. This means that even if one bound is well-calibrated and the other is not, the conformal correction inflates both sides. CQR cannot exploit partial information about which side is poorly calibrated.

Studentized CP: The Score Distribution Problem

For Studentized CP in the linear model with homoscedastic noise, the normalized calibration score is:

$$S_i^{\text{stud}} = \frac{|Y_i - \hat{f}(X_i)|}{\hat{\sigma}(X_i)}$$

Now consider what $\hat{\sigma}$ has learned. It was trained on training absolute residuals, which have (conditional) mean proportional to $\sigma\sqrt{1 - h_i}$. If $\hat{\sigma}$ is a good estimator of the training residual scale, then $\hat{\sigma}(X_i) \approx c \cdot \sigma\sqrt{1 - h_i}$ for some constant $c$.

On the calibration set, the numerator $|Y_i - \hat{f}(X_i)|$ has standard deviation $\sigma\sqrt{1 + h_i}$ (calibration points are new data, not training data). So:

$$S_i^{\text{stud}} \approx \frac{\sigma\sqrt{1 + h_i}}{c \cdot \sigma\sqrt{1 - h_i}} \cdot |\eta_i| = \frac{1}{c}\sqrt{\frac{1 + h_i}{1 - h_i}} \cdot |\eta_i|$$

where $\eta_i$ is a standard noise variable. The factor $\sqrt{(1+h_i)/(1-h_i)}$ is not constant across calibration points — it depends on the leverage $h_i$. Therefore the studentized scores are not identically distributed. They are inflated at high-leverage points and deflated at low-leverage points.

The conformal quantile $\hat{q}$ is then a compromise between these heterogeneous score distributions. It overcovers at low-leverage points (where the scores were artificially deflated) and undercovers at high-leverage points (where the scores were artificially inflated). This is precisely the conditional coverage failure described in Part 3, reintroduced by the sign flip.

Localized CP: Convergence and the Curse of Dimensionality

Under regularity conditions (smooth conditional distribution of $Y \mid X$, appropriate bandwidth decay as $n_2 \to \infty$), the weighted quantile in localized CP converges to the conditional quantile $Q_{1-\alpha}(|Y - \hat{f}(X)| \mid X = x)$. This provides approximate conditional coverage.

The effective sample size at a test point $x$ is:

$$n_{\text{eff}}(x) = \frac{\left(\sum_i w_i(x)\right)^2}{\sum_i w_i(x)^2}$$

In dimension $p$, if the bandwidth scales as $b \sim n_2^{-1/(4+p)}$ (optimal for nonparametric estimation), the effective sample size scales as $n_{\text{eff}} \sim n_2^{4/(4+p)}$. For $p = 30$, this is $n_{\text{eff}} \sim n_2^{4/34} \approx n_2^{0.12}$. With $n_2 = 5000$, the effective sample size is approximately $5000^{0.12} \approx 3.1$ — barely enough to define a quantile, let alone a reliable one.

Furthermore, the kernel in Localized CP operates in the original feature space, measuring Euclidean (or Mahalanobis) distance. But in linear regression, the prediction difficulty depends on leverage $h(x) = x^\top(\mathbf{X}^\top\mathbf{X})^{-1}x$ — a specific quadratic form in the features, not the raw distance. A kernel that localizes in Euclidean distance may assign similar weights to two calibration points that differ substantially in leverage (if they happen to be equidistant from $x$ but in different directions relative to the training data covariance).

Formal Comparison

	CQR	Studentized CP	Localized CP
Auxiliary models	2 quantile regressors	1 scale estimator	None
Hyperparameters	QR model hyperparams	Scale model hyperparams	Kernel, bandwidth
Cost per test point	$O(1)$	$O(1)$	$O(n_2)$
Sign flip	Partial (QR trained on $D_1$)	Full ($\hat{\sigma}$ inverts $h$-dependence)	None
Adaptation source	Learned $\hat{Q}_{\text{hi}} - \hat{Q}_{\text{lo}}$	Estimated $\hat{\sigma}(x)$	Local calibration quantile
Score exchangeability	Exact (by construction)	Violated ($h$-dependent inflation)	Approximate (weighted)
High-$p$ behavior	Depends on QR model	Sign flip worsens ($h$ variation grows)	Effective sample size collapses

An Open Question

Every existing adaptive method either trains extra models (CQR, Studentized) or adds per-test-point computational cost (Localized). Each has genuine merit, and the right choice depends on the application. But a natural question emerges: is there a simpler path? One that uses structural information the data already provides, without training auxiliary models or paying high per-test-point costs?

The Parts 4–6 of this series showed that the design matrix geometry — specifically, leverage scores — directly controls prediction error variance. Whether and how that geometric information can be brought to bear on the problem of adaptive prediction intervals remains an interesting direction.