When Exchangeability Breaks

Part 13 of a series on prediction intervals, conformal prediction, and leverage scores.

This is part of the series From Predictions to Prediction Intervals.

Every guarantee has assumptions. For conformal prediction, the assumption is exchangeability: the calibration data and test data come from the same process, with no point being special. But real-world data frequently violates this. Time series have temporal structure. Medical models get deployed at hospitals different from where they were trained. Financial models face regime changes. When exchangeability breaks, the coverage guarantee evaporates — and you may not even notice. This post explores what goes wrong and what can be done about it.

Intuitive

When the Rules Change

The Invisible Assumption

Recall the core promise of conformal prediction: if you calibrate on data that "looks like" your test data, the prediction intervals you produce will have valid coverage. More precisely, the guarantee says that a new test point will fall inside the interval at least $1-\alpha$ of the time. This guarantee is distribution-free — it does not care whether your errors are Gaussian, heavy-tailed, or multimodal. The only thing it needs is exchangeability: the calibration scores and the test score must be drawn from the same process, with no ordering or identity mattering.

In a laboratory setting, exchangeability is easy to arrange. You randomly split your data, calibrate on one portion, and predict on the other. The split is random, so no point is systematically different from any other. Coverage holds. But laboratory settings are not where models live. Models live in the real world, and the real world has structure that exchangeability cannot accommodate.

Three Ways Exchangeability Fails

Scenario 1: Time series. Consider a weather forecasting model. You calibrate it on historical data from March and April, then deploy it to make predictions in July. The calibration residuals reflect spring weather patterns — moderate temperatures, variable precipitation. But July brings heat waves, drought conditions, and convective storms with entirely different error characteristics. Tomorrow's weather is not exchangeable with last month's weather. There is temporal structure: autocorrelation, seasonality, and trends. The calibration scores from spring are not a representative sample of the errors you will see in summer. Your 90% prediction intervals might only cover 60% of summer outcomes, and nothing in the conformal framework will warn you.

Scenario 2: Covariate shift. A medical AI is trained and calibrated on patient data from Hospital A — a large urban research hospital with a young, diverse patient population. The model is then deployed at Hospital B — a rural community hospital where patients are older, have more comorbidities, and present with later-stage conditions. The relationship between features and outcomes may be the same at both hospitals (the biology has not changed), but the distribution of features has shifted. Hospital B has far more patients in regions of feature space that were sparsely represented at Hospital A. The calibration quantile from Hospital A is based on a distribution of residuals that does not represent what the model will encounter at Hospital B. The intervals may be systematically too narrow for the sicker, older population that Hospital B serves — precisely the setting where reliable uncertainty quantification matters most.

Scenario 3: Concept drift. A model predicts housing prices based on features like square footage, neighborhood, and interest rates. The model is calibrated on data from 2018–2019 and deployed in 2020–2021. This time, it is not just the input distribution that changed — the very relationship between inputs and outputs has shifted. The pandemic altered the value of location, remote work changed demand for suburban homes, and interest rate volatility made the feature-outcome mapping unstable. Even if the covariate distribution were identical (it was not), the conditional distribution $P(Y \mid X)$ changed, and no amount of reweighting can fix that from calibration data alone.

flowchart TD ROOT["Exchangeability Violated"] ROOT --> TS["Time Series"] ROOT --> CS["Covariate Shift"] ROOT --> CD["Concept Drift"] TS --> TSF["Temporal autocorrelation\nSeasonality, trends"] CS --> CSF["P(X) changes\nP(Y|X) stays same"] CD --> CDF["P(Y|X) changes\nFeature-outcome map shifts"] TSF --> F1["Intervals calibrated on\npast may not cover future"] CSF --> F2["Intervals too narrow\nin underrepresented regions"] CDF --> F3["Calibration fundamentally\ninvalid"] style ROOT fill:#fce4ec,stroke:#c62828,color:#1c1917 style F1 fill:#fff3e0,stroke:#e65100,color:#1c1917 style F2 fill:#fff3e0,stroke:#e65100,color:#1c1917 style F3 fill:#fff3e0,stroke:#e65100,color:#1c1917

Three distinct scenarios that violate exchangeability, each producing a different failure mode for conformal prediction.

The Weather Analogy

To make this concrete, think of calibrating a thermostat. You set it in summer: 72 degrees feels comfortable, the AC runs a certain number of hours, and you measure how well it performs. Now winter arrives. The thermostat still "works" — it turns on the heat when the temperature drops below 72 — but its calibration is completely wrong. The summer data told it that the house cools by 2 degrees per hour overnight; in winter, it cools by 8 degrees per hour. The thermostat thinks it needs to run for 20 minutes; it actually needs 80 minutes. The house ends up freezing.

Conformal prediction with stale calibration data is the same. The quantile computed from summer residuals reflects summer variability. Applying it to winter predictions gives intervals that are absurdly narrow — because summer was mild and predictable, but winter is volatile. The intervals have 90% written on the label, but the actual coverage might be 40%.

The Failure Mode You Cannot See

The most dangerous aspect of exchangeability failure is that you often cannot detect it from the inside. Conformal prediction does not come with a built-in alarm that rings when its assumptions are violated. The procedure still outputs intervals. They still have a width and a center. They still look like prediction intervals. Nothing in the output format distinguishes a valid interval from a useless one.

The failure can go in either direction. If the test distribution has larger errors than the calibration distribution, the intervals become too narrow — this is the dangerous case, because it gives false confidence. If the test distribution has smaller errors, the intervals become too wide — less dangerous, but wasteful. You are paying for uncertainty you do not have, making decisions that are overly cautious. And without monitoring actual outcomes against predictions, you cannot tell which case you are in. This is why exchangeability is not just a mathematical technicality. It is the load-bearing wall of the entire conformal guarantee.

Analogy

Using conformal prediction under broken exchangeability is like wearing a life jacket rated for a swimming pool while sailing in the North Atlantic. The label says it provides a certain level of safety. The testing was rigorous — under pool conditions. But the pool conditions are not the ocean conditions, and the label's assurance becomes meaningless the moment the environment changes. The jacket does not announce that it is no longer adequate. You only find out when you are in the water.

What Can Be Done?

The good news is that the machine learning and statistics communities have developed tools for exactly these situations. Two main strategies have emerged. The first, weighted conformal prediction, handles covariate shift by reweighting the calibration scores to account for the difference in feature distributions. The intuition is straightforward: if the test population has more patients in a certain region of feature space, then calibration scores from that region should count more when computing the quantile. The second, adaptive conformal inference, handles distribution shift that evolves over time. Instead of computing a single quantile and applying it forever, it continuously adjusts the quantile level based on recent performance — widening intervals when coverage has been too low, narrowing them when it has been too high. Both approaches relax the exchangeability assumption in controlled ways while preserving as much of the coverage guarantee as possible. The next two sections make these ideas precise.

Technical

Weighted Conformal Prediction and ACI

Weighted Conformal Prediction

The first principled fix addresses covariate shift: the setting where the distribution of features changes between calibration and test time, but the conditional distribution $P(Y \mid X)$ remains the same. This is precisely the hospital deployment scenario. The biology has not changed; only the patient demographics have.

The key idea, due to Tibshirani, Barber, Candès, and Ramdas (2019), is to replace the uniform quantile in standard conformal prediction with a weighted quantile. In standard (split) conformal prediction, each calibration score contributes equally to the empirical quantile. Under covariate shift, this is wrong: calibration points that are more "test-like" should contribute more. The importance weight for calibration point $i$ is the density ratio:

$$w_i = \frac{p_{\text{test}}(X_i)}{p_{\text{cal}}(X_i)}$$

where $p_{\text{test}}$ is the test covariate density and $p_{\text{cal}}$ is the calibration covariate density. Points in regions where the test distribution has more mass than the calibration distribution receive higher weight, and vice versa. The weighted conformal quantile is then defined as the smallest $q$ such that:

$$\frac{\sum_{i=1}^{n} w_i \, \mathbf{1}\{S_i \leq q\}}{\sum_{i=1}^{n} w_i + w_{n+1}} \geq 1 - \alpha$$

where $S_i$ are the conformal scores of the calibration points, $w_{n+1}$ is the weight assigned to the test point, and the normalization in the denominator ensures that the total probability sums correctly. When all weights are equal ($w_i = 1$), this reduces to the standard conformal quantile. When the weights correctly reflect the density ratio, the coverage guarantee extends to the test distribution.

The practical challenge is that you need to estimate the density ratio, and density ratio estimation is itself a hard statistical problem. In low dimensions, kernel density estimation works. In moderate dimensions, methods like KLIEP (Kullback-Leibler Importance Estimation Procedure) or uLSIF (unconstrained Least-Squares Importance Fitting) can be effective. In very high dimensions, the density ratio may be poorly estimated, and the weighted quantile may be unreliable. This limitation is fundamental: if the covariate shift is so severe that the calibration and test distributions have very little overlap, no reweighting scheme can rescue the situation. The support of the calibration distribution must cover the support of the test distribution.

Scenario Exchangeability What changes Approach
Standard (i.i.d.) Holds Nothing Standard conformal prediction
Covariate shift Broken $P(X)$ changes, $P(Y \mid X)$ fixed Weighted conformal prediction
Temporal drift Broken Distribution evolves over time Adaptive conformal inference (ACI)
Concept drift Broken $P(Y \mid X)$ changes Re-calibration + ACI / no easy fix

Adaptive Conformal Inference (ACI)

The second fix, due to Gibbs and Candès (2021), addresses a more challenging setting: distribution shift that evolves over time. In a time series or streaming setting, the distribution may change slowly or abruptly, and you do not know the density ratio. You cannot compute importance weights because you do not have access to the future distribution.

ACI takes a different approach entirely. Instead of trying to characterize the shift, it adapts to it in an online fashion. The core idea is elegantly simple: maintain a running quantile level $\alpha_t$ and update it at each time step based on whether the most recent prediction interval achieved coverage.

The update rule is:

$$\alpha_{t+1} = \alpha_t + \gamma \left( \alpha - \text{err}_t \right)$$

where $\text{err}_t = \mathbf{1}\{Y_t \notin \hat{C}_t(X_t)\}$ is the miscoverage indicator at time $t$, $\alpha$ is the target miscoverage rate (e.g., 0.1 for 90% coverage), and $\gamma > 0$ is a step size that controls how aggressively the method adapts.

The intuition is that of a thermostat. If the interval failed to cover the true value ($\text{err}_t = 1$), the update becomes $\alpha_{t+1} = \alpha_t + \gamma(\alpha - 1) = \alpha_t - \gamma(1 - \alpha)$, which decreases $\alpha_{t+1}$ — meaning the next quantile will be higher, producing a wider interval. If the interval did cover ($\text{err}_t = 0$), the update becomes $\alpha_{t+1} = \alpha_t + \gamma \alpha$, which increases $\alpha_{t+1}$ — meaning the next quantile will be lower, producing a narrower interval. Over time, this negative feedback loop drives the average miscoverage rate toward $\alpha$.

flowchart LR A["Predict: build\ninterval at level alpha_t"] --> B["Observe: see\ntrue outcome Y_t"] B --> C{"Did interval\ncover Y_t?"} C -->|"No: err_t = 1"| D["Decrease alpha:\nwider intervals"] C -->|"Yes: err_t = 0"| E["Increase alpha:\nnarrower intervals"] D --> F["Update:\nalpha_{t+1} = alpha_t + gamma*(alpha - err_t)"] E --> F F --> A style C fill:#fff3e0,stroke:#e65100,color:#1c1917 style D fill:#fce4ec,stroke:#c62828,color:#1c1917 style E fill:#e8f5e9,stroke:#2e7d32,color:#1c1917

The ACI feedback loop. The quantile level adjusts at each time step, acting like a thermostat that targets the desired coverage rate.

The remarkable property of ACI is its long-run coverage guarantee: regardless of how the distribution changes over time — arbitrarily, adversarially, with no structural assumptions whatsoever — the average coverage over time converges to $1 - \alpha$. More precisely:

$$\frac{1}{T} \sum_{t=1}^{T} \text{err}_t \to \alpha \quad \text{as } T \to \infty$$

This is a weaker guarantee than finite-sample coverage (which says that each individual interval has $1-\alpha$ probability of covering), but it is a guarantee under no assumptions at all on the data-generating process. The price of this universality is that coverage may be very uneven in the short run: there may be periods of severe undercoverage (during abrupt distribution shifts) followed by periods of overcoverage (as the method corrects). The step size $\gamma$ controls the bias-variance tradeoff: large $\gamma$ adapts quickly but oscillates more; small $\gamma$ is smoother but reacts slowly to shifts.

Choosing Between the Two

Weighted conformal prediction and ACI address different aspects of the exchangeability problem, and the choice depends on the setting.

Use weighted CP when you have a batch prediction problem with known or estimable covariate shift: you have calibration data from one distribution and test data from another, you can estimate the density ratio, and $P(Y \mid X)$ has not changed. The guarantee is finite-sample (like standard CP), conditioned on the quality of the density ratio estimate.

Use ACI when you have a sequential prediction problem where the distribution may shift over time in unknown ways: streaming data, time series, or deployment settings where the environment changes. The guarantee is asymptotic (long-run average), but requires no knowledge of how the distribution is changing.

Use both when you face covariate shift that also evolves over time. One can run weighted CP with ACI-style updates to the quantile level, getting the best of both worlds: importance weighting corrects for the current covariate shift, while ACI adapts the level when the shift itself changes.

Practical Warning

Neither method can fix concept drift from calibration data alone. If the conditional distribution $P(Y \mid X)$ has changed — if the same patient with the same features now has a different outcome distribution — then no reweighting or quantile adjustment based on old calibration scores can recover valid coverage. The only remedy is to re-calibrate with data from the new regime. ACI can partially compensate by widening intervals when it detects undercoverage, but it is fighting a losing battle if the scores themselves are based on a stale model. In practice, concept drift often requires retraining the model, not just re-calibrating the intervals.

Advanced

The Theory of Weighted Exchangeability

Formal Framework: Weighted Exchangeability

The standard conformal guarantee relies on exchangeability: the joint distribution of $(Z_1, \ldots, Z_n, Z_{n+1})$, where $Z_i = (X_i, Y_i)$, is invariant under permutations. Under covariate shift, this fails. Tibshirani et al. (2019) introduced a notion of weighted exchangeability that replaces the uniform permutation distribution with a weighted one.

Suppose that the calibration points $Z_1, \ldots, Z_n$ are drawn i.i.d. from a distribution $P$, and the test point $Z_{n+1}$ is drawn from a different distribution $Q$, where $Q$ is absolutely continuous with respect to $P$. Define the likelihood ratio (Radon-Nikodym derivative):

$$w(z) = \frac{dQ(z)}{dP(z)}$$

Under covariate shift, $P(Y \mid X) = Q(Y \mid X)$, so the likelihood ratio simplifies to the covariate density ratio: $w(z) = w(x) = p_Q(x) / p_P(x)$. The key observation is that while the data are not exchangeable, they satisfy a weighted form of symmetry. The probability of any particular assignment of scores to positions is proportional to the product of the weights at each position. Formally, for any permutation $\sigma$ of $\{1, \ldots, n+1\}$:

$$P(\text{rank of } Z_{n+1} = k) = \frac{w(Z_k)}{\sum_{j=1}^{n+1} w(Z_j)}$$

This weighted rank distribution replaces the uniform rank distribution of the exchangeable case. The conformal p-value becomes:

$$p(Z_{n+1}) = \frac{\sum_{i=1}^{n} w(Z_i) \, \mathbf{1}\{S_i \geq S_{n+1}\} + w(Z_{n+1})}{\sum_{i=1}^{n} w(Z_i) + w(Z_{n+1})}$$

If we define $\hat{C}_\alpha(X_{n+1}) = \{y : p(X_{n+1}, y) > \alpha\}$, then under the covariate shift assumption and assuming the weights are known:

$$P_{Q}\left(Y_{n+1} \in \hat{C}_\alpha(X_{n+1})\right) \geq 1 - \alpha$$

This is the coverage guarantee under weighted exchangeability. It is exact and finite-sample, just like the standard conformal guarantee, but holds under the test distribution $Q$ rather than under $P$.

When Density Ratios Are Bounded

In practice, the density ratio $w(x) = p_Q(x) / p_P(x)$ must be estimated, introducing error. How does estimation error affect coverage? The analysis depends on the boundedness of the true weights.

Suppose the density ratio is bounded: $w(x) \leq M$ for all $x$ in the support. Then the effective sample size of the weighted calibration set is:

$$n_{\text{eff}} = \frac{\left(\sum_{i=1}^{n} w_i\right)^2}{\sum_{i=1}^{n} w_i^2}$$

When the weights are uniform ($w_i = 1$), we get $n_{\text{eff}} = n$. When the weights are highly variable, $n_{\text{eff}} \ll n$. The coverage guarantee degrades gracefully with the effective sample size: the finite-sample correction term scales as $1/n_{\text{eff}}$ rather than $1/n$.

More concretely, if the estimated weights $\hat{w}_i$ satisfy $|\hat{w}_i - w_i| \leq \delta$ for all $i$, the coverage guarantee becomes:

$$P_Q\left(Y_{n+1} \in \hat{C}_\alpha(X_{n+1})\right) \geq 1 - \alpha - O\left(\frac{n \delta}{\sum_{i=1}^{n} w_i}\right)$$

The coverage degradation is proportional to the total estimation error relative to the total true weight. When $\delta$ is small relative to the average weight, the coverage loss is small. When $\delta$ is comparable to the weights themselves, the guarantee becomes vacuous.

Conformal Prediction Beyond Exchangeability

Barber, Candès, Ramdas, and Tibshirani (2023) provided the most general treatment of conformal prediction without exchangeability. Their framework encompasses covariate shift, time series, and arbitrary dependence structures. The key result is a coverage bound that depends on the total variation distance between the actual data-generating process and the closest exchangeable process.

Specifically, let $P$ be the true joint distribution of $(Z_1, \ldots, Z_n, Z_{n+1})$ and let $P_{\text{exch}}$ be the closest exchangeable distribution (in total variation). Then:

$$\left| P(Y_{n+1} \in \hat{C}_\alpha(X_{n+1})) - (1-\alpha) \right| \leq 2 \, d_{\text{TV}}(P, P_{\text{exch}})$$

This gives a quantitative version of the informal statement that "approximate exchangeability gives approximate coverage." If the data are nearly exchangeable (small $d_{\text{TV}}$), coverage is nearly valid. If the data are far from exchangeable (large $d_{\text{TV}}$), the coverage guarantee degrades proportionally. The factor of 2 is tight.

The power of this result is its generality: it applies to any departure from exchangeability, not just covariate shift. It also has a sobering implication: without any assumption on the relationship between calibration and test data (not even approximate exchangeability), no distribution-free coverage guarantee is possible. You must assume something about the connection between past and future. The question is not whether to make assumptions, but which assumptions are both plausible and sufficient.

The Long-Run Coverage Property of ACI

The ACI guarantee is proved via a supermartingale argument. Define the regret process:

$$R_T = \sum_{t=1}^{T} \left(\text{err}_t - \alpha\right)$$

This measures the cumulative excess miscoverage. If $R_T$ grows without bound, the method is consistently undercovering. If $R_T$ is bounded, the average miscoverage converges to $\alpha$.

Consider the potential function $\Phi_t = (\alpha_t - \alpha)^2$. Using the ACI update rule $\alpha_{t+1} = \alpha_t + \gamma(\alpha - \text{err}_t)$, we can compute:

$$\Phi_{t+1} - \Phi_t = (\alpha_{t+1} - \alpha)^2 - (\alpha_t - \alpha)^2$$ $$= \left((\alpha_t - \alpha) - \gamma(\text{err}_t - \alpha)\right)^2 - (\alpha_t - \alpha)^2$$ $$= -2\gamma (\alpha_t - \alpha)(\text{err}_t - \alpha) + \gamma^2 (\text{err}_t - \alpha)^2$$

Since $\text{err}_t \in \{0, 1\}$, we have $(\text{err}_t - \alpha)^2 \leq 1$. The term $-2\gamma(\alpha_t - \alpha)(\text{err}_t - \alpha)$ is negative on average when $\alpha_t$ is on the same side as $\text{err}_t$ relative to $\alpha$ — which is exactly what happens under the feedback mechanism. Telescoping the sum and using $\Phi_t \geq 0$:

$$\sum_{t=1}^{T} 2\gamma (\alpha_t - \alpha)(\text{err}_t - \alpha) \leq \Phi_1 + T \gamma^2$$

This implies that the cumulative deviation is bounded by $O(\Phi_1/\gamma + T\gamma)$. Dividing by $T$, the average excess miscoverage satisfies:

$$\left|\frac{1}{T}\sum_{t=1}^{T} \text{err}_t - \alpha\right| \leq O\left(\frac{1}{T\gamma} + \gamma\right)$$

Choosing $\gamma = T^{-1/2}$ gives a rate of $O(T^{-1/2})$, and the average miscoverage converges to $\alpha$ at the $\sqrt{T}$ rate. This holds for any sequence of distributions — no stochastic assumptions are needed on the data-generating process. The guarantee is entirely deterministic, which is what makes it so powerful. The argument is essentially the same as the regret bound in online convex optimization, and ACI can be viewed as online gradient descent on the pinball loss.

flowchart TD A["Joint distribution P\nof (Z_1, ..., Z_n, Z_{n+1})"] A --> B{"Is P\nexchangeable?"} B -->|"Yes"| C["Standard CP\nExact coverage: 1 - alpha"] B -->|"No"| D{"Is the departure\ncharacterizable?"} D -->|"Known covariate shift"| E["Weighted CP\nExact coverage under Q"] D -->|"Bounded TV distance"| F["Approximate coverage\n|coverage - (1-alpha)| ≤ 2 d_TV"] D -->|"Sequential / arbitrary"| G["ACI\nLong-run average coverage"] D -->|"No structure at all"| H["No distribution-free\nguarantee possible"] style C fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style E fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style F fill:#fff3e0,stroke:#e65100,color:#1c1917 style G fill:#fff3e0,stroke:#e65100,color:#1c1917 style H fill:#fce4ec,stroke:#c62828,color:#1c1917

A decision tree for conformal prediction under different departures from exchangeability. The strength of the guarantee decreases as the assumptions weaken, culminating in impossibility when no structure is assumed at all.

Recent Developments and Open Frontiers

Conformal prediction for time series. Xu and Xie (2021) developed conformal prediction methods specifically for time series data by using a rolling calibration window that adapts to recent data, effectively assuming local stationarity. Zaffran et al. (2022) extended ACI to time series with theory for how the step size $\gamma_t$ should be chosen adaptively, rather than fixed, to achieve minimax-optimal regret. Their EnbPI (Ensemble Batch Prediction Intervals) method uses ensembles of models trained on different time windows to capture temporal heterogeneity in the scores.

Sequential conformal prediction. The sequential setting, where data arrives one point at a time and predictions must be made before the outcome is revealed, is where ACI is most natural. Recent work has connected this to the broader theory of online learning and game-theoretic probability. The conformal prediction interval can be viewed as a betting strategy: at each round, you "bet" that the outcome will fall inside your interval, and the ACI update adjusts your betting strategy based on past performance. The long-run coverage guarantee is then a consequence of the fundamental theorem of online learning.

The fundamental tension. All of these methods navigate the same fundamental tension: stronger assumptions yield stronger guarantees, but are less likely to hold in practice. Standard conformal prediction assumes full exchangeability and gives exact finite-sample coverage — but exchangeability fails in most real applications. Weighted CP relaxes to covariate shift and gives exact coverage under the shifted distribution — but requires knowing the density ratio. ACI makes no distributional assumptions and gives long-run average coverage — but the guarantee is asymptotic and the short-run behavior can be poor. The Barber et al. (2023) framework quantifies this tradeoff: the coverage guarantee degrades continuously as a function of the distance from exchangeability, and this degradation is information-theoretically unavoidable.

The practitioner's task is to identify which regime they are in and choose the appropriate tool. In batch settings with identifiable covariate shift, weighted CP is the right choice. In streaming settings with unknown drift, ACI is the right choice. In settings where the departure from exchangeability is mild and hard to characterize, standard conformal prediction may be "good enough" — and the Barber et al. bound makes "good enough" precise. In settings where the departure is severe and unstructured, no conformal method will save you, and the honest answer is to collect new calibration data.

The Bottom Line

Exchangeability is not a technicality — it is the foundation. When it breaks, the conformal guarantee breaks with it. The theory of weighted exchangeability, adaptive conformal inference, and conformal prediction beyond exchangeability provides a rich toolkit for navigating this reality. But no method can conjure coverage from nothing. The deepest lesson is an impossibility result: without some assumption connecting calibration data to test data, distribution-free prediction intervals with non-trivial coverage are impossible. The art is in choosing assumptions that are both defensible and sufficient — and in monitoring your deployed system to know when even those assumptions have failed.

Further Reading

  • Tibshirani, R. J., Barber, R. F., Candès, E. J., & Ramdas, A. (2019). Conformal prediction under covariate shift. NeurIPS.
  • Gibbs, I. & Candès, E. J. (2021). Adaptive conformal inference under distribution shift. NeurIPS.
  • Barber, R. F., Candès, E. J., Ramdas, A., & Tibshirani, R. J. (2023). Conformal prediction beyond exchangeability. Annals of Statistics.
  • Zaffran, M., Féron, O., Goude, Y., Josse, J., & Dieuleveut, A. (2022). Adaptive conformal predictions for time series. ICML.