Your Model Is Confident. Should You Be?

A neural network predicts that a patient needs 150mg of a drug. A gradient-boosted tree estimates a house will sell for 420,000. A linear model forecasts next quarter's revenue at 2.3M. These numbers are precise. They are also incomplete. What is missing is uncertainty—and that omission can be deeply misleading.

This post is about why point predictions are not enough, what prediction intervals actually are, and why building them correctly is harder than it looks. We present the ideas at three levels of depth. Read just the first for the core intuition, or go all the way through for the full mathematical story.

Intuitive

The Problem with Naked Predictions

▾

A Weather Forecast You Can Use

Imagine two weather forecasts for tomorrow:

Forecast A: "It will rain tomorrow."
Forecast B: "There is a 70% chance of rain tomorrow."

Forecast A is a point prediction—a single, definitive statement. Forecast B is far more useful because it tells you how sure the forecaster is. With a 70% chance of rain, you might grab an umbrella but still plan an outdoor lunch. With a 95% chance, you move the party indoors. The uncertainty changes the decision, and in many cases, the uncertainty is more important than the forecast itself.

Machine learning models almost always behave like Forecast A. They give you a single number—a prediction—without any indication of how much to trust it. A random forest might output 420,000 for a house price, but it will not tell you whether that estimate could plausibly be off by 5,000 or by 50,000. Without that information, anyone acting on the prediction is essentially flying blind.

flowchart LR A["Model"] --> B["Point Prediction\n420,000"] B --> C{"What is missing?"} C --> D["How confident\nis the model?"] D --> E["Prediction Interval\n370K – 470K"]

A point prediction alone leaves you guessing. A prediction interval tells you the range of plausible outcomes.

When the Stakes Are High

For low-stakes predictions, a single number is fine. If a movie recommendation engine is off, you lose 90 minutes. But in high-stakes settings, acting on a prediction without knowing its reliability can be costly or even harmful:

Medical dosing. A model predicts a patient needs 150mg of a drug. But is the true dose likely between 145mg and 155mg, or between 100mg and 200mg? The treatment decision depends entirely on how wide that range is. A clinician who sees a tight interval may proceed confidently; a wide interval signals the need for additional testing or a more conservative dose.
Self-driving cars. An autonomous vehicle predicts that a pedestrian will be at a specific position in two seconds. Is that prediction reliable to within 10cm or 2m? The braking decision depends on the uncertainty, not just the point estimate. A 10cm margin allows the car to maintain speed, while a 2m margin demands an immediate brake or lane change.
Financial risk. A portfolio model predicts 7% annual return. The difference between a range of 5% to 9% and a range of -10% to 24% is the difference between a safe investment and a gamble. An institutional investor allocating pension funds needs to know whether that 7% comes with a narrow spread or a wide one, because the downside tail determines the risk of insolvency.

Analogy

GPS without a confidence radius. Your phone shows a blue dot for your location. But it also shows a shaded circle around the dot—the confidence radius. In a clear open field with good satellite coverage, that circle is tiny: the phone knows where you are to within a few meters. In a dense urban canyon with tall buildings bouncing signals around, the circle grows much larger. The dot tells you where the GPS thinks you are. The circle tells you how much to trust that estimate. A prediction without uncertainty is like GPS with only the dot and no circle: you know the estimate, but you have no idea how far off it might be. You would never navigate a tricky intersection using only the dot.

Three Flavors of Uncertainty

So we agree that uncertainty matters. But "uncertainty" itself is an overloaded word in statistics, and people often confuse three different ways of expressing it. Before going further, it helps to distinguish them clearly. Here is a non-technical way to tell them apart:

flowchart TD Q["You have a model.\nWhat are you uncertain about?"] Q --> P["A fixed quantity?\ne.g., population average height"] Q --> F["A future observation?\ne.g., the NEXT person's height"] Q --> C["Your probability estimates?\ne.g., does 70% rain really mean 70%?"] P --> CI["Confidence Interval"] F --> PI["Prediction Interval"] C --> CAL["Calibration"] style CI fill:#e3f2fd,stroke:#1565c0,color:#1c1917 style PI fill:#e8f5e9,stroke:#2e7d32,color:#1c1917 style CAL fill:#fce4ec,stroke:#c62828,color:#1c1917

Three kinds of uncertainty: about parameters, about future data, and about probabilities themselves.

Confidence interval — "The average height of adults in this country is between 168cm and 172cm." This is about a fixed quantity (the population average). You are measuring how precisely you know it. If you collected a different sample, you would get a slightly different interval, but the true average would stay the same.
Prediction interval — "The next person you measure will be between 155cm and 185cm." This is about a future observation. It is always wider than a confidence interval because individual people vary much more than averages do. To cover where the next individual will land, you need to account for both the uncertainty in your model and the natural variability of the outcome.
Calibration — "When the weather app says 70% chance of rain, does it actually rain about 70% of the time?" This is about whether the probabilities themselves are trustworthy. A well-calibrated model does not need to be perfectly accurate on any single prediction—it just needs its stated probabilities to line up with observed frequencies over many predictions.

This series is about prediction intervals: given what we know about something (its features), can we give a range that the true outcome will fall into?

Key Takeaway

A point prediction tells you the model's best guess. A prediction interval tells you how much to trust that guess. In high-stakes applications—medicine, autonomy, finance—the interval is often more important than the prediction itself.

Analogy

Confidence interval vs. prediction interval, in everyday terms. A confidence interval is like saying, "The average commute in this city is 25 to 35 minutes." A prediction interval is like saying, "YOUR commute tomorrow will be 10 to 60 minutes." The first is about the average; the second is about what will actually happen to you. The second is always wider because individual outcomes are noisier than averages. On any given morning, you might hit every red light and get stuck behind a stalled bus, or you might catch greens the whole way. The average smooths all of that out, but your lived experience does not.

Technical

Formalizing Prediction Intervals

▾

Confidence Intervals, Prediction Intervals, and Calibration

The intuitive section introduced three flavors of uncertainty using everyday language. Now let us define these three concepts more precisely, using the language of probability.

A confidence interval targets a fixed but unknown parameter $\theta$ (such as a population mean). A 95% confidence interval $[L, U]$ satisfies:

$$P(\theta \in [L, U]) \geq 0.95$$

The randomness is in the interval endpoints, which depend on the sample. The parameter $\theta$ is fixed. If you were to redraw the training data and recompute the interval many times, about 95% of those intervals would contain $\theta$.

A prediction interval targets a future random observation $Y_{\text{new}}$. A 90% prediction interval $[L(x), U(x)]$ satisfies:

$$P(Y_{\text{new}} \in [L(x), U(x)]) \geq 0.90$$

Here, both the interval endpoints and the future observation are random. The prediction interval must account for two sources of uncertainty: estimation error in the model and the intrinsic randomness of the outcome. This is why prediction intervals are always wider than confidence intervals at the same coverage level—they inherit all of the estimation uncertainty of a confidence interval, plus the additional variability of an individual outcome around the mean.

Calibration is a property of probabilistic forecasts. A forecaster that assigns probability $p$ to an event is calibrated if the event occurs with empirical frequency $p$. More precisely, among all instances where the forecaster says $P(\text{rain}) = 0.7$, it should rain about 70% of the time.

Concept	Target	What varies?	Width
Confidence interval	Fixed parameter $\theta$	Interval endpoints (random sample)	Narrower
Prediction interval	Future observation $Y_{\text{new}}$	Both endpoints and target	Wider
Calibration	Probability estimates	Predicted probabilities vs. frequencies	N/A

flowchart TD subgraph CI["Confidence Interval"] direction TB C1["Target: fixed parameter θ"] C2["Uncertainty: sampling variability"] C3["Width: reflects estimation precision"] C1 --- C2 --- C3 end subgraph PI["Prediction Interval"] direction TB P1["Target: future observation Y"] P2["Uncertainty: estimation + noise"] P3["Width: always wider than CI"] P1 --- P2 --- P3 end subgraph CAL["Calibration"] direction TB A1["Target: probability accuracy"] A2["Metric: predicted freq. vs. observed"] A3["Output: reliability diagram"] A1 --- A2 --- A3 end

Structural comparison of the three uncertainty concepts.

The Classical Prediction Interval

With those definitions in hand, let us look at the most well-known prediction interval formula: the one from ordinary least squares regression. Under the standard assumptions of Gaussian errors, the textbook prediction interval at a new point $x$ is:

$$\hat{y}(x) \pm t_{1-\alpha/2,\; n-p} \cdot \hat{\sigma} \cdot \sqrt{1 + h(x)}$$

Each piece has a specific role:

$\hat{y}(x) = x^\top \hat{\beta}$ — the fitted value (center of the interval).
$t_{1-\alpha/2,\; n-p}$ — the quantile of the $t$-distribution with $n - p$ degrees of freedom. This plays the role of the "how many standard deviations" factor, but accounts for the fact that we estimated $\sigma$.
$\hat{\sigma}$ — the residual standard error, estimating the noise level.
$h(x) = x^\top (X^\top X)^{-1} x$ — the leverage of the test point $x$. This measures how "unusual" or "influential" the test point is relative to the training data. A point far from the center of the training distribution in feature space has high leverage, meaning the model's estimate at that point depends heavily on a small number of nearby training observations. Higher leverage means wider intervals, reflecting the greater estimation uncertainty at those locations.
The $\sqrt{1 + h(x)}$ factor combines two sources of variance: the irreducible noise ($\sigma^2$) and the estimation uncertainty ($\sigma^2 \cdot h(x)$).

flowchart LR subgraph Var["Prediction Variance"] direction TB V1["Noise variance\nσ²"] V2["Estimation variance\nσ² · h(x)"] V3["Total\nσ²(1 + h(x))"] V1 --> V3 V2 --> V3 end V3 --> W["Interval width\n∝ σ̂ · √(1 + h(x))"]

Two sources of uncertainty combine to determine the prediction interval width.

Where Classical Intervals Break Down

This formula requires four assumptions:

Linearity. The true relationship $Y = X^\top \beta^* + \varepsilon$ is linear in the features.
Gaussian errors. $\varepsilon \sim N(0, \sigma^2)$.
Homoscedasticity. The error variance $\sigma^2$ is constant across the feature space.
Independence. Observations are independent.

Violate any of these and the $1-\alpha$ coverage guarantee no longer holds. In modern practice, models are routinely nonlinear (random forests, neural networks, boosted trees), errors are non-Gaussian, and variance is often heteroscedastic—different in different parts of the feature space. For example, predicting house prices in a city center (where there is abundant training data) may produce tighter residuals than predicting prices in rural areas (where the model has little data and outcomes are more variable). The classical formula has no mechanism to account for this. We need methods that provide valid prediction intervals without these assumptions.

Looking Ahead

The next post introduces conformal prediction, which achieves valid coverage without any distributional or model assumptions. The only requirement is exchangeability of the data—a much weaker condition than independence and Gaussianity.

Advanced

Derivation and Desiderata

▾

Formal Definition

Given a feature vector $X \in \mathbb{R}^p$ and a response $Y \in \mathbb{R}$, a prediction interval at level $1 - \alpha$ is a pair of measurable functions $L, U : \mathbb{R}^p \to \mathbb{R}$ (constructed from training data) such that:

$$P\bigl(Y_{\text{new}} \in [L(x), U(x)] \mid X_{\text{new}} = x\bigr) \geq 1 - \alpha$$

The stronger marginal coverage guarantee—which does not condition on $x$—requires:

$$P\bigl(Y_{\text{new}} \in [L(X_{\text{new}}), U(X_{\text{new}})]\bigr) \geq 1 - \alpha$$

Marginal coverage is achievable distribution-free; conditional coverage in general is not (without further assumptions). Intuitively, marginal coverage allows the method to "average" its mistakes over the entire feature space: it may undercover in some regions and overcover in others, as long as the average is at least $1-\alpha$. Conditional coverage demands that every region of the feature space is covered at the right level, which is a much stronger requirement. This distinction will be a recurring theme throughout the series.

Derivation from First Principles

Having defined what a prediction interval is, let us now derive the classical formula from scratch, making each assumption explicit. This derivation clarifies exactly where each piece of the formula comes from and, equally important, where it can go wrong.

Assume the linear model $Y = X^\top \beta^* + \varepsilon$ with $\varepsilon \sim N(0, \sigma^2 I_n)$ and design matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$. The OLS estimator is $\hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top Y$.

For a new observation $(x, Y_{\text{new}})$ with $Y_{\text{new}} = x^\top \beta^* + \varepsilon_{\text{new}}$ independent of the training data, consider the prediction residual:

$$Y_{\text{new}} - \hat{Y}(x) = Y_{\text{new}} - x^\top \hat{\beta}$$

Decompose this into noise and estimation error:

$$Y_{\text{new}} - x^\top \hat{\beta} = \underbrace{\varepsilon_{\text{new}}}_{\text{future noise}} - \underbrace{x^\top(\hat{\beta} - \beta^*)}_{\text{estimation error}}$$

flowchart TD R["Prediction residual\nY(new) − x'β̂"] R --> N["Future noise\nϵ(new)"] R --> E["Estimation error\n−x'(β̂ − β*)"] N --> VN["Var = σ²"] E --> VE["Var = σ² · h(x)"] VN --> T["Independent, so\nVar = σ² + σ² · h(x)\n= σ²(1 + h(x))"] VE --> T

Decomposition of prediction residual variance into two independent components.

Since $\hat{\beta} - \beta^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \varepsilon$ is independent of $\varepsilon_{\text{new}}$, the two terms are independent. Compute the variance:

$$\operatorname{Var}(Y_{\text{new}} - \hat{Y}(x)) = \operatorname{Var}(\varepsilon_{\text{new}}) + \operatorname{Var}(x^\top(\hat{\beta} - \beta^*))$$ $$= \sigma^2 + x^\top \operatorname{Var}(\hat{\beta})\, x = \sigma^2 + x^\top \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} x = \sigma^2(1 + h(x))$$

where $h(x) = x^\top (\mathbf{X}^\top \mathbf{X})^{-1} x$ is the leverage of the test point. Since $\varepsilon_{\text{new}}$ and $\hat{\beta} - \beta^*$ are both Gaussian (the latter because it is a linear transformation of the Gaussian vector $\varepsilon$), the prediction residual is Gaussian:

$$Y_{\text{new}} - \hat{Y}(x) \sim N\bigl(0,\; \sigma^2(1 + h(x))\bigr)$$

Replacing $\sigma$ with $\hat{\sigma} = \sqrt{\frac{1}{n-p}\|\mathbf{Y} - \mathbf{X}\hat{\beta}\|^2}$ and using the fact that $\hat{\sigma}^2$ is independent of $\hat{\beta}$ (Cochran's theorem), we obtain:

$$\frac{Y_{\text{new}} - \hat{Y}(x)}{\hat{\sigma}\sqrt{1 + h(x)}} \sim t_{n-p}$$

This immediately yields the $1-\alpha$ prediction interval:

$$\hat{Y}(x) \pm t_{1-\alpha/2,\; n-p} \cdot \hat{\sigma} \cdot \sqrt{1 + h(x)}$$

flowchart LR subgraph Assumptions A1["Y = X'β* + ϵ"] A2["ϵ ~ N(0, σ²I)"] A3["Independence"] A4["Homoscedasticity"] end subgraph Derivation D1["Residual decomposition"] D2["Variance = σ²(1 + h(x))"] D3["Studentize with σ̂"] D4["Pivotal t-statistic"] D1 --> D2 --> D3 --> D4 end Assumptions --> Derivation D4 --> PI["Prediction Interval\nŷ ± t · σ̂ · √(1+h)"]

The derivation pipeline from assumptions to the classical prediction interval.

The Five Desiderata

The derivation above gives us a concrete formula, but it relies on strong assumptions. Before moving on, it is worth asking a broader question: what would an ideal prediction interval method look like, regardless of the model or the data distribution? We articulate five desiderata:

Valid coverage. If we claim $1-\alpha$ coverage, the guarantee $P(Y_{\text{new}} \in [L(x), U(x)]) \geq 1-\alpha$ must hold as a finite-sample statement, not merely asymptotically.
Distribution-free. The coverage guarantee must not depend on the distribution of $\varepsilon$ being Gaussian, or belonging to any parametric family.
Model-agnostic. The method must produce valid intervals for any underlying prediction model $\hat{f}$: linear regression, random forests, neural networks, or any other. This is essential for practical use, because the best model for a given problem is often unknown in advance.
Adaptive width. The interval width $U(x) - L(x)$ should vary with $x$, being narrower where prediction is easier and wider where it is harder. A constant-width interval wastes coverage budget: it is too wide in easy regions and potentially too narrow in hard ones. For instance, a house price model should produce tighter intervals in a dense urban neighborhood with many comparable sales than in a rural area with few data points.
Computationally efficient. The method should not require retraining the model $O(n)$ times (as the jackknife does) or running expensive bootstrap simulations. In production settings with large datasets and complex models, the cost of computing intervals should be a small fraction of the cost of fitting the model itself.

The classical formula satisfies desiderata 1 and 5 but fails 2 and 3 by construction, and only partially satisfies 4 (the width adapts via $h(x)$, but this is limited to the leverage structure of OLS).

Preview of the Series

Conformal prediction (Part 2) achieves desiderata 1, 2, 3, and 5. It provides distribution-free, finite-sample, model-agnostic prediction intervals at essentially zero computational overhead beyond fitting the model once. But vanilla conformal prediction produces constant-width intervals, failing desideratum 4. The rest of the series explores this tension from multiple angles—including the role of leverage scores (Parts 4–6) in understanding what makes some predictions harder than others.

Why Conditional Coverage Is Hard

The marginal guarantee $P(Y_{\text{new}} \in C(X_{\text{new}})) \geq 1-\alpha$ is achievable nonparametrically. The conditional guarantee $P(Y_{\text{new}} \in C(x) \mid X_{\text{new}} = x) \geq 1-\alpha$ for all $x$ is, in general, impossible without structural assumptions. Vovk (2012) and Lei and Wasserman (2014) showed that for any distribution-free method with marginal coverage, there exist distributions where conditional coverage fails badly at specific points $x$.

The key difficulty is that the feature space is typically high-dimensional and continuous, so any finite calibration set provides only sparse coverage of the space. Without assumptions that link the distribution at nearby points (such as smoothness or parametric structure), there is simply not enough information to guarantee coverage at every individual location. This impossibility result motivates the search for approximate conditional coverage—methods that come as close to pointwise validity as possible while remaining distribution-free—which is a major theme in Parts 5–10.