Conformal Prediction: General Case and Regression

Chapter 2 of Vovk, Gammerman, and Shafer (2022)

In this chapter the authors formally introduce conformal predictors and discuss notions of efficiency and validity.

References

Barber, Rina Foygel, Emmanuel J. Candes, Aaditya Ramdas, and Ryan J. Tibshirani. 2023. “Conformal Prediction Beyond Exchangeability,” March. https://doi.org/10.48550/arXiv.2202.13415.
Serfling, Robert J. 2001. Approximation Theorems of Mathematical Statistics. Wiley Series in Probability and Statistics. Nashville, TN: John Wiley & Sons.
Vovk, Vladimir, Alexander Gammerman, and Glenn Shafer. 2022. Algorithmic Learning in a Random World. 2nd ed. Cham, Switzerland: Springer International Publishing.

Overview

  1. Formally introduce notion of conformal predictors.
  2. Formalized notions of validity and efficiency.
  3. Recall of online and offline prediction.

Result: when a conformal predictor is used in the online mode, its output is valid.

Modification: basic procedure of conformal prediction might look computationally inefficient when the label set is large, but there are ways of making conformal predictors much more efficient.

Confidence predictors

The conformal predictors we define in this chapter are confidence predictors – they make a range of successively more specific predictions with successively less confidence.

In this section we define precisely what we mean by a confidence predictor and its validity.

Notation and assumptions

  1. We have successive pairs called examples

\[ (x_1, y_1), (x_2, y_2), \ldots, \qquad(1)\]

where \(x_i\) is the label and \(y_i\) is the object.

  1. \(x_i \in \mathscr{X}\), the object space and \(y_i \in \mathscr{Y}\), the label space. Typically, we assume \(\mathscr{X}\) is nonempty and \(\mathscr{Y}\) has at least two different elements.

  2. Together, they form the example space: \[\mathscr{Z} = \mathscr{X} \times \mathscr{Y}.\]

  3. The infinite data sequence (Eq. 1) is an element of the space \(\mathscr{Z}^\infty\).

  1. Standard assumption: the examples are generated independently from some distribution \(\mathscr{Q}\) on \(\mathscr{Z}\).

  2. Most of the results presented in Vovk, Gammerman, and Shafer (2022) hold under this randomness assumption.

  3. Usually, a weaker condition of exchangeability will work for most results!

    1. Recent work by Barber et al. (2023) builds from this assumption.
  4. We use the randomness or exchangeability assumption depending on whichever leads to stronger claims.

  5. Loosely speaking, exchangeability implies the distribution of the examples does not depend on their order.


The statement that a distrbution \(\mathscr{P}\) is exchangeable means that for every positive integer \(n\), every permutation \(\pi\) of \(\{1,\ldots ,n\}\), and every measurable set \(\mathscr{E} \subseteq \mathscr{Z}^n\), \[\begin{align} \mathscr{P}\{(z_1,z_2,\ldots) \in \mathscr{Z}^\infty &: (z_1,z_2, \ldots z_n) \in \mathscr{E}\} \\ = \mathscr{P}\{(z_1,z_2,\ldots) \in \mathscr{Z}^\infty &: (z_{\pi(1)},z_{\pi(2)}, \ldots z_{\pi(n)}) \in \mathscr{E}\}. \end{align}\]

Simple predictors and confidence predictors

Given information on \(n-1\) past examples and the \(n^{\text{th}}\) object \(x_n\), we want to predict the label \(y_n\). To do so we need a simple predictor function \[D: \mathscr{Z}^* \times \mathscr{X} \rightarrow \mathscr{Y}.\]

For any sequence of old examples, say \(x_1,y_1,\ldots,x_{n−1},y_{n−1} \in \mathscr{Z}^*\), and any new object, \(x_n \in \mathscr{X}\), the simple predictor \(D(x_1,y_1,\ldots,x_{n−1},y_{n−1},x_n) \in \mathscr{Y}\) as its prediction for the new label \(y_n\).

Conformal predictors require a more complicated notion of prediction: we want to give a range of predictions.

For this, we are interested in an algorithm \(\Gamma\) that requires an input of significance level \(\epsilon\) and inputs \((x_1,y_1,\ldots,x_{n−1},y_{n−1},x_n)\) to generate a subset

\[ \Gamma_n^{\epsilon}(x_1,y_1,\ldots,x_{n−1},y_{n−1},x_n) \subset \mathscr{Y} \qquad(2)\]

We require the predictor given in Eq. 2 to shrink as \(\epsilon\) increases, i.e. for \(\epsilon_1 \geq \epsilon_2\)

\[\Gamma^{\epsilon_1}\left(x_1, y_1, \ldots, x_{n-1}, y_{n-1}, x_n\right) \subseteq \Gamma^{\epsilon_2}\left(x_1, y_1, \ldots, x_{n-1}, y_{n-1}, x_n\right). \qquad(3)\]

Formally, a confidence predictor \(\Gamma\) is a measurable function \[D: \mathscr{Z}^* \times \mathscr{X} \times (0, 1) \rightarrow 2^\mathscr{Y}\] that satisfies Eq. 3 for all significance levels \(\epsilon_1 \geq \epsilon_2\), all positive integers \(n\) and all data sequences \((x_1,y_1,\ldots,x_{n−1},y_{n−1},x_n)\).

Validity

We are predicting \[y_n \in \Gamma_n^{\epsilon}(x_1,y_1,...,x_{n−1},y_{n−1},x_n).\]

For each significance level \(\epsilon\), we want to have confidence \(1 - \epsilon\) in our prediction about \(y_n\).

  1. The probability of the prediction being in error should be \(\epsilon\).
  2. Since we are making a sequence of predictions, we would like these error events to be independent.

If these conditions are met no matter what exchangeable probability distribution \(\mathscr{P}\) governs the sequence of examples, then we say that the confidence predictor \(\Gamma\) is “exactly valid”.
If the probabilities for errors are allowed to be even less than \(\epsilon\), then we say that the confidence predictor \(\Gamma\) is “conservatively valid”.

We will see, confidence predictors sometimes have asymptotic property even if they are not valid:

  1. We call a confidence predictor that has the frequency of errors at significance level \(\epsilon\) for an exactly valid confidence predictor converging to \(\epsilon\) with probability one “asymptotically exact”.
  2. Similarly, we call a confidence predictor for which the frequency of errors is asymptotically no more than \(\epsilon\) with probability one as “asymptotically conservative”.

Formalizing validity

Using data sequence \(\omega = (x_1,y_1, x_2, y_2, \ldots)\) and the predictor \(\Gamma\) at significance level \(\epsilon\), we judge whether \(\Gamma\) makes an error at the \(n^{\text{th}}\) stage as follows: \[\begin{aligned} \operatorname{err}_n^\epsilon(\Gamma, \omega) & := \begin{cases}1 \quad \text { if } y_n \notin \Gamma^\epsilon\left(x_1, y_1, \ldots, x_{n-1}, y_{n-1}, x_n\right) \\ 0 \quad \text { otherwise }, \end{cases} \end{aligned} \qquad(4)\] and the number of errors made during the first \(n\) trials is given by \[\begin{aligned} \operatorname{Err}_n^\epsilon(\Gamma, \omega) := \sum_{i=1}^n \operatorname{err}_i^\epsilon(\Gamma, \omega). \end{aligned} \qquad(5)\]

If \(\omega\) is drawn from an exchangeable distribution \(\mathscr{P}\), the number \(err_n(\Gamma, \omega)\) is the realized value of a random variable, which we designate \(err_n(\Gamma)\).

\(\Gamma\) is exactly valid if for each \(err_1(\Gamma), err_2(\Gamma), \ldots\) is a sequence of independent Bernoulli random variables with parameter \(\epsilon\).

\(\Gamma\) is asymptotically exact if \[\lim_{n \rightarrow \infty} \frac{Err_n^\epsilon(\Gamma)}{n} = \epsilon, w.p. 1.\]

\(\Gamma\) is conservatively valid if \(err_1(\Gamma), err_2(\Gamma), \ldots\) is DOMINATED by a sequence of independent Bernoulli random variables with parameter \(\epsilon\). Click here for more on stochastic dominance.

\(\Gamma\) is asymptotically conservative if \[\limsup_{n \rightarrow \infty} \frac{Err_n^\epsilon(\Gamma)}{n} \leq \epsilon, w.p. 1.\]

Note: Randomized Confidence Predictors

Randomized confidence predictors depend additionally on elements of an auxiliary probability space. The main advantage of randomization in this context is that there are many randomized confidence predictors that are exactly valid.

Proposition 2.1: An exact confidence predictor is asymptotically exact. A conservative confidence predictor is asymptotically conservative.

This proposition is an immediate consequence of the law of large numbers.

Conformal predictors

In this section we will learn what nonconformity measures are. Nonconformity measures how strange/unexpected a specific example is, given a bag of examples.

There are many different nonconformity measures, and each one defines a conformal predictor and a smoothed conformal predictor.

Bags

  • A bag of size \(n \in \mathbb{N}\) is a collection of n elements some of which may be identical.
  • A bag resembles a set in that the order of its elements is irrelevant, but it differs from a set in that repetition is allowed.
    • To identify a bag, we must say what elements it contains and how many times each of these elements is repeated.
  • \(\langle z_1,\ldots,z_n\rangle\) denotes the bag consisting of elements \(z_1,\ldots,z_n\), some of which may be identical with each other.
    • The empty bag \(\langle \rangle\) contains no elements.
    • We are only interested in finite bags.
  • We write \(\mathscr{Z}^{(n)}\) for the set of all bags of size \(n\) of elements of a measurable space \(\mathscr{Z}\).
  • We write \(\mathscr{Z}^{(∗)}\) for the set of all bags of elements of \(\mathscr{Z}\).
  • Operations such as union \((\cup)\), intersection \((\cap)\), and difference \((\backslash)\) are defined for bags in a natural way.
    • For example, if \(A, B\) are two bags, the number of times their union \(A \cup B\) contains any \(z \in \mathscr{Z}\) is the number of times \(A\) contains \(z\) plus the number of times \(B\) contains \(z\).

Nonconformity and Conformity

Formally, a nonconformity measure is a measurable mapping \[A : \mathscr{Z}^{(*)} \times \mathscr{Z} \rightarrow \overline{\mathbb{R}}\] \(A\) assigns a numerical score indicating how different the new example \(z\) is from the old ones \(\langle z_1,\ldots,z_n\rangle\).

  • Given a nonconformity measure \(A\), a sequence \(z_1,\ldots,z_n\) of examples, and an example \(z\), we can score how different \(z\) is from the bag.

  • \(A(\langle z_1,\ldots,z_n\rangle,z)\) is called the nonconformity score for \(z\).

    • We will sometimes refer to \(\langle z_1,\ldots,z_n\rangle\) as the comparison bag in this context.

Note: Conformity measures are the same as nonconformity measures

Given a conformity measure \(B\) we can define a nonconformity measure \(A\) using any strictly decreasing transformation: \(A := −B\), or \(A := 1/B\). Doing so is consistent with tradition in mathematical statistics, where test statistics are usually defined so as to measure discrepancy rather than agreement.

Definition: p-value as a conformity measure

Given a conformity measure \(A\) and a bag \(\langle z_1,\ldots,z_n\rangle\), we can compute the nonconformity score for every example \(z_i\) in the bag as follows \(\alpha_i = A(\langle z_1,\ldots,z_n\rangle, z_i).\) The numerical value of \(\alpha_i\) does not, by itself, tell us how unusual \(A\) finds \(z_i\) to be, so we choose a convenient way of finding out: \[\frac{\lvert\left\{j=1, \ldots, n: \alpha_j \geq \alpha_i\right\}\rvert}{n}\] This fraction is the p-value for \(z_i\).

Conformal predictors: formal definition.

  • Every nonconformity measure determines a confidence predictor.
  • Given a new object \(x_n\) and a level of significance \(\epsilon\), this predictor provides a prediction set \(\subset \mathscr{Y}\) that should contain the object’s label \(y_n\).
  • We obtain the set by supposing that \(y_n\) will have a value that makes \((x_n, y_n)\) conform with the previous examples.
  • \(\epsilon\) determines the amount of conformity (as measured by the p-value) that we require.

Definition: Conformal predictors

The conformal predictor determined by a nonconformity measure \(A\) is the confidence predictor \(\Gamma\) obtained by setting \(\Gamma_n^{\epsilon}(x_1, y_1, \ldots, x_{n-1}, y_{n-1}, x_n)\) equal to the set of all labels \(y \in \mathscr{Y}\) such that \[\frac{\lvert\left\{i=1, \ldots, n: \alpha_i \geq \alpha_n(y)\right\}\rvert}{n} > \epsilon,\] where \[\begin{aligned} \alpha_i & :=A\left(\sigma,\left(x_i, y_i\right)\right), \quad i=1, \ldots, n-1 \\ \alpha_n(y) & :=A\left(\sigma,\left(x_n, y\right)\right) \\ \sigma & :=\langle\left(x_1, y_1\right), \ldots,\left(x_{n-1}, y_{n-1}\right),\left(x_n, y\right)\rangle. \end{aligned}\]

Proposition 2.3: All conformal predictors are conservative

  • Proposition 2.1 implies conformal predictor is asymptotically conservative.
  • To say something more concrete, we need law of iterated logarithm (Serfling (2001)).

To move from conservative validity to exact validity, we need to define smoothed conformal predictors.

Smoothed conformal predictors

The smoothed conformal predictor determined by the nonconformity measure \(A\) is the following randomized confidence predictor \(\Gamma\): the set \(\Gamma^{\epsilon}(x_1, \tau_1, y_1, \ldots, x_{n-1}, \tau_{n-1}, y_{n-1}, x_n, \tau_n)\) that consists of \(y \in \mathscr{Y}\) satisfying

\[\frac{\lvert\left\{i=1, \ldots, n: \alpha_i > \alpha_n(y)\right\}\rvert + \tau_n \lvert \left\{i=1, \ldots, n: \alpha_i = \alpha_n(y)\right\}\rvert}{n} > \epsilon,\]

Note: Difference between smoothed and non-smoothed conformal predictors.

The main difference of smoothed conformal predictors from non-smoothed conformal preditors is that in the former we treat the borderline cases \(\alpha_i = \alpha_n\) more carefully. Instead of increasing the p-value by \(1/n\) for each \(\alpha_i = \alpha_n\), we increase it by a random amount (\(\tau_n\)) between \(0\) and \(1/n\), yielding a smoothed p-value.

Proposition 2.4 Any smoothed conformal predictor is exactly valid.

It immediately implies Proposition 2.3: if a smoothed conformal predictor \(\Gamma\) and a conformal predictor \(\Gamma\) are constructed from the same nonconformity measure, the latter’s errors never exceed the former’s errors.

Proof in chapter 11 (all the best to whoever presents 🙃)

Corollary 2.5 Every smoothed conformal predictor is asymptotically exact.

Implied by proposition 2.4 above.

Finite-horizon and one-off conformal predictors

Definition: Finite-horizon conformal predictors

The definition of conformal predictors and smoothed conformal predictors also works for the online protocol with horizon \(N\). When talking about conservative or exact validity in the online protocol with horizon \(N\), we will mean the definitions given before with \(n\) ranging over \(1,\ldots,N\) and with \(z_1,\ldots,z_N\) assumed exchangeable.

Proposition 2.6 In the online protocol with a finite horizon, any conformal predictor is conservatively valid, and any smoothed conformal predictor is exactly valid.

Proof in chapter 11 (🙃)

Definition: One-off conformal predictors

The definition of conformal predictors and smoothed conformal predictors also works for the online protocol with horizon \(N\). When talking about conservative or exact validity in the online protocol with horizon \(N\), we will mean the definitions given before with \(n\) ranging over \(1,\ldots,N\) and with \(z_1,\ldots,z_N\) assumed exchangeable.

General Schemes for Defining Nonconformity

There are many different ways of defining nonconformity measures, but here we will only explain the most basic approach in the case of regression, \(\mathscr{Y} = \mathbb{R}\).

Conformity to a bag

  • Suppose we are given a bag \(\langle z_1,\ldots,z_n\rangle\) and we want to estimate the nonconformity of each example \(z_i\) as an element of the bag.

  • There is a natural solution if we are given a simple predictor \(D\) whose output does not depend on the order in which the old examples are presented.

  • The simple predictor \(D\) then defines a prediction rule \(D_{\langle z_1,\ldots, z_n \rangle}: \mathscr{X} \rightarrow \mathscr{Y}\) by the formula \(D(z_1,\ldots,z_n,x)\).

  • A natural measure of nonconformity of \(z_i\) is the deviation of the predicted label \(\hat{y}_i = D_{\langle z_1,\ldots, z_n \rangle}(x_i)\) from the true label \(y_i\).

Ways of measuring deviation

  • \(\alpha_i = \lvert D_{\langle z_1,\ldots, z_n \rangle}(x_i) - y_i \rvert\).

  • \(\alpha_i = \lvert D_{\langle z_1,\ldots, z_{i-1}, z_{i+1}, \ldots, z_n \rangle}(x_i) - y_i \rvert\).


Conformity to a property

  • We talked about an important aspect of conformity: \(A(\sigma, z)\) measures how well \(z\) conforms to the bag \(\sigma\).
  • Another important case is where \(A(\sigma, z)\) measures how well \(z\) conforms to a property that elements of \(\sigma\) may satisfy to different degrees.

How well \(z_i\) conforms to the property of having a large label?

\(\alpha_i := y_i − \hat{y}_i\)

How well \(z_i\) conforms to the property of having a small label?

\(\alpha_i := \hat{y}_i - y_i\)

Up next: conformalized ridge regression and k-NN regression

  • Ridge regression and nearest neighbours regression are two of the most standard regression algorithms.

  • In the case of regression there is an obvious difficulty in implementing the idea of conformal prediction: it appears that to form a prediction set we need to examine each potential classification \(y\).

  • In some important cases, for ridge regression and nearest neighbours regression, there is a feasible way to compute conformal predictors.

Thank you for your time!