Chapter 2 of Vovk, Gammerman, and Shafer (2022)
Result: when a conformal predictor is used in the online mode, its output is valid.
Modification: basic procedure of conformal prediction might look computationally inefficient when the label set is large, but there are ways of making conformal predictors much more efficient.
The conformal predictors we define in this chapter are confidence predictors – they make a range of successively more specific predictions with successively less confidence.
In this section we define precisely what we mean by a confidence predictor and its validity.
\[ (x_1, y_1), (x_2, y_2), \ldots, \qquad(1)\]
where \(x_i\) is the label and \(y_i\) is the object.
\(x_i \in \mathscr{X}\), the object space and \(y_i \in \mathscr{Y}\), the label space. Typically, we assume \(\mathscr{X}\) is nonempty and \(\mathscr{Y}\) has at least two different elements.
Together, they form the example space: \[\mathscr{Z} = \mathscr{X} \times \mathscr{Y}.\]
The infinite data sequence (Eq. 1) is an element of the space \(\mathscr{Z}^\infty\).
Standard assumption: the examples are generated independently from some distribution \(\mathscr{Q}\) on \(\mathscr{Z}\).
Most of the results presented in Vovk, Gammerman, and Shafer (2022) hold under this randomness assumption.
Usually, a weaker condition of exchangeability will work for most results!
We use the randomness or exchangeability assumption depending on whichever leads to stronger claims.
Loosely speaking, exchangeability implies the distribution of the examples does not depend on their order.
The statement that a distrbution \(\mathscr{P}\) is exchangeable means that for every positive integer \(n\), every permutation \(\pi\) of \(\{1,\ldots ,n\}\), and every measurable set \(\mathscr{E} \subseteq \mathscr{Z}^n\), \[\begin{align} \mathscr{P}\{(z_1,z_2,\ldots) \in \mathscr{Z}^\infty &: (z_1,z_2, \ldots z_n) \in \mathscr{E}\} \\ = \mathscr{P}\{(z_1,z_2,\ldots) \in \mathscr{Z}^\infty &: (z_{\pi(1)},z_{\pi(2)}, \ldots z_{\pi(n)}) \in \mathscr{E}\}. \end{align}\]
Given information on \(n-1\) past examples and the \(n^{\text{th}}\) object \(x_n\), we want to predict the label \(y_n\). To do so we need a simple predictor function \[D: \mathscr{Z}^* \times \mathscr{X} \rightarrow \mathscr{Y}.\]
For any sequence of old examples, say \(x_1,y_1,\ldots,x_{n−1},y_{n−1} \in \mathscr{Z}^*\), and any new object, \(x_n \in \mathscr{X}\), the simple predictor \(D(x_1,y_1,\ldots,x_{n−1},y_{n−1},x_n) \in \mathscr{Y}\) as its prediction for the new label \(y_n\).
Conformal predictors require a more complicated notion of prediction: we want to give a range of predictions.
For this, we are interested in an algorithm \(\Gamma\) that requires an input of significance level \(\epsilon\) and inputs \((x_1,y_1,\ldots,x_{n−1},y_{n−1},x_n)\) to generate a subset
\[ \Gamma_n^{\epsilon}(x_1,y_1,\ldots,x_{n−1},y_{n−1},x_n) \subset \mathscr{Y} \qquad(2)\]
We require the predictor given in Eq. 2 to shrink as \(\epsilon\) increases, i.e. for \(\epsilon_1 \geq \epsilon_2\)
\[\Gamma^{\epsilon_1}\left(x_1, y_1, \ldots, x_{n-1}, y_{n-1}, x_n\right) \subseteq \Gamma^{\epsilon_2}\left(x_1, y_1, \ldots, x_{n-1}, y_{n-1}, x_n\right). \qquad(3)\]
Formally, a confidence predictor \(\Gamma\) is a measurable function \[D: \mathscr{Z}^* \times \mathscr{X} \times (0, 1) \rightarrow 2^\mathscr{Y}\] that satisfies Eq. 3 for all significance levels \(\epsilon_1 \geq \epsilon_2\), all positive integers \(n\) and all data sequences \((x_1,y_1,\ldots,x_{n−1},y_{n−1},x_n)\).
We are predicting \[y_n \in \Gamma_n^{\epsilon}(x_1,y_1,...,x_{n−1},y_{n−1},x_n).\]
For each significance level \(\epsilon\), we want to have confidence \(1 - \epsilon\) in our prediction about \(y_n\).
If these conditions are met no matter what exchangeable probability distribution \(\mathscr{P}\) governs the sequence of examples, then we say that the confidence predictor \(\Gamma\) is “exactly valid”.
If the probabilities for errors are allowed to be even less than \(\epsilon\), then we say that the confidence predictor \(\Gamma\) is “conservatively valid”.
We will see, confidence predictors sometimes have asymptotic property even if they are not valid:
Using data sequence \(\omega = (x_1,y_1, x_2, y_2, \ldots)\) and the predictor \(\Gamma\) at significance level \(\epsilon\), we judge whether \(\Gamma\) makes an error at the \(n^{\text{th}}\) stage as follows: \[\begin{aligned} \operatorname{err}_n^\epsilon(\Gamma, \omega) & := \begin{cases}1 \quad \text { if } y_n \notin \Gamma^\epsilon\left(x_1, y_1, \ldots, x_{n-1}, y_{n-1}, x_n\right) \\ 0 \quad \text { otherwise }, \end{cases} \end{aligned} \qquad(4)\] and the number of errors made during the first \(n\) trials is given by \[\begin{aligned} \operatorname{Err}_n^\epsilon(\Gamma, \omega) := \sum_{i=1}^n \operatorname{err}_i^\epsilon(\Gamma, \omega). \end{aligned} \qquad(5)\]
If \(\omega\) is drawn from an exchangeable distribution \(\mathscr{P}\), the number \(err_n(\Gamma, \omega)\) is the realized value of a random variable, which we designate \(err_n(\Gamma)\).
\(\Gamma\) is exactly valid if for each \(err_1(\Gamma), err_2(\Gamma), \ldots\) is a sequence of independent Bernoulli random variables with parameter \(\epsilon\).
\(\Gamma\) is asymptotically exact if \[\lim_{n \rightarrow \infty} \frac{Err_n^\epsilon(\Gamma)}{n} = \epsilon, w.p. 1.\]
\(\Gamma\) is conservatively valid if \(err_1(\Gamma), err_2(\Gamma), \ldots\) is DOMINATED by a sequence of independent Bernoulli random variables with parameter \(\epsilon\). Click here for more on stochastic dominance.
\(\Gamma\) is asymptotically conservative if \[\limsup_{n \rightarrow \infty} \frac{Err_n^\epsilon(\Gamma)}{n} \leq \epsilon, w.p. 1.\]
Note: Randomized Confidence Predictors
Randomized confidence predictors depend additionally on elements of an auxiliary probability space. The main advantage of randomization in this context is that there are many randomized confidence predictors that are exactly valid.
Proposition 2.1: An exact confidence predictor is asymptotically exact. A conservative confidence predictor is asymptotically conservative.
This proposition is an immediate consequence of the law of large numbers.
In this section we will learn what nonconformity measures are. Nonconformity measures how strange/unexpected a specific example is, given a bag of examples.
There are many different nonconformity measures, and each one defines a conformal predictor and a smoothed conformal predictor.
Formally, a nonconformity measure is a measurable mapping \[A : \mathscr{Z}^{(*)} \times \mathscr{Z} \rightarrow \overline{\mathbb{R}}\] \(A\) assigns a numerical score indicating how different the new example \(z\) is from the old ones \(\langle z_1,\ldots,z_n\rangle\).
Given a nonconformity measure \(A\), a sequence \(z_1,\ldots,z_n\) of examples, and an example \(z\), we can score how different \(z\) is from the bag.
\(A(\langle z_1,\ldots,z_n\rangle,z)\) is called the nonconformity score for \(z\).
Note: Conformity measures are the same as nonconformity measures
Given a conformity measure \(B\) we can define a nonconformity measure \(A\) using any strictly decreasing transformation: \(A := −B\), or \(A := 1/B\). Doing so is consistent with tradition in mathematical statistics, where test statistics are usually defined so as to measure discrepancy rather than agreement.
Definition: p-value as a conformity measure
Given a conformity measure \(A\) and a bag \(\langle z_1,\ldots,z_n\rangle\), we can compute the nonconformity score for every example \(z_i\) in the bag as follows \(\alpha_i = A(\langle z_1,\ldots,z_n\rangle, z_i).\) The numerical value of \(\alpha_i\) does not, by itself, tell us how unusual \(A\) finds \(z_i\) to be, so we choose a convenient way of finding out: \[\frac{\lvert\left\{j=1, \ldots, n: \alpha_j \geq \alpha_i\right\}\rvert}{n}\] This fraction is the p-value for \(z_i\).
Definition: Conformal predictors
The conformal predictor determined by a nonconformity measure \(A\) is the confidence predictor \(\Gamma\) obtained by setting \(\Gamma_n^{\epsilon}(x_1, y_1, \ldots, x_{n-1}, y_{n-1}, x_n)\) equal to the set of all labels \(y \in \mathscr{Y}\) such that \[\frac{\lvert\left\{i=1, \ldots, n: \alpha_i \geq \alpha_n(y)\right\}\rvert}{n} > \epsilon,\] where \[\begin{aligned} \alpha_i & :=A\left(\sigma,\left(x_i, y_i\right)\right), \quad i=1, \ldots, n-1 \\ \alpha_n(y) & :=A\left(\sigma,\left(x_n, y\right)\right) \\ \sigma & :=\langle\left(x_1, y_1\right), \ldots,\left(x_{n-1}, y_{n-1}\right),\left(x_n, y\right)\rangle. \end{aligned}\]
Proposition 2.3: All conformal predictors are conservative
To move from conservative validity to exact validity, we need to define smoothed conformal predictors.
The smoothed conformal predictor determined by the nonconformity measure \(A\) is the following randomized confidence predictor \(\Gamma\): the set \(\Gamma^{\epsilon}(x_1, \tau_1, y_1, \ldots, x_{n-1}, \tau_{n-1}, y_{n-1}, x_n, \tau_n)\) that consists of \(y \in \mathscr{Y}\) satisfying
\[\frac{\lvert\left\{i=1, \ldots, n: \alpha_i > \alpha_n(y)\right\}\rvert + \tau_n \lvert \left\{i=1, \ldots, n: \alpha_i = \alpha_n(y)\right\}\rvert}{n} > \epsilon,\]
Note: Difference between smoothed and non-smoothed conformal predictors.
The main difference of smoothed conformal predictors from non-smoothed conformal preditors is that in the former we treat the borderline cases \(\alpha_i = \alpha_n\) more carefully. Instead of increasing the p-value by \(1/n\) for each \(\alpha_i = \alpha_n\), we increase it by a random amount (\(\tau_n\)) between \(0\) and \(1/n\), yielding a smoothed p-value.
Proposition 2.4 Any smoothed conformal predictor is exactly valid.
It immediately implies Proposition 2.3: if a smoothed conformal predictor \(\Gamma\) and a conformal predictor \(\Gamma\) are constructed from the same nonconformity measure, the latter’s errors never exceed the former’s errors.
Proof in chapter 11 (all the best to whoever presents 🙃)
Corollary 2.5 Every smoothed conformal predictor is asymptotically exact.
Implied by proposition 2.4 above.
Definition: Finite-horizon conformal predictors
The definition of conformal predictors and smoothed conformal predictors also works for the online protocol with horizon \(N\). When talking about conservative or exact validity in the online protocol with horizon \(N\), we will mean the definitions given before with \(n\) ranging over \(1,\ldots,N\) and with \(z_1,\ldots,z_N\) assumed exchangeable.
Proposition 2.6 In the online protocol with a finite horizon, any conformal predictor is conservatively valid, and any smoothed conformal predictor is exactly valid.
Proof in chapter 11 (🙃)
Definition: One-off conformal predictors
The definition of conformal predictors and smoothed conformal predictors also works for the online protocol with horizon \(N\). When talking about conservative or exact validity in the online protocol with horizon \(N\), we will mean the definitions given before with \(n\) ranging over \(1,\ldots,N\) and with \(z_1,\ldots,z_N\) assumed exchangeable.
There are many different ways of defining nonconformity measures, but here we will only explain the most basic approach in the case of regression, \(\mathscr{Y} = \mathbb{R}\).
Suppose we are given a bag \(\langle z_1,\ldots,z_n\rangle\) and we want to estimate the nonconformity of each example \(z_i\) as an element of the bag.
There is a natural solution if we are given a simple predictor \(D\) whose output does not depend on the order in which the old examples are presented.
The simple predictor \(D\) then defines a prediction rule \(D_{\langle z_1,\ldots, z_n \rangle}: \mathscr{X} \rightarrow \mathscr{Y}\) by the formula \(D(z_1,\ldots,z_n,x)\).
A natural measure of nonconformity of \(z_i\) is the deviation of the predicted label \(\hat{y}_i = D_{\langle z_1,\ldots, z_n \rangle}(x_i)\) from the true label \(y_i\).
Ways of measuring deviation
\(\alpha_i = \lvert D_{\langle z_1,\ldots, z_n \rangle}(x_i) - y_i \rvert\).
\(\alpha_i = \lvert D_{\langle z_1,\ldots, z_{i-1}, z_{i+1}, \ldots, z_n \rangle}(x_i) - y_i \rvert\).
How well \(z_i\) conforms to the property of having a large label?
\(\alpha_i := y_i − \hat{y}_i\)
How well \(z_i\) conforms to the property of having a small label?
\(\alpha_i := \hat{y}_i - y_i\)
Ridge regression and nearest neighbours regression are two of the most standard regression algorithms.
In the case of regression there is an obvious difficulty in implementing the idea of conformal prediction: it appears that to form a prediction set we need to examine each potential classification \(y\).
In some important cases, for ridge regression and nearest neighbours regression, there is a feasible way to compute conformal predictors.
Thank you for your time!