Introduction to conformal prediction

References

Stone, Charles J. 1977. “Consistent Nonparametric Regression.” The Annals of Statistics 5 (4): 595–620. http://www.jstor.org/stable/2958783.

Vovk, Vladimir, Alexander Gammerman, and Glenn Shafer. 2022. Algorithmic Learning in a Random World. 2nd ed. Cham, Switzerland: Springer International Publishing.

Future of machine learning

Developments in ML have been very impactful in a wide variety of fields.
Focus of current and future projects is on programs that can learn.
- Recognizing handwritten digits: too ambitious to write from scratch programs for tasks that even humans must learn to perform.
Adaptability: key difference between a rigid program built for a specific task and an adaptive program that learns from experience.

The rapid development of computer technology during the last several decades has made it possible to solve ever more difficult problems in a wide variety of fields. The development of software has been essential to this progress. The painstaking programming in machine code or assembly language that was once required to solve even simple problems has been replaced by programming in high-level object-oriented languages. We are concerned with the next natural step is this progression—the development of programs that can learn, i.e., automatically improve their performance with experience.
We are not born able to perform this task, but we learn to do it quite robustly. We can recognize it easily, and our ability to do so scarcely diminishes when it is slightly rotated or otherwise perturbed. We do not know how to write instructions for a computer that will produce equally robust performance.
The essential difference between a program that implements instructions for a particular task and a program that learns is adaptability. A single learning program may be able to learn a wide variety of tasks: recognizing hand-written digits and faces, diagnosing patients in a hospital, estimating house prices, etc.

Learning under randomness

Learning happens in a stable environment generating random samples.
Stable environment is governed by some fixed probability distribution \(Q\) on a fixed sample space \(Z\).
- Together, \((Q, Z)\) can describe the environment.
\(Z\) can be large and structured in a complex way.
- USPS handwritten digit data: 16×16 image with 31 shades of grey, together with the digit the image represents (an integer between 0 to 9).
- \(31^{16\times 16} \times 10\) (this is approximately \(10^{382}\)) possible examples in the space \(Z\).

Learning under randomness

Most examples in book: object and its label.
- USPS example: grey-scale matrix of image pixels and label.
In the problem of recognizing hand-written digits and other typical ML problems, it is the space of objects, the space of possible grey-scale images, that is large.
The space of labels is either a small finite set (classification problems) or the set of real numbers (regression problems).
Randomness assumption: Examples are chosen randomly from \(Q\), we mean that they are independent in the sense of probability theory and all have the distribution \(Q\). They are independent and identically distributed (IID).

ML problems \(\neq\) learning under randomness

In learning with expert advice, for example, randomness is replaced by a game-theoretic setup.
- Here a typical result is that the learner can predict almost as well as the best strategy in a pool of possible strategies.
In reinforcement learning, which is concerned with rational decision-making in a dynamic environment, the standard assumption is Markovian behaviour.
- Extensions of learning under randomness in Parts III and IV of this book!

Learning Under Unconstrained Randomness

We make the randomness assumption without assuming anything more about the environment: we know the space of examples \(Z\), we know that examples are drawn independently from the same distribution, and this is all we know.
We know nothing at the outset about the probability distribution \(Q\) from which each example is drawn. In this case, we say we are learning under unconstrained randomness.
The strength of modern machine-learning methods often lies in their ability to make hedged predictions under unconstrained randomness in a high-dimensional environment, where examples have a very large (or infinite) number of components.
- USPS example: treated as a toy problem.

Shortcomings of Statistical Learning theory

Statistical learning theory guarantees that under some conditions on the prediction algorithm, as the training set becomes bigger and bigger these predictions will become more and more accurate with greater and greater probability: they are probably approximately correct.
- How probably and how approximately?
Theoretical bounds on error/probability of error are too loose.
- USPS example: upper bounds on probability of error \(\geq 1\).

Machine learning has made significant strides in its study of learning under unconstrained randomness. We now have a wide range of algorithms that often work very well in practice: neural networks, decision trees, nearest neighbours algorithms, and naive Bayes methods have been used for decades; newer algorithms include support vector machines, random forests, and boosting, and in recent years neural networks have again become extremely popular in the form of deep learning.
This question has not been answered as well as we might like by statistical learning theory. This is because the theoretical results that might be thought to answer it, the bounds that demonstrate arbitrarily good accuracy with sufficiently large sizes of the training set, are usually too loose to tell us anything interesting for training sets that we actually have.
Consider, for example, the problem of recognizing hand-written digits, which we have already discussed. Here we are interested in giving an upper bound on the probability that our learning algorithm fails to choose the right digit; we might like this probability to be less than 0.05, for example, so that we can be 95% confident that the prediction is correct. Unfortunately, typical upper bounds on the probability of error provided by the theory, even for relatively clean datasets such as the USPS dataset we have discussed, are greater than 1; bounds less than 1 can usually be achieved only for very straightforward problems or with very large datasets.

Alternative ways of evaluation: Hold-Out Estimate of Confidence

Hold-out estimates: we split the available examples into two parts, a training set and a hold-out sample, playing the role of a test set.
We apply the algorithm to the training set in order to find a prediction rule, and then we apply this prediction rule to the hold-out sample.
The observed rate of errors on the hold-out sample tells us how confident we should be in the prediction rule when we apply it to new examples.

Conformal predictions: a different method of evaluation

Hold-out samples involve hedging our predictions.
Conformal learning develops a new approach of producing hedged predictions that enjoy advantages over hold-out sample methods.
- No rigid separation between learning tasks and prediction tasks.
- The hedged predictions produced by new algorithms are much more confident and accurate.
- Online learning: the confidence with which the label of a new object is predicted is always tailored not only to the previously seen examples but also to the new object.

Online learning

Framework is called online because we assume that the examples are presented one by one. Each time, we observe the object and predict the label.
- We start by observing the first object \(x_1\) and predicting its label \(y_1\).
- Then we observe \(y_1\) and the second object \(x_2\), and predict its label \(y_2\). And so on. - At the nth step, we have observed the previous examples \[(x_1, y_1), (x_2, y_2), \ldots (x_{n-1}, y_{n-1})\] as well as the new object \(x_n\). Our task is to predict \(y_n\).
The quality of our predictions should improve as we accumulate more and more old examples. This is the sense in which we are learning.

Online/offline compromises

Methods presented in this book are most naturally described and are most amenable to mathematical analysis in the online framework.
They also extend to relaxations of the online protocol that make it close to the offline setting.
- If we are concerned with recognizing hand-written zip codes, for example, we cannot always rely on a human grader to tell us the correct interpretation of each hand-written zip code; why not use such an ideal grader directly for prediction?
- Relaxation of the online protocol considers ‘slow graders’ and ‘lazy graders’.
Relaxation allows us to replace constant supervision by using some (problematic) samples for grading.

One-off and offline learning

A one-off predictor is given a training set \((x_1, y_1), (x_2, y_2), \ldots (x_{n-1}, y_{n-1})\) and a test object \(x_n\); its task is to predict the label \(y_n\) of \(x_n\). It is applied only once.
A one-off predictor can be embedded into both online and offline frameworks.
- In the online framework, new examples arrive sequentially, their labels are predicted given their objects, and they are added to the training set, as described earlier.
- In the offline framework, we have old examples \((x_1, y_1), (x_2, y_2), \ldots (x_{l}, y_{l})\) and the one-off predictor is applied independently to the old samples for each of the new objects \((x_{l+1}, x_{l+2}, \ldots, x_{l+k})\) to generate new labels \((y_{l+1}, y_{l+2}, \ldots, y_{l+k})\).

Induction and transduction

Inductive prediction:
- Inductive step: Move from examples in hand to some more or less general rule, which we might call a prediction or decision rule, a model, or a theory.
- Deductive step: When presented with a new object, or a test set of new objects, we derive a prediction from the general rule.
- Examples: parameter estimation, followed by model-based prediction.
Transductive prediction:
- Shortcut: moving from the old examples directly to the prediction(s) about the new object(s).
- Examples: estimation of future observations and k-NN algorithms.

Induction and transduction

Mathematical results about induction typically involve two parameters.
- \(\epsilon\) - the desired accuracy of the prediction rule.
- \(\delta\) - the probability we are willing to tolerate of failing to achieve the accuracy \(\epsilon\).
Mathematical results about transduction typically involves one parameter.
- \(\epsilon\) - the probability of error we are willing to tolerate.

Inductive and transductive (or eductive) prediction (Vovk, 2022)

Induction and transduction in the online framework

Want to use a general rule extracted from \[{(x_1,y_1),\ldots,(x_{n−1},y_{n−1})}\] only once, to predict \(y_n\) from \(x_n\). Once this is done, repeat to predict \(y_{n+1}\) from \(x_{n+1}\).
Induction might seem like the less obvious choice, transduction is more natural.
- Computational cost of transductive approach may be higher and may be sensible to compromise and follow an inductive approach.

Conformal prediction

We predict that a new object will have a label that makes it similar to the old examples in some specified way.
We use the degree to which the specified type of similarity holds within the old examples to estimate our confidence in the prediction.
Conformal predictors are, in other words, “confidence predictors”.
Purpose of these slides is to explain informally what a confidence predictor aims to do and what it means for it to be valid and efficient.

Nested prediction sets

Suppose we want to pinpoint a target that lies somewhere within a rectangular field: for each example, we predict the coordinates \(y_n \in [a_1, a_2]\times[b_1, b_2]\) of the target from a set of measurements \(x_n\).

Produce a subset \(\Gamma_n \subseteq [a_1, a_2]\times[b_1, b_2]\) with a probability of error. Casual prediction error: \(20\%\); confident prediction error: \(5\%\) and highly confident prediction error: \(1\%\).

An example of a nested family of prediction sets (casual prediction in black, confident prediction in dark grey, and highly confident prediction in light grey). Vovk (2022).

Desiderata for confidence predictors:

There are three important desiderata for confidence predictors:

They should be valid: in the long run the frequency of error does not exceed \(\epsilon\) at each chosen confidence level \(1 − \epsilon\).
They should be efficient: the prediction sets they output are as small as possible.
As conditional as possible: take full account of how difficult the particular example is.

Validity

Conservative validity: the probability of making an error when predicting at a confidence level \(1 - \epsilon\) is no greater than \(\epsilon\), and there is little dependence between errors it makes when predicting successive examples.
Exact validity: probability of a prediction set at a confidence level \(1 - \epsilon\) being in error is exactly \(\epsilon\), errors are made independently at different trials, and the long-run frequency of errors converges to \(\epsilon\).

Efficiency

Efficiency in regression: In regression problems, the prediction set output by a conformal predictor is often an interval of values. A natural measure of efficiency of such a prediction is the length of the interval, at different confidence levels.
Efficiency in classification: a natural measure of efficiency is the size of the prediction set—the number of labels in it, perhaps apart from the true label (which will be in the prediction set with high probability for a valid predictor). Measuring efficiency in the case of classification will be one of the major topics of Chapter 3 of Vovk (2022).

There are many conformal predictors for any particular online prediction problem. Which is most efficient—which produces the smallest prediction sets in practice—will depend on details of the environment that we may not know in advance.

Conditionality with a motivating example

There are two categories of objects, “easy” (easy to predict) and “hard” (hard to predict). Both categories are equally likely.
There is a prediction method that applies to all objects, hard and easy, and has error rate \(5\%\). We do not know what the error rate is for hard objects, and we get an overall error rate of \(5\%\).
Not fair to appeal to the average error rate of 5% and say that we are 95% confident of our prediction.

Whenever there are features of objects that we know make the prediction easier or harder, we would like to take these features into account—to condition on them. This is done by conformal predictors almost automatically!

Flexibility of Conformal Predictors

A conformal predictor can be built on top of almost any machine-learning algorithm.
An underlying algorithm, may produce hedged predictions, simple predictions, or simple predictions complemented by ad-hoc measures of confidence.
It is always possible to transform it into a conformal predictor that inherits its predictive performance but is, of course, valid, just like any other conformal predictor.

Chapters 2, 3 and 4 of Vovk (2022) discuss how to build conformal predictors using such methods as nearest neighbours, support vector machines, bootstrap, boosting, neural networks, decision trees, random forests, ridge regression, logistic regression, and any Bayesian algorithm.

Probabilistic Prediction Under Unconstrained Randomness

Many ways to do classification and regression under unconstrained randomness and for high-dimensional examples.
Conformal predictors combine good theoretical properties with high accuracy in practical problems.
The situation changes if we move to the harder problem of probabilistic prediction: that of guessing the probability distribution for the new object’s label.

For simplicity, we will assume, that the label is binary, 0 or 1. In this case the probabilistic prediction for the label of the new object boils down to one number, the predicted probability that the label is 1.

Universally Consistent Probabilistic Predictor

Universally consistent: the difference between the probabilistic prediction and the true conditional probability given the object that the label is 1 converges to zero in probability.
A nearest neighbours probabilistic predictor is universally consistent (Stone (1977)).
Stone’s actual result was more general, and it has been further extended in different directions. Chapters 6 and 7 of Vovk (2022) will state similar results in the context of conformal prediction.

Probabilistic Prediction Using a Finite Dataset

The main obstacle in applying Stone’s theorem is that the convergence it asserts is not uniform. It is well understood that in this situation the applicability of Stone’s theorem is very limited.
A dataset consisting of old examples and one new object is diverse if no object in it is repeated (in particular, the new object is different from all old objects).
In Chapter 5 of Vovk (2022), the main result asserts that any nontrivial valid prediction interval for the conditional probability given the new object that the new label is \(1\) is inadmissible if the dataset is diverse and randomness is the only assumption.

The assumption that the dataset is diverse is related to the assumption of a highdimensional environment. If the objects are, for example, complex images, we will not expect precise repetitions among them.

Venn prediction

Chapter 5 of Vovk (2022) shows that it is impossible to estimate the true conditional probabilities under the conditions stated.
That chapter notes that it is impossible to find conditional probabilities that are as good (in the sense of the algorithmic theory of randomness) as the true probabilities.
If, however, we are only want probabilities that are “well calibrated”, a modification of conformal predictors which we call Venn predictors will achieve this goal, in a strong non-asymptotic sense, as shown in Chapter 6 of Vovk (2022).

Conformal predictive distributions

Venn predictors only work in the case of classification.
It turns out that in the case of regression conformal prediction is capable of outputting predictive distribution functions, which we call conformal predictive distributions, as in chapter 7 of Vovk (2022).

It is interesting that while our approach to probabilistic prediction in the case of classification requires a nontrivial modification of conformal prediction, no such modifications are needed for regression.

Vovk (2022) also consider testing the assumption of randomness and alternatives to this assumption, which are discussed in Chapters 11 and 12.