Introduction to conformal prediction

Chapter 1 of Vovk, Gammerman, and Shafer (2022)

In this chapter the authors highlight specific ML work that serves as motivation for conformal prediction and outline major ideas in the book.

References

Stone, Charles J. 1977. “Consistent Nonparametric Regression.” The Annals of Statistics 5 (4): 595–620. http://www.jstor.org/stable/2958783.
Vovk, Vladimir, Alexander Gammerman, and Glenn Shafer. 2022. Algorithmic Learning in a Random World. 2nd ed. Cham, Switzerland: Springer International Publishing.

Future of machine learning

  • Developments in ML have been very impactful in a wide variety of fields.
  • Focus of current and future projects is on programs that can learn.
    • Recognizing handwritten digits: too ambitious to write from scratch programs for tasks that even humans must learn to perform.
  • Adaptability: key difference between a rigid program built for a specific task and an adaptive program that learns from experience.

Learning under randomness

  • Learning happens in a stable environment generating random samples.
  • Stable environment is governed by some fixed probability distribution \(Q\) on a fixed sample space \(Z\).
    • Together, \((Q, Z)\) can describe the environment.
  • \(Z\) can be large and structured in a complex way.
    • USPS handwritten digit data: 16×16 image with 31 shades of grey, together with the digit the image represents (an integer between 0 to 9).
    • \(31^{16\times 16} \times 10\) (this is approximately \(10^{382}\)) possible examples in the space \(Z\).

Learning under randomness

  • Most examples in book: object and its label.
    • USPS example: grey-scale matrix of image pixels and label.
  • In the problem of recognizing hand-written digits and other typical ML problems, it is the space of objects, the space of possible grey-scale images, that is large.
  • The space of labels is either a small finite set (classification problems) or the set of real numbers (regression problems).
  • Randomness assumption: Examples are chosen randomly from \(Q\), we mean that they are independent in the sense of probability theory and all have the distribution \(Q\). They are independent and identically distributed (IID).

ML problems \(\neq\) learning under randomness

  • In learning with expert advice, for example, randomness is replaced by a game-theoretic setup.
    • Here a typical result is that the learner can predict almost as well as the best strategy in a pool of possible strategies.
  • In reinforcement learning, which is concerned with rational decision-making in a dynamic environment, the standard assumption is Markovian behaviour.
    • Extensions of learning under randomness in Parts III and IV of this book!

Learning Under Unconstrained Randomness

  • We make the randomness assumption without assuming anything more about the environment: we know the space of examples \(Z\), we know that examples are drawn independently from the same distribution, and this is all we know.
  • We know nothing at the outset about the probability distribution \(Q\) from which each example is drawn. In this case, we say we are learning under unconstrained randomness.
  • The strength of modern machine-learning methods often lies in their ability to make hedged predictions under unconstrained randomness in a high-dimensional environment, where examples have a very large (or infinite) number of components.
    • USPS example: treated as a toy problem.

Shortcomings of Statistical Learning theory

  • Statistical learning theory guarantees that under some conditions on the prediction algorithm, as the training set becomes bigger and bigger these predictions will become more and more accurate with greater and greater probability: they are probably approximately correct.
    • How probably and how approximately?
  • Theoretical bounds on error/probability of error are too loose.
    • USPS example: upper bounds on probability of error \(\geq 1\).

Alternative ways of evaluation: Hold-Out Estimate of Confidence

  • Hold-out estimates: we split the available examples into two parts, a training set and a hold-out sample, playing the role of a test set.
  • We apply the algorithm to the training set in order to find a prediction rule, and then we apply this prediction rule to the hold-out sample.
  • The observed rate of errors on the hold-out sample tells us how confident we should be in the prediction rule when we apply it to new examples.

Conformal predictions: a different method of evaluation

  • Hold-out samples involve hedging our predictions.
  • Conformal learning develops a new approach of producing hedged predictions that enjoy advantages over hold-out sample methods.
    • No rigid separation between learning tasks and prediction tasks.
    • The hedged predictions produced by new algorithms are much more confident and accurate.
    • Online learning: the confidence with which the label of a new object is predicted is always tailored not only to the previously seen examples but also to the new object.

Online learning

  • Framework is called online because we assume that the examples are presented one by one. Each time, we observe the object and predict the label.
    • We start by observing the first object \(x_1\) and predicting its label \(y_1\).
    • Then we observe \(y_1\) and the second object \(x_2\), and predict its label \(y_2\). And so on. - At the nth step, we have observed the previous examples \[(x_1, y_1), (x_2, y_2), \ldots (x_{n-1}, y_{n-1})\] as well as the new object \(x_n\). Our task is to predict \(y_n\).
  • The quality of our predictions should improve as we accumulate more and more old examples. This is the sense in which we are learning.

Online/offline compromises

  • Methods presented in this book are most naturally described and are most amenable to mathematical analysis in the online framework.
  • They also extend to relaxations of the online protocol that make it close to the offline setting.
    • If we are concerned with recognizing hand-written zip codes, for example, we cannot always rely on a human grader to tell us the correct interpretation of each hand-written zip code; why not use such an ideal grader directly for prediction?
    • Relaxation of the online protocol considers ‘slow graders’ and ‘lazy graders’.
  • Relaxation allows us to replace constant supervision by using some (problematic) samples for grading.

One-off and offline learning

  • A one-off predictor is given a training set \((x_1, y_1), (x_2, y_2), \ldots (x_{n-1}, y_{n-1})\) and a test object \(x_n\); its task is to predict the label \(y_n\) of \(x_n\). It is applied only once.
  • A one-off predictor can be embedded into both online and offline frameworks.
    • In the online framework, new examples arrive sequentially, their labels are predicted given their objects, and they are added to the training set, as described earlier.
    • In the offline framework, we have old examples \((x_1, y_1), (x_2, y_2), \ldots (x_{l}, y_{l})\) and the one-off predictor is applied independently to the old samples for each of the new objects \((x_{l+1}, x_{l+2}, \ldots, x_{l+k})\) to generate new labels \((y_{l+1}, y_{l+2}, \ldots, y_{l+k})\).

Induction and transduction

  • Inductive prediction:
    • Inductive step: Move from examples in hand to some more or less general rule, which we might call a prediction or decision rule, a model, or a theory.
    • Deductive step: When presented with a new object, or a test set of new objects, we derive a prediction from the general rule.
    • Examples: parameter estimation, followed by model-based prediction.
  • Transductive prediction:
    • Shortcut: moving from the old examples directly to the prediction(s) about the new object(s).
    • Examples: estimation of future observations and k-NN algorithms.

Induction and transduction

  • Mathematical results about induction typically involve two parameters.
    • \(\epsilon\) - the desired accuracy of the prediction rule.
    • \(\delta\) - the probability we are willing to tolerate of failing to achieve the accuracy \(\epsilon\).
  • Mathematical results about transduction typically involves one parameter.
    • \(\epsilon\) - the probability of error we are willing to tolerate.

Inductive and transductive (or eductive) prediction (Vovk, 2022)

Induction and transduction in the online framework

  • Want to use a general rule extracted from \[{(x_1,y_1),\ldots,(x_{n−1},y_{n−1})}\] only once, to predict \(y_n\) from \(x_n\). Once this is done, repeat to predict \(y_{n+1}\) from \(x_{n+1}\).
  • Induction might seem like the less obvious choice, transduction is more natural.
    • Computational cost of transductive approach may be higher and may be sensible to compromise and follow an inductive approach.

Conformal prediction

  • We predict that a new object will have a label that makes it similar to the old examples in some specified way.
  • We use the degree to which the specified type of similarity holds within the old examples to estimate our confidence in the prediction.
  • Conformal predictors are, in other words, “confidence predictors”.
  • Purpose of these slides is to explain informally what a confidence predictor aims to do and what it means for it to be valid and efficient.

Nested prediction sets

Suppose we want to pinpoint a target that lies somewhere within a rectangular field: for each example, we predict the coordinates \(y_n \in [a_1, a_2]\times[b_1, b_2]\) of the target from a set of measurements \(x_n\).

  • Produce a subset \(\Gamma_n \subseteq [a_1, a_2]\times[b_1, b_2]\) with a probability of error. Casual prediction error: \(20\%\); confident prediction error: \(5\%\) and highly confident prediction error: \(1\%\).

An example of a nested family of prediction sets (casual prediction in black, confident prediction in dark grey, and highly confident prediction in light grey). Vovk (2022).

Desiderata for confidence predictors:

There are three important desiderata for confidence predictors:

  • They should be valid: in the long run the frequency of error does not exceed \(\epsilon\) at each chosen confidence level \(1 − \epsilon\).
  • They should be efficient: the prediction sets they output are as small as possible.
  • As conditional as possible: take full account of how difficult the particular example is.

Validity

Empirical validity for one particular conformal predictor of the USPS dataset.

  • Conservative validity: the probability of making an error when predicting at a confidence level \(1 - \epsilon\) is no greater than \(\epsilon\), and there is little dependence between errors it makes when predicting successive examples.
  • Exact validity: probability of a prediction set at a confidence level \(1 - \epsilon\) being in error is exactly \(\epsilon\), errors are made independently at different trials, and the long-run frequency of errors converges to \(\epsilon\).

Efficiency

  • Efficiency in regression: In regression problems, the prediction set output by a conformal predictor is often an interval of values. A natural measure of efficiency of such a prediction is the length of the interval, at different confidence levels.
  • Efficiency in classification: a natural measure of efficiency is the size of the prediction set—the number of labels in it, perhaps apart from the true label (which will be in the prediction set with high probability for a valid predictor). Measuring efficiency in the case of classification will be one of the major topics of Chapter 3 of Vovk (2022).

There are many conformal predictors for any particular online prediction problem. Which is most efficient—which produces the smallest prediction sets in practice—will depend on details of the environment that we may not know in advance.

Conditionality with a motivating example

  • There are two categories of objects, “easy” (easy to predict) and “hard” (hard to predict). Both categories are equally likely.
  • There is a prediction method that applies to all objects, hard and easy, and has error rate \(5\%\). We do not know what the error rate is for hard objects, and we get an overall error rate of \(5\%\).
  • Not fair to appeal to the average error rate of 5% and say that we are 95% confident of our prediction.

Whenever there are features of objects that we know make the prediction easier or harder, we would like to take these features into account—to condition on them. This is done by conformal predictors almost automatically!

Flexibility of Conformal Predictors

  • A conformal predictor can be built on top of almost any machine-learning algorithm.
  • An underlying algorithm, may produce hedged predictions, simple predictions, or simple predictions complemented by ad-hoc measures of confidence.
  • It is always possible to transform it into a conformal predictor that inherits its predictive performance but is, of course, valid, just like any other conformal predictor.

Chapters 2, 3 and 4 of Vovk (2022) discuss how to build conformal predictors using such methods as nearest neighbours, support vector machines, bootstrap, boosting, neural networks, decision trees, random forests, ridge regression, logistic regression, and any Bayesian algorithm.

Probabilistic Prediction Under Unconstrained Randomness

  • Many ways to do classification and regression under unconstrained randomness and for high-dimensional examples.
  • Conformal predictors combine good theoretical properties with high accuracy in practical problems.
  • The situation changes if we move to the harder problem of probabilistic prediction: that of guessing the probability distribution for the new object’s label.

For simplicity, we will assume, that the label is binary, 0 or 1. In this case the probabilistic prediction for the label of the new object boils down to one number, the predicted probability that the label is 1.

Universally Consistent Probabilistic Predictor

  • Universally consistent: the difference between the probabilistic prediction and the true conditional probability given the object that the label is 1 converges to zero in probability.
  • A nearest neighbours probabilistic predictor is universally consistent (Stone (1977)).
  • Stone’s actual result was more general, and it has been further extended in different directions. Chapters 6 and 7 of Vovk (2022) will state similar results in the context of conformal prediction.

Probabilistic Prediction Using a Finite Dataset

  • The main obstacle in applying Stone’s theorem is that the convergence it asserts is not uniform. It is well understood that in this situation the applicability of Stone’s theorem is very limited.
  • A dataset consisting of old examples and one new object is diverse if no object in it is repeated (in particular, the new object is different from all old objects).
  • In Chapter 5 of Vovk (2022), the main result asserts that any nontrivial valid prediction interval for the conditional probability given the new object that the new label is \(1\) is inadmissible if the dataset is diverse and randomness is the only assumption.

The assumption that the dataset is diverse is related to the assumption of a highdimensional environment. If the objects are, for example, complex images, we will not expect precise repetitions among them.

Venn prediction

  • Chapter 5 of Vovk (2022) shows that it is impossible to estimate the true conditional probabilities under the conditions stated.
  • That chapter notes that it is impossible to find conditional probabilities that are as good (in the sense of the algorithmic theory of randomness) as the true probabilities.
  • If, however, we are only want probabilities that are “well calibrated”, a modification of conformal predictors which we call Venn predictors will achieve this goal, in a strong non-asymptotic sense, as shown in Chapter 6 of Vovk (2022).

Conformal predictive distributions

  • Venn predictors only work in the case of classification.
  • It turns out that in the case of regression conformal prediction is capable of outputting predictive distribution functions, which we call conformal predictive distributions, as in chapter 7 of Vovk (2022).

It is interesting that while our approach to probabilistic prediction in the case of classification requires a nontrivial modification of conformal prediction, no such modifications are needed for regression.

Vovk (2022) also consider testing the assumption of randomness and alternatives to this assumption, which are discussed in Chapters 11 and 12.

Thank you for your time!