ROC class notes

Le Kang

2024-01-17

Background

The earliest manifestation of the receiver operating characteristic (ROC) curve was during World War II for the analysis of radar signals, and it consequently entered the scientific literature in the 1950s in connection with signal detection theory and psychophysics, where assessment of human and animal detection of weak signals was of considerable interest.

The seminal text for the early work was that by Green and Swets ¹.

Later, in the 1970s and1980s, it became evident that the technique was of considerable relevance to medical test evaluation and decision making.

The decades since then have seen much development and use of the technique in areas such as radiology, cardiology, clinical chemistry, and epidemiology.

The name Receiver Operating Characteristic arises from the use of such curves in signal detection theory ¹ ², where the aim is to detect the presence of a particular signal, missing as few genuine occurrences as possible while simultaneously raising as few false alarms as possible.

That is, in signal detection theory, the aim is to assign each event either into the signal class or into the non signal class — so that the abstract situation is the same as above. The word “characteristic” in Receiver Operating Characteristic refers to the characteristics of behavior of the classifier over the potential range of its operation.

A huge number of situations can be described by the following abstract framework,

Each of a set of objects is known to belong to one of two classes.
An assignment procedure assigns each object to a class on the basis of information observed about that object.

Unfortunately, the assignment procedure is not perfect: errors are made, meaning that sometimes an object is assigned to an incorrect class. Because of this imperfection, we need to evaluate the quality of performance of the procedure.

Examples

conducting medical diagnosis, in which the aim is to assign each patient to disease A or disease B;
developing speech recognition systems, in which the aim is to classify spoken words;
evaluating ﬁnancial credit applications, in which the aim is to assign each applicant to a “likely to default” or “not likely to default” class;
assessing applicants for a university course, on the basis of whether or not they are likely to be able to pass the ﬁnal examination;

ﬁltering incoming email messages, to decide if they are spam or genuine messages;
examining credit card transactions, to decide if they are fraudulent or not;
investigating patterns of gene expression in microarray data, to see if they correspond to cancer or not.

In some cases, of course, more than two classes might be involved, but the case of two classes is by far the most important one in practice (sick/well, yes/no, right/wrong,accept/reject, act/do not act, condition present/absent, and so on).

Classification

The information about each object which is used to assign it to a class can be regarded as a vector of descriptive variables, characteristics, or features.

Sometimes the vector of descriptive variables will be univariate, but often it will be multivariate.

Univariate: white blood cell \(\rightarrow\) inflammation
Multivariate: a vector \(\boldsymbol{X}\) of measurements

The type of information that one obtains depends on the level of measurement of each variable:

a nominal variable is one whose “values” are categories (e.g., color of eyes);
a binary variable is a nominal variable having just two possible categories (e.g., presence or absence of pain);
an ordinal variable has categories that are ordered in some way (e.g., no pain, mild pain, moderate pain, severe pain);

a discrete (numerical) variable is one that can take only a finite number of distinct possible values (e.g., the number of students who wear glasses in a class of 25);
a continuous (numerical) variable is one that can take any value in either a finite or infinite range (e.g., the weight of a patient in hospital).

In general, multivariate methods are more powerful than univariate methods, if only because each component of the descriptive vector can add extra information about the class of the object.

The multiple measurements taken on each object are then reduced to a single score \(S(\boldsymbol{X})\) for that object by some appropriate function.

The majority of functions \(S\) with which we are typically concerned will convert the raw information into a continuous value (a score on a univariate continuum).

The class assignment or classiﬁcation is then made by comparing this score with a threshold: if the score is above the threshold they are assigned to one class, and if the score is below the threshold to the other.

Notations

We denote the characteristics describing objects as \(\boldsymbol{X}\), with \(\boldsymbol{x}\) denoting particular values, and the resulting scores as \(S(\boldsymbol{X})\), taking particular values \(S(\boldsymbol{x})\). The classiﬁcation threshold \(T\) takes values denoted by \(t\).

In general, we denote the two classes by P (positive) and N (negative). The emphasis is often on identifying P individuals correctly.

symmetry vs asymmetry between two populations
“supervised classiﬁcation”

Aim

A central aspect to developing classiﬁcation rules is to choose the function \(S\) which reduces the vector \(\boldsymbol{x}\) to a single score, to construct a score function \(S(\boldsymbol{X})\) such that members of the two classes have distinctly diﬀerent sets of scores, thereby enabling the classes to be clearly distinguished.

We will assume that the scores have been orientated in such a way that members of class P tend to have large scores and members of class N tend to have small scores, with a threshold that divides the scores into two groups.

Training set

Includes both the descriptive vectors \(\boldsymbol{X}\) and the true P or N classes of each of the objects in this set.

Any proposed function \(S\) will then produce a distribution of scores for the members of P in the training set, and a distribution of scores for the members of N in the training set, and the score for any particular object can then be compared with the classiﬁcation threshold \(t\).

Form of the rule

a weighted sum of the raw components of \(\boldsymbol{X}\),
a partition of the multivariate \(\boldsymbol{X}\) space,
a sum of nonlinear transformations of the components of \(\boldsymbol{X}\),
any of a host of other ways of combining these components

Estimation of parameter

Also it is important to define the criterion used to estimate any parameters in \(S\).

In a weighted sum, for example, we must choose the weights, in a partition of \(\boldsymbol{X}\) we must choose the positions of the cut points, and so on.

Classiﬁer performance assessment

Training data are used to construct the classiﬁcation rule. Then, having ﬁnally settled on a rule, we want to know how eﬀective it will be in assigning future objects to classes. To explore this, we need actually to assign some objects to classes and see, in some way, how well the rule does.

substitution
training vs testing set plus averaging
leave-one-out
resampling/bootstrap

Two-by-two classification table

Four joint probabilities,

\(p(s>t,P)\),
\(p(s>t,N)\),
\(p(s \leq t,P)\),
\(p(s \leq t,N)\).

One very common measure is the misclassiﬁcation, but it is far from perfect.

Conditional and marginal probabilities

\(p(s>t|N)\), fp;
\(p(s>t|P)\), tp, se;
\(p(s \leq t|N)\), tn=1-fp, sp;
\(p(s \leq t|P)\), fn=1-tp;
\(p(P)\) is the prevalence, and \(p(N)=1-p(P)\).

Converse conditional probabilities

\(p(P|s>t)\), ppv
\(p(N|s\leq t)\), npv
mis-classification rate as a weighted sum of the tp and fp
Youden index