Types of errors

Basic terms

In general, Positive = identified and negative = rejected. Therefore:

True positive = correctly identified

False positive = incorrectly identified

True negative = correctly rejected

False negative = incorrectly rejected

Medical testing example:

True positive = Sick people correctly diagnosed as sick

False positive= Healthy people incorrectly identified as sick

True negative = Healthy people correctly identified as healthy

False negative = Sick people incorrectly identified as healthy.

Basic terms So, first of all we’re going to focus on the types of errors that you can make when you’re doing a binary prediction problem. In other words, you were trying to predict things into one of two groups. In general, we’re going to be talking about language in terms of true positives and true negatives, and false positives and false negatives. So the first thing to keep in mind is that, when we talk about positive versus negative, we’re actually talking about what the algorithm decided. Whether it decided that you’re in a class or not in a class. Then true and false, refer to the true state of the word. So, true means that you actually belong to the class we’re trying to identify, and false means that you actually don’t belong to that class. So as an example, the “true part” of true positive, means that you were correctly, so in other words that, the truth is that there actually was something to identify a positive. In other words, we actually identified you as being belonging to that class. Similarly for a false positive, the positive part again refers to the fact that we identified you as being part of the positive class, and false refers to the fact that you were wrong. We didn’t actually classify you to the correct class.

To make this a little more concrete, consider a medical testing example. So in this case, we’re trying to identify people that are sick using say a screening test, a very common example would be, say mammograms to try to identify if women have breast cancer. In this case, the true part will be the status as to whether you’re sick or not. So if we say that we truly identified you, then you were truly sick. And if we falsely identified you, then you were actually healthy. You were not truly sick. So in this case, a true positive, is truly, somebody who is truly sick. And it’s positive, in other words, we actually diagnosed those people as correctly as being sick.

*If you're a false positive it means that, false, in other words you are a healthy person, but positive, means that we were still somebody that we identified as being sick, even though you weren't. Similarly with a true negative. This is somebody true, who is truly negative, truly healthy, and we identified them as being negative.*
*And a false negative would be somebody who is sick, so we incorrectly identified them as healthy, and the negative part of is we identified them as healthy. You can learn more about sensitivity and specificity by going to this Wikipedia link below:*

Sensitivity and specificity

Key quantities

Sensitivity and specificity

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

You can also see them in this 2 by 2 table. So it’s called a 2 by 2 table, because it has two rows, here, and two columns, here. So, the columns correspond to what your disease status is. So, in this, in this particular example, positive means that you have the disease, and negative means that you don’t have the disease. That’s the real truth about your disease status. And the test is our prediction, our machine learning algorithm. A positive means we predict that you have a disease and a negative means that we predict that you don’t have the disease. So some of the key quantities that people talk about, are:

the sensitivity. This is the probability that you are diseased, given that you really are diseased, so, if you’re really diseased, what’s the probability we get that, right? And then the specificity is if you are really healthy, what’s the probability we get it right? The positive predictive value is the probability that we call you diseased, or the probability that you are diseased, given that we call you diseased. So it’s a little bit different than the sensitivity in the sense that, now it’s looking at all the people we calle diseased, and saying, what fraction of them actually are diseased. Similarly for the negative predictive value. And the accuracy is just a probability that we classified you to the correct outcome. So in this table, it’s the terms on the diagonal (TP, TN) it’s the true positives, and the true negatives, just added up.

Key quantities as fractions

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

So you can write these as fractions. So for example: Sensitivity: that’s the probability, given that you are diseased, that we called you diseased. So we look at this first column, this is all the people that are diseased. And we look, what fraction of them, did we actually get right. So that’s TP, the true positives, divided by (TP+FN) the true positives, plus the false negatives, that gives you the sensitivity. You can similarly make the same sort of fractions for the specificity, the positive predictive value, the negative predictive value, and so forth. When looking at the positive predictive value. We basically look at the true positives, divided by the true positives plus the false positives, because we’re looking at only the positive tests, and we say what fraction of the positive tests did we get right. So the true positives were the ones that we got right, and the true positives plus the false positives, is the total of the positive tests.

Screening tests

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

*So this is kind of important because, many prediction problems, one of the classes will be more rare than the other. So, for example in, in medical studies, it's very common that only a very small percentage of people will be sick. In this case, suppose that there's a  disease where only one, 0.1% of the people are sick in the population. And suppose we have a really good machine learning algorithm. A really good testing kit, that is 99% sensitive, and 99% specific. In other words, the probability that we'll get it right, if you're diseased is 99%, and the probability we'll get it right if you're healthy is 99%. So in this case, suppose that you get a positive test. What's the probability, that you actually have the disease? You can consider two different cases. One, in a general population. In other words, in a population where there's a very small chance that you have the disease. Another one you can consider, is a case where 10% of people have the disease, so the disease is much more prevalent. Let's look at how that changes your positive predictive value.*

General population

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

So the general population, remember, we only have about 1% of the people that have the disease. So there are only 100 people in this column that have the disease, but there are a lot more people that are healthy. Similarly, we have a 99% accuracy, if you have the disease. So, 99 out of 100 people. And 99 out of these 100, are correctly called diseased. Similarly, among the people that are healthy, we get 99% right, so 98,901 we call healthy, when they really are healthy. That’s 99% of the time.

General population as fractions

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

But suppose that we wanted to know, if you got a positive test, what’s the probability that you actually have the disease? So, let’s look at this for a second. Suppose you actually got a positive test, that’s this first row right here. What’s the probability that you actually have the disease? So that’s, the number of people that actually have the disease, among the total number of people who had a positive test, so that’s 99 divided by 99 plus 999, so it’s only a 9% positive predictive value. In other words, if you got a positive test, it’s only about a 9% chance that you actually have the disease. What’s the reason for that? The reason is 99% of a small number, so 99 out of 100 is still smaller than 1% out of a much bigger number. So 999 out of a much larger fraction that are actually healthy people.

At risk subpopulation

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

If instead we consider the case where 10% of people are actually sick, then you have a much larger number of people that are actually sick. And 99% of the time, we’ll get it right, so 9,900 of people, that actually are sick, we’ll call sick, and only 900 of the people that are healthy will be called sick. And so then, things work out how you’d expect them to.

At risk subpopulation as fraction

http://www.biostat.jhsph.edu/~iruczins/teaching/140.615/

In other words, 9,900 out of 9,900 plus 900, so that’s. This number on the top left-hand corner. Divided by this total row is 92%, and so you have a high positive predictive value. What does this mean? If you’re predicting a rare event. You have to be aware of, how rare that event is. This goes back to the idea of knowing what population you’re sampling from. When you’re building the a predictive model.

Key public health issue

Vast Study Casts Doubts on Value of Mammograms

This is actually a key public health issue, so you’ve probably seen it in the news that there’s been questions about how, what’s the value of mammograms in detecting disease, and detecting the value of disease versus detecting cases that aren’t necessarily life threatening.

Key public health issue

Looser Guidelines Issued on Prostate Screening

Similarly, you’ve probably heard about it for prostate cancer screening, and in both of these cases. You have a fairly rare disease, and even though the screening mechanisms are relatively good, it’s very hard to know whether you’re getting a lot of false positives that are, as a fraction of the total number of positives that you’re getting.

For continuous data

Mean squared error (MSE):

\[\frac{1}{n} \sum_{i=1}^n (Prediction_i - Truth_i)^2\]

Root mean squared error (RMSE):

\[\sqrt{\frac{1}{n} \sum_{i=1}^n(Prediction_i - Truth_i)^2}\]

For continuous data, you actually don’t have quite so simple a scenario, where you only have one of two cases, and one of two types of errors that you can possibly make. The goal here is to see how close you are to the truth. And so, one common way to do that, is with something called mean squared error. And so the idea is, you have a prediction that you have from your model or your machine learning algorithm. And so, you have a prediction for every single sample that you’re trying to predict. And you also maybe know the truth for those samples, say in a test set. So what you do is, you calculate the difference between the prediction and the truth. And you square it, so the numbers are all positive. And then you average the total distance between the prediction and the truth. The one thing that’s a little bit difficult about interpreting this number is that you squared this distance, and so, it’s a little bit hard to interpret on the same scale as the predictions or the truth. And so what people often do is they take the square root of that quantity. So here, underneath the square root sign, is the same number, it’s just the average distance between the prediction and thetruth, and you just sum it and square it. And then you take the square root in that number, and that gives you the root, root mean squared error. And this is probably the most common error measure that’s used for continuous data.

Common error measures

Mean squared error (or root mean squared error)

Continuous data, sensitive to outliers

Median absolute deviation

Continuous data, often more robust

Sensitivity (recall)

If you want few missed positives

Specificity

If you want few negatives called positives

Accuracy

Weights false positives/negatives equally

Concordance

One example is kappa

Predictive value of a positive (precision)

When you are screeing and prevelance is low

So for continuous data, people often use either the mean squared error, or root mean squared error. But if often doesn’t work when there are a lot of outliers. Or the values of the variables can have very different scales. Because, it will be sensitive to those outliers. So, for example, if you have one really,really large value. It might really raise the mean. Instead, what we could use is often the median absolute deviation. So in that case, they take the median of the distance between the observed value, and the predicted value, and they do the absolute value instead of doing the squared value. And so again, that requires all of the distances to be positive, but it’s a little bit more robust to the size of those errors. And then sensitivity and specificity are very commonly used when talking about particularly medical tests, but they also are particularly widely used if you care about one type of error more than the other type of error. And then, accuracy which weights false positives and false negatives equally. This is an important point if again you have a very large discrepancy in number of times that you’re a positive or a negative. For multiclass cases, you might have something like concordance, and here I’ve linked to one particular distance measure, kappa. But there are a whole large class of distance measures, and they all have different properties, that can be used when you have multiclass data. So, those are some of the common error measures that are used when doing prediction algorithms.

Types of errors

Jeffrey Leek

Basic terms

Key quantities

Key quantities as fractions

Screening tests

General population

General population as fractions

At risk subpopulation

At risk subpopulation as fraction

Key public health issue

Key public health issue

For continuous data

Common error measures