In sample and out of sample error

In sample versus out of sample

In Sample Error: The error rate you get on the same data set you used to build your predictor. Sometimes called resubstitution error.

Out of Sample Error: The error rate you get on a new data set. Sometimes called generalization error.

Key ideas

Out of sample error is what you care about
In sample error \(<\) out of sample error
The reason is overfitting

Matching your algorithm to the data you have

So “in sample errors”, the error you get on the same data you used to train your predictor. This is sometimes called “resubstitution error” in the machine learning literature. And “in sample error” is always going to be a little bit optimistic, from what the error is that you would get from a new sample. And the reason why is, in your specific sample, sometimes your prediction algorithm will tune itself a little bit to the noise that you collected in that particular data set. And so when you get a new data set, there’ll be different noise, and so the accuracy will go down a little bit.

So in “out of sample error” rate, this is sometimes called the “generalization error” in machine learning, and so the idea is that once we build a model on a sample of data that we have collected we might want to test it on a new sample, on a sample collected by a different person or in a different time, in order to be able to see what the sort of realistic expectation of how well that machine learning algorithm will perform on new data.

Key Ideas: So almost always, “out of sample errors” is what you care about. So if you see an error rate reported only on the data where the machine-learning algorithm was built, you know that’s very optimistic, and it probably won’t reflect how the model will perform in real practice.

“In sample error” is always less than “out of sample error”, so that’s something to keep in mind. And the reason is overfitting. Basically, again, you’re matching your algorithm to the data that you have at hand, and you’re matching it a little bit too well.

So sometimes you want to be able to give up a little bit of accuracy in the sample you have, to be able to get accuracy on new data sets. In other words, when the noise is a little bit different, your algorithm will be robust.

In sample versus out of sample errors

library(kernlab); data(spam); set.seed(333)
smallSpam <- spam[sample(dim(spam)[1],size=10),]
spamLabel <- (smallSpam$type=="spam")*1 + 1
plot(smallSpam$capitalAve,col=spamLabel)

So just to show you a really simple example, I thought I’d show you in sample versus out of sample error’s with a kind of a trivial example. So here’s what I’ve done, I’ve taken this again I’ve gone to the kernlab package and I looked at the spam data set. Remember that was the data set where we collected information about spam messages, or messages from robot and things like that, and HAM messages, messages we actually care about. And what I do is I actually take a very small sample of that Spam data set. I just take ten messages and what I do is looking at the average number of capital letters that you observes in a particular email. And so I’ve plotted the first ten examples here versus their index. And so in red are all the spam messages, in black are all the ham messages. And so you can see, for example, that some of the spam messages like this one up here, have a lot more capital letters than the ones that are ham messages. That sort of makes sense intuitively.

Prediction rule 1

capitalAve \(>\) 2.7 = “spam”
capitalAve \(<\) 2.40 = “nonspam”
capitalAve between 2.40 and 2.45 = “spam”
capitalAve between 2.45 and 2.7 = “nonspam”

So we might want to build a predictor, based on the average number of capital letters, as to whether you are a spam message or you’re a ham message. So one thing that we could do is build a predictor that says “if you have a lot of capitals than you’re a spam message, and if you don’t then you’re a non spam message”. And here’s what this rule could look like, you could say if you’re above 2.7, per capital average we’re going to call you spam, if you’re below 2.40 you’re classified as non-spam. And then one more, one thing we can do is we can actually try to train this algorithm very, very well to predict perfectly on this data set.

library(kernlab); data(spam); set.seed(333)
smallSpam <- spam[sample(dim(spam)[1],size=10),]
spamLabel <- (smallSpam$type=="spam")*1 + 1
plot(smallSpam$capitalAve,col=spamLabel)

So if we go back to this plot of the different values, you can see there’s one spam message right down here in the lower right hand corner (colored in red), that is a little bit lower than the highest non-spam value (colored black) in terms of this capital average. So we could build a prediction algorithm that would capture that spam value as well.

Prediction rule 1

capitalAve between 2.40 and 2.45 = “spam”
capitalAve between 2.45 and 2.7 = “nonspam”

And so what we would make here is a rule that just picks out that ONE value in the training set. It says if you’re between 2.4 and 2.45, you’re called spam as well. And that’s designed to basically make the training set accuracy perfect.

Apply Rule 1 to smallSpam

rule1 <- function(x){
  prediction <- rep(NA,length(x))
  prediction[x > 2.7] <- "spam"
  prediction[x < 2.40] <- "nonspam"
  prediction[(x >= 2.40 & x <= 2.45)] <- "spam"
  prediction[(x > 2.45 & x <= 2.70)] <- "nonspam"
  return(prediction)
}
table(rule1(smallSpam$capitalAve),smallSpam$type)

         
          nonspam spam
  nonspam       5    0
  spam          0    5

And you can see if we apply this rule to the training set, we actually do get perfect accuracy. So if you’re nonspam, we perfectly classify you as non spam, and if you are spam, we perfectly classify you as spam.

Prediction rule 2

capitalAve \(>\) 2.40 = “spam”
capitalAve \(\leq\) 2.40 = “nonspam”

An alternative rule would not train quite so tightly to the training set, but would still use the basic principle of if you have a high number of capital letters then your spam message, and so this rule might look something like this. If you’re above 2.40, your cap, your spam message, if you’re less than or equal to 2.40, then you’re nonspam message.

Apply Rule 2 to smallSpam

rule2 <- function(x){
  prediction <- rep(NA,length(x))
  prediction[x > 2.8] <- "spam"
  prediction[x <= 2.8] <- "nonspam"
  return(prediction)
}
table(rule2(smallSpam$capitalAve),smallSpam$type)

         
          nonspam spam
  nonspam       5    1
  spam          0    4

So this rule on the training set would then miss that one value. In other words, you could have a prediction of nonspam for that one spam message that was just a little bit lower in our training set. So overall, this looks like in that training set that the accuracy is a little bit lower for this rule, and it’s a little bit more simplistic.

Apply to complete spam data

table(rule1(spam$capitalAve),spam$type)

         
          nonspam spam
  nonspam    2141  588
  spam        647 1225

table(rule2(spam$capitalAve),spam$type)

         
          nonspam spam
  nonspam    2224  642
  spam        564 1171

mean(rule1(spam$capitalAve)==spam$type)

[1] 0.7315801

mean(rule2(spam$capitalAve)==spam$type)

[1] 0.7378831

So then we can apply it to all the spam data. In other words apply it to all the values, not just the values that we had in the small training set, and these are the results that you would get.

So this is a table of our predictions on the the rows here, and in the columns that’s the actual values. And so you can see the number of errors that we make are the errors here (rule1: 588, 647; rule2: 642, 564) that are on the off-diagonal elements of this little matrix that we created. So those are the number of errors that we made

Look at accuracy

sum(rule1(spam$capitalAve)==spam$type)

[1] 3366

sum(rule2(spam$capitalAve)==spam$type)

[1] 3395

And so what we can look at is that we can actually look at the average number of times that were right using our more complicated rules. So this is just the sum of the times that our prediction is equal to the actual value in the spam data set. And so that happens 3,366 times in this data set. And then we could also look at the more simplified rule, the rule where we just used a threshold, and also look at the number of times that’s equal to the real spam type. And you can see about 30 more times, we actually get the right answer when we use this more simplified rule.

What’s going on?

Overfitting

Data have two parts
Signal
Noise
The goal of a predictor is to find signal
You can always design a perfect in-sample predictor
You capture both signal + noise when you do that
Predictor won’t perform as well on new samples

http://en.wikipedia.org/wiki/Overfitting

So, what’s the reason that the simplified rule actually does better than the more complicated rule? And the reason why is over fitting.

So, in every data set we have two parts, we have the signal component, that’s the part we’re trying to use to predict. And then we have noise, so that’s just random variation in the dataset that we get, because the data are measured noisily. And so the goal of a predictor is to find a signal and ignore the noise. And in any small dataset, you can always build a perfect in-sample predictor just like we did with that spam dataset. You can always carve up the prediction space in this, in this small data set, to capture every single quirk of that data set. But when you do that, you capture both the signal and the noise. So for example, in that training set there was one stem value that has slightly lower capital average than some of the non-span values. But that was just because we randomly picked a data set where that was true, where that value was low. So that predictor won’t necessarily perform as well on new samples, because we’ve tuned it too tightly to the observed training set.

So, this lecture has two purposed. One is to introduce you to the idea of in sample and out of sample errors. In-sample errors are errors on the trainings that we actually built with, and out of sample errors are the errors on the data set that wasn’t used to build the training predictor. And also we introduced to you this idea of over fitting. In that we want to build models that are simple and robust enough that they don’t actually capture the noise, while they do capture all of the signal.

In sample and out of sample error

Jeffrey Leek

In sample versus out of sample

So sometimes you want to be able to give up a little bit of accuracy in the sample you have, to be able to get accuracy on new data sets. In other words, when the noise is a little bit different, your algorithm will be robust.

In sample versus out of sample errors

Prediction rule 1

Prediction rule 1

And so what we would make here is a rule that just picks out that ONE value in the training set. It says if you’re between 2.4 and 2.45, you’re called spam as well. And that’s designed to basically make the training set accuracy perfect.

Apply Rule 1 to smallSpam

And you can see if we apply this rule to the training set, we actually do get perfect accuracy. So if you’re nonspam, we perfectly classify you as non spam, and if you are spam, we perfectly classify you as spam.

Prediction rule 2

Apply Rule 2 to smallSpam

Apply to complete spam data

Look at accuracy

What’s going on?