In Sample Error: The error rate you get on the same data set you used to build your predictor. Sometimes called resubstitution error.
Out of Sample Error: The error rate you get on a new data set. Sometimes called generalization error.
Key ideas
So “in sample errors”, the error you get on the same data you used to train your predictor. This is sometimes called “resubstitution error” in the machine learning literature. And “in sample error” is always going to be a little bit optimistic, from what the error is that you would get from a new sample. And the reason why is, in your specific sample, sometimes your prediction algorithm will tune itself a little bit to the noise that you collected in that particular data set. And so when you get a new data set, there’ll be different noise, and so the accuracy will go down a little bit.
So in “out of sample error” rate, this is sometimes called the “generalization error” in machine learning, and so the idea is that once we build a model on a sample of data that we have collected we might want to test it on a new sample, on a sample collected by a different person or in a different time, in order to be able to see what the sort of realistic expectation of how well that machine learning algorithm will perform on new data.
Key Ideas: So almost always, “out of sample errors” is what you care about. So if you see an error rate reported only on the data where the machine-learning algorithm was built, you know that’s very optimistic, and it probably won’t reflect how the model will perform in real practice.
“In sample error” is always less than “out of sample error”, so that’s something to keep in mind. And the reason is overfitting. Basically, again, you’re matching your algorithm to the data that you have at hand, and you’re matching it a little bit too well.
library(kernlab); data(spam); set.seed(333)
smallSpam <- spam[sample(dim(spam)[1],size=10),]
spamLabel <- (smallSpam$type=="spam")*1 + 1
plot(smallSpam$capitalAve,col=spamLabel)
So we might want to build a predictor, based on the average number of capital letters, as to whether you are a spam message or you’re a ham message. So one thing that we could do is build a predictor that says “if you have a lot of capitals than you’re a spam message, and if you don’t then you’re a non spam message”. And here’s what this rule could look like, you could say if you’re above 2.7, per capital average we’re going to call you spam, if you’re below 2.40 you’re classified as non-spam. And then one more, one thing we can do is we can actually try to train this algorithm very, very well to predict perfectly on this data set.
library(kernlab); data(spam); set.seed(333)
smallSpam <- spam[sample(dim(spam)[1],size=10),]
spamLabel <- (smallSpam$type=="spam")*1 + 1
plot(smallSpam$capitalAve,col=spamLabel)
So if we go back to this plot of the different values, you can see there’s one spam message right down here in the lower right hand corner (colored in red), that is a little bit lower than the highest non-spam value (colored black) in terms of this capital average. So we could build a prediction algorithm that would capture that spam value as well.
rule1 <- function(x){
prediction <- rep(NA,length(x))
prediction[x > 2.7] <- "spam"
prediction[x < 2.40] <- "nonspam"
prediction[(x >= 2.40 & x <= 2.45)] <- "spam"
prediction[(x > 2.45 & x <= 2.70)] <- "nonspam"
return(prediction)
}
table(rule1(smallSpam$capitalAve),smallSpam$type)
nonspam spam
nonspam 5 0
spam 0 5
rule2 <- function(x){
prediction <- rep(NA,length(x))
prediction[x > 2.8] <- "spam"
prediction[x <= 2.8] <- "nonspam"
return(prediction)
}
table(rule2(smallSpam$capitalAve),smallSpam$type)
nonspam spam
nonspam 5 1
spam 0 4
table(rule1(spam$capitalAve),spam$type)
nonspam spam
nonspam 2141 588
spam 647 1225
table(rule2(spam$capitalAve),spam$type)
nonspam spam
nonspam 2224 642
spam 564 1171
mean(rule1(spam$capitalAve)==spam$type)
[1] 0.7315801
mean(rule2(spam$capitalAve)==spam$type)
[1] 0.7378831
So then we can apply it to all the spam data. In other words apply it to all the values, not just the values that we had in the small training set, and these are the results that you would get.
sum(rule1(spam$capitalAve)==spam$type)
[1] 3366
sum(rule2(spam$capitalAve)==spam$type)
[1] 3395
http://en.wikipedia.org/wiki/Overfitting
So, what’s the reason that the simplified rule actually does better than the more complicated rule? And the reason why is over fitting.
So, in every data set we have two parts, we have the signal component, that’s the part we’re trying to use to predict. And then we have noise, so that’s just random variation in the dataset that we get, because the data are measured noisily. And so the goal of a predictor is to find a signal and ignore the noise. And in any small dataset, you can always build a perfect in-sample predictor just like we did with that spam dataset. You can always carve up the prediction space in this, in this small data set, to capture every single quirk of that data set. But when you do that, you capture both the signal and the noise. So for example, in that training set there was one stem value that has slightly lower capital average than some of the non-span values. But that was just because we randomly picked a data set where that was true, where that value was low. So that predictor won’t necessarily perform as well on new samples, because we’ve tuned it too tightly to the observed training set.