This content was created from the Coursera course on introduction to machine learning.
Related work can be found on my website.
In sample error: The error rate you get on the same data set you used to build your model on.
Out of sample error: The error rate you get on a new dataset. Sometimes called Generalization error.
Key Ideas
Sometimes, we could give up some accuracy on our given data, just so that we have more generalized accuracy on new data, which is more robust and applicable.
Let's try building a predictor on the basis of how many capital letters are in the email. We can see that there seem to be more capital letters in spam emails within this sample.
library(kernlab); data(spam); set.seed(333)
smallSpam <- spam[sample(dim(spam)[1], size=10), ]
spamLabel <- (smallSpam$type=='spam')*1 + 1
plot(smallSpam$capitalAve, col=spamLabel) #Avg # of capital letters in an email.
For example:
However, we see that there is one point (the last index), whereby the spam message is slightly less capital letters than the highest non-spam message. We can incorporate a prediction rule for that to make it a perfect prediction.
In code…
rule1 <- function(x){
pred <-rep(NA, length(x)) # create empty list of length x
pred[x > 2.7]<-'spam'
pred[x < 2.4]<-'nonspam'
pred[(x >= 2.4 & x <=2.45)] <- 'spam'
pred[(x > 2.45 & x <= 2.70)] <- 'nonspam'
return(pred)
}
table(rule1(smallSpam$capitalAve) ,smallSpam$type)
##
## nonspam spam
## nonspam 5 0
## spam 0 5
Hurrah! A perfect prediction algorithm! But is this overfitting the data?
Yes it is…
Let's just remove the third and fourth statements in rule1, which catered a bit too closely to our sample data.
rule2 <- function(x){
pred <-rep(NA, length(x)) # create empty list of length x
pred[x > 2.8]<-'spam'
pred[x <= 2.8]<-'nonspam'
return(pred)
}
table(rule2(smallSpam$capitalAve) ,smallSpam$type)
##
## nonspam spam
## nonspam 5 1
## spam 0 4
Hey! This isn't the most accurate model, is it? Let's apply our rules to the full spam dataset and review our predictions.
sum(rule1(spam$capitalAve) == spam$type) #Sum the # of errors
## [1] 3366
sum(rule2(spam$capitalAve) == spam$type) #Sum the # of errors
## [1] 3395
It turns out that our more simplified example is able to predict slightly better than the complex (overfitted) model.
__ OVERFITTING__
Data have two parts:
The goal of the predictor is the find signal. We can always create a perfect in-sample predictor, however in doing this we capture both signal and noise, and the predictor will not perform as well on new data.
Define Error rate
Split data into:
If you have a large sample size:
If you have medium sample size:
If you have a small sample size:
On the training set, pick features
On the training set, pick prediction function
If no validation set exists,
If validation
Positive = “identified”, negative = “rejected” in the scope of our prediction results.
In a medical example:
In a more quantitative way, we can set up a 2x2 table and compute the probabilities.
Sensitivity Pr( positive test | disease )
If you are actually diseased, what is the probability we get that right? TP / (TP +FN) - True divided by total true
Specificity Pr( negative test | no disease )
TN / (FP + TN) - False divided by total false
Positive Predictive Value Pr( disease | positive test )
What fraction of people we called disease are actually diseased. TP / (TP + FP) - Total of the positive tests
Negative Predictive Value Pr( no disease | negative test )
What fraction of people we called not diseased, are not diseased. TN / (FN + TN) - Total of the negative tests
Accuracy Pr( correct outcomes )
True Positives and True Negatives added up. (TP + TN) / (TP + FP + FN + TN)
Let's look at a real example
Assume that a disease has a 0.1% prevalence in the population, and we have a test kit that works with 99% specificity and 99% sensitivity.
What is the probability of a person having the disease, given the test result is positive?
Suppose a population of 10 000 people.
Sensitivity –> 99 / (99 + 1) = 99%
Specificity –> 98901 / (999 + 98901) = 99%
Positive predicted value –> 99 / (99 + 999) = ~9%
If we used at at risk sub-population, we would get results we would expect, and we would have a high positive predicted power.
The lesson here is, we have to be careful when predicting rare events, because we always have many false positives.
Using a plot of specificity vs sensitivity, we can look at a receiver operating characteristic curve to help us pick cutoffs for our models. We will look more into that later.
Recall, good estimates of our model are always optimistic while using the the training set. Ideally, we should only evaluate the performance of our final model once on a testing set, to get a good picture of actual real-world predictive power.
Useful for 1. Picking variables to include in the model. 2. Picking the type of prediction function. 3. Picking the parameters in the prediction function. 4. Comparing different predictors.
There are many ways to do this.
Try to use data that is exactly the same, or very similar to what you are actually predicting. For example, use previous movie ratings to predict new movie ratings.
If your data is not as similar to the actual prediction quantity, be sure to understand exactly how your data is serving as a predictor in your model, is it just a coincidence? Or does your predictor describe something fundamental to the responses underlying process.