Machine Learning: Initial Considerations

This content was created from the Coursera course on introduction to machine learning.

Related work can be found on my website.

1.0 In Sample vs Out of Sample Error

In sample error: The error rate you get on the same data set you used to build your model on.
Out of sample error: The error rate you get on a new dataset. Sometimes called Generalization error.

Key Ideas

We should care about out of sample error only.
In sample error < out of sample error.
The reason is overfitting, we match the algorithm a little to closely to the data that we have (since it contains some noise), rather than the underlying trends. Thus, when we apply the data to the real world, we are not matching the generalized reality as well as we potentially could.

Sometimes, we could give up some accuracy on our given data, just so that we have more generalized accuracy on new data, which is more robust and applicable.

Let's try building a predictor on the basis of how many capital letters are in the email. We can see that there seem to be more capital letters in spam emails within this sample.

library(kernlab); data(spam); set.seed(333)

smallSpam <- spam[sample(dim(spam)[1],  size=10), ]
spamLabel <- (smallSpam$type=='spam')*1 + 1

plot(smallSpam$capitalAve, col=spamLabel) #Avg # of capital letters in an email.

plot of chunk unnamed-chunk-1

Prediction Rule

For example:

Capital Avg > 2.7 = 'spam'
Capital Avg < 2.4 = 'nonspam'

However, we see that there is one point (the last index), whereby the spam message is slightly less capital letters than the highest non-spam message. We can incorporate a prediction rule for that to make it a perfect prediction.

Capital Avg between 2.4 - 2.45 = 'spam'
Capital Avg between 2.45 - 2.7 = 'non-spam'

In code…

rule1 <- function(x){
  pred <-rep(NA, length(x)) # create empty list of length x
  pred[x > 2.7]<-'spam'
  pred[x < 2.4]<-'nonspam'
  pred[(x >= 2.4 & x <=2.45)] <- 'spam'
  pred[(x > 2.45 & x <= 2.70)] <- 'nonspam'
  return(pred)
}

table(rule1(smallSpam$capitalAve) ,smallSpam$type)

##          
##           nonspam spam
##   nonspam       5    0
##   spam          0    5

Hurrah! A perfect prediction algorithm! But is this overfitting the data?

Yes it is…

Let's just remove the third and fourth statements in rule1, which catered a bit too closely to our sample data.

rule2 <- function(x){
  pred <-rep(NA, length(x)) # create empty list of length x
  pred[x > 2.8]<-'spam'
  pred[x <= 2.8]<-'nonspam'
  return(pred)
}

table(rule2(smallSpam$capitalAve) ,smallSpam$type)

##          
##           nonspam spam
##   nonspam       5    1
##   spam          0    4

Hey! This isn't the most accurate model, is it? Let's apply our rules to the full spam dataset and review our predictions.

sum(rule1(spam$capitalAve) == spam$type) #Sum the # of errors

## [1] 3366

sum(rule2(spam$capitalAve) == spam$type) #Sum the # of errors

## [1] 3395

It turns out that our more simplified example is able to predict slightly better than the complex (overfitted) model.

What is happening here?

__ OVERFITTING__

Data have two parts:

Signal
Noise

The goal of the predictor is the find signal. We can always create a perfect in-sample predictor, however in doing this we capture both signal and noise, and the predictor will not perform as well on new data.

2.0 Prediction Study Design

Define Error rate
Split data into:
- Randomly sampled: Training, Testing, Validating
- Have to be careful to avoid small sample sizes
- Probabilistic values can change dramatically over different sized samples.
If you have a large sample size:
- 60% training
- 20% test
- 20% validation
If you have medium sample size:
- 60% training
- 40% testing - but we cannot fine tune, only test once.
If you have a small sample size:
- Do not use cross-validation
- Report caveat of small sample size
On the training set, pick features
- use cross-validation
On the training set, pick prediction function
- use cross-validation.
If no validation set exists,
- apply once to test set (applying more than once would imply we are still training the model).
If validation
- apply to test set and refine model.
- then apply one time to validation.

3.0 Types of errors

Positive = “identified”, negative = “rejected” in the scope of our prediction results.

True positive = correctly identified
True negative = incorrectly identified
False Positive = correctly rejected
False negative = incorrectly rejected

In a medical example:

True positive = Sick people correctly diagnosed as sick
False Positive = Healthy people incorrectly identified as sick
True negative = Healthy people correctly identified as healthy
False negative = Sick people incorrectly identified as healthy

In a more quantitative way, we can set up a 2x2 table and compute the probabilities.

Sensitivity Pr( positive test | disease )

If you are actually diseased, what is the probability we get that right? TP / (TP +FN) - True divided by total true

Specificity Pr( negative test | no disease )

TN / (FP + TN) - False divided by total false

Positive Predictive Value Pr( disease | positive test )

What fraction of people we called disease are actually diseased. TP / (TP + FP) - Total of the positive tests

Negative Predictive Value Pr( no disease | negative test )

What fraction of people we called not diseased, are not diseased. TN / (FN + TN) - Total of the negative tests

Accuracy Pr( correct outcomes )

True Positives and True Negatives added up. (TP + TN) / (TP + FP + FN + TN)

Let's look at a real example

Assume that a disease has a 0.1% prevalence in the population, and we have a test kit that works with 99% specificity and 99% sensitivity.

What is the probability of a person having the disease, given the test result is positive?

if we sampled from the general population (0.1% prevalence?
if we sampled from a high risk sub-population with 10% disease prevalence?

Suppose a population of 10 000 people.

Sensitivity –> 99 / (99 + 1) = 99%

Specificity –> 98901 / (999 + 98901) = 99%

Positive predicted value –> 99 / (99 + 999) = ~9%

Only 9% of the people are actually sick that had a positive test result.
Why? Because 99% of a small number, is still smaller than 1 % of a large number.

If we used at at risk sub-population, we would get results we would expect, and we would have a high positive predicted power.

The lesson here is, we have to be careful when predicting rare events, because we always have many false positives.

Using a plot of specificity vs sensitivity, we can look at a receiver operating characteristic curve to help us pick cutoffs for our models. We will look more into that later.

More here at wikipedia

4.0 Cross validation

Recall, good estimates of our model are always optimistic while using the the training set. Ideally, we should only evaluate the performance of our final model once on a testing set, to get a good picture of actual real-world predictive power.

Use only the training set for building your algorithm
Split data into training data testing/training sets
Build a model on the training set
Evaluate on the test set
Repeat process and average the estimated errors
Use the original test set for the final test to get robust out sample error.

Useful for 1. Picking variables to include in the model. 2. Picking the type of prediction function. 3. Picking the parameters in the prediction function. 4. Comparing different predictors.

There are many ways to do this.

Random subsampling
K-folds
Leave-one-out (extreme case of K-folds)

Considerations

Time series data cant be randomly sampled, must be used in 'chunks'
for k-fold CV:
- Larger k = less bias, more variance
- Smaller k = more bias, less variance.
Random sampling must be done without replacement
Random sampling with replacement is bootstrap
- Underestimates error
- Can be corrected, but is complicated
If you cross validate to pick predictors estimate, you must still estimate errors on an independent dataset.

5.0 Some notes on the data you use

Try to use data that is exactly the same, or very similar to what you are actually predicting. For example, use previous movie ratings to predict new movie ratings.

If your data is not as similar to the actual prediction quantity, be sure to understand exactly how your data is serving as a predictor in your model, is it just a coincidence? Or does your predictor describe something fundamental to the responses underlying process.