Machine Learning: Initial Considerations

This content was created from the Coursera course on introduction to machine learning.

Related work can be found on my website.

1.0 In Sample vs Out of Sample Error

Key Ideas

  1. We should care about out of sample error only.
  2. In sample error < out of sample error.
  3. The reason is overfitting, we match the algorithm a little to closely to the data that we have (since it contains some noise), rather than the underlying trends. Thus, when we apply the data to the real world, we are not matching the generalized reality as well as we potentially could.

Sometimes, we could give up some accuracy on our given data, just so that we have more generalized accuracy on new data, which is more robust and applicable.

Let's try building a predictor on the basis of how many capital letters are in the email. We can see that there seem to be more capital letters in spam emails within this sample.

library(kernlab); data(spam); set.seed(333)

smallSpam <- spam[sample(dim(spam)[1],  size=10), ]
spamLabel <- (smallSpam$type=='spam')*1 + 1

plot(smallSpam$capitalAve, col=spamLabel) #Avg # of capital letters in an email.

plot of chunk unnamed-chunk-1

Prediction Rule

For example:

However, we see that there is one point (the last index), whereby the spam message is slightly less capital letters than the highest non-spam message. We can incorporate a prediction rule for that to make it a perfect prediction.

In code…

rule1 <- function(x){
  pred <-rep(NA, length(x)) # create empty list of length x
  pred[x > 2.7]<-'spam'
  pred[x < 2.4]<-'nonspam'
  pred[(x >= 2.4 & x <=2.45)] <- 'spam'
  pred[(x > 2.45 & x <= 2.70)] <- 'nonspam'
  return(pred)
}

table(rule1(smallSpam$capitalAve) ,smallSpam$type)
##          
##           nonspam spam
##   nonspam       5    0
##   spam          0    5

Hurrah! A perfect prediction algorithm! But is this overfitting the data?

Yes it is…

Let's just remove the third and fourth statements in rule1, which catered a bit too closely to our sample data.

rule2 <- function(x){
  pred <-rep(NA, length(x)) # create empty list of length x
  pred[x > 2.8]<-'spam'
  pred[x <= 2.8]<-'nonspam'
  return(pred)
}

table(rule2(smallSpam$capitalAve) ,smallSpam$type)
##          
##           nonspam spam
##   nonspam       5    1
##   spam          0    4

Hey! This isn't the most accurate model, is it? Let's apply our rules to the full spam dataset and review our predictions.

sum(rule1(spam$capitalAve) == spam$type) #Sum the # of errors
## [1] 3366
sum(rule2(spam$capitalAve) == spam$type) #Sum the # of errors
## [1] 3395

It turns out that our more simplified example is able to predict slightly better than the complex (overfitted) model.

What is happening here?

__ OVERFITTING__

Data have two parts:

The goal of the predictor is the find signal. We can always create a perfect in-sample predictor, however in doing this we capture both signal and noise, and the predictor will not perform as well on new data.

2.0 Prediction Study Design

  1. Define Error rate

  2. Split data into:

    If you have a large sample size:

    If you have medium sample size:

    If you have a small sample size:

  3. On the training set, pick features

  4. On the training set, pick prediction function

  5. If no validation set exists,

  6. If validation

3.0 Types of errors

Positive = “identified”, negative = “rejected” in the scope of our prediction results.

In a medical example:

In a more quantitative way, we can set up a 2x2 table and compute the probabilities.

Sensitivity Pr( positive test | disease )

If you are actually diseased, what is the probability we get that right? TP / (TP +FN) - True divided by total true

Specificity Pr( negative test | no disease )

TN / (FP + TN) - False divided by total false

Positive Predictive Value Pr( disease | positive test )

What fraction of people we called disease are actually diseased. TP / (TP + FP) - Total of the positive tests

Negative Predictive Value Pr( no disease | negative test )

What fraction of people we called not diseased, are not diseased. TN / (FN + TN) - Total of the negative tests

Accuracy Pr( correct outcomes )

True Positives and True Negatives added up. (TP + TN) / (TP + FP + FN + TN)

Let's look at a real example

Assume that a disease has a 0.1% prevalence in the population, and we have a test kit that works with 99% specificity and 99% sensitivity.

What is the probability of a person having the disease, given the test result is positive?

Suppose a population of 10 000 people.

Sensitivity –> 99 / (99 + 1) = 99%

Specificity –> 98901 / (999 + 98901) = 99%

Positive predicted value –> 99 / (99 + 999) = ~9%

If we used at at risk sub-population, we would get results we would expect, and we would have a high positive predicted power.

The lesson here is, we have to be careful when predicting rare events, because we always have many false positives.

Using a plot of specificity vs sensitivity, we can look at a receiver operating characteristic curve to help us pick cutoffs for our models. We will look more into that later.

More here at wikipedia

4.0 Cross validation

Recall, good estimates of our model are always optimistic while using the the training set. Ideally, we should only evaluate the performance of our final model once on a testing set, to get a good picture of actual real-world predictive power.

  1. Use only the training set for building your algorithm
  2. Split data into training data testing/training sets
  3. Build a model on the training set
  4. Evaluate on the test set
  5. Repeat process and average the estimated errors
  6. Use the original test set for the final test to get robust out sample error.

Useful for 1. Picking variables to include in the model. 2. Picking the type of prediction function. 3. Picking the parameters in the prediction function. 4. Comparing different predictors.

There are many ways to do this.

Considerations

5.0 Some notes on the data you use

Try to use data that is exactly the same, or very similar to what you are actually predicting. For example, use previous movie ratings to predict new movie ratings.

If your data is not as similar to the actual prediction quantity, be sure to understand exactly how your data is serving as a predictor in your model, is it just a coincidence? Or does your predictor describe something fundamental to the responses underlying process.