Model analysis and improvement

RPub #4 in a series on data mining with R by

Karen Mazidi

Further information can be found on my blog: http://www.karenmazidi.com/blog.html

Before getting to the content of this RPub, let me say a word about ggplot2: amazing. This graphics package lets you create amazing graphs and is fairly easy to use but there is a learning curve. There are many good tutorials on the web and it's worth your time to go through one or more. It's called “ggplot” because it specifies a grammar of graphics that lets you specific the data, aesthetics and geometric objects separately.

# suppress warning messages for this demo, not a good idea in practice
options(warn=-1)
# Here are some libraries we will use in this RPub
require(ggplot2)

## Loading required package: ggplot2

In creating models such as the linear regression models we created in previous RPubs, we want to create a model that will generalize well to other data sets. That is, we don't want our model to be overly influenced by the data set on which it was built. There are two primary considerations here. One is that we want to select predictors that predict well for varied data sets. Two is that we want a model that is not biased by outliers or influential cases. Before getting into these two issues we look at the bias-variance tradeoff.

Bias-variance tradeoff

We expect to have more errors on our test set than on our training set. The expected test mean squared error (MSE) can be broken into 3 components:

the variance of the function, which is the amount by which the function would change if it were built using different training data. Variance decreases with more training data and increases with more complicated classifiers.
the squared bias of the function, where bias is understood to be error inevitably introduced by trying to model real life with mathematical assumptions, such as that X and Y have a linear relationship. Bias measures how well a model can fit the true data distribution.
the variance of the error terms \( \epsilon \), often called the irreducible error

Ideally we would like a model that has low variance and low bias. Generally, more complex methods such as a polynomial fit, SVM and other techniques we will look at later, have less bias than the rather simple linear model we have examined so far. But unfortunately, more complex methods have greater variance.

The following chart illustrates this point. The curved line hits all points perfectly but we would not expect it to have good results on new points. The curved line has high variance and low bias. In contrast we expect the straight line to have lower variance but higher bias.

If you have high bias, training and test error will be high. If you have high variance, training error will be low but the test error will be high.

As we make a model more complicated, bias decreases, variance increases.

x <- 1:5
y <- c(4.9,6.3,6.8,8.2,9.1)
model <- predict(lm(y~x))
qplot(x, y, geom='smooth', span=0.5) + geom_point() + geom_hline(aes(yintercept=mean(y))) +
  geom_line(aes(y=model))

plot of chunk unnamed-chunk-2

Methods of selecting predictors

We've gone through quite a bit on analysis looking at adding the Gender variable, and we haven't even talked about the Age variable yet. Imagine a data set with dozens of possible predictor variables. How can we go about selecting predictors in a principled manner? There are a few popular approaches.

Hierarchical approach

In a hierarchical approach, predictors are added one at a time to a model in the order of their presumed importance. This presumed importance could be based on prior analysis.

Forced entry

The forced entry approach throws all the variables in the model at once.

Stepwise methods

Some statisticians frown upon the stepwise approach because it can be influenced by the data set and therefore not lead to replicable research. Nevertheless it is still commonly used. In forward stepwise selection, one variable is added at a time to the model. The variable chosen is the one with the highest correlation with the response variable. If it improves the model it is retained. This processes continues until the AIC stops dropping. The backward method starts with all the predictors and removes them one at a time. The both method starts the same as the forward method, but with each addition it reevaluates the predictors in the model to see if any can be eliminated. The backward method is often preferred because the forward method has a greater risk of making a Type II error - missing a useful predictor. This is because of suppressor effects that occur when a predictor has an effect but only when another variable is held constant.

All subsets method

This method tries every subset of predictors to see which one gives the best fit according to a metric called Mallows C_p. The number of possible combinations can be quite large so this technique was not used until PCs became powerful enough to handle it.

How to choose the method

In choosing a method be aware that a model with fewer predictors is preferable to a model with many predictors. Also keep in mind the meaning of your predictors. Do they make sense? If we had a predictor with the number of pounds of chocolate eaten every year, would brain size likely be correlated with chocolate consumption? This is what leads to those flaky science headlines you see in the news: Eating chocolate give you a big head! We have already talked about the potential pitfalls off step-wise methods. If you use this method, cross-validating your results is a good idea.

also talk about the update function p. 279

Generalization

Let's suppose we get reasonably good results on our model on the data we used. Can we conclude that the model tells us something important about the correlations between the variables? Not necessarily. We need to explore whether the model would generalize to other data sets.

Two types of data can prevent a model from generalizing well: outlier and influential cases. An outlier is a data point that differs significantly from the other data points. We can detect outliers by looking at their residuals, which quantify how far off the outcome was from the outcome predicted by the model.

# create a synthetic data set to demo outliers and influential cases
# show how it biases the mean and the regression line

Detecting outliers

The residual is a normal or unstandardized residual because it is in the same units as the outcome. If we divide the residual by an estimate of its standard deviation, we get a standardized residual which is easier to interpret. We can also convert an unstandardized residual into a a z-score by subtracting the mean and then dividing by the standard deviation of all observations. The distribution of these z-scores will have a mean of 0 and a standard deviation of 1.

How can we use this z-score? We know that in a normally distributed sample, 95% of z-scores should be in the range -1.96 to +1.96. and 99% of them should be in the range +/- 2.58. If more than 1% of our data points have |z| > 2.58 then our model doesn't really fit the data well. Likewise if more than 5% have |z| > 1.96 then the model may not be fitting well.

Influential cases

An outlier is a data point with a y value that is extreme with respect to the other data points. A related idea is that of an influential case which is a data point that unduly influences any part of the regression analysis. One example is a data point that varies significantly in the X values. Points that lie on extreme ends of the X axis are called leverage points. A leverage point is not necessarily an influential case if it doesn't change the line.