First load required datasets:

## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2

Partition test and training datasets:

inTrain<-createDataPartition(y=faithful$waiting,p=.5,list=F)
trainFaith<-faithful[inTrain,]
testFaith<-faithful[-inTrain,]

Create a simple linear regression betweeen the waiting time and eruption time:

lm1<-lm(eruptions~waiting,data=trainFaith)
summary(lm1)
## 
## Call:
## lm(formula = eruptions ~ waiting, data = trainFaith)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1681 -0.3342  0.0069  0.3158  0.9989 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.773726   0.215983  -8.212 1.55e-13 ***
## waiting      0.074998   0.002984  25.132  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4817 on 135 degrees of freedom
## Multiple R-squared:  0.8239, Adjusted R-squared:  0.8226 
## F-statistic: 631.6 on 1 and 135 DF,  p-value: < 2.2e-16

For visual understanding, plot the data and the linear model:

plot(x=trainFaith$waiting,y=trainFaith$eruptions,pch=19,col="blue")
lines(x=trainFaith$waiting,y=lm1$fitted.values,lwd=3)

Though for this particular purpose, I’d usually use ggplot2:

ggplot(data=trainFaith, aes(x=waiting,y=eruptions)) + geom_point() + geom_smooth(method="lm")

Using properties of linear regressions, do some predicting. First extract coefficients, then do a single prediction:

intercept<-coef(lm1)[1]; slope<-coef(lm1)[2]
# Check at 80: 
intercept+slope*80
## (Intercept) 
##    4.226097
# better to use predict function --> same answer
newData<-data.frame(waiting=80)
predict(lm1,newData)
##        1 
## 4.226097

Now plot the original line parameters against training data and test data: (side note: learn an efficient way to plot a lm in ggplot without using geom_smooth).

par(mfrow=c(1,2))
plot(x=trainFaith$waiting,y=trainFaith$eruptions,pch=19,col="blue")
lines(x=trainFaith$waiting,y=predict(lm1),lwd=3)
plot(x=testFaith$waiting,y=testFaith$eruptions,pch=19,col="red")
lines(x=testFaith$waiting,y=predict(lm1,newdata=testFaith),lwd=3)

^^ Labels are easily changed; the purpose here is to get a general picture of what’s happening. Next let’s look at RMSE for training and test sets.

#calculate root mean square error (RMSE) on training set
sqrt(sum((lm1$fitted.values-trainFaith$eruptions)^2))
## [1] 5.597283
#see how it does with the test data:
sqrt(sum((predict(lm1,newdata=testFaith)-testFaith$eruptions)^2))
## [1] 6.007774

We see here that as expected, average error for the test set is higher (since the data here wasn’t used to create the model). However, it’s close and more likely closer to the actual error of the model when faced with new data.

Now we’ll create some prediction intervals:

pred1<-predict(lm1,newdata=testFaith,interval="prediction")
ord<-order(testFaith$waiting)
plot(testFaith$waiting, testFaith$eruptions,pch=19,col="blue")
matlines(testFaith$waiting[ord],pred1[ord,],type="l",col=c(1,2,2),lty=c(1,1,1),lwd=3)

And as might be expected, there’s a better (and easier) way to do a simple linear regression prediction using the caret package:

modFit<-train(eruptions~waiting, data=trainFaith,method="lm")
summary(modFit$finalModel)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1681 -0.3342  0.0069  0.3158  0.9989 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.773726   0.215983  -8.212 1.55e-13 ***
## waiting      0.074998   0.002984  25.132  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4817 on 135 degrees of freedom
## Multiple R-squared:  0.8239, Adjusted R-squared:  0.8226 
## F-statistic: 631.6 on 1 and 135 DF,  p-value: < 2.2e-16

Now let’s try it with several variables

Use new data set, subset for the stuff we need:

As usual, partition into training and testing datasets.

inTrain<-createDataPartition(y=Wage$wage,p=.6,list=FALSE)
training<-Wage[inTrain,]
testing<-Wage[-inTrain,]

Recall the two plots from from previous analysis (also see Tableau visualizations). It seems like age, jobclass, and education level might be variables that help to explain the splitting of this data into to clusters.

qplot(age, wage,data=Wage,col=jobclass)

qplot(age, wage,data=Wage,col=education)

Let’s try doing a linear regression with several regressors variables.

modFit<-train(wage~age+jobclass+education, method="lm", data=training)
finMod<-modFit$finalModel
modFit
## Linear Regression 
## 
## 1802 samples
##    3 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1802, 1802, 1802, 1802, 1802, 1802, ... 
## Resampling results:
## 
##   RMSE      Rsquared 
##   36.31838  0.2793854
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Cool, it handles the conversion to dummy variables automatically, and comes up with a linear model that accounts for about 25% of the variability. Not an exceptional predictor, but a simple enough process that would probably work well with more linear data.

Let’s look at the residuals.

plot(finMod,1,pch=19,cex=.5,col="#00000010")

Here we see that it labels outliers for us. We can also color the residuals by the variables not used in our model, to see if a different variable seems to be responsible for our outliers.

qplot(finMod$fitted,finMod$residuals,col=race,data=training)

Doesn’t seem like race. Lastly, let’s look at the residuals plotted by index (that is, row number). This result surprised me, because it was different than the example. In the class example, there was a clear upward slope in the residuals as index increased, as well as a clustering of outliers at high index values. In that situation, the instructor suggested that there was a variable that was not accounted for in our data (or at least the model). Clearly, our data seems to be mostly centered around 0, with no linear trends, and a fairly consistent dispersion of high residuals across index values.

plot(finMod$residuals,pch=19)

Lastly, since the purpose of this class is predictive models, let’s get a visual on how well our model does to predict values.

This plot does 2 things. First, it shows how well the prediction works: in a perfect model, all the data points would be on the \(y=x\) line. Second, it looks for some sort of clusturing or trends on our predictors caused by an outside variable. It does this by plotting by color – I conclude that there does not appear to be any sort of trend on the model being better or worse for a certain year range.

pred<-predict(modFit,testing)
qplot(wage,pred,col=year,data=testing)

That concludes week 2 of practical machine learning. Thank you to Jeffrey Leek and the Johns Hopkins biostatistics department for this course.