Sec.18 - PREDICTING WITH REGRESSION, MULTIPLE COVARIATES

Based on Jeef Leek's slides for the “Practical Machine Learning” course.

This lecture is about two things.

predicting with regression and using multiple covariates, and more importantly
exploring a data set and trying to identify which predictors are the most important to include in our prediction model.

Example: predicting wages

Image Credit http://www.cahs-media.org/the-high-cost-of-low-wages

Data from: ISLR package from the book: Introduction to statistical learning

Example: Wage data

library(ISLR) 
library(ggplot2) 
library(caret)

We subset the data set for exploration purposes leaving out the variable that we are trying to predict logwage. So we select out the log wage variable, which is the variable that we're going to be predicting in this analysis.

data(Wage) 
Wage <- subset(Wage, select=-c(logwage))
summary(Wage)
##       year           age              sex                    maritl           race     
##  Min.   :2003   Min.   :18.0   1. Male  :3000   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.8   2. Female:   0   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.0                    3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.4                    4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.0                    5. Separated    :  55                  
##  Max.   :2009   Max.   :80.0                                                           
##                                                                                        
##               education                     region               jobclass               health    
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544   1. <=Good     : 858  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456   2. >=Very Good:2142  
##  3. Some College   :650   3. East North Central:   0                                              
##  4. College Grad   :685   4. West North Central:   0                                              
##  5. Advanced Degree:426   5. South Atlantic    :   0                                              
##                           6. East South Central:   0                                              
##                           (Other)              :   0                                              
##   health_ins        wage      
##  1. Yes:2083   Min.   : 20.1  
##  2. No : 917   1st Qu.: 85.4  
##                Median :104.9  
##                Mean   :111.7  
##                3rd Qu.:128.7  
##                Max.   :318.3  
##

Get training/test sets

inTrain <- createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training <- Wage[inTrain,] 
testing <- Wage[-inTrain,]
dim(training) 
## [1] 2102   11
dim(testing)
## [1] 898  11

Feature plot

featurePlot(x=training[, c("age","education","jobclass")],
            y = training$wage,
            plot="pairs")

Plot age vs. wage

qplot(age, wage, data=training)

Plot age vs. wage colour by jobclass

qplot(age, wage, colour=jobclass, data=training)

Plot age vs. wage colour by education

qplot(age, wage, colour=education, data=training)

Fit a linear model

\[ ED_i = \beta_0 + \beta_1 age + \beta_2 \mathbb{I}(\text{Jobclass}_i=\text{"Information"}) + \sum_{k=1}^5 \gamma_k \mathbb{I}(\text{education}_i= \text{level}_k) \]

modFit<- train(wage ~ age + jobclass + education, method = "lm", data=training)
print(modFit)
## Linear Regression 
## 
## 2102 samples
##   10 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ... 
## 
## Resampling results
## 
##   RMSE  Rsquared  RMSE SD  Rsquared SD
##   40    0.3       1        0.02       
## 
## 

finMod <- modFit$finalModel
summary(finMod)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -100.00  -20.11   -3.73   14.81  202.63 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    62.0310     3.8704   16.03  < 2e-16 ***
## age                             0.5249     0.0674    7.79  1.0e-14 ***
## `jobclass2. Information`        2.6795     1.6195    1.65  0.09817 .  
## `education2. HS Grad`          10.8645     2.9341    3.70  0.00022 ***
## `education3. Some College`     23.4075     3.0829    7.59  4.7e-14 ***
## `education4. College Grad`     36.6979     3.0874   11.89  < 2e-16 ***
## `education5. Advanced Degree`  63.9465     3.4054   18.78  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.3 on 2095 degrees of freedom
## Multiple R-squared:  0.259,  Adjusted R-squared:  0.257 
## F-statistic:  122 on 6 and 2095 DF,  p-value: <2e-16

Education levels:

1 = < HS Grad,
2 = HS Grad,
3 = Some College
4 = College Grad
5 = Advanced Degree

Diagnostics : plot of predicted values vs. residuals

plot(finMod, 1, pch=19, cex=0.5, col="#00000010")

Ideally one would want the line to run at about 0.0, and this is close but still with a slight tilt.
We can also see that there are a couple of outliers, automatically labeled.

Diagnostics: color by variables not used in the model (e.g. race)

qplot(finMod$fitted, finMod$residuals, colour=race, data=training)

Diagnostics: plot (residuals) by index

plot(finMod$residuals, pch=19)

There should not be any trend with index, but the residuals show both an increasing trend towards higher indices and also all points with high residuals are concentrated towards the end of the data set.

Whenever you see a trend or outliers pattern like that with respect to the row numbers, it suggests that there is a variable missing from the model because we are plotting the residuals here, i.e. the difference between true values and predicted values.

Residuals should not have any relationship with the order in which the variables appear in your data set, unless, and this is what typically you discover when you see a trend like this, or outliers like this at one end of this plot, there is a relationship with respect to time, or age, or some other continuous variable by which rows may be ordered.

Predicted vs. truth in test set

pred <- predict(modFit, testing)
qplot(wage, pred, colour=year, data=testing)

WARNING : once we look at the test data set, we can not return to the training set and make adjustments!

If you want to use all covariates (use `.` in the formula)

modFitAll <- train(wage ~ ., data=training, method="lm")
modFitAll
## Linear Regression 
## 
## 2102 samples
##   10 predictors
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ... 
## 
## Resampling results
## 
##   RMSE  Rsquared  RMSE SD  Rsquared SD
##   30    0.3       2        0.02       
## 
## 

pred <- predict(modFitAll, testing)

# GF: it returns a warning:
## Warning message:
## In predict.lm(modelFit, newdata) :
##   prediction from a rank-deficient fit may be misleading

qplot(wage, pred, data=testing)

Notes and further reading

Often useful in combination with other models
Elements of statistical learning
Modern applied statistics with S
Introduction to statistical learning