Based on Jeef Leek's slides for the “Practical Machine Learning” course.
This lecture is about two things.
Image Credit http://www.cahs-media.org/the-high-cost-of-low-wages
Data from: ISLR package from the book: Introduction to statistical learning
library(ISLR)
library(ggplot2)
library(caret)
We subset the data set for exploration purposes leaving out the variable that we are trying to predict logwage
.
So we select out the log wage variable,
which is
the variable that we're going to be
predicting in this analysis.
data(Wage)
Wage <- subset(Wage, select=-c(logwage))
summary(Wage)
## year age sex maritl race
## Min. :2003 Min. :18.0 1. Male :3000 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.8 2. Female: 0 2. Married :2074 2. Black: 293
## Median :2006 Median :42.0 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.4 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.0 5. Separated : 55
## Max. :2009 Max. :80.0
##
## education region jobclass health
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544 1. <=Good : 858
## 2. HS Grad :971 1. New England : 0 2. Information:1456 2. >=Very Good:2142
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health_ins wage
## 1. Yes:2083 Min. : 20.1
## 2. No : 917 1st Qu.: 85.4
## Median :104.9
## Mean :111.7
## 3rd Qu.:128.7
## Max. :318.3
##
inTrain <- createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
dim(training)
## [1] 2102 11
dim(testing)
## [1] 898 11
featurePlot(x=training[, c("age","education","jobclass")],
y = training$wage,
plot="pairs")
qplot(age, wage, data=training)
qplot(age, wage, colour=jobclass, data=training)
qplot(age, wage, colour=education, data=training)
\[ ED_i = \beta_0 + \beta_1 age + \beta_2 \mathbb{I}(\text{Jobclass}_i=\text{"Information"}) + \sum_{k=1}^5 \gamma_k \mathbb{I}(\text{education}_i= \text{level}_k) \]
modFit<- train(wage ~ age + jobclass + education, method = "lm", data=training)
print(modFit)
## Linear Regression
##
## 2102 samples
## 10 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ...
##
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 40 0.3 1 0.02
##
##
finMod <- modFit$finalModel
summary(finMod)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.00 -20.11 -3.73 14.81 202.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.0310 3.8704 16.03 < 2e-16 ***
## age 0.5249 0.0674 7.79 1.0e-14 ***
## `jobclass2. Information` 2.6795 1.6195 1.65 0.09817 .
## `education2. HS Grad` 10.8645 2.9341 3.70 0.00022 ***
## `education3. Some College` 23.4075 3.0829 7.59 4.7e-14 ***
## `education4. College Grad` 36.6979 3.0874 11.89 < 2e-16 ***
## `education5. Advanced Degree` 63.9465 3.4054 18.78 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.3 on 2095 degrees of freedom
## Multiple R-squared: 0.259, Adjusted R-squared: 0.257
## F-statistic: 122 on 6 and 2095 DF, p-value: <2e-16
Education levels:
plot(finMod, 1, pch=19, cex=0.5, col="#00000010")
Ideally one would want the line to run at about 0.0, and this is close but still with a slight tilt.
We can also see that there are a couple of outliers, automatically labeled.
qplot(finMod$fitted, finMod$residuals, colour=race, data=training)
plot(finMod$residuals, pch=19)
There should not be any trend with index, but the residuals show both an increasing trend towards higher indices and also all points with high residuals are concentrated towards the end of the data set.
Whenever you see a trend or outliers pattern like that with respect to the row numbers, it suggests that there is a variable missing from the model because we are plotting the residuals here, i.e. the difference between true values and predicted values.
Residuals should not have any relationship with the order in which the variables appear in your data set, unless, and this is what typically you discover when you see a trend like this, or outliers like this at one end of this plot, there is a relationship with respect to time, or age, or some other continuous variable by which rows may be ordered.
pred <- predict(modFit, testing)
qplot(wage, pred, colour=year, data=testing)
WARNING : once we look at the test data set, we can not return to the training set and make adjustments!
.
in the formula)modFitAll <- train(wage ~ ., data=training, method="lm")
modFitAll
## Linear Regression
##
## 2102 samples
## 10 predictors
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ...
##
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 30 0.3 2 0.02
##
##
pred <- predict(modFitAll, testing)
# GF: it returns a warning:
## Warning message:
## In predict.lm(modelFit, newdata) :
## prediction from a rank-deficient fit may be misleading
qplot(wage, pred, data=testing)