library(ISLR); library(ggplot2); library(caret);
## Loading required package: lattice
data(Wage); Wage <- subset(Wage,select=-c(logwage)) # omit logwage because it is the response variable
summary(Wage)
## year age sex maritl
## Min. :2003 Min. :18.00 1. Male :3000 1. Never Married: 648
## 1st Qu.:2004 1st Qu.:33.75 2. Female: 0 2. Married :2074
## Median :2006 Median :42.00 3. Widowed : 19
## Mean :2006 Mean :42.41 4. Divorced : 204
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## race education region
## 1. White:2480 1. < HS Grad :268 2. Middle Atlantic :3000
## 2. Black: 293 2. HS Grad :971 1. New England : 0
## 3. Asian: 190 3. Some College :650 3. East North Central: 0
## 4. Other: 37 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## jobclass health health_ins
## 1. Industrial :1544 1. <=Good : 858 1. Yes:2083
## 2. Information:1456 2. >=Very Good:2142 2. No : 917
##
##
##
##
##
## wage
## Min. : 20.09
## 1st Qu.: 85.38
## Median :104.92
## Mean :111.70
## 3rd Qu.:128.68
## Max. :318.34
##
We partition the dataset into training and testing sets.
inTrain<-createDataPartition(y=Wage$wage,
p=.7,list=FALSE)
training<-Wage[inTrain,]; testing<-Wage[-inTrain,]
dim(training); dim(testing)
## [1] 2102 11
## [1] 898 11
We perform some EDA by creating a feature plot. Given the large number of covariates (features) in the datasets, we limit the feature plot to an examination of a subset of those covariates, so as to make the resulting plot more digestible.
featurePlot(x=training[,c("age","education","jobclass")],
y=training$wage,
plot="pairs")
Notice, for instance how the subplots depicting jobclass vs. wage (in the two plots in the upper right quadrant of the feature plot), demonstrate that the higher wages are earned more often by Information workers than by Industrial ones. The wage vs. age plot display a cluster or outliers floating above their age-cohorts, i.e., certain people within the same age groups earning significantly more than their cohorts, to such an extent that these “floaters” are outliers. This phenomenon calls for closer investigation.
qplot(age,wage,data=training)
Two observations: first, the plot appears, albeit not strikingly, to suggest an positive relationship between age and wage; second, the floaters might be explained by seeing the effect of another variable on those particular datapoints.
qplot(age,wage,colour=jobclass,data=training)
The floating cluster indicates that the jobclass variable can explain a large portion of the variability of these datapoints, since a significant majority of the points in the cluster are assigned a jobclass of Information.
Intuition also suggests that education (levels) would affect wages within a given age cohort, and we test such an intuition thus:
qplot(age,wage,colour=education,data=training)
We see that the floaters are overwhelmingly those with at least a college degree; and that a significant majority have advanced degrees. We now fit a linear model.
modFit<-train(wage~age+jobclass+education,
method="lm",data=training)
finMod<-modFit$finalModel
print(modFit)
## Linear Regression
##
## 2102 samples
## 10 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ...
## Resampling results
##
## RMSE Rsquared RMSE SD Rsquared SD
## 35.91696 0.2571411 1.343554 0.02041253
##
##
We create a diagnostic plot, namely a Residuals vs. Fitted Values plot. Ideally, we want the datapoints to be close to the horizontal line at y=0.
plot(finMod,1,pch=19,cex=0.4,col="#00000010")
We see in this plot that there still exists a cluster of datapoints that deviate significantly from our linear model predicts. As above, we can select another variable to determine the extent of its effect on those very datapoints.
qplot(finMod$fitted.values,finMod$residuals,colour=race,data=training)
Once again, the race variable seems to explain the outliers.
We can plot the residuals vs. the index, i.e., the order in which the datapoints were recorded. Normally, the two should have no relationship to one another, if we assume the dataset includes all relevant variables. On the other hand, if there seems to be a trend between the index and the residual, that suggests that the order in which the data are tabulated, often in chronological order, should itself be a variable.
plot(finMod$residuals,pch=19)
There doesn’t appear to be any reason to believe from this plot that the dataset is missing a relevant index-tied variable.
We make predictions on the testing set and plot them against the truth values.
pred<-predict(modFit,testing)
qplot(wage,pred,colour=year,data=testing)
Ideally the scatterplot would suggest a relationship such that a line y = x would fit the data nicely.