library(ISLR); library(ggplot2); library(caret);
## Loading required package: lattice
data(Wage); Wage <- subset(Wage,select=-c(logwage)) # omit logwage because it is the response variable
summary(Wage)
##       year           age               sex                    maritl    
##  Min.   :2003   Min.   :18.00   1. Male  :3000   1. Never Married: 648  
##  1st Qu.:2004   1st Qu.:33.75   2. Female:   0   2. Married      :2074  
##  Median :2006   Median :42.00                    3. Widowed      :  19  
##  Mean   :2006   Mean   :42.41                    4. Divorced     : 204  
##  3rd Qu.:2008   3rd Qu.:51.00                    5. Separated    :  55  
##  Max.   :2009   Max.   :80.00                                           
##                                                                         
##        race                   education                     region    
##  1. White:2480   1. < HS Grad      :268   2. Middle Atlantic   :3000  
##  2. Black: 293   2. HS Grad        :971   1. New England       :   0  
##  3. Asian: 190   3. Some College   :650   3. East North Central:   0  
##  4. Other:  37   4. College Grad   :685   4. West North Central:   0  
##                  5. Advanced Degree:426   5. South Atlantic    :   0  
##                                           6. East South Central:   0  
##                                           (Other)              :   0  
##            jobclass               health      health_ins  
##  1. Industrial :1544   1. <=Good     : 858   1. Yes:2083  
##  2. Information:1456   2. >=Very Good:2142   2. No : 917  
##                                                           
##                                                           
##                                                           
##                                                           
##                                                           
##       wage       
##  Min.   : 20.09  
##  1st Qu.: 85.38  
##  Median :104.92  
##  Mean   :111.70  
##  3rd Qu.:128.68  
##  Max.   :318.34  
## 

We partition the dataset into training and testing sets.

inTrain<-createDataPartition(y=Wage$wage,
                             p=.7,list=FALSE)
training<-Wage[inTrain,]; testing<-Wage[-inTrain,]
dim(training); dim(testing)
## [1] 2102   11
## [1] 898  11

We perform some EDA by creating a feature plot. Given the large number of covariates (features) in the datasets, we limit the feature plot to an examination of a subset of those covariates, so as to make the resulting plot more digestible.

featurePlot(x=training[,c("age","education","jobclass")],
            y=training$wage,
            plot="pairs")  

Notice, for instance how the subplots depicting jobclass vs. wage (in the two plots in the upper right quadrant of the feature plot), demonstrate that the higher wages are earned more often by Information workers than by Industrial ones. The wage vs. age plot display a cluster or outliers floating above their age-cohorts, i.e., certain people within the same age groups earning significantly more than their cohorts, to such an extent that these “floaters” are outliers. This phenomenon calls for closer investigation.

qplot(age,wage,data=training)

Two observations: first, the plot appears, albeit not strikingly, to suggest an positive relationship between age and wage; second, the floaters might be explained by seeing the effect of another variable on those particular datapoints.

qplot(age,wage,colour=jobclass,data=training)

The floating cluster indicates that the jobclass variable can explain a large portion of the variability of these datapoints, since a significant majority of the points in the cluster are assigned a jobclass of Information.

Intuition also suggests that education (levels) would affect wages within a given age cohort, and we test such an intuition thus:

qplot(age,wage,colour=education,data=training)

We see that the floaters are overwhelmingly those with at least a college degree; and that a significant majority have advanced degrees. We now fit a linear model.

modFit<-train(wage~age+jobclass+education,
              method="lm",data=training)
finMod<-modFit$finalModel
print(modFit)
## Linear Regression 
## 
## 2102 samples
##   10 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 2102, 2102, 2102, 2102, 2102, 2102, ... 
## Resampling results
## 
##   RMSE      Rsquared   RMSE SD   Rsquared SD
##   35.91696  0.2571411  1.343554  0.02041253 
## 
## 

We create a diagnostic plot, namely a Residuals vs. Fitted Values plot. Ideally, we want the datapoints to be close to the horizontal line at y=0.

plot(finMod,1,pch=19,cex=0.4,col="#00000010")

We see in this plot that there still exists a cluster of datapoints that deviate significantly from our linear model predicts. As above, we can select another variable to determine the extent of its effect on those very datapoints.

qplot(finMod$fitted.values,finMod$residuals,colour=race,data=training)

Once again, the race variable seems to explain the outliers.

We can plot the residuals vs. the index, i.e., the order in which the datapoints were recorded. Normally, the two should have no relationship to one another, if we assume the dataset includes all relevant variables. On the other hand, if there seems to be a trend between the index and the residual, that suggests that the order in which the data are tabulated, often in chronological order, should itself be a variable.

plot(finMod$residuals,pch=19)  

There doesn’t appear to be any reason to believe from this plot that the dataset is missing a relevant index-tied variable.

We make predictions on the testing set and plot them against the truth values.

pred<-predict(modFit,testing)
qplot(wage,pred,colour=year,data=testing)

Ideally the scatterplot would suggest a relationship such that a line y = x would fit the data nicely.