Load up the ISLR, which will provide the dataset, ggplot2 and caret libraries.

library(ISLR); library(ggplot2); library(caret);
## Loading required package: lattice
data("Wage")
summary(Wage)
##       year           age               sex                    maritl    
##  Min.   :2003   Min.   :18.00   1. Male  :3000   1. Never Married: 648  
##  1st Qu.:2004   1st Qu.:33.75   2. Female:   0   2. Married      :2074  
##  Median :2006   Median :42.00                    3. Widowed      :  19  
##  Mean   :2006   Mean   :42.41                    4. Divorced     : 204  
##  3rd Qu.:2008   3rd Qu.:51.00                    5. Separated    :  55  
##  Max.   :2009   Max.   :80.00                                           
##                                                                         
##        race                   education                     region    
##  1. White:2480   1. < HS Grad      :268   2. Middle Atlantic   :3000  
##  2. Black: 293   2. HS Grad        :971   1. New England       :   0  
##  3. Asian: 190   3. Some College   :650   3. East North Central:   0  
##  4. Other:  37   4. College Grad   :685   4. West North Central:   0  
##                  5. Advanced Degree:426   5. South Atlantic    :   0  
##                                           6. East South Central:   0  
##                                           (Other)              :   0  
##            jobclass               health      health_ins      logwage     
##  1. Industrial :1544   1. <=Good     : 858   1. Yes:2083   Min.   :3.000  
##  2. Information:1456   2. >=Very Good:2142   2. No : 917   1st Qu.:4.447  
##                                                            Median :4.653  
##                                                            Mean   :4.654  
##                                                            3rd Qu.:4.857  
##                                                            Max.   :5.763  
##                                                                           
##       wage       
##  Min.   : 20.09  
##  1st Qu.: 85.38  
##  Median :104.92  
##  Mean   :111.70  
##  3rd Qu.:128.68  
##  Max.   :318.34  
## 

Even before EDA, we partition the dataset into training and testing sets.

inTrain<-createDataPartition(y=Wage$wage,p=.7,list=FALSE)
training<-Wage[inTrain,]; testing<-Wage[-inTrain,]
dim(training); dim(testing)
## [1] 2102   12
## [1] 898  12

We do some EDA.

featurePlot(x=training[,c("age","education","jobclass")],
            y=training$wage,
            plot="pairs")

qplot(age,wage,data=training)  

Notice first that both the featurePlot and qplot suggest a relationship between wage and age. Moreover, in the qplot, notice the observations essentially floating way above most of the dataset, i.e., people who earn much higher wages than those in similar age groups.

To assess whether that group of ‘outliers’ can be explained by another features, we can color the plotted points according to the type of job held.

qplot(age,wage,colour=jobclass,data=training)  

We see that that the hovering group consists largely of “information” workers as opposed to “industrial” ones. This may explain the drastic jump in wages.

We can assess the effect of education level on wages; the former should intuitvely affect the latter in a positive fashion.

qq<-qplot(age,wage,colour=education,data=training)
qq+geom_smooth(method='lm',fomula=y~x)  

Unsurprisingly, education level affects wages. In particular, the set of regression lines plotted for each group of education level is largely non-overlapping, and appear to the naked eye to be roughly parallel.

We also notice from the regression lines that given an education level, wages tend to increase modestly with age.

We could use the c2 function from the Hmisc library to tranform the wage data into a categorical feature.

library(Hmisc)
## Loading required package: grid
## Loading required package: survival
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
groupWage <- cut2(training$wage,g=3)
table(groupWage)  
## groupWage
## [ 20.1, 91.7) [ 91.7,118.9) [118.9,318.3] 
##           701           727           674

Having the response feature be depicted as categorical allows other plots which may be informative.

p1<-qplot(groupWage,age,data=training,fill=groupWage,geom=c("boxplot"))
p1

The resulting boxplot depicts the relationship between age and wage in a cleaner wage than did the scatterplot-type visualizations above. The boxplot, however, does not identify the existence of a significant imbalances of observations among the the wage categories. In other words, the upward trend depicted by the boxplot may be misleading if, for instance, only a few workers were in any of the three wage categories.

To see that the observations are evenly balanced among the three groups, we can overlay the boxplots with a variation of the ‘jitter’ from histograms.

p2<-qplot(groupWage,age,data=training,fill=groupWage,
          geom=c("boxplot","jitter"))
library(gridExtra)
grid.arrange(p1,p2,ncol=2)

We see that the observations are well distributed, and thus that the positive reltionship between wage and age is real.

We can also make tables of the data depicted above based on the wage groups.

t1 <- table(groupWage,training$jobclass)
t1
##                
## groupWage       1. Industrial 2. Information
##   [ 20.1, 91.7)           445            256
##   [ 91.7,118.9)           376            351
##   [118.9,318.3]           255            419
prop.table(t1,1)  # get proportions
##                
## groupWage       1. Industrial 2. Information
##   [ 20.1, 91.7)     0.6348074      0.3651926
##   [ 91.7,118.9)     0.5171939      0.4828061
##   [118.9,318.3]     0.3783383      0.6216617