Sec.13 - PLOTTING PREDICTORS

Based on Jeef Leek's slides for the “Practical Machine Learning” course.

Example: predicting wages

<img class=center src=../../assets/img/08_PredictionAndMachineLearning/wages.jpg height=350>

Image Credit http://www.cahs-media.org/the-high-cost-of-low-wages

Data from: ISLR package from the book: Introduction to statistical learning

Example: Wage data

library(ISLR) 
library(ggplot2) 
library(caret)
library(Hmisc)
library(gridExtra)

data(Wage)
str(Wage)
## 'data.frame':    3000 obs. of  12 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ sex       : Factor w/ 2 levels "1. Male","2. Female": 1 1 1 1 1 1 1 1 1 1 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...
summary(Wage)
##       year           age              sex                    maritl           race     
##  Min.   :2003   Min.   :18.0   1. Male  :3000   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.8   2. Female:   0   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.0                    3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.4                    4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.0                    5. Separated    :  55                  
##  Max.   :2009   Max.   :80.0                                                           
##                                                                                        
##               education                     region               jobclass               health    
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544   1. <=Good     : 858  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456   2. >=Very Good:2142  
##  3. Some College   :650   3. East North Central:   0                                              
##  4. College Grad   :685   4. West North Central:   0                                              
##  5. Advanced Degree:426   5. South Atlantic    :   0                                              
##                           6. East South Central:   0                                              
##                           (Other)              :   0                                              
##   health_ins      logwage          wage      
##  1. Yes:2083   Min.   :3.00   Min.   : 20.1  
##  2. No : 917   1st Qu.:4.45   1st Qu.: 85.4  
##                Median :4.65   Median :104.9  
##                Mean   :4.65   Mean   :111.7  
##                3rd Qu.:4.86   3rd Qu.:128.7  
##                Max.   :5.76   Max.   :318.3  
##

Get training/test sets

inTrain <- createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
dim(training) 
## [1] 2102   12
dim(testing)
## [1] 898  12

Feature plot (caret package)

featurePlot(x=training[,c("age", "education", "jobclass")],
            y = training$wage,
            plot = "pairs")

Qplot (ggplot2 package)

qplot(age, wage, data=training)

Interesting, odd, pattern with a group of data points separated from the bulk of the sample at higher wages.

Qplot with color (ggplot2 package)

qplot(age, wage, colour=jobclass, data=training)

Most of the high points come from the information jobclass, an interesting clue relevant for proceeding with modeling.

Add regression smoothers (ggplot2 package)

Splitting (color-coding) data by education variable and adding a “smoother”, in this case a linear regression model to each group to see if there is a different wage ~ age relationship for different groups.

qq <- qplot(age, wage, colour=education, data=training)
qq +  geom_smooth(method='lm', formula = y ~ x)

cut2, making factors (Hmisc package)

cutWage <- cut2(training$wage, g=3)
str(cutWage)
##  Factor w/ 3 levels "[ 20.1, 93.5)",..: 1 1 1 1 1 1 1 1 1 1 ...
table(cutWage)
## cutWage
## [ 20.1, 93.5) [ 93.5,118.9) [118.9,318.3] 
##           711           718           673

Divides data in groups, creating factors based on quantiles.
These new categorical variables can be used to make different kind of plots.

Boxplots with `cut2()`

p1 <- qplot(cutWage, age, data=training, 
      fill=cutWage,
      geom=c("boxplot"))
p1

tapply(training$age, cutWage, summary)
## $`[ 20.1, 93.5)`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0    28.0    38.0    39.1    49.0    80.0 
## 
## $`[ 93.5,118.9)`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    21.0    33.0    42.5    42.9    51.0    80.0 
## 
## $`[118.9,318.3]`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    21.0    38.0    45.0    45.2    52.0    76.0

Boxplots with points overlayed

p2 <- qplot(cutWage, age, data=training, 
      fill=cutWage,
      geom=c("boxplot", "jitter"))
grid.arrange(p1, p2, ncol=2)

Cut and boxplot with wage groups

cutAge <- cut2(training$age, g=3)
p1b <- qplot(cutAge, wage, data=training, fill=cutAge, geom=c("boxplot"))
p1b

tapply(training$wage, cutAge, summary)
## $`[18,38)`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    23.3    76.9    94.1    99.5   115.0   278.0 
## 
## $`[38,49)`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.1    92.2   113.0   120.0   136.0   318.0 
## 
## $`[49,80]`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    23.0    90.5   110.0   117.0   133.0   318.0

Tables

t1 <- table(cutWage, training$jobclass)
t1
##                
## cutWage         1. Industrial 2. Information
##   [ 20.1, 93.5)           443            268
##   [ 93.5,118.9)           355            363
##   [118.9,318.3]           277            396

prop.table(t1,1)
##                
## cutWage         1. Industrial 2. Information
##   [ 20.1, 93.5)        0.6231         0.3769
##   [ 93.5,118.9)        0.4944         0.5056
##   [118.9,318.3]        0.4116         0.5884

Density plots

qplot(wage, colour=education, data=training, geom="density")

Notes and further reading

Make your plots only in the training set
- Don't use the test set for exploration!
Things you should be looking for
- Imbalance in outcomes/predictors
- Outliers
- Groups of points not explained by a predictor
- Skewed variables
ggplot2 tutorial
caret visualizations