Based on Jeef Leek's slides for the “Practical Machine Learning” course.
<img class=center src=../../assets/img/08_PredictionAndMachineLearning/wages.jpg height=350>
Image Credit http://www.cahs-media.org/the-high-cost-of-low-wages
Data from: ISLR package from the book: Introduction to statistical learning
library(ISLR)
library(ggplot2)
library(caret)
library(Hmisc)
library(gridExtra)
data(Wage)
str(Wage)
## 'data.frame': 3000 obs. of 12 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ sex : Factor w/ 2 levels "1. Male","2. Female": 1 1 1 1 1 1 1 1 1 1 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
summary(Wage)
## year age sex maritl race
## Min. :2003 Min. :18.0 1. Male :3000 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.8 2. Female: 0 2. Married :2074 2. Black: 293
## Median :2006 Median :42.0 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.4 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.0 5. Separated : 55
## Max. :2009 Max. :80.0
##
## education region jobclass health
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544 1. <=Good : 858
## 2. HS Grad :971 1. New England : 0 2. Information:1456 2. >=Very Good:2142
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health_ins logwage wage
## 1. Yes:2083 Min. :3.00 Min. : 20.1
## 2. No : 917 1st Qu.:4.45 1st Qu.: 85.4
## Median :4.65 Median :104.9
## Mean :4.65 Mean :111.7
## 3rd Qu.:4.86 3rd Qu.:128.7
## Max. :5.76 Max. :318.3
##
inTrain <- createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
dim(training)
## [1] 2102 12
dim(testing)
## [1] 898 12
featurePlot(x=training[,c("age", "education", "jobclass")],
y = training$wage,
plot = "pairs")
qplot(age, wage, data=training)
Interesting, odd, pattern with a group of data points separated from the bulk of the sample at higher wages.
qplot(age, wage, colour=jobclass, data=training)
Most of the high points come from the information jobclass, an interesting clue relevant for proceeding with modeling.
Splitting (color-coding) data by education variable and adding a “smoother”, in this case a linear regression model to each group to see if there is a different wage ~ age relationship for different groups.
qq <- qplot(age, wage, colour=education, data=training)
qq + geom_smooth(method='lm', formula = y ~ x)
cutWage <- cut2(training$wage, g=3)
str(cutWage)
## Factor w/ 3 levels "[ 20.1, 93.5)",..: 1 1 1 1 1 1 1 1 1 1 ...
table(cutWage)
## cutWage
## [ 20.1, 93.5) [ 93.5,118.9) [118.9,318.3]
## 711 718 673
Divides data in groups, creating factors based on quantiles.
These new categorical variables can be used to make different kind of plots.
cut2()p1 <- qplot(cutWage, age, data=training,
fill=cutWage,
geom=c("boxplot"))
p1
tapply(training$age, cutWage, summary)
## $`[ 20.1, 93.5)`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 28.0 38.0 39.1 49.0 80.0
##
## $`[ 93.5,118.9)`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.0 33.0 42.5 42.9 51.0 80.0
##
## $`[118.9,318.3]`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.0 38.0 45.0 45.2 52.0 76.0
p2 <- qplot(cutWage, age, data=training,
fill=cutWage,
geom=c("boxplot", "jitter"))
grid.arrange(p1, p2, ncol=2)
cutAge <- cut2(training$age, g=3)
p1b <- qplot(cutAge, wage, data=training, fill=cutAge, geom=c("boxplot"))
p1b
tapply(training$wage, cutAge, summary)
## $`[18,38)`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.3 76.9 94.1 99.5 115.0 278.0
##
## $`[38,49)`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.1 92.2 113.0 120.0 136.0 318.0
##
## $`[49,80]`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.0 90.5 110.0 117.0 133.0 318.0
t1 <- table(cutWage, training$jobclass)
t1
##
## cutWage 1. Industrial 2. Information
## [ 20.1, 93.5) 443 268
## [ 93.5,118.9) 355 363
## [118.9,318.3] 277 396
prop.table(t1,1)
##
## cutWage 1. Industrial 2. Information
## [ 20.1, 93.5) 0.6231 0.3769
## [ 93.5,118.9) 0.4944 0.5056
## [118.9,318.3] 0.4116 0.5884
qplot(wage, colour=education, data=training, geom="density")