Sec.15 - COVARIATE CREATION

Based on Jeef Leek's slides for the “Practical Machine Learning” course.

Two levels of covariate creation

Level 1: From raw data to covariate

<img class=center src=../../assets/img/08_PredictionAndMachineLearning/covCreation1.png height=200>

Level 2: Transforming tidy covariates

library(kernlab)
data(spam)
spam$capitalAveSq <- spam$capitalAve^2

Level 1, Raw data -> covariates


Level 2, Tidy covariates -> new covariates


Load example data

library(ISLR) 
library(caret)
data(Wage)
inTrain <- createDataPartition(y = Wage$wage, p=0.7, list=FALSE)
training <- Wage[inTrain,] 
testing <- Wage[-inTrain,]

Common covariates to add, dummy variables : dummyVars()

Basic idea - convert factor variables to indicator variables

table(training$jobclass)
## 
##  1. Industrial 2. Information 
##           1078           1024
dummies <- dummyVars(wage ~ jobclass, data=training)

head(predict(dummies, newdata=training))
##        jobclass.1. Industrial jobclass.2. Information
## 231655                      1                       0
## 305706                      1                       0
## 160191                      1                       0
## 86064                       1                       0
## 87492                       1                       0
## 86929                       0                       1

Arguments for dummyVars():


Removing zero covariates : nearZeroVar()

# to return a data frame with full predictor information:
nsv <- nearZeroVar(training, saveMetrics=TRUE)
nsv
##            freqRatio percentUnique zeroVar   nzv
## year           1.031       0.33302   FALSE FALSE
## age            1.071       2.85442   FALSE FALSE
## sex            0.000       0.04757    TRUE  TRUE
## maritl         3.344       0.23787   FALSE FALSE
## race           7.885       0.19029   FALSE FALSE
## education      1.436       0.23787   FALSE FALSE
## region         0.000       0.04757    TRUE  TRUE
## jobclass       1.053       0.09515   FALSE FALSE
## health         2.509       0.09515   FALSE FALSE
## health_ins     2.214       0.09515   FALSE FALSE
## logwage        1.077      17.50714   FALSE FALSE
## wage           1.064      18.45861   FALSE FALSE

# to return just the positions of zero- or near-zero predictors:
nsv <- nearZeroVar(training, saveMetrics=FALSE)
nsv
## [1] 3 7

Spline basis : bs() (package splines)

Generate the B-spline basis matrix for a polynomial spline.

This is one way to generate new variables and to allow more flexibility in the way that we model specific variables.

library(splines)
bsBasis <- bs(training$age, df=3) 
bsBasis[1:12,]
##             1        2         3
##  [1,] 0.00000 0.000000 0.0000000
##  [2,] 0.43214 0.160795 0.0199436
##  [3,] 0.44406 0.210927 0.0333968
##  [4,] 0.43214 0.160795 0.0199436
##  [5,] 0.42215 0.310402 0.0760789
##  [6,] 0.44295 0.244786 0.0450922
##  [7,] 0.27648 0.037219 0.0016701
##  [8,] 0.21297 0.019720 0.0006086
##  [9,] 0.17675 0.012854 0.0003116
## [10,] 0.04768 0.303945 0.6458840
## [11,] 0.13742 0.007362 0.0001315
## [12,] 0.27648 0.037219 0.0016701

See also: ns(), poly()


Fitting curves with splines

lm1 <- lm(wage ~ bsBasis, data=training)
lm1$coefficients
## (Intercept)    bsBasis1    bsBasis2    bsBasis3 
##       60.22       93.39       51.05       47.28

plot(training$age, training$wage, pch=19, cex=0.5)
points(training$age, predict(lm1, newdata=training), col="red", pch=19, cex=0.5)
plot of chunk unnamed-chunk-2

Then on the test set, we have to predict those same variables.
This point is incredibly critical for machine learning when you create new covariates. We have to create the covariates on the test data set using the exact same procedure that we used on the training set.


Splines on the test set

head(predict(bsBasis, age=testing$age), 20)
##             1        2         3
##  [1,] 0.00000 0.000000 0.0000000
##  [2,] 0.43214 0.160795 0.0199436
##  [3,] 0.44406 0.210927 0.0333968
##  [4,] 0.43214 0.160795 0.0199436
##  [5,] 0.42215 0.310402 0.0760789
##  [6,] 0.44295 0.244786 0.0450922
##  [7,] 0.27648 0.037219 0.0016701
##  [8,] 0.21297 0.019720 0.0006086
##  [9,] 0.17675 0.012854 0.0003116
## [10,] 0.04768 0.303945 0.6458840
## [11,] 0.13742 0.007362 0.0001315
## [12,] 0.27648 0.037219 0.0016701
## [13,] 0.21297 0.019720 0.0006086
## [14,] 0.26158 0.439938 0.2466318
## [15,] 0.42945 0.294480 0.0673097
## [16,] 0.27648 0.037219 0.0016701
## [17,] 0.29448 0.429450 0.2087604
## [18,] 0.42419 0.144611 0.0164330
## [19,] 0.43214 0.160795 0.0199436
## [20,] 0.44406 0.210927 0.0333968

Notes and further reading