Sec.15 - COVARIATE CREATION

Based on Jeef Leek's slides for the “Practical Machine Learning” course.

Two levels of covariate creation

Level 1: From raw data to covariate

<img class=center src=../../assets/img/08_PredictionAndMachineLearning/covCreation1.png height=200>

Level 2: Transforming tidy covariates

library(kernlab)
data(spam)
spam$capitalAveSq <- spam$capitalAve^2

Level 1, Raw data -> covariates

Depends heavily on application
The balancing act is summarization vs. information loss
Examples:
- Text files: frequency of words, frequency of phrases (Google ngrams), frequency of capital letters.
- Images: Edges, corners, blobs, ridges (computer vision feature detection))
- Webpages: Number and type of images, position of elements, colors, videos (A/B Testing)
- People: Height, weight, hair color, sex, country of origin.
The more knowledge of the system you have the better the job you will do.
When in doubt, err on the side of more features
Can be automated, but use caution!

Level 2, Tidy covariates -> new covariates

More necessary for some methods (regression, svms) than for others (classification trees).
Should be done only on the training set
The best approach is through exploratory analysis (plotting/tables)
New covariates should be added to data frames

Load example data

library(ISLR) 
library(caret)

data(Wage)
inTrain <- createDataPartition(y = Wage$wage, p=0.7, list=FALSE)
training <- Wage[inTrain,] 
testing <- Wage[-inTrain,]

Common covariates to add, dummy variables : `dummyVars()`

Basic idea - convert factor variables to indicator variables

table(training$jobclass)
## 
##  1. Industrial 2. Information 
##           1078           1024
dummies <- dummyVars(wage ~ jobclass, data=training)

head(predict(dummies, newdata=training))
##        jobclass.1. Industrial jobclass.2. Information
## 231655                      1                       0
## 305706                      1                       0
## 160191                      1                       0
## 86064                       1                       0
## 87492                       1                       0
## 86929                       0                       1

Arguments for dummyVars():

formula : an appropriate R model formula, see References.
data : a data frame with the predictors of interest.
sep : an optional separator between factor variable names and their levels. Use sep=NULL for no separator (i.e. normal behavior of model.matrix as shown in the Details section).
levelsOnly : a logical; TRUE means to completely remove the variable names from the column names.
fullRank : a logical; should a full rank or less than full rank parameterization be used?
If TRUE, factors are encoded to be consistent with model.matrix and the resulting there are no linear dependencies induced between the columns.
object : an object of class dummyVars
newdata : a data frame with the required columns.
na.action : a function determining what should be done with missing values in newdata.
The default is to predict NA.
n : a vector of levels for a factor, or the number of levels.
contrasts : a logical indicating whether contrasts should be computed.
sparse : a logical indicating if the result should be sparse.
… : additional arguments to be passed to other methods.

Removing zero covariates : `nearZeroVar()`

# to return a data frame with full predictor information:
nsv <- nearZeroVar(training, saveMetrics=TRUE)
nsv
##            freqRatio percentUnique zeroVar   nzv
## year           1.031       0.33302   FALSE FALSE
## age            1.071       2.85442   FALSE FALSE
## sex            0.000       0.04757    TRUE  TRUE
## maritl         3.344       0.23787   FALSE FALSE
## race           7.885       0.19029   FALSE FALSE
## education      1.436       0.23787   FALSE FALSE
## region         0.000       0.04757    TRUE  TRUE
## jobclass       1.053       0.09515   FALSE FALSE
## health         2.509       0.09515   FALSE FALSE
## health_ins     2.214       0.09515   FALSE FALSE
## logwage        1.077      17.50714   FALSE FALSE
## wage           1.064      18.45861   FALSE FALSE

# to return just the positions of zero- or near-zero predictors:
nsv <- nearZeroVar(training, saveMetrics=FALSE)
nsv
## [1] 3 7

x : a numeric vector or matrix, or a data frame with all numeric data
freqCut : the cutoff for the ratio of the most common value to the second most common value
uniqueCut : the cutoff for the percentage of distinct values out of the number of total samples
saveMetrics : a logical.
- If FALSE: the positions of the zero- or near-zero predictors is returned.
- If TRUE: a data frame with predictor information is returned.
y : a factor vector with at least two levels
index : a list. Each element corresponds to the training set samples in x for a given resample

Spline basis : `bs()` (package splines)

Generate the B-spline basis matrix for a polynomial spline.

This is one way to generate new variables and to allow more flexibility in the way that we model specific variables.

library(splines)
bsBasis <- bs(training$age, df=3) 
bsBasis[1:12,]
##             1        2         3
##  [1,] 0.00000 0.000000 0.0000000
##  [2,] 0.43214 0.160795 0.0199436
##  [3,] 0.44406 0.210927 0.0333968
##  [4,] 0.43214 0.160795 0.0199436
##  [5,] 0.42215 0.310402 0.0760789
##  [6,] 0.44295 0.244786 0.0450922
##  [7,] 0.27648 0.037219 0.0016701
##  [8,] 0.21297 0.019720 0.0006086
##  [9,] 0.17675 0.012854 0.0003116
## [10,] 0.04768 0.303945 0.6458840
## [11,] 0.13742 0.007362 0.0001315
## [12,] 0.27648 0.037219 0.0016701

See also: ns(), poly()

Fitting curves with splines

lm1 <- lm(wage ~ bsBasis, data=training)
lm1$coefficients
## (Intercept)    bsBasis1    bsBasis2    bsBasis3 
##       60.22       93.39       51.05       47.28

plot(training$age, training$wage, pch=19, cex=0.5)
points(training$age, predict(lm1, newdata=training), col="red", pch=19, cex=0.5)

Then on the test set, we have to predict those same variables.
This point is incredibly critical for machine learning when you create new covariates. We have to create the covariates on the test data set using the exact same procedure that we used on the training set.

Splines on the test set

head(predict(bsBasis, age=testing$age), 20)
##             1        2         3
##  [1,] 0.00000 0.000000 0.0000000
##  [2,] 0.43214 0.160795 0.0199436
##  [3,] 0.44406 0.210927 0.0333968
##  [4,] 0.43214 0.160795 0.0199436
##  [5,] 0.42215 0.310402 0.0760789
##  [6,] 0.44295 0.244786 0.0450922
##  [7,] 0.27648 0.037219 0.0016701
##  [8,] 0.21297 0.019720 0.0006086
##  [9,] 0.17675 0.012854 0.0003116
## [10,] 0.04768 0.303945 0.6458840
## [11,] 0.13742 0.007362 0.0001315
## [12,] 0.27648 0.037219 0.0016701
## [13,] 0.21297 0.019720 0.0006086
## [14,] 0.26158 0.439938 0.2466318
## [15,] 0.42945 0.294480 0.0673097
## [16,] 0.27648 0.037219 0.0016701
## [17,] 0.29448 0.429450 0.2087604
## [18,] 0.42419 0.144611 0.0164330
## [19,] 0.43214 0.160795 0.0199436
## [20,] 0.44406 0.210927 0.0333968

Notes and further reading

Level 1 feature creation (raw data to covariates)
- Science is key. Google “feature extraction for [data type]”
- Err on overcreation of features
- In some applications (images, voices) automated feature creation is possible/necessary
- http://www.cs.nyu.edu/~yann/talks/lecun-ranzato-icml2013.pdf
Level 2 feature creation (covariates to new covariates)
- The function preProcess in caret will handle some preprocessing.
- Create new covariates if you think they will improve fit
- Use exploratory analysis on the training set for creating them
- Be careful about overfitting!
preprocessing with caret
If you want to fit spline models, use the gam method in the caret package which allows smoothing of multiple variables.
More on feature creation/data tidying in the Obtaining Data course from the Data Science course track.