Based on Jeef Leek's slides for the “Practical Machine Learning” course.
<img class=center src=../../assets/img/08_PredictionAndMachineLearning/covCreation1.png height=200>
library(kernlab)
data(spam)
spam$capitalAveSq <- spam$capitalAve^2
library(ISLR)
library(caret)
data(Wage)
inTrain <- createDataPartition(y = Wage$wage, p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
dummyVars()Basic idea - convert factor variables to indicator variables
table(training$jobclass)
##
## 1. Industrial 2. Information
## 1078 1024
dummies <- dummyVars(wage ~ jobclass, data=training)
head(predict(dummies, newdata=training))
## jobclass.1. Industrial jobclass.2. Information
## 231655 1 0
## 305706 1 0
## 160191 1 0
## 86064 1 0
## 87492 1 0
## 86929 0 1
Arguments for dummyVars():
sep=NULL for no separator (i.e. normal behavior of model.matrix as shown in the Details section).model.matrix and the resulting there are
no linear dependencies induced between the columns.NA.nearZeroVar()# to return a data frame with full predictor information:
nsv <- nearZeroVar(training, saveMetrics=TRUE)
nsv
## freqRatio percentUnique zeroVar nzv
## year 1.031 0.33302 FALSE FALSE
## age 1.071 2.85442 FALSE FALSE
## sex 0.000 0.04757 TRUE TRUE
## maritl 3.344 0.23787 FALSE FALSE
## race 7.885 0.19029 FALSE FALSE
## education 1.436 0.23787 FALSE FALSE
## region 0.000 0.04757 TRUE TRUE
## jobclass 1.053 0.09515 FALSE FALSE
## health 2.509 0.09515 FALSE FALSE
## health_ins 2.214 0.09515 FALSE FALSE
## logwage 1.077 17.50714 FALSE FALSE
## wage 1.064 18.45861 FALSE FALSE
# to return just the positions of zero- or near-zero predictors:
nsv <- nearZeroVar(training, saveMetrics=FALSE)
nsv
## [1] 3 7
bs() (package splines)Generate the B-spline basis matrix for a polynomial spline.
This is one way to generate new variables and to allow more flexibility in the way that we model specific variables.
library(splines)
bsBasis <- bs(training$age, df=3)
bsBasis[1:12,]
## 1 2 3
## [1,] 0.00000 0.000000 0.0000000
## [2,] 0.43214 0.160795 0.0199436
## [3,] 0.44406 0.210927 0.0333968
## [4,] 0.43214 0.160795 0.0199436
## [5,] 0.42215 0.310402 0.0760789
## [6,] 0.44295 0.244786 0.0450922
## [7,] 0.27648 0.037219 0.0016701
## [8,] 0.21297 0.019720 0.0006086
## [9,] 0.17675 0.012854 0.0003116
## [10,] 0.04768 0.303945 0.6458840
## [11,] 0.13742 0.007362 0.0001315
## [12,] 0.27648 0.037219 0.0016701
See also: ns(), poly()
lm1 <- lm(wage ~ bsBasis, data=training)
lm1$coefficients
## (Intercept) bsBasis1 bsBasis2 bsBasis3
## 60.22 93.39 51.05 47.28
plot(training$age, training$wage, pch=19, cex=0.5)
points(training$age, predict(lm1, newdata=training), col="red", pch=19, cex=0.5)
Then on the test set, we have to predict those same variables.
This point is incredibly critical for machine learning when you create new covariates.
We have to create the covariates on the test data set using the exact same procedure
that we used on the training set.
head(predict(bsBasis, age=testing$age), 20)
## 1 2 3
## [1,] 0.00000 0.000000 0.0000000
## [2,] 0.43214 0.160795 0.0199436
## [3,] 0.44406 0.210927 0.0333968
## [4,] 0.43214 0.160795 0.0199436
## [5,] 0.42215 0.310402 0.0760789
## [6,] 0.44295 0.244786 0.0450922
## [7,] 0.27648 0.037219 0.0016701
## [8,] 0.21297 0.019720 0.0006086
## [9,] 0.17675 0.012854 0.0003116
## [10,] 0.04768 0.303945 0.6458840
## [11,] 0.13742 0.007362 0.0001315
## [12,] 0.27648 0.037219 0.0016701
## [13,] 0.21297 0.019720 0.0006086
## [14,] 0.26158 0.439938 0.2466318
## [15,] 0.42945 0.294480 0.0673097
## [16,] 0.27648 0.037219 0.0016701
## [17,] 0.29448 0.429450 0.2087604
## [18,] 0.42419 0.144611 0.0164330
## [19,] 0.43214 0.160795 0.0199436
## [20,] 0.44406 0.210927 0.0333968