Sec.14 - BASIC PREPROCESSING

Based on Jeef Leek's slides for the “Practical Machine Learning” course.

Why preprocess?

library(caret) 
library(kernlab) 
library(RANN)

data(spam)
set.seed(4567)
inTrain <- createDataPartition(y = spam$type, p = 0.75, list = FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
hist(training$capitalAve, main="", xlab="ave. capital run length")

Very skewed distribution with few data points in a long high-value tail.

mean(training$capitalAve)
## [1] 5.444
sd(training$capitalAve)
## [1] 35.49

Huge standard deviation!

Standardizing

trainCapAve <- training$capitalAve
trainCapAveS <- (trainCapAve  - mean(trainCapAve))/sd(trainCapAve) 
mean(trainCapAveS)
## [1] -1.025e-17
sd(trainCapAveS)
## [1] 1

NOTE: center & scaling does not really do much to fix the problem of huge standard deviation compared with the spread of the bulk of the data…

Standardizing : test set

testCapAve <- testing$capitalAve
testCapAveS <- (testCapAve  - mean(trainCapAve))/sd(trainCapAve) 
mean(testCapAveS)
## [1] -0.02848
sd(testCapAveS)
## [1] 0.443

When we apply a prediction algorithm to the test set we have to be aware that we can only use parameters that we estimated in the training set.
In other words, when we apply this same standardization to the test set, we have to use the mean and the standard deviation from the training set to standardize the testing set values.

This means that when we apply this standardization to the test set, the mean will not be exactly zero and the standard deviation will not be exactly one, because we have standardized by parameters estimated in the training set,

Standardizing : preProcess() function

preObj <- preProcess(training[,-58], method=c("center","scale"))
trainCapAveS <- predict(preObj, training[,-58])$capitalAve
mean(trainCapAveS)
## [1] -1.025e-17
sd(trainCapAveS)
## [1] 1

Standardizing (centering and scaling) all variables except #58 (“type”) which is the outcome we want to predict.

Parameters for preProcess() function:

x : a matrix or data frame. All variables must be numeric.
method : a character vector specifying the type of processing.
Possible values are “BoxCox”, “YeoJohnson”, “expoTrans”, “center”, “scale”, “range”,
“knnImpute”, “bagImpute”, “medianImpute”, pca", “ica” and “spatialSign”.
thresh : a cutoff for the cumulative percent of variance to be retained by PCA
pcaComp : the specific number of PCA components to keep. If specified, this over-rides thresh
na.remove : a logical; should missing values be removed from the calculations?
object : an object of class preProcess.
newdata : a matrix or data frame of new data to be pre-processed.
k : the number of nearest neighbors from the training set to use for imputation.
knnSummary : function to average the neighbor values per column during imputation.
outcome : a numeric or factor vector for the training set outcomes.
This can be used to help estimate the Box-Cox transformation of the predictor variables.
fudge : a tolerance value: Box-Cox transformation lambda values within +/-fudge will be coerced
to 0 and within 1+/-fudge will be coerced to 1.
numUnique : how many unique values should y have to estimate the Box-Cox transformation?
verbose : a logical: prints a log as the computations proceed.
… : additional arguments to pass to fastICA, such as n.comp.

preProcess() results in a list with elements

call : the function call.
dim : the dimensions of x.
bc : Box-Cox transformation values, see BoxCoxTrans.
mean : a vector of means (if centering was requested).
std : a vector of standard deviations (if scaling or PCA was requested).
rotation : a matrix of eigenvectors if PCA was requested.
method : the value ofmethod.
thresh : the value ofthresh.
ranges : a matrix of min and max values for each predictor when method includes “range” (and NULL otherwise).
numComp : the number of principal components required of capture the specified amount of variance.
ica : contains values for the W and K matrix of the decomposition.
median : a vector of medians (if median imputation was requested).

The other thing that you can do is you can use the object that is created using preProcess() on the training set to apply that same preprocessing to another set, typically the test set.

testCapAveS <- predict(preObj, testing[,-58])$capitalAve
mean(testCapAveS)
## [1] -0.02848
sd(testCapAveS)
## [1] 0.443

Standardizing : preProcess argument to train() function

set.seed(32343)
modelFit <- train(type ~ ., data=training, preProcess=c("center","scale"), method="glm")
modelFit
## Generalized Linear Model 
## 
## 3451 samples
##   57 predictors
##    2 classes: 'nonspam', 'spam' 
## 
## Pre-processing: centered, scaled 
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## 
## Resampling results
## 
##   Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.9       0.8    0.02         0.05    
## 
##

Standardizing : Box-Cox transforms

Box-Cox transforms are a set of transformations that take continuous data and try to make them look like normal data. They do that by estimating a specific set of parameters using maximum likelihood.

preObj <- preProcess(training[,-58], method=c("BoxCox"))
trainCapAveS <- predict(preObj, training[,-58])$capitalAve

par(mfrow=c(1,2)) 
hist(trainCapAveS) 
qqnorm(trainCapAveS)

Standardizing : imputing data (K Nearest Neighbors)

set.seed(13343)

# Make some values NA
training$capAveFake <- training$capitalAve
selectNA <- rbinom(dim(training)[1], size=1, prob=0.05) == 1
training$capAveFake[selectNA] <- NA

# Impute and standardize
preObj <- preProcess(training[,-58], method="knnImpute")
capAveFake <- predict(preObj, training[,-58])$capAveFake
# GF: it looks like values are automatically Centered&Scaled anyway.
# GF: non-NA values I presume are not imputed, right?
#     However, because of Centering&Scaling their values change...
# GF: to do this experiment cleanly, shouldn't the existing $capitalAve data be 
#     replaced by the new ones with NAs?  
#     Otherwise the imputing can take advantage of the full $capitalAve to 
#     inform the imputing...

# Standardize true values
capAveTruth <- training$capitalAve
capAveTruth <- (capAveTruth-mean(capAveTruth))/sd(capAveTruth)

quantile(capAveFake - capAveTruth)                   # all
##         0%        25%        50%        75%       100% 
## -9.2257387 -0.0087032 -0.0058498  0.0001841  4.1743895
quantile((capAveFake - capAveTruth)[selectNA])       # imputed values 
##        0%       25%       50%       75%      100% 
## -9.225739 -0.022477 -0.009016  0.005223  0.159508
quantile((capAveFake - capAveTruth)[!selectNA])      # non-NA values
##         0%        25%        50%        75%       100% 
## -1.089e-02 -8.562e-03 -5.810e-03  2.927e-05  4.174e+00

My cleaner NA-imputing experiment

set.seed(13343)
fake <- training

# Make some values NA
selectNA <- rbinom(dim(fake)[1], size=1, prob=0.05) == 1
fake$capitalAve[selectNA] <- NA

# Impute and standardize
preObj <- preProcess(fake[, -58], method="knnImpute")
capAve.predicted <- predict(preObj, fake[, -58])$capitalAve

# unscaling and checking what happened to the non-NA values
fake.capAve.mean <- preObj$mean[55]
fake.capAve.std <- preObj$std[55]
fake.capAve.unscaled <- fake.capAve.mean + capAve.predicted*fake.capAve.std
cbind(fake$capitalAve, fake.capAve.unscaled)[sample(1:1000,20),]
##             fake.capAve.unscaled
##  [1,] 2.988                2.988
##  [2,] 1.785                1.785
##  [3,] 1.000                1.000
##  [4,] 1.263                1.263
##  [5,] 1.266                1.266
##  [6,] 1.958                1.958
##  [7,] 1.538                1.538
##  [8,] 1.400                1.400
##  [9,] 1.535                1.535
## [10,] 1.000                1.000
## [11,] 1.375                1.375
## [12,] 1.870                1.870
## [13,] 2.000                2.000
## [14,] 1.785                1.785
## [15,] 1.466                1.466
## [16,] 5.857                5.857
## [17,] 1.611                1.611
## [18,] 2.521                2.521
## [19,] 1.746                1.746
## [20,] 3.857                3.857

# Standardize true values
capAve.truth <- training$capitalAve
capAve.truth <- (capAve.truth - mean(capAve.truth))/sd(capAve.truth)

# comparisons
quantile(capAve.predicted - capAve.truth)                 # all
##         0%        25%        50%        75%       100% 
## -1.141e+01 -8.703e-03 -5.850e-03  1.841e-04  4.174e+00
quantile((capAve.predicted - capAve.truth)[selectNA])     # imputed values 
##         0%        25%        50%        75%       100% 
## -11.407590  -0.022477  -0.009016   0.005501   0.159508
quantile((capAve.predicted - capAve.truth)[!selectNA])    # non-NA values
##         0%        25%        50%        75%       100% 
## -1.089e-02 -8.562e-03 -5.810e-03  2.927e-05  4.174e+00

Notes and further reading

Training and test must be processed in the same way
Test transformations will likely be imperfect
- Especially if the test/training sets collected at different times
Careful when transforming factor variables!
preprocessing with caret