Sec.14 - BASIC PREPROCESSING

Based on Jeef Leek's slides for the “Practical Machine Learning” course.

Why preprocess?

library(caret) 
library(kernlab) 
library(RANN) 
data(spam)
set.seed(4567)
inTrain <- createDataPartition(y = spam$type, p = 0.75, list = FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
hist(training$capitalAve, main="", xlab="ave. capital run length")
plot of chunk loadPackage

Very skewed distribution with few data points in a long high-value tail.

mean(training$capitalAve)
## [1] 5.444
sd(training$capitalAve)
## [1] 35.49

Huge standard deviation!

Standardizing

trainCapAve <- training$capitalAve
trainCapAveS <- (trainCapAve  - mean(trainCapAve))/sd(trainCapAve) 
mean(trainCapAveS)
## [1] -1.025e-17
sd(trainCapAveS)
## [1] 1

NOTE: center & scaling does not really do much to fix the problem of huge standard deviation compared with the spread of the bulk of the data…

Standardizing : test set

testCapAve <- testing$capitalAve
testCapAveS <- (testCapAve  - mean(trainCapAve))/sd(trainCapAve) 
mean(testCapAveS)
## [1] -0.02848
sd(testCapAveS)
## [1] 0.443

When we apply a prediction algorithm to the test set we have to be aware that we can only use parameters that we estimated in the training set.
In other words, when we apply this same standardization to the test set, we have to use the mean and the standard deviation from the training set to standardize the testing set values.

This means that when we apply this standardization to the test set, the mean will not be exactly zero and the standard deviation will not be exactly one, because we have standardized by parameters estimated in the training set,

Standardizing : preProcess() function

preObj <- preProcess(training[,-58], method=c("center","scale"))
trainCapAveS <- predict(preObj, training[,-58])$capitalAve
mean(trainCapAveS)
## [1] -1.025e-17
sd(trainCapAveS)
## [1] 1

Standardizing (centering and scaling) all variables except #58 (“type”) which is the outcome we want to predict.

Parameters for preProcess() function:

preProcess() results in a list with elements

The other thing that you can do is you can use the object that is created using preProcess() on the training set to apply that same preprocessing to another set, typically the test set.

testCapAveS <- predict(preObj, testing[,-58])$capitalAve
mean(testCapAveS)
## [1] -0.02848
sd(testCapAveS)
## [1] 0.443

Standardizing : preProcess argument to train() function

set.seed(32343)
modelFit <- train(type ~ ., data=training, preProcess=c("center","scale"), method="glm")
modelFit
## Generalized Linear Model 
## 
## 3451 samples
##   57 predictors
##    2 classes: 'nonspam', 'spam' 
## 
## Pre-processing: centered, scaled 
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## 
## Resampling results
## 
##   Accuracy  Kappa  Accuracy SD  Kappa SD
##   0.9       0.8    0.02         0.05    
## 
## 

Standardizing : Box-Cox transforms

Box-Cox transforms are a set of transformations that take continuous data and try to make them look like normal data. They do that by estimating a specific set of parameters using maximum likelihood.

preObj <- preProcess(training[,-58], method=c("BoxCox"))
trainCapAveS <- predict(preObj, training[,-58])$capitalAve

par(mfrow=c(1,2)) 
hist(trainCapAveS) 
qqnorm(trainCapAveS)
plot of chunk unnamed-chunk-5

Standardizing : imputing data (K Nearest Neighbors)

set.seed(13343)

# Make some values NA
training$capAveFake <- training$capitalAve
selectNA <- rbinom(dim(training)[1], size=1, prob=0.05) == 1
training$capAveFake[selectNA] <- NA

# Impute and standardize
preObj <- preProcess(training[,-58], method="knnImpute")
capAveFake <- predict(preObj, training[,-58])$capAveFake
# GF: it looks like values are automatically Centered&Scaled anyway.
# GF: non-NA values I presume are not imputed, right?
#     However, because of Centering&Scaling their values change...
# GF: to do this experiment cleanly, shouldn't the existing $capitalAve data be 
#     replaced by the new ones with NAs?  
#     Otherwise the imputing can take advantage of the full $capitalAve to 
#     inform the imputing...

# Standardize true values
capAveTruth <- training$capitalAve
capAveTruth <- (capAveTruth-mean(capAveTruth))/sd(capAveTruth)
quantile(capAveFake - capAveTruth)                   # all
##         0%        25%        50%        75%       100% 
## -9.2257387 -0.0087032 -0.0058498  0.0001841  4.1743895
quantile((capAveFake - capAveTruth)[selectNA])       # imputed values 
##        0%       25%       50%       75%      100% 
## -9.225739 -0.022477 -0.009016  0.005223  0.159508
quantile((capAveFake - capAveTruth)[!selectNA])      # non-NA values
##         0%        25%        50%        75%       100% 
## -1.089e-02 -8.562e-03 -5.810e-03  2.927e-05  4.174e+00

My cleaner NA-imputing experiment

set.seed(13343)
fake <- training

# Make some values NA
selectNA <- rbinom(dim(fake)[1], size=1, prob=0.05) == 1
fake$capitalAve[selectNA] <- NA

# Impute and standardize
preObj <- preProcess(fake[, -58], method="knnImpute")
capAve.predicted <- predict(preObj, fake[, -58])$capitalAve

# unscaling and checking what happened to the non-NA values
fake.capAve.mean <- preObj$mean[55]
fake.capAve.std <- preObj$std[55]
fake.capAve.unscaled <- fake.capAve.mean + capAve.predicted*fake.capAve.std
cbind(fake$capitalAve, fake.capAve.unscaled)[sample(1:1000,20),]
##             fake.capAve.unscaled
##  [1,] 2.988                2.988
##  [2,] 1.785                1.785
##  [3,] 1.000                1.000
##  [4,] 1.263                1.263
##  [5,] 1.266                1.266
##  [6,] 1.958                1.958
##  [7,] 1.538                1.538
##  [8,] 1.400                1.400
##  [9,] 1.535                1.535
## [10,] 1.000                1.000
## [11,] 1.375                1.375
## [12,] 1.870                1.870
## [13,] 2.000                2.000
## [14,] 1.785                1.785
## [15,] 1.466                1.466
## [16,] 5.857                5.857
## [17,] 1.611                1.611
## [18,] 2.521                2.521
## [19,] 1.746                1.746
## [20,] 3.857                3.857

# Standardize true values
capAve.truth <- training$capitalAve
capAve.truth <- (capAve.truth - mean(capAve.truth))/sd(capAve.truth)

# comparisons
quantile(capAve.predicted - capAve.truth)                 # all
##         0%        25%        50%        75%       100% 
## -1.141e+01 -8.703e-03 -5.850e-03  1.841e-04  4.174e+00
quantile((capAve.predicted - capAve.truth)[selectNA])     # imputed values 
##         0%        25%        50%        75%       100% 
## -11.407590  -0.022477  -0.009016   0.005501   0.159508
quantile((capAve.predicted - capAve.truth)[!selectNA])    # non-NA values
##         0%        25%        50%        75%       100% 
## -1.089e-02 -8.562e-03 -5.810e-03  2.927e-05  4.174e+00

Notes and further reading