Based on Jeef Leek's slides for the “Practical Machine Learning” course.
library(caret)
library(kernlab)
library(RANN)
data(spam)
set.seed(4567)
inTrain <- createDataPartition(y = spam$type, p = 0.75, list = FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
hist(training$capitalAve, main="", xlab="ave. capital run length")
Very skewed distribution with few data points in a long high-value tail.
mean(training$capitalAve)
## [1] 5.444
sd(training$capitalAve)
## [1] 35.49
Huge standard deviation!
trainCapAve <- training$capitalAve
trainCapAveS <- (trainCapAve - mean(trainCapAve))/sd(trainCapAve)
mean(trainCapAveS)
## [1] -1.025e-17
sd(trainCapAveS)
## [1] 1
NOTE: center & scaling does not really do much to fix the problem of huge standard deviation compared with the spread of the bulk of the data…
testCapAve <- testing$capitalAve
testCapAveS <- (testCapAve - mean(trainCapAve))/sd(trainCapAve)
mean(testCapAveS)
## [1] -0.02848
sd(testCapAveS)
## [1] 0.443
When we apply a prediction algorithm to the test set we have to be aware that we can only use
parameters that we estimated in the training set.
In other words, when we apply this same standardization to the test set, we have to use the
mean and the standard deviation from the training set to standardize the testing set values.
This means that when we apply this standardization to the test set, the mean will not be exactly zero and the standard deviation will not be exactly one, because we have standardized by parameters estimated in the training set,
preObj <- preProcess(training[,-58], method=c("center","scale"))
trainCapAveS <- predict(preObj, training[,-58])$capitalAve
mean(trainCapAveS)
## [1] -1.025e-17
sd(trainCapAveS)
## [1] 1
Standardizing (centering and scaling) all variables except #58 (“type”) which is the outcome we want to predict.
Parameters for preProcess() function:
y have to estimate the Box-Cox transformation?n.comp.preProcess() results in a list with elements
The other thing that you can do is you can use the object that is created using preProcess()
on the training set to apply that same preprocessing to another set, typically the test set.
testCapAveS <- predict(preObj, testing[,-58])$capitalAve
mean(testCapAveS)
## [1] -0.02848
sd(testCapAveS)
## [1] 0.443
set.seed(32343)
modelFit <- train(type ~ ., data=training, preProcess=c("center","scale"), method="glm")
modelFit
## Generalized Linear Model
##
## 3451 samples
## 57 predictors
## 2 classes: 'nonspam', 'spam'
##
## Pre-processing: centered, scaled
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
##
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.9 0.8 0.02 0.05
##
##
Box-Cox transforms are a set of transformations that take continuous data and try to make them look like normal data. They do that by estimating a specific set of parameters using maximum likelihood.
preObj <- preProcess(training[,-58], method=c("BoxCox"))
trainCapAveS <- predict(preObj, training[,-58])$capitalAve
par(mfrow=c(1,2))
hist(trainCapAveS)
qqnorm(trainCapAveS)
set.seed(13343)
# Make some values NA
training$capAveFake <- training$capitalAve
selectNA <- rbinom(dim(training)[1], size=1, prob=0.05) == 1
training$capAveFake[selectNA] <- NA
# Impute and standardize
preObj <- preProcess(training[,-58], method="knnImpute")
capAveFake <- predict(preObj, training[,-58])$capAveFake
# GF: it looks like values are automatically Centered&Scaled anyway.
# GF: non-NA values I presume are not imputed, right?
# However, because of Centering&Scaling their values change...
# GF: to do this experiment cleanly, shouldn't the existing $capitalAve data be
# replaced by the new ones with NAs?
# Otherwise the imputing can take advantage of the full $capitalAve to
# inform the imputing...
# Standardize true values
capAveTruth <- training$capitalAve
capAveTruth <- (capAveTruth-mean(capAveTruth))/sd(capAveTruth)
quantile(capAveFake - capAveTruth) # all
## 0% 25% 50% 75% 100%
## -9.2257387 -0.0087032 -0.0058498 0.0001841 4.1743895
quantile((capAveFake - capAveTruth)[selectNA]) # imputed values
## 0% 25% 50% 75% 100%
## -9.225739 -0.022477 -0.009016 0.005223 0.159508
quantile((capAveFake - capAveTruth)[!selectNA]) # non-NA values
## 0% 25% 50% 75% 100%
## -1.089e-02 -8.562e-03 -5.810e-03 2.927e-05 4.174e+00
set.seed(13343)
fake <- training
# Make some values NA
selectNA <- rbinom(dim(fake)[1], size=1, prob=0.05) == 1
fake$capitalAve[selectNA] <- NA
# Impute and standardize
preObj <- preProcess(fake[, -58], method="knnImpute")
capAve.predicted <- predict(preObj, fake[, -58])$capitalAve
# unscaling and checking what happened to the non-NA values
fake.capAve.mean <- preObj$mean[55]
fake.capAve.std <- preObj$std[55]
fake.capAve.unscaled <- fake.capAve.mean + capAve.predicted*fake.capAve.std
cbind(fake$capitalAve, fake.capAve.unscaled)[sample(1:1000,20),]
## fake.capAve.unscaled
## [1,] 2.988 2.988
## [2,] 1.785 1.785
## [3,] 1.000 1.000
## [4,] 1.263 1.263
## [5,] 1.266 1.266
## [6,] 1.958 1.958
## [7,] 1.538 1.538
## [8,] 1.400 1.400
## [9,] 1.535 1.535
## [10,] 1.000 1.000
## [11,] 1.375 1.375
## [12,] 1.870 1.870
## [13,] 2.000 2.000
## [14,] 1.785 1.785
## [15,] 1.466 1.466
## [16,] 5.857 5.857
## [17,] 1.611 1.611
## [18,] 2.521 2.521
## [19,] 1.746 1.746
## [20,] 3.857 3.857
# Standardize true values
capAve.truth <- training$capitalAve
capAve.truth <- (capAve.truth - mean(capAve.truth))/sd(capAve.truth)
# comparisons
quantile(capAve.predicted - capAve.truth) # all
## 0% 25% 50% 75% 100%
## -1.141e+01 -8.703e-03 -5.850e-03 1.841e-04 4.174e+00
quantile((capAve.predicted - capAve.truth)[selectNA]) # imputed values
## 0% 25% 50% 75% 100%
## -11.407590 -0.022477 -0.009016 0.005501 0.159508
quantile((capAve.predicted - capAve.truth)[!selectNA]) # non-NA values
## 0% 25% 50% 75% 100%
## -1.089e-02 -8.562e-03 -5.810e-03 2.927e-05 4.174e+00