Why Preprocess?

We need to plot the variables upfront so we can see if there’s any sort of weird behavior of those variables. And sometimes predictors or the distribution look very strange, and we might need to transform them in order to make them more useful for prediction algorithms. This is particularly true when we’re using model based algorithms, like linear discriminate analysis, naive Bayes, linear regression, etc. We’ll talk about all those methods later in the class, but just keep in mind that pre-processing can be more useful often when we’re using model based approaches, then when we’re using more non-parametric approaches.

Now, lets use the the same Caret package and explore the answers in detail.

library(caret)
library(kernlab)
data(spam)
inTrain<-createDataPartition(y=spam$type, p=0.75, list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,]
hist(training$capitalAve,main="",xlab="ave. capital run length")

So, using this dataset we are trying to predict if an email is spam or ham, for example by looking at the number of capital letters per email or so. Looking at the histogram, we can see that the capital run length of the emails are very very small. It is kind of skewed dataset. There are some though which have much more capital run length. We are building a model based model, that’s why we are trying to preprocess.

Checking the Variables

Now lets check the mean and the standard deviation of the total capital letter per email included in the training set.

mean(training$capitalAve)
## [1] 5.130038
sd(training$capitalAve)
## [1] 30.65889

The average capital letters are 5.14 but the standard deviation is much more dramatic, which is 30.8849.

a. Standardizing the Variable in the Training Set

So, we want to preprocess so that the machine learning algorithm doesn’t get tricked by the fact that it has highly variable standard deviation. The one way we can do so by standardizing the variable. We are creating a couple of new variables, i.e., trainCapAve, and then trainCapAveS. The latter one is the standardized form of the capitalAve variable.

trainCapAve<-training$capitalAve
trainCapAveS<-(trainCapAve-mean(trainCapAve))/sd(trainCapAve)
mean(trainCapAveS)
## [1] -6.225931e-18
sd(trainCapAveS)
## [1] 1

The usual way of standardizing variables is to take the values, subtract the mean, and divide the whole thing by the standard deviation. Once we do so, we get the mean of the variable to be a zero and the standard deviation to be 1. Doing so, we can minimize the variability we saw in the previous calculation.

b. Standardizing the Testing Set

One thing to keep in mind is, when we apply a prediction algorithm to the test set we have to be aware that we can only use parameters that we estimated in the training set. In other words, when we apply this same standardization to the test set, we have to use the mean from the training set, and the standard deviation from the training set to standardize the testing set values.

What does this mean?

It means that when we do the standardization, the mean will not be exactly zero in the test set. And the standard deviation will not be exactly one, because we’ve standardized by parameters estimated in the training set, but hopefully they’ll be close to those values even though we’re using not the exact values built in the test set. Let’s see if that’s the case here:

testCapAve<-testing$capitalAve
testCapAveS<-(testCapAve - mean(trainCapAve))/sd(trainCapAve)
mean(testCapAveS)
## [1] 0.008022498
sd(testCapAveS)
## [1] 1.133708

As mentioned above, the mean of the test set is -0.05868 and the standard deviation is 0.2134. They are clearly not equal to the standardized mean and standard deviation of the training set.

Using preProcess Function to do the same thing

a. Training Set

We can also use the preProcess function to do a lot of standardization for us. So, the preprocess function is a function that is built into the caret package. And here we are going to pass all of the training variables except for one, except for the 58th in the data set, which is the actual outcome that we care about (outcome variable).

preObj<-preProcess(training[,-58], method=c("center", "scale"))
trainCapAveS<-predict(preObj, training[,-58])$capitalAve
mean(trainCapAveS)
## [1] -6.225931e-18
sd(trainCapAveS)
## [1] 1

And we can see that by looking at the mean of the value capitalAve, just like we did before, the preProcess function the mean is 0, and the standard deviation is 1. So, preprocess can be used to perform a lot of the preprocessing tool, techniques that we used to have to do by hand.

b. The Tesging Set

The other thing that we can do is to use the object that’s created using the preprocessing technique to apply that same preprocessing to the test set.

testCapAveS<-predict(preObj, testing[,-58])$capitalAve
mean(testCapAveS)
## [1] 0.008022498
sd(testCapAveS)
## [1] 1.133708

If we pass the testing set to the predict function as can be seen in the algorithm, it will take the values calculated on the preObj (preprocessing object) and apply them to the test set object.Comparing the results of preProcess function and and the manual standardization of the training and the test sets, we find that they are close but not equal because because we used the training set values to normalize. So, it proves that we can use preProcess function which requires much less effort but similar results.

c. More Example

We can also pass the preprocessed commands directly to the train function in caret as an argument. So, for example here we can send to the preprocessed argument of the train function, the command, the parameters center and scale, and that will center and scale all of the predictors, before using those predictors in the prediction model.

set.seed(32343)
modelFit<-train(type~., data=training, preProcess=c("center","scale"),method="glm")
modelFit
## Generalized Linear Model 
## 
## 3451 samples
##   57 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## Pre-processing: centered (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9182901  0.8285312

Standardizing- Box-Cox Transforms

The other thing that we can do is to perform other kinds of transformation like centering and scaling is one approach, and that will take care of some the problems that we see in these data. We can remove very strongly biased predictors or predictors with very high variability. The other thing that we can do is use other different kinds of transformations like the box cox transforms. Box cox transforms are a set of transformations that take continuous data, and try to make them look like normal data and they do that by estimating a specific set of parameters using maximum likelihood. Here’s an example:

preObj<-preProcess(training[,-58],method=c("BoxCox"))
trainCapAve<-predict(preObj, training[,-58])$capitalAve
par(mfrow=c(1,2))
hist(trainCapAveS)
qqnorm(trainCapAveS)

Doing so, it looks a little bit more like a normal distribution (compared to the first histogram), but it is nowhere near a bell curve. The first take away is it doesn’t take care of all of the problems, especially when our data is hugely skewed. Similarly, the Normal QQ plot proves the same fact. There’s still a stack set of values from -2 to +2. It is showing the theoretical quantiles of the normal distribution versus the sample quintiles that we calculated for our preProcess data. We can see that they don’t perfectly line up, and in particular, there’s the data points at the bottom don’t lie on a perfect 45 degree line, and that’s because if we have a bunch of values that are exactly equal to zero this is a continuous transform and it doesn’t take care of values that are repeated.

Standardizing: Imputing Data

It’s very common to have missing data in our dataset. When we are using missing data in the datasets, the prediction algorithms often fail. Predication algorithms are build not to handle missing data in most cases. So, if we have missing data, we can imputem them using something called k-nearest neighbor’s imputation. Let’s check this example:

a. Make Some Values NA

set.seed (13343)
training$capAve<-training$capitalAve
selectNA<-rbinom(dim(training)[1],size=1, prob=0.05)==1
training$capAve[selectNA]<-NA

First, we took capital average values and we created a new variable called ‘training$capAve’. We, then, randomly generate a bunch of values to set equal to NA. And, set those values to NA.

b. Impute and Standardize

Now, lets impute and standardize the missing values

library(RANN)
## Warning: package 'RANN' was built under R version 4.0.3
preObj<-preProcess(training[,-58],method="knnImpute")
capAve<-predict(preObj, training[,-58])$capAve

So, one thing that we are doing is, we are going to use preProcess function and tell it to do k-nearest neighbors imputation. K-nearest neighbors computation and it will find the k, i.e., ten nearest data vectors that look most like data vector with the missing value, and average the values of the variable that’s missing and compute them at that position. So, if we do that, then we can predict on our training set, all of the new values, including the ones that have been imputed with the k-nearest imputation algorithm.

c. Standardizing

We can then standardize those values, using the same standardization procedure that we did before, by subtracting the mean and divided by the standard dev, deviation.

capAveTruth<-training$capitalAve
capAveTruth<-(capAveTruth-mean(capAveTruth))/sd(capAveTruth)

Now, we can look at the values that were imputed, and the values that were truly there before we removed them and made them NAs. And we can see how close those two values are to each other.

quantile(capAve-capAveTruth)
##            0%           25%           50%           75%          100% 
## -5.4017590992  0.0006407855  0.0017965784  0.0041675398  1.7916146624

We can see that in three of the five categories the difference between these values are close to zero. However, in case of the first and last categories the values are much bigger. So, can’t really argue that the imputation worked well.

Now, lets check the values of the quantiles and the differences, but only of the missing values.

quantile((capAve-capAveTruth)[selectNA])
##           0%          25%          50%          75%         100% 
## -5.401759099 -0.016637361  0.001061465  0.017585245  0.198181368

And again, we can see the similar trend. The difference in middle 3 quantiles are close to zero, however the difference in first and the last categories are comparatively much bigger. However, we see certain difference in the values compared to previous statistics.

Now, we can see the difference in the ones we did not select to use NAs.

quantile((capAve-capAveTruth)[!selectNA])
##            0%           25%           50%           75%          100% 
## -0.0002303294  0.0006985344  0.0018030853  0.0039983597  1.7916146624

Looking at the results, we can clearly see that these values are much much farther apart from each other. Now, we can see why even the imputation did not work well. These values, in reality are much more different from each other as represented by the histograms.

Finally we can say that the imputation has tried its best and it did a great job.