Preprocessing with Principle Component Analysis (PCA)

This section is about preprocessings covariates with principal components analysis. Often we have multiple quantitative variables and sometimes they’ll be highly correlated with each other. In other words, they are very similar to being the almost the exact same variable. In this case, it’s not necessarily useful to include every variable in the model. We might want to include some summary that captures most of the information in those quantitative variables.

Lets check it with the spam dataset.

library(caret)
library(kernlab)
data(spam)
inTrain<-createDataPartition(y=spam$type, p=0.75, list=FALSE)
training<-spam[inTrain,]
testing<-spam[-inTrain,]
M<-abs(cor(training[,-58]))#left 58 column on the training dataset because that is going to be the output variable. We are checking correlation among all other variables in the training dataset, and we are taking the absolute value.
diag(M)<-0 #setting the correlation of a variable to itself to be a zero. Because we know that every variable has a correlation of 1 with itself
which(M>0.8, arr.ind=T)#Checking which of the variables have higher correlation than 0.8.

##        row col
## num415  34  32
## direct  40  32
## num857  32  34
## direct  40  34
## num857  32  40
## num415  34  40

Looks like there are 3 variables (32, 34, and 40)have higher correlation (>0.80) on row 34, and 40; 32 and 40, and 32 and 34 (num415 and direct, or num857 seem to be the words repeat most of the time). num415, or num857 may be the numbers that repeat in the emails.

Now,lets check the correlation between these columns:

plot(spam[,32],spam[,34])

plot(spam[,32],spam[,40])

plot(spam[,34],spam[,40])

Plotting these variables, we can see that the expressions num415, num857, and direct correlate perfectly to each other. Based on the graph, as the num415 increases, both the num857, and directly increase in the same rate, and they almost fit in a straight line.

So, using all these variables in the model may not be a perfect idea. The Basic PCA Idea:

We might not need every predictor
A weighted combination of predictors might be better
We should pick the combination to capture the “most information” possible
Benefits

Reduced number of predictors
Reduced noise (due to averaging)

Now, let’s figure out a combination of these variables in a way that explain most of its variability. We created X, and Y variables as the new variables. Basically, in X, we multiplied all the varaibles of interest by 0.71 and added them together. However, in Y, we multiplied them with the same number and we subtracted the multiplication form each other. In general, X is the addition of these three variables and Y is the subtraction.

X = 0.71 * training$num415 + 0.71 * training$num857 + 0.71 * training$direct
Y = ((0.71 * training$num415 - 0.71 * training$num857) - (0.71 * training$num857- 0.71 * training$direct) - (0.71 * training$direct - 0.71 * training$num415))
plot(X,Y)

We can see most of the variability is happening in the x-axis. In other words there’s lots of points all spread out across the x-axis, but most of the points are clustered right around 0 on the y-axis. So that almost all of these points have a y value of 0. So, the adding the two variables together captures most of the information in those two variables and subtracting the variables takes less information. So, we might want to use the sum of the two variables as a predictor. That will reduce the number predicted that we will have to use and renew some of the noise. So there are two related problems to how we do this in a more general sense. And so the ideas are find a new set of variables based on the variables that we have that are uncorrelated and explain as much variability as possible. In other words from the previous plot, we’re looking for the x variable which has lots of variation in it. And note the y variable which is almost always 0. So if we put the variables together in one matrix, create the best matrix with fewer variables. In other words this is lower rank if you’re talking mathematically that explains the original data. These two problems are very closely related to each other. And they’re both the idea that, can we use fewer variables to explain almost everything that’s going on. The first goal is a statistical goal and the second goal is a data compression goal. But they’re also, they’re both very useful for machine learning.

smallSpam<-spam[,c(32,34,40)]
prComp<-prcomp(smallSpam)
plot(prComp$x[,1],prComp$x[,2])

plot(prComp$x[,2],prComp$x[,3])

plot(prComp$x[,3],prComp$x[,1])

The first PCA looks like one in which we added all the variables, where the values go along the x-axis. And the second and the third principal components look a lot like subtracting the two variables from each other.

It begs than a question, why would we do principal components, instead of just adding and subtracting?

Well, principal components allows us to perform this operation, even if we have more than just two variables. We may be able to reduce all of the variables down into a very small number of combinations of sums and differences and weighted sums and differences of the variables that we’ve observed. So, using principal components let us look at a large number of quantitative variables and reduce it quite a bit.

The other thing that we can look at in the PCA is the rotation matrix, which is basically how it’s summing up the variables to get each of the principal components.

prComp$rotation

##              PC1        PC2         PC3
## num857 0.5742979  0.4058943 -0.71093720
## num415 0.5752832  0.4177584  0.70322625
## direct 0.5824356 -0.8128516  0.00641362

Looking at the results, the principle component 1 is the one where we added all the variables, the PC2 was one where we subtracted the 3rd variable from 2, and PC3 was the one where we subracted variable 1 from the the third. When we were creating the variables X, and Y a moment ago, it would make more sense if we multiplied the variables by 0.57 rather than 0.71. AS result suggested, the PC1 (i.e., where we added all three variables) is the one that had the highest variability. So, if we have to make decision, it is good to use PC1 (addition of all three variables) instead of num415, num857, and direct in our algorithm.

Now lets use the PCA on the whole Spam data. We want to see if there are more then two variables with very high correlation, and so, we can reduce the number of the variables to be used in the prediction model. We are creating a variable and color code them as, i.e., purple for not spam and blue for spam. And the statement [prComp<-prcomp(log10(spam[,58]+1] calculates the principal components on the entire data set. We’ve applied a function of the data set, the log 10 transform, and added one to make the data look a little bit more Gaussian because some of the variables are normal looking while other variables are skewed. We often have to do that for principal component analysis to look sensible.

typeColor<-((spam$type=="spam")*7+6)
prComp<-prcomp(log10(spam[,-58]+1))
plot(prComp$x[,1],prComp$x[,2],col=typeColor, xlab="PC1",ylab="PC2")

plot(prComp$x[,2],prComp$x[,3],col=typeColor, xlab="PC2",ylab="PC3")

plot(prComp$x[,3],prComp$x[,1],col=typeColor, xlab="PC3",ylab="PC1")

PC1 presents the most variation in data, while PC2, and PC3 explain second the third most variation in data. Each of the points in the plot represent a single observations. The purple represents the ham emails while blue represents the spam emails. The plot 1 shows the relationship PC2 on Y axis, and PC1 on X axis. Plot two shows PC2 on X, and PC3 on Y, while plot 3 shows PC3 on X axis and PC1 on y axis.

Looking at the first plot along the X-axis, that it the PC1 separation, we can see that there is a little bit of separation in spam and ham data points. It looks like spam emails tend to have a little bit higher values than the non-spam emails. This is a way to reduce the size of our data set while still capturing a large amount of variation which is an important idea behind feature creation.

We can apply the principal component analysis in Caret package, as well. We can use preprocessing covariants with principal components analysis. The idea is that often we have multiple quantitative variables and sometimes they’ll be highly correlated with each other. In other words, they’ll be very similar to being almost the exact same variable. In this case, it’s not necessarily useful to include every variable in the model. We might want to include some summary that captures most of the information in those quantitative variables.

PCA wit CARET

Lets use preprocessing with PCA:

library(caret)
preProc<-preProcess(log10(spam[,-58]+1),
                    method="pca", pcaComp=3)
spamPC<-predict(preProc, log10(spam[,-58]+1))
plot(spamPC[,1],spamPC[,2],col=typeColor)

plot(spamPC[,2],spamPC[,3],col=typeColor)

plot(spamPC[,3],spamPC[,1],col=typeColor)

Again, we see a little bit of separation between ham and spam messages. All in all, these plots strengthen the idea we explored earlier, i.e., there is separation of values between the spam and ham messages.

Preprocessing with PCA

We can do the same thing using the PCF function (as in the first line of the syntax below). As can be seen in the second line, we pass the ‘predict’ function on the entire training set. And, finally, we do the modelFit passing the ‘train’ function on the ‘trainPC’ dataset we created in the second line.

preProc<-preProcess(log10(training[,-58]+1),method="pca", pcaComp=3)
trainPC<-predict(preProc, log10(training[,-58]+1))
head(trainPC)

##         PC1        PC2        PC3
## 1 1.4639313 -0.6081696 -0.5187293
## 2 2.6266138 -3.2030963 -0.1157695
## 3 3.1360478 -5.3240368  1.0654552
## 4 1.4347072 -0.8663120 -0.2698004
## 5 1.4356476 -0.8643261 -0.2704324
## 6 0.3829731  0.8648493 -0.1986835

This syntax would help us identify the model fit. Here, we are fitting a model using the ‘train’ function that relates to the ‘training’ variables to the principle component ‘trainPC’.

modelFit<-train(training$type ~ ., method=“glm”, data=trainPC)

After we conduct all these steps, we come to the point where we need to use the same PCA in out test dataset.

Using what we have learned on TEST Dataset

This syntax would provide us with the required statistics. The "modelFit’ object is from the training dataset and it is used to predict the testPC on testing data.

testPC<-predict(preProc,log(testing[,-58]+1))*

confusionMatrix(testing$type, predict(modelFit,testPC))

Training and Tuning Tree Models (using rpart package- an alternate model)

#class<-spam$type
library(rpart)
rp<-rpart(spam$type ~ ., 
           data=spam, 
           control=rpart.control(maxdepth=2))
rp

## n= 4601 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 4601 1813 nonspam (0.60595523 0.39404477)  
##   2) charDollar< 0.0555 3471  816 nonspam (0.76490925 0.23509075)  
##     4) remove< 0.055 3141  516 nonspam (0.83572111 0.16427889) *
##     5) remove>=0.055 330   30 spam (0.09090909 0.90909091) *
##   3) charDollar>=0.0555 1130  133 spam (0.11769912 0.88230088)  
##     6) hp>=0.4 70    7 nonspam (0.90000000 0.10000000) *
##     7) hp< 0.4 1060   70 spam (0.06603774 0.93396226) *

Looking at the statistics, we see that there are total of 3451 cases (emails), of which 2091 are non spam and 1360 spams. Now, lets visualize the tree

library(partykit)

## Loading required package: grid

## Loading required package: libcoin

## Loading required package: mvtnorm

rp1<-as.party(rp)
plot(rp1)

rpartFull<-rpart(training$type~., data=training)
rpartFull

## n= 3451 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 3451 1360 nonspam (0.60591133 0.39408867)  
##    2) charExclamation< 0.0795 2025  330 nonspam (0.83703704 0.16296296)  
##      4) remove< 0.045 1871  198 nonspam (0.89417424 0.10582576)  
##        8) money< 0.01 1789  141 nonspam (0.92118502 0.07881498) *
##        9) money>=0.01 82   25 spam (0.30487805 0.69512195) *
##      5) remove>=0.045 154   22 spam (0.14285714 0.85714286) *
##    3) charExclamation>=0.0795 1426  396 spam (0.27769986 0.72230014)  
##      6) capitalAve< 2.3715 511  213 nonspam (0.58317025 0.41682975)  
##       12) free< 0.285 378  100 nonspam (0.73544974 0.26455026)  
##         24) remove< 0.045 336   64 nonspam (0.80952381 0.19047619) *
##         25) remove>=0.045 42    6 spam (0.14285714 0.85714286) *
##       13) free>=0.285 133   20 spam (0.15037594 0.84962406) *
##      7) capitalAve>=2.3715 915   98 spam (0.10710383 0.89289617)  
##       14) hp>=0.41 38    6 nonspam (0.84210526 0.15789474) *
##       15) hp< 0.41 877   66 spam (0.07525656 0.92474344) *

rp2<-as.party(rpartFull)
plot(rp2)

The Alternate Approach (Direct Method)

So the other thing we can do is not use the predict function separately. We can build it right into our training exercise. So if we take the train function from the caret package, and we pass it in a training set, but we tell it to pre process with principal component analysis. It will do that pre-processing as part of the training process. And when we do the prediction on the new data set (testing data set), it will actually calculate the PC’s forus. The reason why we went through the more elaborate way of calculating the PC’s first and passing them to the model, is so that we can see what’s going on under the hood When we pass a command like this, to the train function in the caret package.

modelFit<-train(training$type ~ ., method=“glm”, preProcess=“pca”, data=training)

*confusionMatrix(testing$type, predict(modelFit, testing))

Here, we train the modelFit on the training datasest (first line) using the ‘generalized linear model’. We have also conducted the ‘preProcess’ using the ‘principle component analysis’ function on the training dataset. In the second line, we ordered the ‘confusionMatric’ statistics on ‘testing’ dataset, where we use the ‘predict’ function and passed the training Model Fit to calculate statistics based on the ‘testing’ dataset.

(Note: for some reason the program did not work though.)

Let’s see if rpart prgram works: TEST SET Results

library(caret)
rpartpred<-predict(rpartFull, testing, type="class")
confusionMatrix(rpartpred, testing$type)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     641   77
##    spam         56  376
##                                           
##                Accuracy : 0.8843          
##                  95% CI : (0.8644, 0.9023)
##     No Information Rate : 0.6061          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7558          
##                                           
##  Mcnemar's Test P-Value : 0.08288         
##                                           
##             Sensitivity : 0.9197          
##             Specificity : 0.8300          
##          Pos Pred Value : 0.8928          
##          Neg Pred Value : 0.8704          
##              Prevalence : 0.6061          
##          Detection Rate : 0.5574          
##    Detection Prevalence : 0.6243          
##       Balanced Accuracy : 0.8748          
##                                           
##        'Positive' Class : nonspam         
##

Wohoo!! It worked. We completed the predict function on the new data set. Note that rpartFull comes from the training set and we passed it to the test data set.