Sec.16 - PREPROCESSING WITH PRINCIPAL COMPONENTS ANALYSIS (PCA)

Based on Jeef Leek's slides for the “Practical Machine Learning” course.

Correlated predictors

Often you have multiple quantitative variables and sometimes they will be highly correlated with each other.
In other words, they will be very similar to being the almost the exact same variable. In this case, it is not necessarily useful to include every variable in the model. You might want to include some summary that captures most of the information in those quantitative variables.

library(caret) 
library(kernlab) 
data(spam)
inTrain <- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]

M <- abs(cor(training[,-58]))
# setting to 0 auto-correlations, i.e. the diagonal.
diag(M) <- 0

# selecting cases with high correlation
which(M > 0.8, arr.ind=TRUE)
##        row col
## num415  34  32
## direct  40  32
## num857  32  34
## direct  40  34
## num857  32  40
## num415  34  40
names(spam)[c(34,32)]
## [1] "num415" "num857"
plot(spam[,34], spam[,32])
plot of chunk unnamed-chunk-1

Basic PCA idea

We could rotate the plot

\[ X = 0.71 \times {\rm num 415} + 0.71 \times {\rm num857} \]

\[ Y = 0.71 \times {\rm num 415} - 0.71 \times {\rm num857} \]

X <- 0.71*training$num415 + 0.71*training$num857
Y <- 0.71*training$num415 - 0.71*training$num857
plot(X, Y)
plot of chunk unnamed-chunk-2

Related problems

You have multivariate variables \( X_1,\ldots, X_n \) so \( X_1 = (X_{11},\ldots, X_{1m}) \)

The first goal is statistical and the second goal is data compression.


Related solutions - PCA/SVD

SVD

If \( X \) is a matrix with each variable in a column and each observation in a row then the SVD is a “matrix decomposition”

\[ X = UDV^T \]

where the columns of \( U \) are orthogonal (left singular vectors), the columns of \( V \) are orthogonal (right singluar vectors) and \( D \) is a diagonal matrix (singular values).

PCA

The principal components are equal to the right singular values if you first scale (subtract the mean, divide by the standard deviation) the variables.


Principal components in R - prcomp()

smallSpam <- spam[, c(34,32)]
prComp <- prcomp(smallSpam)
plot(prComp$x[,1], prComp$x[,2])
plot of chunk prcomp

The rotation

prComp$rotation
##           PC1     PC2
## num415 0.7081 -0.7061
## num857 0.7061  0.7081

PCA on SPAM data

typeColor <- ((spam$type=="spam")*1 + 1)
prComp <- prcomp(log10(spam[,-58]+1))

plot(prComp$x[,1], prComp$x[,2], col=typeColor, xlab="PC1", ylab="PC2")
plot of chunk spamPC

We have applied a transformation to the data, the log10 function, and added +1 to it. This makes the data look a little bit more Gaussian. This helps because some of the variables are skewed, while others are normal looking.

You often have to do that for principal component analysis to look sensible.


PCA with caret

Here applied to the full spam data set.

preProc <- preProcess(log10(spam[,-58]+1), method="pca", pcaComp=2)

# create new PC variables
spamPC <- predict(preProc, log10(spam[,-58]+1))
plot(spamPC[,1], spamPC[,2], col=typeColor)
plot of chunk unnamed-chunk-4

Preprocessing with PCA : training data set

preProc <- preProcess(log10(training[,-58]+1), method="pca", pcaComp=2)
trainPC <- predict(preProc, log10(training[,-58]+1))

# training only on the PC variables.
modelFit <- train(training$type ~ ., method="glm", data=trainPC)

Adding new “PC variables” to the test data set

testPC <- predict(preProc, log10(testing[,-58]+1))
confusionMatrix(testing$type, predict(modelFit, testPC))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     653   44
##    spam         63  390
##                                         
##                Accuracy : 0.907         
##                  95% CI : (0.889, 0.923)
##     No Information Rate : 0.623         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.804         
##  Mcnemar's Test P-Value : 0.0818        
##                                         
##             Sensitivity : 0.912         
##             Specificity : 0.899         
##          Pos Pred Value : 0.937         
##          Neg Pred Value : 0.861         
##              Prevalence : 0.623         
##          Detection Rate : 0.568         
##    Detection Prevalence : 0.606         
##       Balanced Accuracy : 0.905         
##                                         
##        'Positive' Class : nonspam       
## 

Alternative (sets # of PCs)

modelFit <- train(training$type ~ ., method="glm", preProcess="pca", data=training)
confusionMatrix(testing$type, predict(modelFit, testing))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     670   27
##    spam         47  406
##                                        
##                Accuracy : 0.936        
##                  95% CI : (0.92, 0.949)
##     No Information Rate : 0.623        
##     P-Value [Acc > NIR] : <2e-16       
##                                        
##                   Kappa : 0.864        
##  Mcnemar's Test P-Value : 0.0272       
##                                        
##             Sensitivity : 0.934        
##             Specificity : 0.938        
##          Pos Pred Value : 0.961        
##          Neg Pred Value : 0.896        
##              Prevalence : 0.623        
##          Detection Rate : 0.583        
##    Detection Prevalence : 0.606        
##       Balanced Accuracy : 0.936        
##                                        
##        'Positive' Class : nonspam      
## 

Final thoughts on PCs