Sec.16 - PREPROCESSING WITH PRINCIPAL COMPONENTS ANALYSIS (PCA)

Based on Jeef Leek's slides for the “Practical Machine Learning” course.

Correlated predictors

Often you have multiple quantitative variables and sometimes they will be highly correlated with each other.
In other words, they will be very similar to being the almost the exact same variable. In this case, it is not necessarily useful to include every variable in the model. You might want to include some summary that captures most of the information in those quantitative variables.

library(caret) 
library(kernlab)

data(spam)
inTrain <- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]

M <- abs(cor(training[,-58]))
# setting to 0 auto-correlations, i.e. the diagonal.
diag(M) <- 0

# selecting cases with high correlation
which(M > 0.8, arr.ind=TRUE)
##        row col
## num415  34  32
## direct  40  32
## num857  32  34
## direct  40  34
## num857  32  40
## num415  34  40

names(spam)[c(34,32)]
## [1] "num415" "num857"
plot(spam[,34], spam[,32])

Basic PCA idea

We might not need every predictor
A weighted combination of predictors might be better
We should pick this combination to capture the “most information” possible
Benefits
- Reduced number of predictors
- Reduced noise (due to averaging)

We could rotate the plot

\[ X = 0.71 \times {\rm num 415} + 0.71 \times {\rm num857} \]

\[ Y = 0.71 \times {\rm num 415} - 0.71 \times {\rm num857} \]

X <- 0.71*training$num415 + 0.71*training$num857
Y <- 0.71*training$num415 - 0.71*training$num857
plot(X, Y)

Principal components in R - `prcomp()`

smallSpam <- spam[, c(34,32)]
prComp <- prcomp(smallSpam)
plot(prComp$x[,1], prComp$x[,2])

The rotation

prComp$rotation
##           PC1     PC2
## num415 0.7081 -0.7061
## num857 0.7061  0.7081

PCA on SPAM data

typeColor <- ((spam$type=="spam")*1 + 1)
prComp <- prcomp(log10(spam[,-58]+1))

plot(prComp$x[,1], prComp$x[,2], col=typeColor, xlab="PC1", ylab="PC2")

We have applied a transformation to the data, the log10 function, and added +1 to it. This makes the data look a little bit more Gaussian. This helps because some of the variables are skewed, while others are normal looking.

You often have to do that for principal component analysis to look sensible.

PCA with caret

Here applied to the full spam data set.

preProc <- preProcess(log10(spam[,-58]+1), method="pca", pcaComp=2)

# create new PC variables
spamPC <- predict(preProc, log10(spam[,-58]+1))
plot(spamPC[,1], spamPC[,2], col=typeColor)

Preprocessing with PCA : training data set

preProc <- preProcess(log10(training[,-58]+1), method="pca", pcaComp=2)
trainPC <- predict(preProc, log10(training[,-58]+1))

# training only on the PC variables.
modelFit <- train(training$type ~ ., method="glm", data=trainPC)

Adding new “PC variables” to the test data set

testPC <- predict(preProc, log10(testing[,-58]+1))
confusionMatrix(testing$type, predict(modelFit, testPC))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     653   44
##    spam         63  390
##                                         
##                Accuracy : 0.907         
##                  95% CI : (0.889, 0.923)
##     No Information Rate : 0.623         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.804         
##  Mcnemar's Test P-Value : 0.0818        
##                                         
##             Sensitivity : 0.912         
##             Specificity : 0.899         
##          Pos Pred Value : 0.937         
##          Neg Pred Value : 0.861         
##              Prevalence : 0.623         
##          Detection Rate : 0.568         
##    Detection Prevalence : 0.606         
##       Balanced Accuracy : 0.905         
##                                         
##        'Positive' Class : nonspam       
##

Alternative (sets # of PCs)

modelFit <- train(training$type ~ ., method="glm", preProcess="pca", data=training)
confusionMatrix(testing$type, predict(modelFit, testing))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     670   27
##    spam         47  406
##                                        
##                Accuracy : 0.936        
##                  95% CI : (0.92, 0.949)
##     No Information Rate : 0.623        
##     P-Value [Acc > NIR] : <2e-16       
##                                        
##                   Kappa : 0.864        
##  Mcnemar's Test P-Value : 0.0272       
##                                        
##             Sensitivity : 0.934        
##             Specificity : 0.938        
##          Pos Pred Value : 0.961        
##          Neg Pred Value : 0.896        
##              Prevalence : 0.623        
##          Detection Rate : 0.583        
##    Detection Prevalence : 0.606        
##       Balanced Accuracy : 0.936        
##                                        
##        'Positive' Class : nonspam      
##

Final thoughts on PCs

Most useful for linear-type models
Can make it harder to interpret predictors
Watch out for outliers!
- Transform first (with logs/Box Cox)
- Plot predictors to identify problems
For more info see
- Exploratory Data Analysis
- Elements of Statistical Learning

Sec.16 - PREPROCESSING WITH PRINCIPAL COMPONENTS ANALYSIS (PCA)

Correlated predictors

Basic PCA idea

We could rotate the plot

Related problems

Related solutions - PCA/SVD

Principal components in R - `prcomp()`

The rotation

PCA on SPAM data

PCA with caret

Preprocessing with PCA : training data set

Adding new “PC variables” to the test data set

Alternative (sets # of PCs)

Final thoughts on PCs

Sec.16 - PREPROCESSING WITH PRINCIPAL COMPONENTS ANALYSIS (PCA)

Correlated predictors

Basic PCA idea

We could rotate the plot

Related problems

Related solutions - PCA/SVD

Principal components in R - prcomp()

The rotation

PCA on SPAM data

PCA with caret

Preprocessing with PCA : training data set

Adding new “PC variables” to the test data set

Alternative (sets # of PCs)

Final thoughts on PCs

Principal components in R - `prcomp()`