Machine Learning: Preprocessing and PCA

Continuing week 2 of “Practical Machine Learning” with Jeffrey Leek. Topics include principle component analysis (PCA), predicting with regression, and regression prediction with multiple covariates.

First we load required packages and libraries, then create training and test sets for our data (the spam data set from the kernlab package).

## Loading required package: caret

## Loading required package: lattice

## Loading required package: ggplot2

## Loading required package: kernlab

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

Check out correlation between variables. Set correlation of same variables to zero (diagonal).

M<-abs(cor(training[,-58]))
diag(M)<-0
which(M >.85,arr.ind=TRUE)

##        row col
## num415  34  32
## num857  32  34

We see a high correlation between num 415 and num857 –> let’s look.

names(spam)[c(34,32)]

## [1] "num415" "num857"

# So the correlation is between the presence of the numbers 415 and 857. Phone number?
qplot(data=spam,x=num415,y=num857)

So we probably don’t need both for this model. How to do?

Goal:

Explain as much variance as possible with few uncorrelated variables as possible.

SVD (Singular Value Decomposition)

Let X be a matrix with each variable in a column and each observation in a row.
Then the SVD is a matrix decomposition s.t. \(X=UDV^T\) where:
- The columns of U are orthogonal (left singular vectors)
- The columns of V are orthogonal (right singular vectors)
- And D is a diagonal matrix of singular values.

PCA (Principle Component Analysis)

The principle components are the right singular values if you normalize the variables (subtract mean, divide by sd)

First, look at just the two related variables from above.

smallSpam<-spam[,c(34,32)]
prComp<-prcomp(smallSpam)
plot(x=prComp$x[,1],y=prComp$x[,2])

prComp$rotation

##              PC1        PC2
## num415 0.7080625  0.7061498
## num857 0.7061498 -0.7080625

Now look at the whole data set. We do a PCA on the spam data set, and plot the the first two principle components against each other. Note the log10 transform, common for PCA. Also note that unlike above, each of PC1 and PC2 is a combination of many variables in the spam dataset.

typeColor<-((spam$type=="spam")*1+1)
prComp<-prcomp(log10(spam[,-58]+1))
plot(x=prComp$x[,1],y=prComp$x[,2],col=typeColor,xlab="PC1",ylab="PC2")

Next we’ll do (roughly) the same thing using the caret package. We preprocess the spam dataset then create a new df trying to predict the values of the top 2 principle components.

preProc<-preProcess(log10(spam[,-58]+1),method="pca",pcaComp=2)
spamPC<-predict(preProc,log10(spam[,-58]+1))
plot(spamPC[,1],spamPC[,2],col=typeColor)

Finally, we can fit a model based on principle components.

preProc<-preProcess(log10(training[,-58]+1),method="pca",pcaComp=2)
trainPC<-predict(preProc,log10(training[,-58]+1))
modelFit<-train(x=trainPC, y=training$type, method="glm")
modelFit

## Generalized Linear Model 
## 
## 3451 samples
##    2 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9003345  0.7892013

Using the two principle components we previously identified, we were able to create a model with around 90% accuracy in our training set. This is pretty good considering using no singular value decomposition we were just under 92%.

Now we do what amounts to the exact same thing, but a more friendly version. Do the preprocessing within the train function, and eliminate the intermediate steps in the analysis. Notice that this model yields a higher accuracy than above. This is because we haven’t specified only two PCs to be used. We could do that by including the argument trControl= trainControl(preProcOptions=list(pcaComp=2)) in the train function.

modelFit<-train(type~.,method="glm",preProcess="pca",data=training)
modelFit

## Generalized Linear Model 
## 
## 3451 samples
##   57 predictor
##    2 classes: 'nonspam', 'spam' 
## 
## Pre-processing: principal component signal extraction (57), centered
##  (57), scaled (57) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9225658  0.8367994

confusionMatrix(testing$type,predict(modelFit,testing))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction nonspam spam
##    nonspam     665   32
##    spam         51  402
##                                           
##                Accuracy : 0.9278          
##                  95% CI : (0.9113, 0.9421)
##     No Information Rate : 0.6226          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8477          
##  Mcnemar's Test P-Value : 0.04818         
##                                           
##             Sensitivity : 0.9288          
##             Specificity : 0.9263          
##          Pos Pred Value : 0.9541          
##          Neg Pred Value : 0.8874          
##              Prevalence : 0.6226          
##          Detection Rate : 0.5783          
##    Detection Prevalence : 0.6061          
##       Balanced Accuracy : 0.9275          
##                                           
##        'Positive' Class : nonspam         
##

Machine Learning: Preprocessing and PCA

DRWatkins

June 11, 2017

Goal:

SVD (Singular Value Decomposition)

PCA (Principle Component Analysis)