Continuing week 2 of “Practical Machine Learning” with Jeffrey Leek. Topics include principle component analysis (PCA), predicting with regression, and regression prediction with multiple covariates.
First we load required packages and libraries, then create training and test sets for our data (the spam data set from the kernlab package).
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: kernlab
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
Check out correlation between variables. Set correlation of same variables to zero (diagonal).
M<-abs(cor(training[,-58]))
diag(M)<-0
which(M >.85,arr.ind=TRUE)
## row col
## num415 34 32
## num857 32 34
We see a high correlation between num 415 and num857 –> let’s look.
names(spam)[c(34,32)]
## [1] "num415" "num857"
# So the correlation is between the presence of the numbers 415 and 857. Phone number?
qplot(data=spam,x=num415,y=num857)
So we probably don’t need both for this model. How to do?
Explain as much variance as possible with few uncorrelated variables as possible.
First, look at just the two related variables from above.
smallSpam<-spam[,c(34,32)]
prComp<-prcomp(smallSpam)
plot(x=prComp$x[,1],y=prComp$x[,2])
prComp$rotation
## PC1 PC2
## num415 0.7080625 0.7061498
## num857 0.7061498 -0.7080625
Now look at the whole data set. We do a PCA on the spam data set, and plot the the first two principle components against each other. Note the log10 transform, common for PCA. Also note that unlike above, each of PC1 and PC2 is a combination of many variables in the spam dataset.
typeColor<-((spam$type=="spam")*1+1)
prComp<-prcomp(log10(spam[,-58]+1))
plot(x=prComp$x[,1],y=prComp$x[,2],col=typeColor,xlab="PC1",ylab="PC2")
Next we’ll do (roughly) the same thing using the caret package. We preprocess the spam dataset then create a new df trying to predict the values of the top 2 principle components.
preProc<-preProcess(log10(spam[,-58]+1),method="pca",pcaComp=2)
spamPC<-predict(preProc,log10(spam[,-58]+1))
plot(spamPC[,1],spamPC[,2],col=typeColor)
Finally, we can fit a model based on principle components.
preProc<-preProcess(log10(training[,-58]+1),method="pca",pcaComp=2)
trainPC<-predict(preProc,log10(training[,-58]+1))
modelFit<-train(x=trainPC, y=training$type, method="glm")
modelFit
## Generalized Linear Model
##
## 3451 samples
## 2 predictor
## 2 classes: 'nonspam', 'spam'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9003345 0.7892013
Using the two principle components we previously identified, we were able to create a model with around 90% accuracy in our training set. This is pretty good considering using no singular value decomposition we were just under 92%.
Now we do what amounts to the exact same thing, but a more friendly version. Do the preprocessing within the train function, and eliminate the intermediate steps in the analysis. Notice that this model yields a higher accuracy than above. This is because we haven’t specified only two PCs to be used. We could do that by including the argument trControl= trainControl(preProcOptions=list(pcaComp=2)) in the train function.
modelFit<-train(type~.,method="glm",preProcess="pca",data=training)
modelFit
## Generalized Linear Model
##
## 3451 samples
## 57 predictor
## 2 classes: 'nonspam', 'spam'
##
## Pre-processing: principal component signal extraction (57), centered
## (57), scaled (57)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9225658 0.8367994
confusionMatrix(testing$type,predict(modelFit,testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 665 32
## spam 51 402
##
## Accuracy : 0.9278
## 95% CI : (0.9113, 0.9421)
## No Information Rate : 0.6226
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8477
## Mcnemar's Test P-Value : 0.04818
##
## Sensitivity : 0.9288
## Specificity : 0.9263
## Pos Pred Value : 0.9541
## Neg Pred Value : 0.8874
## Prevalence : 0.6226
## Detection Rate : 0.5783
## Detection Prevalence : 0.6061
## Balanced Accuracy : 0.9275
##
## 'Positive' Class : nonspam
##
.