Based on Jeef Leek's slides for the “Practical Machine Learning” course.
Often you have multiple quantitative variables and sometimes they will be highly correlated with each other.
In other words, they will be very similar to being the almost the exact same variable.
In this case, it is not necessarily useful to include every variable in the model.
You might want to include some summary that captures most of the information in those quantitative variables.
library(caret)
library(kernlab)
data(spam)
inTrain <- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]
M <- abs(cor(training[,-58]))
# setting to 0 auto-correlations, i.e. the diagonal.
diag(M) <- 0
# selecting cases with high correlation
which(M > 0.8, arr.ind=TRUE)
## row col
## num415 34 32
## direct 40 32
## num857 32 34
## direct 40 34
## num857 32 40
## num415 34 40
names(spam)[c(34,32)]
## [1] "num415" "num857"
plot(spam[,34], spam[,32])
\[ X = 0.71 \times {\rm num 415} + 0.71 \times {\rm num857} \]
\[ Y = 0.71 \times {\rm num 415} - 0.71 \times {\rm num857} \]
X <- 0.71*training$num415 + 0.71*training$num857
Y <- 0.71*training$num415 - 0.71*training$num857
plot(X, Y)
You have multivariate variables \( X_1,\ldots, X_n \) so \( X_1 = (X_{11},\ldots, X_{1m}) \)
The first goal is statistical and the second goal is data compression.
SVD
If \( X \) is a matrix with each variable in a column and each observation in a row then the SVD is a “matrix decomposition”
\[ X = UDV^T \]
where the columns of \( U \) are orthogonal (left singular vectors), the columns of \( V \) are orthogonal (right singluar vectors) and \( D \) is a diagonal matrix (singular values).
PCA
The principal components are equal to the right singular values if you first scale (subtract the mean, divide by the standard deviation) the variables.
prcomp()smallSpam <- spam[, c(34,32)]
prComp <- prcomp(smallSpam)
plot(prComp$x[,1], prComp$x[,2])
prComp$rotation
## PC1 PC2
## num415 0.7081 -0.7061
## num857 0.7061 0.7081
typeColor <- ((spam$type=="spam")*1 + 1)
prComp <- prcomp(log10(spam[,-58]+1))
plot(prComp$x[,1], prComp$x[,2], col=typeColor, xlab="PC1", ylab="PC2")
We have applied a transformation to the data, the log10 function, and added +1 to it.
This makes the data look a little bit more Gaussian.
This helps because some of the variables are skewed, while others are normal looking.
You often have to do that for principal component analysis to look sensible.
Here applied to the full spam data set.
preProc <- preProcess(log10(spam[,-58]+1), method="pca", pcaComp=2)
# create new PC variables
spamPC <- predict(preProc, log10(spam[,-58]+1))
plot(spamPC[,1], spamPC[,2], col=typeColor)
preProc <- preProcess(log10(training[,-58]+1), method="pca", pcaComp=2)
trainPC <- predict(preProc, log10(training[,-58]+1))
# training only on the PC variables.
modelFit <- train(training$type ~ ., method="glm", data=trainPC)
testPC <- predict(preProc, log10(testing[,-58]+1))
confusionMatrix(testing$type, predict(modelFit, testPC))
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 653 44
## spam 63 390
##
## Accuracy : 0.907
## 95% CI : (0.889, 0.923)
## No Information Rate : 0.623
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.804
## Mcnemar's Test P-Value : 0.0818
##
## Sensitivity : 0.912
## Specificity : 0.899
## Pos Pred Value : 0.937
## Neg Pred Value : 0.861
## Prevalence : 0.623
## Detection Rate : 0.568
## Detection Prevalence : 0.606
## Balanced Accuracy : 0.905
##
## 'Positive' Class : nonspam
##
modelFit <- train(training$type ~ ., method="glm", preProcess="pca", data=training)
confusionMatrix(testing$type, predict(modelFit, testing))
## Confusion Matrix and Statistics
##
## Reference
## Prediction nonspam spam
## nonspam 670 27
## spam 47 406
##
## Accuracy : 0.936
## 95% CI : (0.92, 0.949)
## No Information Rate : 0.623
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.864
## Mcnemar's Test P-Value : 0.0272
##
## Sensitivity : 0.934
## Specificity : 0.938
## Pos Pred Value : 0.961
## Neg Pred Value : 0.896
## Prevalence : 0.623
## Detection Rate : 0.583
## Detection Prevalence : 0.606
## Balanced Accuracy : 0.936
##
## 'Positive' Class : nonspam
##