Load the Alzheimer’s disease data using the commands:
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(AppliedPredictiveModeling)
set.seed(3433)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]
Create a training data set consisting of only the predictors with variable names beginning with IL and the diagnosis. Build two predictive models, one using the predictors as they are and one using PCA with principal components explaining 80% of the variance in the predictors. Use method=“glm” in the train function.
What is the accuracy of each method in the test set? Which is more accurate?
We start out by selecting the data we want in the model.
require(dplyr)
ILtraining <- select(training, starts_with("IL"), diagnosis)
ILtesting <- select(testing, starts_with("IL"), diagnosis)
We then create a Principle Component preprocessed training and test sets based on our selected data set. To do so, we first create a principle component model, and then use that model to make our new sets.
We make sure to leave out the data we’re predicting when we preprocess. As a result, we need to add that data back on to the predicted training data afterwards.
PCA <- preProcess(select(ILtraining, -diagnosis), method = "pca", thresh = .8)
PCAtraining <- predict(PCA, select(ILtraining, -diagnosis))
PCAtesting <- predict(PCA, select(ILtesting, -diagnosis))
PCAtraining <- cbind(PCAtraining, diagnosis = ILtraining$diagnosis)
PCAtesting <- cbind(PCAtesting, diagnosis = ILtesting$diagnosis)
We create generalized linear model from the training sets that will be compared against the testing data to predict the diagnoses. This will be done for the data both with and without the preprocessing.
We then view the results.
PCAmodel <- train(diagnosis ~ ., PCAtraining, method = "glm" )
FULLmodel <- train(diagnosis ~ ., ILtraining, method = "glm" )
PCAresult <- predict(PCAmodel, select(PCAtesting, -diagnosis))
FULLresult <- predict(FULLmodel, select(ILtesting, -diagnosis))
confusionMatrix(PCAresult, PCAtesting$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Impaired Control
## Impaired 3 4
## Control 19 56
##
## Accuracy : 0.7195
## 95% CI : (0.6094, 0.8132)
## No Information Rate : 0.7317
## P-Value [Acc > NIR] : 0.651780
##
## Kappa : 0.0889
## Mcnemar's Test P-Value : 0.003509
##
## Sensitivity : 0.13636
## Specificity : 0.93333
## Pos Pred Value : 0.42857
## Neg Pred Value : 0.74667
## Prevalence : 0.26829
## Detection Rate : 0.03659
## Detection Prevalence : 0.08537
## Balanced Accuracy : 0.53485
##
## 'Positive' Class : Impaired
##
confusionMatrix(FULLresult, ILtesting$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Impaired Control
## Impaired 2 9
## Control 20 51
##
## Accuracy : 0.6463
## 95% CI : (0.533, 0.7488)
## No Information Rate : 0.7317
## P-Value [Acc > NIR] : 0.96637
##
## Kappa : -0.0702
## Mcnemar's Test P-Value : 0.06332
##
## Sensitivity : 0.09091
## Specificity : 0.85000
## Pos Pred Value : 0.18182
## Neg Pred Value : 0.71831
## Prevalence : 0.26829
## Detection Rate : 0.02439
## Detection Prevalence : 0.13415
## Balanced Accuracy : 0.47045
##
## 'Positive' Class : Impaired
##
We see that the preprocessed data results in an accuracy of 71.95%, while the unproccessed data results in an accuracy of 64.63%.
One explanation of this would be that the full data set also has more noise, and that the PCA removes some of that noise.