Question 1

Load the Alzheimer’s disease data using the commands:

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version
## 3.5.1
data(AlzheimerDisease)

Which of the following commands will create non-overlapping training and test sets with about 50% of the observations assigned to each?
Answer:

adData = data.frame(diagnosis,predictors)
testIndex = createDataPartition(diagnosis, p = 0.50,list=FALSE)
training = adData[-testIndex,]
testing = adData[testIndex,]
# test the equality of length of sets
sapply(c(testing[1], training[1]), length)
## diagnosis diagnosis 
##       167       166

Question 2

Load the cement data using the commands:

library(AppliedPredictiveModeling)
data(concrete)
library(caret)
set.seed(1000)
inTrain = createDataPartition(mixtures$CompressiveStrength, p = 3/4)[[1]]
training = mixtures[ inTrain,]
testing = mixtures[-inTrain,]

Make a plot of the outcome (CompressiveStrength) versus the index of the samples. Color by each of the variables in the data set (you may find the cut2() function in the Hmisc package useful for turning continuous covariates into factors). What do you notice in these plots?

library(Hmisc)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
training$cp <- cut2(training$CompressiveStrength, g = 5)
pairs(training)

qplot(seq_along(training$cp), training$cp, color = training$cp)

Answer:
There is a non-random pattern in the plot of the outcome versus index that does not appear to be perfectly explained by any predictor suggesting a variable may be missing.

Question 3

Load the cement data using the commands:

library(AppliedPredictiveModeling)
data(concrete)
library(caret)
set.seed(1000)
inTrain = createDataPartition(mixtures$CompressiveStrength, p = 3/4)[[1]]
training = mixtures[ inTrain,]
testing = mixtures[-inTrain,]

Make a histogram and confirm the SuperPlasticizer variable is skewed. Normally you might use the log transform to try to make the data more symmetric. Why would that be a poor choice for this variable?

class(training$Superplasticizer)
## [1] "numeric"
hist(training$Superplasticizer)

sum(training$Superplasticizer == 0)
## [1] 288
min(training$Superplasticizer)
## [1] 0
hist(log(training$Superplasticizer + 1))

There is 288 zero values in training$Superplasticizer. So on cannot apply log transform to it. But one can apply the log transform to training$Superplasticizer + min(training$Superplasticizer) + 1. But even the log + 1 transform does not helps and the data still highly nonsymmetric because of the large number of zero values.

Answer:
There are values of zero so when you take the log() transform those values will be -Inf.

Question 4

Load the Alzheimer’s disease data using the commands:

library(caret)
library(AppliedPredictiveModeling)
set.seed(3433);data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]] 
training = adData[ inTrain,]
testing = adData[-inTrain,]

Find all the predictor variables in the training set that begin with IL. Perform principal components on these variables with the preProcess() function from the caret package. Calculate the number of principal components needed to capture 80% of the variance. How many are there?

trainingIL <- training[,grep("^IL", names(training))]
preProc <- preProcess(trainingIL, method = 'pca', thresh = 0.8)
print(paste0('the number of principal components needed to capture 80% of the variance = ', preProc$numComp))
## [1] "the number of principal components needed to capture 80% of the variance = 7"

Answer:
7 PCA components co capture 80% of variance

Question 5

Load the Alzheimer’s disease data using the commands:

library(caret)
library(AppliedPredictiveModeling)
set.seed(3433)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]

Create a training data set consisting of only the predictors with variable names beginning with IL and the diagnosis. Build two predictive models, one using the predictors as they are and one using PCA with principal components explaining 80% of the variance in the predictors. Use method=“glm” in the train function.

What is the accuracy of each method in the test set? Which is more accurate?

set.seed(3433)
preProc <- preProcess(trainingIL, method = 'pca', thresh = 0.8)
trainingIL <- training[,grep("^IL|diagnosis", names(training))]
testingIL <- testing[,grep("^IL|diagnosis", names(testing))]
#nonPCA fit
fit <- train(diagnosis~., data=trainingIL, method="glm")
pred <- predict(fit, testingIL)
cm <- confusionMatrix(pred, testingIL$diagnosis)
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Impaired Control
##   Impaired        2       9
##   Control        20      51
##                                          
##                Accuracy : 0.6463         
##                  95% CI : (0.533, 0.7488)
##     No Information Rate : 0.7317         
##     P-Value [Acc > NIR] : 0.96637        
##                                          
##                   Kappa : -0.0702        
##  Mcnemar's Test P-Value : 0.06332        
##                                          
##             Sensitivity : 0.09091        
##             Specificity : 0.85000        
##          Pos Pred Value : 0.18182        
##          Neg Pred Value : 0.71831        
##              Prevalence : 0.26829        
##          Detection Rate : 0.02439        
##    Detection Prevalence : 0.13415        
##       Balanced Accuracy : 0.47045        
##                                          
##        'Positive' Class : Impaired       
## 
print(paste0("overall glm accuracy without PCA = ", cm$overall['Accuracy']))
## [1] "overall glm accuracy without PCA = 0.646341463414634"
#PCA fit
fitPC <- train(diagnosis~., method="glm", data=trainingIL, preProcess="pca", trControl = trainControl(preProcOptions = list(thresh = 0.8)))
predPC <- predict(fitPC, testingIL)
cmPCA <- confusionMatrix(predPC, testingIL$diagnosis)
cmPCA
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Impaired Control
##   Impaired        3       4
##   Control        19      56
##                                           
##                Accuracy : 0.7195          
##                  95% CI : (0.6094, 0.8132)
##     No Information Rate : 0.7317          
##     P-Value [Acc > NIR] : 0.651780        
##                                           
##                   Kappa : 0.0889          
##  Mcnemar's Test P-Value : 0.003509        
##                                           
##             Sensitivity : 0.13636         
##             Specificity : 0.93333         
##          Pos Pred Value : 0.42857         
##          Neg Pred Value : 0.74667         
##              Prevalence : 0.26829         
##          Detection Rate : 0.03659         
##    Detection Prevalence : 0.08537         
##       Balanced Accuracy : 0.53485         
##                                           
##        'Positive' Class : Impaired        
## 
print(paste0("overall glm accuracy with PCA = ", cmPCA$overall['Accuracy']))
## [1] "overall glm accuracy with PCA = 0.719512195121951"

Asnwer:
PCA acuracy: 0.72, nonPCA acuracy: 0.65