Question 1

Load the Alzheimer’s disease data using the commands:

library(AppliedPredictiveModeling, verbose=FALSE); library(caret, verbose=FALSE); library(gridExtra, verbose=FALSE)
## Loading required package: lattice
## Loading required package: ggplot2
data(AlzheimerDisease)

Which of the following commands will create non-overlapping training and test sets with about 50% of the observations assigned to each?

adData <- data.frame(diagnosis, predictors)
trainIndex <- createDataPartition(diagnosis, p=0.50, list=FALSE)
training <- adData[trainIndex, ]; testing <- adData[-trainIndex, ]

adData = data.frame(diagnosis,predictors) testIndex = createDataPartition(diagnosis, p = 0.50,list=FALSE) training = adData[-testIndex,] testing = adData[testIndex,]

Question 2

Load the cement data using the commands:

library(AppliedPredictiveModeling, verbose=FALSE)
data(concrete)
library(caret, verbose=FALSE)
set.seed(1000)
inTrain = createDataPartition(mixtures$CompressiveStrength, p = 3/4)[[1]]
training = mixtures[ inTrain,]
testing = mixtures[-inTrain,]

Make a plot of the outcome (CompressiveStrength) versus the index of the samples. Color by each of the variables in the data set (you may find the cut2() function in the Hmisc package useful for turning continuous covariates into factors). What do you notice in these plots?

Answer is: There is a non-random pattern in the plot of the outcome versus index that does not appear to be perfectly explained by any predictor suggesting a variable may be missing.

library(Hmisc, verbose=FALSE)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
p1 <- qplot(training$CompressiveStrength, seq_along(training$CompressiveStrength), colour=cut2(x=training$BlastFurnaceSlag, g=4))

p2 <- qplot(training$CompressiveStrength, seq_along(training$CompressiveStrength), colour=cut2(x=training$FlyAsh, g=4))

p3 <- qplot(training$CompressiveStrength, seq_along(training$CompressiveStrength), colour=cut2(x=training$Water, g=4))

p4 <- qplot(training$CompressiveStrength, seq_along(training$CompressiveStrength), colour=cut2(x=training$Superplasticizer, g=4))

p5 <- qplot(training$CompressiveStrength, seq_along(training$CompressiveStrength), colour=cut2(x=training$CoarseAggregate, g=4))

p6 <- qplot(training$CompressiveStrength, seq_along(training$CompressiveStrength), colour=cut2(x=training$FineAggregate, g=4))

p7 <- qplot(training$CompressiveStrength, seq_along(training$CompressiveStrength), colour=cut2(x=training$Age, g=4))

grid.arrange(p1, p2, p3, p4, p5, p6, p7, nrow=4)

There is a non-random pattern in the plot of the outcome versus index that is perfectly explained by the Age variable.

Question 3

Load the cement data using the commands:

library(AppliedPredictiveModeling)
data(concrete)
library(caret)
set.seed(1000)
inTrain = createDataPartition(mixtures$CompressiveStrength, p = 3/4)[[1]]
training = mixtures[ inTrain,]
testing = mixtures[-inTrain,]

Make a histogram and confirm the SuperPlasticizer variable is skewed. Normally you might use the log transform to try to make the data more symmetric. Why would that be a poor choice for this variable?

qplot(training$Superplasticizer, geom="histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The historgram shows a large skew of the data at zero. A log-transform does not accept negative values, but there are no negatives on the dataset. The problem here though is that we can’t take the log of zero, because:

log(0)
## [1] -Inf
# no negative values
min(training$Superplasticizer)
## [1] 0

If we histogram the log-transform we don’t get an error, but it looks suspiciously like histogramming the log transform of the data without the zeroes:

a <- qplot(log(training$Superplasticizer), geom="histogram")
b <- qplot(log(subset(training, Superplasticizer > 0)$Superplasticizer), geom="histogram")
grid.arrange(a, b, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 288 rows containing non-finite values (stat_bin).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

So it seems r is just dropping the error values, but clearly this is not a good transformation, because it’s merely discaring the zeroes. We could do a log transform of SuperPlasticizer + n, so that the values are not 0 but 1. However, the log transform will not be able to split these, as it cannot split equal values, but only values that are close to each other:

qplot(log(training$Superplasticizer+2), geom="histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Answer is: There are a large number of values that are the same and even if you took the log(SuperPlasticizer + 1) they would still all be identical so the distribution would not be symmetric.

Question 4

Load the Alzheimer’s disease data using the commands:

library(caret)
library(AppliedPredictiveModeling)
set.seed(3433)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]

Find all the predictor variables in the training set that begin with IL. Perform principal components on these variables with the preProcess() function from the caret package. Calculate the number of principal components needed to capture 90% of the variance. How many are there?

# All the predictor variables in the training set that begin with IL
training2 <- training[colnames(training)[grepl("^IL", colnames(training))]]
training2$diagnosis <- training$diagnosis

# PreProcess keeping 90% of variance
preProcessParams <- preProcess(training2[,-14], method = c("center", "scale","pca"), thresh = 0.90)
print(preProcessParams)
## Created from 251 samples and 13 variables
## 
## Pre-processing:
##   - centered (12)
##   - ignored (1)
##   - principal component signal extraction (12)
##   - scaled (12)
## 
## PCA needed 9 components to capture 90 percent of the variance

Answer is 9

Question 5

Load the Alzheimer’s disease data using the commands:

# Install necessary libraries
library(caret, verbose=FALSE); library("e1071", verbose=FALSE); library(AppliedPredictiveModeling, verbose=FALSE)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
## 
##     impute
# Load Data and create training/testing sets
set.seed(3433); data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4, list=FALSE)
training = adData[ inTrain,]
testing = adData[-inTrain,]

Create a training data set consisting of only the predictors with variable names beginning with IL and the diagnosis. Build two predictive models, one using the predictors as they are and one using PCA with principal components explaining 80% of the variance in the predictors. Use method=“glm” in the train function.

# Dataset consisting only of predictors with variable names beginning with IL and the diagnosis
training2 <- training[colnames(training)[grepl("^IL", colnames(training))]]
training2$diagnosis <- training$diagnosis

testing2 <- testing[colnames(testing)[grepl("^IL", colnames(testing))]]
testing2$diagnosis <- testing$diagnosis

# Predictive Model 1 (non PCA)
model1Fit <- train(diagnosis ~., data=training2, method="glm")
predictions1 <- predict(model1Fit, newdata = testing)
confusionMatrix(predictions1, testing$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Impaired Control
##   Impaired        2       9
##   Control        20      51
##                                          
##                Accuracy : 0.6463         
##                  95% CI : (0.533, 0.7488)
##     No Information Rate : 0.7317         
##     P-Value [Acc > NIR] : 0.96637        
##                                          
##                   Kappa : -0.0702        
##  Mcnemar's Test P-Value : 0.06332        
##                                          
##             Sensitivity : 0.09091        
##             Specificity : 0.85000        
##          Pos Pred Value : 0.18182        
##          Neg Pred Value : 0.71831        
##              Prevalence : 0.26829        
##          Detection Rate : 0.02439        
##    Detection Prevalence : 0.13415        
##       Balanced Accuracy : 0.47045        
##                                          
##        'Positive' Class : Impaired       
## 
# Predictive Model 2 (PCA)
preProcessParams <- preProcess(training2[,-14], method = "pca", thresh = 0.80) # create preProc variables
trainPC <- predict(preProcessParams, training2[, -14]) # predict on training
testPC <- predict(preProcessParams, testing2[, -14]) # predict on testing using training preProc params
model2Fit <- train(diagnosis~., method="glm", data=trainPC) # fit your model
predictions2 <- predict(model2Fit, newdata = testPC) # Apply model to your testing data
confusionMatrix(predictions2, testing2$diagnosis) # Get stats of final model
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Impaired Control
##   Impaired        3       4
##   Control        19      56
##                                           
##                Accuracy : 0.7195          
##                  95% CI : (0.6094, 0.8132)
##     No Information Rate : 0.7317          
##     P-Value [Acc > NIR] : 0.651780        
##                                           
##                   Kappa : 0.0889          
##  Mcnemar's Test P-Value : 0.003509        
##                                           
##             Sensitivity : 0.13636         
##             Specificity : 0.93333         
##          Pos Pred Value : 0.42857         
##          Neg Pred Value : 0.74667         
##              Prevalence : 0.26829         
##          Detection Rate : 0.03659         
##    Detection Prevalence : 0.08537         
##       Balanced Accuracy : 0.53485         
##                                           
##        'Positive' Class : Impaired        
## 

PCA Accuracy is 72% while Non-PCA accuracy is 65%.