Quiz 2

  1. Load the Alzheimer’s disease data using the commands:
library(AppliedPredictiveModeling)
suppressMessages(library(caret))
data(AlzheimerDisease)

Which of the following commands will create non-overlapping training and test sets with about 50% of the observations assigned to each?

adData <- data.frame(diagnosis,predictors)
trainIndex <- createDataPartition(diagnosis, p = 0.50,list=FALSE)
training <- adData[trainIndex,]
testing <- adData[-trainIndex,]

Please remind that p = 0.75 to 0.8 is prefered in the field.

  1. Load the cement data using the commands: (I would omit the library command from now)
data(concrete)
set.seed(1000)
inTrain = createDataPartition(mixtures$CompressiveStrength, p = 3/4)[[1]]
training = mixtures[ inTrain,]
testing = mixtures[-inTrain,]

Make a plot of the outcome (CompressiveStrength) versus the index of the samples. Color by each of the variables in the data set (you may find the cut2() function in the Hmisc package useful for turning continuous covariates into factors). What do you notice in these plots?

With this command, you can add the index on the training set

training$index <- 1:nrow(training)

Let’s make the color index first. First, bring the Hmisc package first to use cut2() function. And… Rather than put the index manually, it’s cool to use for loop to make the col index.

library(Hmisc)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
Tdata <- training[,-9] ## -9 is to subtract the outcome variable, "CompressiveStrength"
colInd <- list()

for(i in 1:ncol(Tdata)) {
        colInd[[i]] <- cut2(Tdata[,i], g = 10)
}

Also, it’s double cool to use loop to make the plot. Well, you can make the loop to make both color index and plot simultaneously.

par(mfrow = c(2,5), mar = c(2,1,2,1))
names <- names(Tdata)
for(i in 1:ncol(Tdata)) {
        plot(Tdata[,i], training$CompressiveStrength, col = colInd[[i]])
        title(names[i])
}

Given this plot, you can infer that the plots for each variable except for index does not explain fully the relationship between index and CompressiveStrength. Something is missing to explain the move on the index.

  1. Load the cement data using the commands:
library(AppliedPredictiveModeling)
data(concrete)
library(caret)
set.seed(1000)
inTrain <- createDataPartition(mixtures$CompressiveStrength, p = 3/4)[[1]]
training <- mixtures[ inTrain,]
testing <- mixtures[-inTrain,]

Make a histogram and confirm the SuperPlasticizer variable is skewed. Normally you might use the log transform to try to make the data more symmetric. Why would that be a poor choice for this variable?

Let’s make the histogram to understand what the instruction says!

attach(training)
range(Superplasticizer)
## [1] 0.0000000 0.0131493

First, the range shows there is no negative value in the variable “Superplasticizer”

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Hmisc':
## 
##     combine
g1 <- qplot(Superplasticizer, data= training, geom = "histogram", main = "Without Log")
g2 <- qplot(log(Superplasticizer+1), data= training, geom = "histogram", main = "With Log")
grid.arrange(g1,g2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

detach(training)

There is not much change in the histogram even after adapting the log. The reason is the frequency on near 0 is so high that log could not take care of it.

  1. Load the Alzheimer’s disease data using the commands:
library(caret)
library(AppliedPredictiveModeling)
set.seed(3433)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]

Find all the predictor variables in the training set that begin with IL. Perform principal components on these variables with the preProcess() function from the caret package. Calculate the number of principal components needed to capture 90% of the variance. How many are there?

You can do it with precomp in r base package(stats), or preProcess from caret package. I tried both, then it turns out precomp function is better.(The reason? Simple)

First, let’s grap the variables which begin with “IL”

n <- names(training)
g <- grep("^IL", n)
tr.IL <- training[,g]

With carat first.

suppressMessages(library(knitr))
c.pr <- preProcess(tr.IL,method = c("center","scale", "pca"), pcaComp = 12) 
## Only 12, the number of variables, give you the correct answer
pr.pred <- predict(c.pr, tr.IL)
pve <- apply(pr.pred,2,var)
a <- data.frame(cumsum(pve/sum(pve))); colnames(a) <- "Var. Explained"
kable(a, header =TRUE)
Var. Explained
PC1 0.3544031
PC2 0.4685630
PC3 0.5593108
PC4 0.6461080
PC5 0.7157456
PC6 0.7780067
PC7 0.8318726
PC8 0.8794419
PC9 0.9207146
PC10 0.9555449
PC11 0.9832193
PC12 1.0000000

You can do it more easily, actually it depends, with precomp function. The cool thing about this function is you don’t need to calculate the cumulative proportion.

o.pr <- prcomp(tr.IL,scale. = TRUE)
s <- summary(o.pr)
kable(t(s$importance))
Standard deviation Proportion of Variance Cumulative Proportion
PC1 2.0622407 0.35440 0.35440
PC2 1.1704354 0.11416 0.46856
PC3 1.0435393 0.09075 0.55931
PC4 1.0205712 0.08680 0.64611
PC5 0.9141397 0.06964 0.71575
PC6 0.8643691 0.06226 0.77801
PC7 0.8039837 0.05387 0.83187
PC8 0.7555341 0.04757 0.87944
PC9 0.7037559 0.04127 0.92071
PC10 0.6465014 0.03483 0.95554
PC11 0.5762751 0.02767 0.98322
PC12 0.4487404 0.01678 1.00000
The ans wer is as you see… 9.
  1. Load the Alzheimer’s disease data using the commands:
library(caret)
library(AppliedPredictiveModeling)
set.seed(3433)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[inTrain,]
testing = adData[-inTrain,]

In this problem, you should not transform the data with precomp, then put it into train function. Train function works, but you cannot predict the data with the model coming out of this process. Please put the data untransformed, then transform it by using built in option in the Train function. Let’s see what it means.

IL <- grep("^IL", names(training))
train.IL <- training[,IL];train.IL$diagnosis <- training$diagnosis
tr.train.IL <- train(diagnosis~., data = train.IL, method = "glm")
ctrl <- trainControl(preProcOptions = list(thresh = 0.8)) ## Built-in option
tr.pca.train.IL <- train(diagnosis~., data = train.IL, trControl = ctrl, preProc = c("center","scale","pca"), method = "glm" ) ## Transfrom the data to PCA with preProc option

pred.train.IL <- predict(tr.train.IL, testing)
pred.pca.train.IL <- predict(tr.pca.train.IL, testing)

Let’s see the result with confusion matrix.

confusionMatrix(pred.train.IL, testing$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Impaired Control
##   Impaired        2       9
##   Control        20      51
##                                          
##                Accuracy : 0.6463         
##                  95% CI : (0.533, 0.7488)
##     No Information Rate : 0.7317         
##     P-Value [Acc > NIR] : 0.96637        
##                                          
##                   Kappa : -0.0702        
##  Mcnemar's Test P-Value : 0.06332        
##                                          
##             Sensitivity : 0.09091        
##             Specificity : 0.85000        
##          Pos Pred Value : 0.18182        
##          Neg Pred Value : 0.71831        
##              Prevalence : 0.26829        
##          Detection Rate : 0.02439        
##    Detection Prevalence : 0.13415        
##       Balanced Accuracy : 0.47045        
##                                          
##        'Positive' Class : Impaired       
## 
confusionMatrix(pred.pca.train.IL, testing$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Impaired Control
##   Impaired        3       4
##   Control        19      56
##                                           
##                Accuracy : 0.7195          
##                  95% CI : (0.6094, 0.8132)
##     No Information Rate : 0.7317          
##     P-Value [Acc > NIR] : 0.651780        
##                                           
##                   Kappa : 0.0889          
##  Mcnemar's Test P-Value : 0.003509        
##                                           
##             Sensitivity : 0.13636         
##             Specificity : 0.93333         
##          Pos Pred Value : 0.42857         
##          Neg Pred Value : 0.74667         
##              Prevalence : 0.26829         
##          Detection Rate : 0.03659         
##    Detection Prevalence : 0.08537         
##       Balanced Accuracy : 0.53485         
##                                           
##        'Positive' Class : Impaired        
## 

Well… the answer is… you can check. right?