Practical Machine Learning

Question 1

At the firt step you must know if the packages are instaled, if not, intall with this commands:

install.packages(“caret”) install.packages(“AppliedPredictiveModeling”)

The question

Which of the following commands will create non-overlapping training and test sets with about 50% of the observations assigned to each?

library(caret)

## Warning: package 'caret' was built under R version 3.3.1

## Loading required package: lattice

## Loading required package: ggplot2

library(AppliedPredictiveModeling)

## Warning: package 'AppliedPredictiveModeling' was built under R version
## 3.3.1

data(AlzheimerDisease)
library(Hmisc)

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

adData = data.frame(diagnosis,predictors)
train = createDataPartition(diagnosis, p = 0.50,list=FALSE)
test = createDataPartition(diagnosis, p = 0.50,list=FALSE)

adData = data.frame(diagnosis,predictors)
testIndex = createDataPartition(diagnosis, p = 0.50,list=FALSE)
training = adData[-testIndex,]
testing = adData[testIndex,]

Question 2

Load the cement data using the commands:

library(AppliedPredictiveModeling)
data(concrete)
library(caret)
set.seed(1000)
inTrain = createDataPartition(mixtures$CompressiveStrength, p = 3/4)[[1]]
training = mixtures[ inTrain,]
testing = mixtures[-inTrain,]

Make a plot of the outcome (CompressiveStrength) versus the index of the samples. Color by each of the variables in the data set (you may find the cut2() function in the Hmisc package useful for turning continuous covariates into factors). What do you notice in these plots?

Before we need to get the names of the columns to subset

names <- colnames(concrete)
names <- names[-length(names)]

Now let’s make a quick feature plot to see if there is any relation between the outcome CompressiveStrength and the rest of the parameters in the data:

featurePlot(x = training[, names], y = training$CompressiveStrength, plot = "pairs")

We can observe on this plot that there is no relation between the outcome and any of the other variables in data set

Now we will make a plot of the outcome as a function of the index

index <- seq_along(1:nrow(training))
ggplot(data = training, aes(x = index, y = CompressiveStrength)) + geom_point() + theme_bw()

Now we see this figure that there is a step-like pattern in the data that could be explained by one or more variable in the data. From this plot we should probably cut the outcome in 4 categories.

cutCS <- cut2(training$CompressiveStrength, g = 4)
summary(cutCS)

## [ 2.33,23.7) [23.74,34.6) [34.56,46.2) [46.23,82.6] 
##          194          193          194          193

Then, we make a plot of the categorized outcome outcome

ggplot(data = training, aes(y = index, x = cutCS)) + geom_boxplot() + geom_jitter(col = "blue") + theme_bw()

Now the step is better seen in the above plot. As we can see this plot the step-like pattern is more clear now.

Now we’ll make a plot of the categorized income as function of the rest of the variables

featurePlot(x = training[, names], y = cutCS, plot = "box")

Once more, none of the variables in the data can explaing the step-like behaviour in the outcome.

The answer is: There is a step-like pattern in the plot of outcome versus index in the training set that isn’t explained by any of the predictor variables so there may be a variable missing.

Question 3

Load the cement data using the commands:

library(AppliedPredictiveModeling)
data(concrete)
library(caret)
set.seed(1000)
inTrain = createDataPartition(mixtures$CompressiveStrength, p = 3/4)[[1]]
training = mixtures[ inTrain,]
testing = mixtures[-inTrain,]

Make a histogram and confirm the SuperPlasticizer variable is skewed. Normally you might use the log transform to try to make the data more symmetric. Why would that be a poor choice for this variable?

ggplot(data = training, aes(x = Superplasticizer)) + geom_histogram() + theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Answer

There are values of zero so when you take the log() transform those values will be -Inf.

Question 4

Load the Alzheimer’s disease data using the commands:

library(caret)
library(AppliedPredictiveModeling)
set.seed(3433)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]

Find all the predictor variables in the training set that begin with IL. Perform principal components on these variables with the preProcess() function from the caret package. Calculate the number of principal components needed to capture 90% of the variance. How many are there?

library(ggplot2)
library(caret)
ncol(training)

## [1] 131

which(sapply(adData,class)=="factor")

## diagnosis  Genotype 
##         1       131

summary(training$diagnosis)

## Impaired  Control 
##       69      182

training$diagnosis = as.numeric(training$diagnosis)
p <- prcomp(training[,grep('^IL',names(training))])
p$rotation[,1:7]

##                         PC1         PC2           PC3          PC4
## IL_11         -0.8398033242  0.10046706  0.5267885354  0.054607502
## IL_13         -0.0003310431 -0.00423693  0.0009998705  0.003115891
## IL_16          0.0892407592 -0.36481166  0.1855061534  0.572050454
## IL_17E        -0.5020336689 -0.49325988 -0.7089120144  0.017554761
## IL_1alpha      0.0484008786 -0.11322325  0.0569829210  0.118655960
## IL_3           0.1040261291 -0.27300040  0.1699381882  0.263855139
## IL_4           0.0584766359 -0.22364552  0.0977670307  0.213823138
## IL_5           0.0548133353 -0.19764601  0.1141032393  0.278567853
## IL_6           0.0582671357 -0.02256691 -0.0265035194  0.154607278
## IL_6_Receptor  0.0104851858 -0.06814260  0.0281536855  0.189136237
## IL_7           0.1078232519 -0.65711127  0.3595689403 -0.634318046
## IL_8           0.0030980876 -0.01084788  0.0059321249  0.020098675
##                        PC5          PC6           PC7
## IL_11          0.056894814  0.020519857 -0.0150319566
## IL_13         -0.001033979  0.003533855 -0.0008470324
## IL_16         -0.046567992 -0.326626203  0.0691010888
## IL_17E        -0.003123557 -0.007561311 -0.0139867230
## IL_1alpha     -0.043818431  0.191980683 -0.9210081633
## IL_3          -0.035022229 -0.036092140  0.0283654620
## IL_4          -0.069325249  0.892445093  0.2686333752
## IL_5          -0.103292061 -0.126252574 -0.0883555683
## IL_6           0.980892603  0.032313850 -0.0085628907
## IL_6_Receptor -0.077162495 -0.188524245  0.2550082237
## IL_7           0.087778186 -0.075199934  0.0257831347
## IL_8           0.012343696 -0.005177812 -0.0020527268

qplot(1:length(p$sdev),p$sdev / sum(p$sdev))

which(cumsum(p$sdev) / sum(p$sdev) <= .9)

## [1] 1 2 3 4 5 6 7

(cumsum(p$sdev) / sum(p$sdev))[8]

## [1] 0.9186485

preProc <- preProcess(training[,grep('^IL',names(training))],method="pca",thres=.9)

# See the result here
preProc

## Created from 251 samples and 12 variables
## 
## Pre-processing:
##   - centered (12)
##   - ignored (0)
##   - principal component signal extraction (12)
##   - scaled (12)
## 
## PCA needed 9 components to capture 90 percent of the variance

Question 5

Load the Alzheimer’s disease data using the commands:

library(caret)
library(AppliedPredictiveModeling)
set.seed(3433)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]

Create a training data set consisting of only the predictors with variable names beginning with IL and the diagnosis. Build two predictive models, one using the predictors as they are and one using PCA with principal components explaining 80% of the variance in the predictors. Use method=“glm” in the train function.

What is the accuracy of each method in the test set? Which is more accurate?

trainSmall <- data.frame(training[,grep('^IL',names(training))],training$diagnosis)
testSmall <- data.frame(testing[,grep('^IL',names(testing))],testing$diagnosis)
preProc <- preProcess(trainSmall[-13],method="pca",thres=.8)
trainPC <- predict(preProc,trainSmall[-13])
testPC <- predict(preProc,testSmall[-13])

PCFit <- train(trainSmall$training.diagnosis~.,data=trainPC,method="glm")
NotPCFit <- train(trainSmall$training.diagnosis~.,data=trainSmall,method="glm")

PCTestPredict <- predict(PCFit,newdata=testPC)
NotPCTestPredict <- predict(NotPCFit,newdata=testSmall)

confusionMatrix(PCTestPredict,testSmall$testing.diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Impaired Control
##   Impaired        3       4
##   Control        19      56
##                                           
##                Accuracy : 0.7195          
##                  95% CI : (0.6094, 0.8132)
##     No Information Rate : 0.7317          
##     P-Value [Acc > NIR] : 0.651780        
##                                           
##                   Kappa : 0.0889          
##  Mcnemar's Test P-Value : 0.003509        
##                                           
##             Sensitivity : 0.13636         
##             Specificity : 0.93333         
##          Pos Pred Value : 0.42857         
##          Neg Pred Value : 0.74667         
##              Prevalence : 0.26829         
##          Detection Rate : 0.03659         
##    Detection Prevalence : 0.08537         
##       Balanced Accuracy : 0.53485         
##                                           
##        'Positive' Class : Impaired        
##

confusionMatrix(NotPCTestPredict,testSmall$testing.diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Impaired Control
##   Impaired        2       9
##   Control        20      51
##                                          
##                Accuracy : 0.6463         
##                  95% CI : (0.533, 0.7488)
##     No Information Rate : 0.7317         
##     P-Value [Acc > NIR] : 0.96637        
##                                          
##                   Kappa : -0.0702        
##  Mcnemar's Test P-Value : 0.06332        
##                                          
##             Sensitivity : 0.09091        
##             Specificity : 0.85000        
##          Pos Pred Value : 0.18182        
##          Neg Pred Value : 0.71831        
##              Prevalence : 0.26829        
##          Detection Rate : 0.02439        
##    Detection Prevalence : 0.13415        
##       Balanced Accuracy : 0.47045        
##                                          
##        'Positive' Class : Impaired       
##

The accuracies are 0.65 and 0.72 respectively.