library(AppliedPredictiveModeling)
suppressMessages(library(caret))
data(AlzheimerDisease)
Which of the following commands will create non-overlapping training and test sets with about 50% of the observations assigned to each?
adData <- data.frame(diagnosis,predictors)
trainIndex <- createDataPartition(diagnosis, p = 0.50,list=FALSE)
training <- adData[trainIndex,]
testing <- adData[-trainIndex,]
Please remind that p = 0.75 to 0.8 is prefered in the field.
data(concrete)
set.seed(1000)
inTrain = createDataPartition(mixtures$CompressiveStrength, p = 3/4)[[1]]
training = mixtures[ inTrain,]
testing = mixtures[-inTrain,]
Make a plot of the outcome (CompressiveStrength) versus the index of the samples. Color by each of the variables in the data set (you may find the cut2() function in the Hmisc package useful for turning continuous covariates into factors). What do you notice in these plots?
With this command, you can add the index on the training set
training$index <- 1:nrow(training)
Let’s make the color index first. First, bring the Hmisc package first to use cut2() function. And… Rather than put the index manually, it’s cool to use for loop to make the col index.
library(Hmisc)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
Tdata <- training[,-9] ## -9 is to subtract the outcome variable, "CompressiveStrength"
colInd <- list()
for(i in 1:ncol(Tdata)) {
colInd[[i]] <- cut2(Tdata[,i], g = 10)
}
Also, it’s double cool to use loop to make the plot. Well, you can make the loop to make both color index and plot simultaneously.
par(mfrow = c(2,5), mar = c(2,1,2,1))
names <- names(Tdata)
for(i in 1:ncol(Tdata)) {
plot(Tdata[,i], training$CompressiveStrength, col = colInd[[i]])
title(names[i])
}
Given this plot, you can infer that the plots for each variable except for index does not explain fully the relationship between index and CompressiveStrength. Something is missing to explain the move on the index.
library(AppliedPredictiveModeling)
data(concrete)
library(caret)
set.seed(1000)
inTrain <- createDataPartition(mixtures$CompressiveStrength, p = 3/4)[[1]]
training <- mixtures[ inTrain,]
testing <- mixtures[-inTrain,]
Make a histogram and confirm the SuperPlasticizer variable is skewed. Normally you might use the log transform to try to make the data more symmetric. Why would that be a poor choice for this variable?
Let’s make the histogram to understand what the instruction says!
attach(training)
range(Superplasticizer)
## [1] 0.0000000 0.0131493
First, the range shows there is no negative value in the variable “Superplasticizer”
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:Hmisc':
##
## combine
g1 <- qplot(Superplasticizer, data= training, geom = "histogram", main = "Without Log")
g2 <- qplot(log(Superplasticizer+1), data= training, geom = "histogram", main = "With Log")
grid.arrange(g1,g2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
detach(training)
There is not much change in the histogram even after adapting the log. The reason is the frequency on near 0 is so high that log could not take care of it.
library(caret)
library(AppliedPredictiveModeling)
set.seed(3433)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]
Find all the predictor variables in the training set that begin with IL. Perform principal components on these variables with the preProcess() function from the caret package. Calculate the number of principal components needed to capture 90% of the variance. How many are there?
You can do it with precomp in r base package(stats), or preProcess from caret package. I tried both, then it turns out precomp function is better.(The reason? Simple)
First, let’s grap the variables which begin with “IL”
n <- names(training)
g <- grep("^IL", n)
tr.IL <- training[,g]
With carat first.
suppressMessages(library(knitr))
c.pr <- preProcess(tr.IL,method = c("center","scale", "pca"), pcaComp = 12)
## Only 12, the number of variables, give you the correct answer
pr.pred <- predict(c.pr, tr.IL)
pve <- apply(pr.pred,2,var)
a <- data.frame(cumsum(pve/sum(pve))); colnames(a) <- "Var. Explained"
kable(a, header =TRUE)
| Var. Explained | |
|---|---|
| PC1 | 0.3544031 |
| PC2 | 0.4685630 |
| PC3 | 0.5593108 |
| PC4 | 0.6461080 |
| PC5 | 0.7157456 |
| PC6 | 0.7780067 |
| PC7 | 0.8318726 |
| PC8 | 0.8794419 |
| PC9 | 0.9207146 |
| PC10 | 0.9555449 |
| PC11 | 0.9832193 |
| PC12 | 1.0000000 |
You can do it more easily, actually it depends, with precomp function. The cool thing about this function is you don’t need to calculate the cumulative proportion.
o.pr <- prcomp(tr.IL,scale. = TRUE)
s <- summary(o.pr)
kable(t(s$importance))
| Standard deviation | Proportion of Variance | Cumulative Proportion | |
|---|---|---|---|
| PC1 | 2.0622407 | 0.35440 | 0.35440 |
| PC2 | 1.1704354 | 0.11416 | 0.46856 |
| PC3 | 1.0435393 | 0.09075 | 0.55931 |
| PC4 | 1.0205712 | 0.08680 | 0.64611 |
| PC5 | 0.9141397 | 0.06964 | 0.71575 |
| PC6 | 0.8643691 | 0.06226 | 0.77801 |
| PC7 | 0.8039837 | 0.05387 | 0.83187 |
| PC8 | 0.7555341 | 0.04757 | 0.87944 |
| PC9 | 0.7037559 | 0.04127 | 0.92071 |
| PC10 | 0.6465014 | 0.03483 | 0.95554 |
| PC11 | 0.5762751 | 0.02767 | 0.98322 |
| PC12 | 0.4487404 | 0.01678 | 1.00000 |
| The ans | wer is as you see… | 9. |
library(caret)
library(AppliedPredictiveModeling)
set.seed(3433)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[inTrain,]
testing = adData[-inTrain,]
In this problem, you should not transform the data with precomp, then put it into train function. Train function works, but you cannot predict the data with the model coming out of this process. Please put the data untransformed, then transform it by using built in option in the Train function. Let’s see what it means.
IL <- grep("^IL", names(training))
train.IL <- training[,IL];train.IL$diagnosis <- training$diagnosis
tr.train.IL <- train(diagnosis~., data = train.IL, method = "glm")
ctrl <- trainControl(preProcOptions = list(thresh = 0.8)) ## Built-in option
tr.pca.train.IL <- train(diagnosis~., data = train.IL, trControl = ctrl, preProc = c("center","scale","pca"), method = "glm" ) ## Transfrom the data to PCA with preProc option
pred.train.IL <- predict(tr.train.IL, testing)
pred.pca.train.IL <- predict(tr.pca.train.IL, testing)
Let’s see the result with confusion matrix.
confusionMatrix(pred.train.IL, testing$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Impaired Control
## Impaired 2 9
## Control 20 51
##
## Accuracy : 0.6463
## 95% CI : (0.533, 0.7488)
## No Information Rate : 0.7317
## P-Value [Acc > NIR] : 0.96637
##
## Kappa : -0.0702
## Mcnemar's Test P-Value : 0.06332
##
## Sensitivity : 0.09091
## Specificity : 0.85000
## Pos Pred Value : 0.18182
## Neg Pred Value : 0.71831
## Prevalence : 0.26829
## Detection Rate : 0.02439
## Detection Prevalence : 0.13415
## Balanced Accuracy : 0.47045
##
## 'Positive' Class : Impaired
##
confusionMatrix(pred.pca.train.IL, testing$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Impaired Control
## Impaired 3 4
## Control 19 56
##
## Accuracy : 0.7195
## 95% CI : (0.6094, 0.8132)
## No Information Rate : 0.7317
## P-Value [Acc > NIR] : 0.651780
##
## Kappa : 0.0889
## Mcnemar's Test P-Value : 0.003509
##
## Sensitivity : 0.13636
## Specificity : 0.93333
## Pos Pred Value : 0.42857
## Neg Pred Value : 0.74667
## Prevalence : 0.26829
## Detection Rate : 0.03659
## Detection Prevalence : 0.08537
## Balanced Accuracy : 0.53485
##
## 'Positive' Class : Impaired
##
Well… the answer is… you can check. right?