Question 1

For this quiz we will be using several R packages. R package versions change over time, the right answers have been checked using the following versions of the packages. AppliedPredictiveModeling: v1.1.6 caret: v6.0.47 ElemStatLearn: v2012.04-0 pgmm: v1.1 rpart: v4.1.8

If you aren’t using these versions of the packages, your answers may not exactly match the right answer, but hopefully should be close.

library(caret)  #packageDescription("caret") 6.0-78
## Loading required package: lattice
## Loading required package: ggplot2
library(AppliedPredictiveModeling)  #1.1-6
library(ElemStatLearn) #packageDescription("ElemStatLearn") #2015.6.26
library(rpart) #4.1-13
library(pgmm)  #1.2.2

Load the cell segmentation data from the AppliedPredictiveModeling package using the commands:

library(AppliedPredictiveModeling)
data(segmentationOriginal)
library(caret)
  1. Subset the data to a training set and testing set based on the Case variable in the data set.
training<-segmentationOriginal[which(segmentationOriginal$Case=='Train'),]
testing<-segmentationOriginal[which(segmentationOriginal$Case=='Test'),]
dim(training)
## [1] 1009  119
dim(testing)
## [1] 1010  119
table(segmentationOriginal$Case)
## 
##  Test Train 
##  1010  1009
  1. Set the seed to 125 and fit a CART model with the rpart method using all predictor variables and default caret settings.
set.seed(125)
CART<-train(Class ~ .,method="rpart",data=training)
CART$finalModel
## n= 1009 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1009 373 PS (0.63032706 0.36967294)  
##   2) TotalIntenCh2< 45323.5 454  34 PS (0.92511013 0.07488987) *
##   3) TotalIntenCh2>=45323.5 555 216 WS (0.38918919 0.61081081)  
##     6) FiberWidthCh1< 9.673245 154  47 PS (0.69480519 0.30519481) *
##     7) FiberWidthCh1>=9.673245 401 109 WS (0.27182045 0.72817955) *
  1. In the final model what would be the final model prediction for cases with the following variable values:
  1. TotalIntench2 = 23,000; FiberWidthCh1 = 10; PerimStatusCh1=2
  2. TotalIntench2 = 50,000; FiberWidthCh1 = 10;VarIntenCh4 = 100
  3. TotalIntench2 = 57,000; FiberWidthCh1 = 8;VarIntenCh4 = 100
  4. FiberWidthCh1 = 8;VarIntenCh4 = 100; PerimStatusCh1=2
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(CART$finalModel)

print(CART$finalModel)
## n= 1009 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1009 373 PS (0.63032706 0.36967294)  
##   2) TotalIntenCh2< 45323.5 454  34 PS (0.92511013 0.07488987) *
##   3) TotalIntenCh2>=45323.5 555 216 WS (0.38918919 0.61081081)  
##     6) FiberWidthCh1< 9.673245 154  47 PS (0.69480519 0.30519481) *
##     7) FiberWidthCh1>=9.673245 401 109 WS (0.27182045 0.72817955) *
newdata <- subset(segmentationOriginal[1,], select=-c(Class))
newdata$TotalIntench2 = 23000
newdata$FiberWidthCh1 = 10
newdata$PerimStatusCh1 = 2
predict(CART, newdata=newdata)
## [1] PS
## Levels: PS WS
newdata$TotalIntench2 = 50000
newdata$FiberWidthCh1 = 10
newdata$VarIntenCh4 = 100
predict(CART, newdata=newdata)
## [1] PS
## Levels: PS WS

Answer: a. PS b. WS c. PS d. Not possible to predict

Question 2

If K is small in a K-fold cross validation is the bias in the estimate of out-of-sample (test set) accuracy smaller or bigger? If K is small is the variance in the estimate of out-of-sample (test set) accuracy smaller or bigger. Is K large or small in leave one out cross validation?

Answer: The bias is larger and the variance is smaller. Under leave one out cross validation K is equal to the sample size.

Question 3

Load the olive oil data using the commands:

library(pgmm)
data(olive)
olive = olive[,-1]

These data contain information on 572 different Italian olive oils from multiple regions in Italy. Fit a classification tree where Area is the outcome variable. Then predict the value of area for the following data frame using the tree command with all defaults

modFit <- train(Area ~ .,method="rpart",data=olive)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
newdata = as.data.frame(t(colMeans(olive)))
predict(modFit,newdata=newdata)
##        1 
## 2.783282
t(colMeans(olive))
##         Area Palmitic Palmitoleic  Stearic    Oleic Linoleic Linolenic
## [1,] 4.59965 1231.741    126.0944 228.8654 7311.748  980.528  31.88811
##      Arachidic Eicosenoic
## [1,]   58.0979   16.28147
fancyRpartPlot(modFit$finalModel)

head(olive)
##   Area Palmitic Palmitoleic Stearic Oleic Linoleic Linolenic Arachidic
## 1    1     1075          75     226  7823      672        36        60
## 2    1     1088          73     224  7709      781        31        61
## 3    1      911          54     246  8113      549        31        63
## 4    1      966          57     240  7952      619        50        78
## 5    1     1051          67     259  7771      672        50        80
## 6    1      911          49     268  7924      678        51        70
##   Eicosenoic
## 1         29
## 2         29
## 3         29
## 4         35
## 5         46
## 6         44
table(olive$Area)
## 
##   1   2   3   4   5   6   7   8   9 
##  25  56 206  36  65  33  50  50  51

Answer: 2.783. It is strange because Area should be a qualitative variable - but tree is reporting the average value of Area as a numeric variable in the leaf predicted for newdata

Question 4

Load the South Africa Heart Disease Data and create training and test sets with the following code:

library(ElemStatLearn)
data(SAheart)
set.seed(8484)
train = sample(1:dim(SAheart)[1],size=dim(SAheart)[1]/2,replace=F)
trainSA = SAheart[train,]
testSA = SAheart[-train,]

Then set the seed to 13234 and fit a logistic regression model (method=“glm”, be sure to specify family=“binomial”) with Coronary Heart Disease (chd) as the outcome and age at onset, current alcohol consumption, obesity levels, cumulative tabacco, type-A behavior, and low density lipoprotein cholesterol as predictors. Calculate the misclassification rate for your model using this function and a prediction on the “response” scale: missClass = function(values,prediction){sum(((prediction > 0.5)*1) != values)/length(values)}

set.seed(13234)
modelFit<-train(chd ~ age+alcohol+obesity+tobacco+typea+ldl, data=trainSA, method="glm", family="binomial")
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to
## do classification? If so, use a 2 level factor as your outcome column.
missClass = function(values,prediction){sum(((prediction > 0.5)*1) != values)/length(values)}
predictTrain<-predict(modelFit,trainSA)
predictTest<-predict(modelFit,testSA)
missClassTrain<-missClass(trainSA$chd,predictTrain)
missClassTrain
## [1] 0.2727273
missClassTest<-missClass(testSA$chd,predictTest)
missClassTest
## [1] 0.3116883

Question 5

Load the vowel.train and vowel.test data sets:

library(ElemStatLearn)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
## 
##     importance
## The following object is masked from 'package:ggplot2':
## 
##     margin
data(vowel.train)
data(vowel.test)

Set the variable y to be a factor variable in both the training and test set. Then set the seed to 33833. Fit a random forest predictor relating the factor variable y to the remaining variables. Read about variable importance in random forests here: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr The caret package uses by default the Gini importance.

Calculate the variable importance using the varImp function in the caret package. What is the order of variable importance?

[NOTE: Use randomForest() specifically, not caret, as there’s been some issues reported with that approach. 11/6/2016]

vowel.test_f<-cbind(as.factor(vowel.test$y), vowel.test[,-1])
vowel.train_f<-cbind(as.factor(vowel.train$y), vowel.train[,-1])
names(vowel.test_f)<-names(vowel.test)
names(vowel.train_f)<-names(vowel.train)
set.seed(33833)
modFit <- randomForest(y ~ .,data=vowel.train_f)
order(varImp(modFit),decreasing = TRUE)
##  [1]  2  1  5  6  8  4  9  3  7 10