For this quiz we will be using several R packages. R package versions change over time, the right answers have been checked using the following versions of the packages. AppliedPredictiveModeling: v1.1.6 caret: v6.0.47 ElemStatLearn: v2012.04-0 pgmm: v1.1 rpart: v4.1.8
If you aren’t using these versions of the packages, your answers may not exactly match the right answer, but hopefully should be close.
library(caret) #packageDescription("caret") 6.0-78
## Loading required package: lattice
## Loading required package: ggplot2
library(AppliedPredictiveModeling) #1.1-6
library(ElemStatLearn) #packageDescription("ElemStatLearn") #2015.6.26
library(rpart) #4.1-13
library(pgmm) #1.2.2
Load the cell segmentation data from the AppliedPredictiveModeling package using the commands:
library(AppliedPredictiveModeling)
data(segmentationOriginal)
library(caret)
training<-segmentationOriginal[which(segmentationOriginal$Case=='Train'),]
testing<-segmentationOriginal[which(segmentationOriginal$Case=='Test'),]
dim(training)
## [1] 1009 119
dim(testing)
## [1] 1010 119
table(segmentationOriginal$Case)
##
## Test Train
## 1010 1009
set.seed(125)
CART<-train(Class ~ .,method="rpart",data=training)
CART$finalModel
## n= 1009
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1009 373 PS (0.63032706 0.36967294)
## 2) TotalIntenCh2< 45323.5 454 34 PS (0.92511013 0.07488987) *
## 3) TotalIntenCh2>=45323.5 555 216 WS (0.38918919 0.61081081)
## 6) FiberWidthCh1< 9.673245 154 47 PS (0.69480519 0.30519481) *
## 7) FiberWidthCh1>=9.673245 401 109 WS (0.27182045 0.72817955) *
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(CART$finalModel)
print(CART$finalModel)
## n= 1009
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1009 373 PS (0.63032706 0.36967294)
## 2) TotalIntenCh2< 45323.5 454 34 PS (0.92511013 0.07488987) *
## 3) TotalIntenCh2>=45323.5 555 216 WS (0.38918919 0.61081081)
## 6) FiberWidthCh1< 9.673245 154 47 PS (0.69480519 0.30519481) *
## 7) FiberWidthCh1>=9.673245 401 109 WS (0.27182045 0.72817955) *
newdata <- subset(segmentationOriginal[1,], select=-c(Class))
newdata$TotalIntench2 = 23000
newdata$FiberWidthCh1 = 10
newdata$PerimStatusCh1 = 2
predict(CART, newdata=newdata)
## [1] PS
## Levels: PS WS
newdata$TotalIntench2 = 50000
newdata$FiberWidthCh1 = 10
newdata$VarIntenCh4 = 100
predict(CART, newdata=newdata)
## [1] PS
## Levels: PS WS
Answer: a. PS b. WS c. PS d. Not possible to predict
If K is small in a K-fold cross validation is the bias in the estimate of out-of-sample (test set) accuracy smaller or bigger? If K is small is the variance in the estimate of out-of-sample (test set) accuracy smaller or bigger. Is K large or small in leave one out cross validation?
Answer: The bias is larger and the variance is smaller. Under leave one out cross validation K is equal to the sample size.
Load the olive oil data using the commands:
library(pgmm)
data(olive)
olive = olive[,-1]
These data contain information on 572 different Italian olive oils from multiple regions in Italy. Fit a classification tree where Area is the outcome variable. Then predict the value of area for the following data frame using the tree command with all defaults
modFit <- train(Area ~ .,method="rpart",data=olive)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
newdata = as.data.frame(t(colMeans(olive)))
predict(modFit,newdata=newdata)
## 1
## 2.783282
t(colMeans(olive))
## Area Palmitic Palmitoleic Stearic Oleic Linoleic Linolenic
## [1,] 4.59965 1231.741 126.0944 228.8654 7311.748 980.528 31.88811
## Arachidic Eicosenoic
## [1,] 58.0979 16.28147
fancyRpartPlot(modFit$finalModel)
head(olive)
## Area Palmitic Palmitoleic Stearic Oleic Linoleic Linolenic Arachidic
## 1 1 1075 75 226 7823 672 36 60
## 2 1 1088 73 224 7709 781 31 61
## 3 1 911 54 246 8113 549 31 63
## 4 1 966 57 240 7952 619 50 78
## 5 1 1051 67 259 7771 672 50 80
## 6 1 911 49 268 7924 678 51 70
## Eicosenoic
## 1 29
## 2 29
## 3 29
## 4 35
## 5 46
## 6 44
table(olive$Area)
##
## 1 2 3 4 5 6 7 8 9
## 25 56 206 36 65 33 50 50 51
Answer: 2.783. It is strange because Area should be a qualitative variable - but tree is reporting the average value of Area as a numeric variable in the leaf predicted for newdata
Load the South Africa Heart Disease Data and create training and test sets with the following code:
library(ElemStatLearn)
data(SAheart)
set.seed(8484)
train = sample(1:dim(SAheart)[1],size=dim(SAheart)[1]/2,replace=F)
trainSA = SAheart[train,]
testSA = SAheart[-train,]
Then set the seed to 13234 and fit a logistic regression model (method=“glm”, be sure to specify family=“binomial”) with Coronary Heart Disease (chd) as the outcome and age at onset, current alcohol consumption, obesity levels, cumulative tabacco, type-A behavior, and low density lipoprotein cholesterol as predictors. Calculate the misclassification rate for your model using this function and a prediction on the “response” scale: missClass = function(values,prediction){sum(((prediction > 0.5)*1) != values)/length(values)}
set.seed(13234)
modelFit<-train(chd ~ age+alcohol+obesity+tobacco+typea+ldl, data=trainSA, method="glm", family="binomial")
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to
## do classification? If so, use a 2 level factor as your outcome column.
missClass = function(values,prediction){sum(((prediction > 0.5)*1) != values)/length(values)}
predictTrain<-predict(modelFit,trainSA)
predictTest<-predict(modelFit,testSA)
missClassTrain<-missClass(trainSA$chd,predictTrain)
missClassTrain
## [1] 0.2727273
missClassTest<-missClass(testSA$chd,predictTest)
missClassTest
## [1] 0.3116883
Load the vowel.train and vowel.test data sets:
library(ElemStatLearn)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
data(vowel.train)
data(vowel.test)
Set the variable y to be a factor variable in both the training and test set. Then set the seed to 33833. Fit a random forest predictor relating the factor variable y to the remaining variables. Read about variable importance in random forests here: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr The caret package uses by default the Gini importance.
Calculate the variable importance using the varImp function in the caret package. What is the order of variable importance?
[NOTE: Use randomForest() specifically, not caret, as there’s been some issues reported with that approach. 11/6/2016]
vowel.test_f<-cbind(as.factor(vowel.test$y), vowel.test[,-1])
vowel.train_f<-cbind(as.factor(vowel.train$y), vowel.train[,-1])
names(vowel.test_f)<-names(vowel.test)
names(vowel.train_f)<-names(vowel.train)
set.seed(33833)
modFit <- randomForest(y ~ .,data=vowel.train_f)
order(varImp(modFit),decreasing = TRUE)
## [1] 2 1 5 6 8 4 9 3 7 10