Question 1

Load the cell segmentation data from the AppliedPredictiveModeling package using the commands:

library(AppliedPredictiveModeling)
data(segmentationOriginal)
suppressMessages(library(caret))
  1. Subset the data to a training set and testing set based on the Case variable in the data set.

  2. Set the seed to 125 and fit a CART model with the rpart method using all predictor variables and default caret settings.

  3. In the final model what would be the final model prediction for cases with the following variable values:

  1. TotalIntench2 = 23,000; FiberWidthCh1 = 10; PerimStatusCh1=2

  2. TotalIntench2 = 50,000; FiberWidthCh1 = 10;VarIntenCh4 = 100

  3. TotalIntench2 = 57,000; FiberWidthCh1 = 8;VarIntenCh4 = 100

  4. FiberWidthCh1 = 8;VarIntenCh4 = 100; PerimStatusCh1=2

Answer 1

inTrain <- createDataPartition(y = segmentationOriginal$Case, p = 0.75, list = FALSE)
training <- segmentationOriginal[inTrain, ]
testing <- segmentationOriginal[-inTrain, ]
set.seed(125)
modFit <- train(Class ~ ., method = "rpart", data = training)
modFit$finalModel
## n= 1515 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1515 532 PS (0.64884488 0.35115512)  
##   2) TotalIntenCh2< 47257 736  61 PS (0.91711957 0.08288043) *
##   3) TotalIntenCh2>=47257 779 308 WS (0.39537869 0.60462131)  
##     6) ConvexHullAreaRatioCh1>=1.070989 450 207 PS (0.54000000 0.46000000) *
##     7) ConvexHullAreaRatioCh1< 1.070989 329  65 WS (0.19756839 0.80243161) *
suppressMessages(library(rattle))
library(rpart.plot)
## Loading required package: rpart
fancyRpartPlot(modFit$finalModel)

Question 2

If \(K\) is small in a \(K\)-fold cross validation is the bias in the estimate of out-of-sample (test set) accuracy smaller or bigger? If \(K\) is small is the variance in the estimate of out-of-sample (test set) accuracy smaller or bigger. Is \(K\) large or small in leave one out cross validation?

Answer 2

The bias is larger and the variance is smaller. Under leave one out cross validation K is equal to the sample size.

Question 3

Load the olive oil data using the commands:

library(pgmm)
data(olive)
olive = olive[,-1]
head(olive)
##   Area Palmitic Palmitoleic Stearic Oleic Linoleic Linolenic Arachidic
## 1    1     1075          75     226  7823      672        36        60
## 2    1     1088          73     224  7709      781        31        61
## 3    1      911          54     246  8113      549        31        63
## 4    1      966          57     240  7952      619        50        78
## 5    1     1051          67     259  7771      672        50        80
## 6    1      911          49     268  7924      678        51        70
##   Eicosenoic
## 1         29
## 2         29
## 3         29
## 4         35
## 5         46
## 6         44

These data contain information on 572 different Italian olive oils from multiple regions in Italy. Fit a classification tree where Area is the outcome variable. Then predict the value of area for the following data frame using the tree command with all defaults

newdata = as.data.frame(t(colMeans(olive)))
head(newdata)
##      Area Palmitic Palmitoleic  Stearic    Oleic Linoleic Linolenic Arachidic
## 1 4.59965 1231.741    126.0944 228.8654 7311.748  980.528  31.88811   58.0979
##   Eicosenoic
## 1   16.28147

What is the resulting prediction? Is the resulting prediction strange? Why or why not?

Answer 3

fitCT <- train(Area ~ ., method = "rpart", data = olive)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
fitCT$finalModel
## n= 572 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 572 3171.32000 4.599650  
##   2) Eicosenoic>=6.5 323  176.82970 2.783282 *
##   3) Eicosenoic< 6.5 249  546.51410 6.955823  
##     6) Linoleic>=1053.5 98   21.88776 5.336735 *
##     7) Linoleic< 1053.5 151  100.99340 8.006623 *
fancyRpartPlot(fitCT$finalModel)

predCT <- predict(fitCT, newdata)
predCT
##        1 
## 2.783282

It is strange because Area should be a qualitative variable - but tree is reporting the average value of Area as a numeric variable in the leaf predicted for newdata.

Question 4

Load the South Africa Heart Disease Data and create training and test sets with the following code:

suppressMessages(library(ElemStatLearn))
data(SAheart)
set.seed(8484)
train = sample(1:dim(SAheart)[1],size=dim(SAheart)[1]/2,replace=F)
trainSA = SAheart[train,]
testSA = SAheart[-train,]

Then set the seed to 13234 and fit a logistic regression model (method=“glm”, be sure to specify family=“binomial”) with Coronary Heart Disease (chd) as the outcome and age at onset, current alcohol consumption, obesity levels, cumulative tabacco, type-A behavior, and low density lipoprotein cholesterol as predictors. Calculate the misclassification rate for your model using this function and a prediction on the “response” scale:

missClass = function(values,prediction){sum(((prediction > 0.5)*1) != values)/length(values)}

What is the misclassification rate on the training set? What is the misclassification rate on the test set?

Answer 4

names(SAheart)
##  [1] "sbp"       "tobacco"   "ldl"       "adiposity" "famhist"   "typea"    
##  [7] "obesity"   "alcohol"   "age"       "chd"
set.seed(13234)
fitLogReg <- train(chd ~ age + alcohol + obesity + tobacco + typea + ldl, 
                   method = "glm", 
                   family = 'binomial',
                   data = trainSA)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to do
## classification? If so, use a 2 level factor as your outcome column.
missClass(trainSA$chd, predict(fitLogReg, trainSA))
## [1] 0.3116883
missClass(testSA$chd, predict(fitLogReg, testSA))
## [1] 0.2813853

Question 5

Load the vowel.train and vowel.test data sets:

library(ElemStatLearn)
data(vowel.train)
data(vowel.test)

Set the variable y to be a factor variable in both the training and test set. Then set the seed to 33833. Fit a random forest predictor relating the factor variable y to the remaining variables.

The caret package uses by default the Gini importance.

Calculate the variable importance using the varImp function in the caret package. What is the order of variable importance?

Answer 5

suppressMessages(library(randomForest))
vowel.train$y <- factor(vowel.train$y)
vowel.test$y <- factor(vowel.test$y)
set.seed(33833)

fitRFC <- train(y ~ .,
                data = vowel.train,
                method = 'rf')
varImp(fitRFC$finalModel)
##       Overall
## x.1  78.70925
## x.2  79.98741
## x.3  36.48760
## x.4  36.85415
## x.5  51.67477
## x.6  46.44626
## x.7  33.82075
## x.8  43.26819
## x.9  37.63431
## x.10 34.19102
order(varImp(fitRFC$finalModel), decreasing = T)
## Warning in xtfrm.data.frame(x): cannot xtfrm data frames
##  [1]  2  1  5  6  8  9  4  3 10  7
fitRF <- randomForest(y ~ ., data = vowel.train)
varImp(fitRF)
##       Overall
## x.1  89.49678
## x.2  88.67203
## x.3  32.12628
## x.4  35.26271
## x.5  51.28104
## x.6  45.46902
## x.7  31.42451
## x.8  42.38164
## x.9  33.07339
## x.10 29.93491
order(varImp(fitRF), decreasing = T)
## Warning in xtfrm.data.frame(x): cannot xtfrm data frames
##  [1]  1  2  5  6  8  4  9  3  7 10