Quiz 1

For this quiz we will be using several R packages. R package versions change over time, the right answers have been checked using the following versions of the packages.

If you aren’t using these versions of the packages, your answers may not exactly match the right answer, but hopefully should be close.

Load the cell segmentation data from the AppliedPredictiveModeling package using the commands:

library(AppliedPredictiveModeling)
data(segmentationOriginal)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] caret_6.0-80                    ggplot2_2.2.1                  
## [3] lattice_0.20-35                 AppliedPredictiveModeling_1.1-7
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.19       lubridate_1.7.4    tidyr_0.8.1       
##  [4] class_7.3-14       assertthat_0.2.0   rprojroot_1.3-2   
##  [7] digest_0.6.18      ipred_0.9-7        foreach_1.4.4     
## [10] R6_2.2.2           plyr_1.8.4         backports_1.1.2   
## [13] magic_1.5-9        stats4_3.4.3       ellipse_0.4.1     
## [16] evaluate_0.11      pillar_1.3.0       rlang_0.2.2       
## [19] lazyeval_0.2.1     data.table_1.11.4  kernlab_0.9-27    
## [22] rpart_4.1-11       Matrix_1.2-12      rmarkdown_1.10    
## [25] splines_3.4.3      CVST_0.2-2         ddalpha_1.3.4     
## [28] gower_0.1.2        stringr_1.2.0      munsell_0.4.3     
## [31] broom_0.5.0        compiler_3.4.3     pkgconfig_2.0.2   
## [34] dimRed_0.1.0       htmltools_0.3.6    nnet_7.3-12       
## [37] tidyselect_0.2.5   prodlim_2018.04.18 tibble_1.4.2      
## [40] DRR_0.0.3          CORElearn_1.53.1   codetools_0.2-15  
## [43] RcppRoll_0.3.0     withr_2.1.2        crayon_1.3.4      
## [46] dplyr_0.7.6        MASS_7.3-47        recipes_0.1.3     
## [49] ModelMetrics_1.2.0 grid_3.4.3         nlme_3.1-131      
## [52] gtable_0.2.0       magrittr_1.5       scales_0.5.0      
## [55] stringi_1.1.6      reshape2_1.4.2     bindrcpp_0.2.2    
## [58] timeDate_3043.102  robustbase_0.93-3  geometry_0.3-6    
## [61] pls_2.7-0          lava_1.6.3         iterators_1.0.10  
## [64] tools_3.4.3        glue_1.3.0         DEoptimR_1.0-8    
## [67] purrr_0.2.5        sfsmisc_1.1-2      abind_1.4-5       
## [70] survival_2.41-3    yaml_2.1.15        colorspace_1.3-2  
## [73] cluster_2.0.6      knitr_1.20         bindr_0.1.1
packageDescription("AppliedPredictiveModeling")$Version
## [1] "1.1-7"
packageDescription("caret")$Version
## [1] "6.0-80"
packageDescription("ElemStatLearn")$Version
## [1] "2015.6.26"
packageDescription("pgmm")$Version
## [1] "1.2.2"

1. Subset the data to a training set and testing set based on the Case variable in the data set.

2. Set the seed to 125 and fit a CART model with the rpart method using all predictor variables and default caret settings.

3. In the final model what would be the final model prediction for cases with the following variable values:

a. PS; b. WS; c. PS; d. Not possible to predict

head(colnames(segmentationOriginal), 10)
##  [1] "Cell"           "Case"           "Class"          "AngleCh1"      
##  [5] "AngleStatusCh1" "AreaCh1"        "AreaStatusCh1"  "AvgIntenCh1"   
##  [9] "AvgIntenCh2"    "AvgIntenCh3"
table(segmentationOriginal$Case)
## 
##  Test Train 
##  1010  1009
training <- subset(segmentationOriginal, Case=="Train")
testing <-  subset(segmentationOriginal, Case=="Test")
## Classification and Regression Trees
set.seed(125)
modFit <- train(Class~., method="rpart", data=training[-c(1,2)])
print(modFit$finalModel)
## n= 1009 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1009 373 PS (0.63032706 0.36967294)  
##   2) TotalIntenCh2< 45323.5 454  34 PS (0.92511013 0.07488987) *
##   3) TotalIntenCh2>=45323.5 555 216 WS (0.38918919 0.61081081)  
##     6) FiberWidthCh1< 9.673245 154  47 PS (0.69480519 0.30519481) *
##     7) FiberWidthCh1>=9.673245 401 109 WS (0.27182045 0.72817955) *
##plot(modFit$finalModel, uniform=TRUE, main="Classification Tree")
##text(modFit$finalModel, use.n=TRUE, all=TRUE, cex=0.8)
library(rattle)
## Warning: package 'rattle' was built under R version 3.4.4
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(modFit$finalModel)

head(predict(modFit, newdata=testing), 20)
##  [1] PS PS WS WS PS PS PS WS PS PS WS WS PS PS PS WS WS PS WS PS
## Levels: PS WS

Quiz 2

If K is small in a K-fold cross validation is the bias in the estimate of out-of-sample (test set) accuracy smaller or bigger? If K is small is the variance in the estimate of out-of-sample (test set) accuracy smaller or bigger. Is K large or small in leave one out cross validation?

Quiz 3

(NOTE: If you have trouble installing the pgmm package, you can download the -code-olive-/code- dataset here: olive_data.zip. After unzipping the archive, you can load the file using the -code-load()-/code- function in R.)

These data contain information on 572 different Italian olive oils from multiple regions in Italy. Fit a classification tree where Area is the outcome variable. Then predict the value of area for the following data frame using the tree command with all defaults.

library(pgmm)
data(olive)
olive = olive[,-1]
newdata = as.data.frame(t(colMeans(olive)))

What is the resulting prediction? Is the resulting prediction strange? Why or why not?

head(olive)
##   Area Palmitic Palmitoleic Stearic Oleic Linoleic Linolenic Arachidic
## 1    1     1075          75     226  7823      672        36        60
## 2    1     1088          73     224  7709      781        31        61
## 3    1      911          54     246  8113      549        31        63
## 4    1      966          57     240  7952      619        50        78
## 5    1     1051          67     259  7771      672        50        80
## 6    1      911          49     268  7924      678        51        70
##   Eicosenoic
## 1         29
## 2         29
## 3         29
## 4         35
## 5         46
## 6         44
table(olive$Area)
## 
##   1   2   3   4   5   6   7   8   9 
##  25  56 206  36  65  33  50  50  51
dim(olive)
## [1] 572   9
modFit <- train(Area ~ ., method = "rpart", data = olive)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
predict(modFit, newdata = newdata)
##        1 
## 2.783282

Quiz 4

Load the South Africa Heart Disease Data and create training and test sets with the following code:

library(ElemStatLearn)
data(SAheart)
set.seed(8484)
train = sample(1:dim(SAheart)[1],size=dim(SAheart)[1]/2,replace=F)
trainSA = SAheart[train,]
testSA = SAheart[-train,]

Then set the seed to 13234 and fit a logistic regression model (method=“glm”, be sure to specify family=“binomial”) with Coronary Heart Disease (chd) as the outcome and age at onset, current alcohol consumption, obesity levels, cumulative tabacco, type-A behavior, and low density lipoprotein cholesterol as predictors. Calculate the misclassification rate for your model using this function and a prediction on the “response” scale:

missClass = function(values,prediction){sum(((prediction > 0.5)*1) != values)/length(values)}

What is the misclassification rate on the training set? What is the misclassification rate on the test set?

set.seed(13234)
colnames(trainSA)
##  [1] "sbp"       "tobacco"   "ldl"       "adiposity" "famhist"  
##  [6] "typea"     "obesity"   "alcohol"   "age"       "chd"
table(trainSA$chd)
## 
##   0   1 
## 147  84
modFit <- train(chd ~ age + alcohol + obesity + tobacco + typea + ldl, 
                method="glm", family="binomial", data=trainSA)
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to
## do classification? If so, use a 2 level factor as your outcome column.
prediction <- predict(modFit, testSA)
head(prediction)
##         1         2         4         8         9        11 
## 0.6559293 0.4784618 0.5232072 0.7286715 0.1132549 0.6652056
missClass(trainSA$chd, predict(modFit, trainSA)) ##0.2727273
## [1] 0.2727273
missClass(testSA$chd, prediction) ##0.3116883
## [1] 0.3116883

Quiz 5

Load the vowel.train and vowel.test data sets:

library(ElemStatLearn)
data(vowel.train)
data(vowel.test)

Set the variable y to be a factor variable in both the training and test set. Then set the seed to 33833. Fit a random forest predictor relating the factor variable y to the remaining variables. Read about variable importance in random forests here: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr The caret package uses by default the Gini importance.

Calculate the variable importance using the varImp function in the caret package. What is the order of variable importance?

[NOTE: Use randomForest() specifically, not caret, as there’s been some issues reported with that approach. 11/6/2016]

set.seed(33833)
head(vowel.train)
##   y    x.1   x.2    x.3   x.4    x.5   x.6    x.7    x.8    x.9   x.10
## 1 1 -3.639 0.418 -0.670 1.779 -0.168 1.627 -0.388  0.529 -0.874 -0.814
## 2 2 -3.327 0.496 -0.694 1.365 -0.265 1.933 -0.363  0.510 -0.621 -0.488
## 3 3 -2.120 0.894 -1.576 0.147 -0.707 1.559 -0.579  0.676 -0.809 -0.049
## 4 4 -2.287 1.809 -1.498 1.012 -1.053 1.060 -0.567  0.235 -0.091 -0.795
## 5 5 -2.598 1.938 -0.846 1.062 -1.633 0.764  0.394 -0.150  0.277 -0.396
## 6 6 -2.852 1.914 -0.755 0.825 -1.588 0.855  0.217 -0.246  0.238 -0.365
class(vowel.train$y)
## [1] "integer"
vowel.train$y <- as.factor(vowel.train$y)
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.4.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
## 
##     importance
## The following object is masked from 'package:ggplot2':
## 
##     margin
modFit <- randomForest(y ~ ., data=vowel.train, importance=TRUE)
imp <- varImp(modFit); imp
##             1        2        3        4        5        6        7
## x.1  29.11895 29.44922 33.18261 48.42164 42.93955 35.09701 35.19359
## x.2  54.80527 42.99639 37.58728 31.51004 32.84094 24.78946 28.79998
## x.3  15.87398 15.59555 19.74431 19.93527 13.92606 14.86718 16.14087
## x.4  18.54148 16.73341 19.41532 20.31802 18.89242 18.34816 14.87841
## x.5  21.58627 26.27623 28.69133 27.81236 30.09398 21.83958 24.61554
## x.6  23.13863 27.33672 22.52596 27.69695 17.23596 23.47240 22.94971
## x.7  15.97967 18.02095 16.66341 13.52179 17.73142 15.97313 17.11697
## x.8  17.83590 20.32252 19.31248 23.83211 16.76017 19.56210 22.32204
## x.9  12.67408 17.92742 16.27996 16.02322 12.98966 15.46911 16.18884
## x.10 15.41072 14.56767 16.84268 15.16088 14.82269 11.24023 15.44459
##             8        9       10       11
## x.1  44.18992 43.62221 52.36179 40.32181
## x.2  33.13660 27.27830 30.81691 36.33539
## x.3  15.45219 18.05612 18.19880 17.66303
## x.4  22.11390 19.67952 19.24777 19.18863
## x.5  33.57581 24.45729 27.33320 26.67657
## x.6  22.81997 21.13448 21.65237 25.87921
## x.7  12.75196 10.53571 16.09083 15.85205
## x.8  24.26107 25.56857 21.55164 21.04051
## x.9  16.93566 14.77635 19.09838 18.87731
## x.10 14.29647 13.12904 12.48161 16.67645
rownames(imp[order(imp[, 1], decreasing=TRUE), ])
##  [1] "x.2"  "x.1"  "x.6"  "x.5"  "x.4"  "x.8"  "x.7"  "x.3"  "x.10" "x.9"
imp <- importance(modFit); imp
##             1        2        3        4        5        6        7
## x.1  29.11895 29.44922 33.18261 48.42164 42.93955 35.09701 35.19359
## x.2  54.80527 42.99639 37.58728 31.51004 32.84094 24.78946 28.79998
## x.3  15.87398 15.59555 19.74431 19.93527 13.92606 14.86718 16.14087
## x.4  18.54148 16.73341 19.41532 20.31802 18.89242 18.34816 14.87841
## x.5  21.58627 26.27623 28.69133 27.81236 30.09398 21.83958 24.61554
## x.6  23.13863 27.33672 22.52596 27.69695 17.23596 23.47240 22.94971
## x.7  15.97967 18.02095 16.66341 13.52179 17.73142 15.97313 17.11697
## x.8  17.83590 20.32252 19.31248 23.83211 16.76017 19.56210 22.32204
## x.9  12.67408 17.92742 16.27996 16.02322 12.98966 15.46911 16.18884
## x.10 15.41072 14.56767 16.84268 15.16088 14.82269 11.24023 15.44459
##             8        9       10       11 MeanDecreaseAccuracy
## x.1  44.18992 43.62221 52.36179 40.32181             74.54433
## x.2  33.13660 27.27830 30.81691 36.33539             66.48362
## x.3  15.45219 18.05612 18.19880 17.66303             36.98281
## x.4  22.11390 19.67952 19.24777 19.18863             35.11464
## x.5  33.57581 24.45729 27.33320 26.67657             48.98575
## x.6  22.81997 21.13448 21.65237 25.87921             43.85545
## x.7  12.75196 10.53571 16.09083 15.85205             37.69351
## x.8  24.26107 25.56857 21.55164 21.04051             39.89072
## x.9  16.93566 14.77635 19.09838 18.87731             36.51970
## x.10 14.29647 13.12904 12.48161 16.67645             37.95938
##      MeanDecreaseGini
## x.1          90.53672
## x.2          88.32345
## x.3          32.51581
## x.4          35.77079
## x.5          51.22808
## x.6          44.87199
## x.7          30.96095
## x.8          42.92778
## x.9          32.48635
## x.10         29.43556
rownames(imp[order(imp[, 1], decreasing=TRUE), ])
##  [1] "x.2"  "x.1"  "x.6"  "x.5"  "x.4"  "x.8"  "x.7"  "x.3"  "x.10" "x.9"
## alternative option
modFit <- randomForest(y ~ ., data=vowel.train)
imp <- varImp(modFit); imp
##       Overall
## x.1  86.52665
## x.2  88.49897
## x.3  32.62066
## x.4  36.21606
## x.5  51.69616
## x.6  46.07759
## x.7  31.00170
## x.8  43.14226
## x.9  32.83070
## x.10 30.50637
order(imp[, 1], decreasing=TRUE)
##  [1]  2  1  5  6  8  4  9  3  7 10