library(AppliedPredictiveModeling)
data(segmentationOriginal)
library(caret)
library(randomForest)

1.

Create non-overlapping training and test sets with about 50% of the observations assigned to each.

inTrain <- segmentationOriginal$Case == "Train"
training <- segmentationOriginal[inTrain,]
testing <- segmentationOriginal[!inTrain,]
set.seed(125)
CART <- train(Class~., method="rpart", data=training)
print(CART$finalModel)
## n= 1009 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1009 373 PS (0.63032706 0.36967294)  
##   2) TotalIntenCh2< 45323.5 454  34 PS (0.92511013 0.07488987) *
##   3) TotalIntenCh2>=45323.5 555 216 WS (0.38918919 0.61081081)  
##     6) FiberWidthCh1< 9.673245 154  47 PS (0.69480519 0.30519481) *
##     7) FiberWidthCh1>=9.673245 401 109 WS (0.27182045 0.72817955) *
suppressMessages(library(rattle))
## Warning: package 'rattle' was built under R version 3.4.3
fancyRpartPlot(CART$finalModel)

2.

If K is small in a K-fold cross validation is the bias in the estimate of out-of-sample (test set) accuracy smaller or bigger? If K is small is the variance in the estimate of out-of-sample (test set) accuracy smaller or bigger. Is K large or small in leave one out cross validation? The bias is larger and the variance is smaller. Under leave one out cross validation K is equal to the sample size.

3.

(NOTE: If you have trouble installing the pgmm package, you can download the -code-olive-/code- dataset here: olive_data.zip. After unzipping the archive, you can load the file using the -code-load()-/code- function in R.)

These data contain information on 572 different Italian olive oils from multiple regions in Italy. Fit a classification tree where Area is the outcome variable. Then predict the value of area for the following data frame using the tree command with all defaults

load("olive.rda")
olive = olive[,-1]
newdata = as.data.frame(t(colMeans(olive)))
olivemod <- train(Area~.,method="rpart", data=olive)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
predict(olivemod, newdata=newdata)
##        1 
## 2.783282

4.

library(ElemStatLearn)
## Warning: package 'ElemStatLearn' was built under R version 3.4.3
data(SAheart)
set.seed(8484)
train = sample(1:dim(SAheart)[1],size=dim(SAheart)[1]/2,replace=F)
trainSA = SAheart[train,]
testSA = SAheart[-train,]

Then set the seed to 13234 and fit a logistic regression model (method=“glm”, be sure to specify family=“binomial”) with Coronary Heart Disease (chd) as the outcome and age at onset, current alcohol consumption, obesity levels, cumulative tabacco, type-A behavior, and low density lipoprotein cholesterol as predictors. Calculate the misclassification rate for your model using this function and a prediction on the “response” scale:

missClass = function(values,prediction){sum(((prediction > 0.5)*1) != values)/length(values)}

What is the misclassification rate on the training set? What is the misclassification rate on the test set?

set.seed(13234)
mdl4 <- train(chd~age+alcohol+obesity+tobacco+typea+ldl,method="glm", family="binomial", data=trainSA) 
## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to
## do classification? If so, use a 2 level factor as your outcome column.
#notice: data set is trainSA, not all data
missClass(trainSA$chd, predict(mdl4,trainSA)) #don't need: trainSA[,-10]
## [1] 0.2727273
missClass(testSA$chd, predict(mdl4,testSA))
## [1] 0.3116883
library(ElemStatLearn)
data(vowel.train)
data(vowel.test)

Set the variable y to be a factor variable in both the training and test set. Then set the seed to 33833. Fit a random forest predictor relating the factor variable y to the remaining variables. Read about variable importance in random forests here: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr The caret package uses by default the Gini importance.

Calculate the variable importance using the varImp function in the caret package. What is the order of variable importance?

[NOTE: Use randomForest() specifically, not caret, as there’s been some issues reported with that approach. 11/6/2016]

set.seed(33833)
vowel.train$y <- factor(vowel.train$y)
vowel.test$y <- factor(vowel.test$y)
mdl5 <- randomForest(y~., data=vowel.train) 
order(varImp(mdl5),decreasing = TRUE)
##  [1]  2  1  5  6  8  4  9  3  7 10