Coursera: Machine Learning Quiz 3

Load the cell segmentation data from the AppliedPredictiveModeling package using the commands:

library(rpart.plot)
library(AppliedPredictiveModeling)
data(segmentationOriginal)
library(caret)
set.seed(125)


training = subset(segmentationOriginal, Case=="Train")
testing = subset(segmentationOriginal, Case=="Test")
crtmd <- train(Class~., data= training, method= "rpart")
rpart.plot(crtmd$finalModel, type=4, extra=104)

crtmd$finalModel

## n= 1009 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1009 373 PS (0.63032706 0.36967294)  
##   2) TotalIntenCh2< 45323.5 454  34 PS (0.92511013 0.07488987) *
##   3) TotalIntenCh2>=45323.5 555 216 WS (0.38918919 0.61081081)  
##     6) FiberWidthCh1< 9.673245 154  47 PS (0.69480519 0.30519481) *
##     7) FiberWidthCh1>=9.673245 401 109 WS (0.27182045 0.72817955) *

Predict the classification for following class

a)TotalIntench2 = 23000 ,FiberWidthCh1 = 10, PerimStatusCh1=2
b) TotalIntench2 = 50,000; FiberWidthCh1 = 10;VarIntenCh4 = 100
c) TotalIntench2 = 57,000; FiberWidthCh1 = 8;VarIntenCh4 = 100
d) FiberWidthCh1 = 8;VarIntenCh4 = 100; PerimStatusCh1=2

Answer*

PS WS pS Not possible to predict

2 If K is small in a K-fold cross validation is the bias in the estimate of out-of-sample (test set) accuracy smaller or bigger? If K is small is the variance in the estimate of out-of-sample (test set) accuracy smaller or bigger. Is K large or small in leave one out cross validation?

If K is small than the bias in the estimate of out-of-sample accuracy is small. THe Variance in the estimate of out-of-samole accuracy bigger. IN Leave one out cross Validation (LOOCV) k value is large

3 These data contain information on 572 different Italian olive oils from multiple regions in Italy. Fit a classification tree where Area is the outcome variable. Then predict the value of area for the following data frame using the tree command with all defaults

library(pgmm)
data(olive)
olive = olive[,-1]
mdl <- train(Area~., data= olive, method= "rpart")

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.

nd= as.data.frame(t(colMeans(olive)))
pred = predict(mdl,nd)

Answer : 2.783. It is strange because Area should be a qualitative variable - but tree is reporting the average value of Area as a numeric variable in the leaf predicted for newdata

4 Load the South Africa Heart Disease Data and create training and test sets with the following code: Then set the seed to 13234 and fit a logistic regression model (method=“glm”, be sure to specify family=“binomial”) with Coronary Heart Disease (chd) as the outcome and age at onset, current alcohol consumption, obesity levels, cumulative tabacco, type-A behavior, and low density lipoprotein cholesterol as predictors. Calculate the misclassification rate for your model using this function and a prediction on the “response” scale:

library(ElemStatLearn)
data(SAheart)
set.seed(8484)
train = sample(1:dim(SAheart)[1],size=dim(SAheart)[1]/2,replace=F)
SAheart$chd= as.factor(SAheart$chd)
trainSA = SAheart[train,]
testSA = SAheart[-train,]
mdl <- glm(chd ~ age + alcohol + obesity + tobacco + typea + ldl,data =trainSA,family= 'binomial')

pred = predict(mdl, newdata=testSA, type ='response')
predt = ifelse(pred > 0.5,1 ,0)

# classification error for test
accuracy <- table(predt, testSA$chd)
round(1- sum(diag(accuracy))/sum(accuracy),2)

## [1] 0.31

# classification error for train

train_pred = predict(mdl, newdata= trainSA, type ='response')
train_predt = ifelse(train_pred > 0.5, 1, 0)
train_accuracy = table (train_predt, trainSA$chd)
round(1- sum(diag(train_accuracy))/sum(train_accuracy),2)

## [1] 0.27

5 Set the variable y to be a factor variable in both the training and test set. Then set the seed to 33833. Fit a random forest predictor relating the factor variable y to the remaining variables. Read about variable importance in random forests here: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr The caret package uses by default the Gini importance.

Calculate the variable importance using the varImp function in the caret package. What is the order of variable importance?

library(randomForest)
data(vowel.train)
data(vowel.test)
set.seed(33833)
vowel.train$y= as.factor(vowel.train$y)
mdl <- randomForest(y~., data= vowel.train)
varImp(mdl)

##       Overall
## x.1  89.12864
## x.2  91.24009
## x.3  33.08111
## x.4  34.24433
## x.5  50.25539
## x.6  43.33148
## x.7  31.88132
## x.8  42.92470
## x.9  33.37031
## x.10 29.59956

Answer : x.2, x.1, x.5, x.6, x.8, x.4, x.9, x.3, x.7,x.10