PratMachLearnQuiz3

Question 1 - For this quiz we will be using several R packages. R package versions change over time, the right answers have been checked using the following versions of the packages.

AppliedPredictiveModeling: v1.1.6
caret: v6.0.47 used:6.0.76
ElemStatLearn: v2012.04-0 used: 2015.6.26 pgmm: v1.1 used: 1.2
rpart: v4.1.8 used: 4.1.10

If you aren’t using these versions of the packages, your answers may not exactly match the right answer, but hopefully should be close.

Load the cell segmentation data from the AppliedPredictiveModeling package using the commands:

library(AppliedPredictiveModeling)  
data(segmentationOriginal)    
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

#names(segmentationOriginal)

1. Subset the data to a training set and testing set based on the Case variable in the data set.

# you may use subset to split the file.It will create training and test with 50% of the occurences each.   
subset<-split(segmentationOriginal, segmentationOriginal$Case)  
dim(subset$Train)

## [1] 1009  119

dim(subset$Test)

## [1] 1010  119

2. Set the seed to 125 and fit a CART model with the rpart method using all predictor variables and default caret settings.

set.seed(125)
modFit<- train(Class ~ ., method="rpart", data=subset$Train)

## Loading required package: rpart

modFit$finalModel

## n= 1009 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1009 373 PS (0.63032706 0.36967294)  
##   2) TotalIntenCh2< 45323.5 454  34 PS (0.92511013 0.07488987) *
##   3) TotalIntenCh2>=45323.5 555 216 WS (0.38918919 0.61081081)  
##     6) FiberWidthCh1< 9.673245 154  47 PS (0.69480519 0.30519481) *
##     7) FiberWidthCh1>=9.673245 401 109 WS (0.27182045 0.72817955) *

library(rattle)

## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

fancyRpartPlot(modFit$finalModel)

3. In the final model what would be the final model prediction for cases with the following variable values:

a. TotalIntench2 = 23,000; FiberWidthCh1 = 10; PerimStatusCh1=2

b. TotalIntench2 = 50,000; FiberWidthCh1 = 10;VarIntenCh4 = 100

c. TotalIntench2 = 57,000; FiberWidthCh1 = 8;VarIntenCh4 = 100

d. FiberWidthCh1 = 8;VarIntenCh4 = 100; PerimStatusCh1=2

a. PS

b. WS

c. PS

d. Not possible to predict

PS

Not possible to predict

PS

WS

PS

WS

PS

WS

Not possible to predict

WS

PS

PS

Question 2 - If K is small in a K-fold cross validation is the bias in the estimate of out-of-sample (test set) accuracy smaller or bigger?

If K is small is the variance in the estimate of out-of-sample (test set) accuracy smaller or bigger.

Is K large or small in leave one out cross validation?

The bias is smaller and the variance is smaller. Under leave one out cross validation K is equal to one.
The bias is larger and the variance is smaller. Under leave one out cross validation K is equal to the sample size.
The bias is smaller and the variance is smaller. Under leave one out cross validation K is equal to the sample size.
The bias is smaller and the variance is bigger. Under leave one out cross validation K is equal to one.

Question 3 - Load the olive oil data using the commands:

library(pgmm)
data(olive)
olive = olive[,-1]

(NOTE: If you have trouble installing the pgmm package, you can download the -code-olive-/code- dataset here: olive_data.zip. After unzipping the archive, you can load the file using the -code-load()-/code- function in R.)

These data contain information on 572 different Italian olive oils from multiple regions in Italy.

Fit a classification tree where Area is the outcome variable. Then predict the value of area for the following data frame using the tree command with all defaults

newdata = as.data.frame(t(colMeans(olive)))  
modOlive<-rpart(Area ~., data=olive)
modOlive

## n= 572 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 572 3171.32000 4.599650  
##    2) Eicosenoic>=6.5 323  176.82970 2.783282  
##      4) Oleic>=7770.5 19   16.10526 1.315789 *
##      5) Oleic< 7770.5 304  117.25000 2.875000 *
##    3) Eicosenoic< 6.5 249  546.51410 6.955823  
##      6) Linoleic>=1053.5 98   21.88776 5.336735 *
##      7) Linoleic< 1053.5 151  100.99340 8.006623  
##       14) Oleic< 7895 95   23.72632 7.515789 *
##       15) Oleic>=7895 56   15.55357 8.839286 *

predict(modOlive, newdata)

##     1 
## 2.875

What is the resulting prediction? Is the resulting prediction strange? Why or why not?

0.005291005 0 0.994709 0 0 0 0 0 0. There is no reason why the result is strange.
0.005291005 0 0.994709 0 0 0 0 0 0. The result is strange because Area is a numeric variable and we should get the average within each leaf.
4.59965. There is no reason why the result is strange.
2.783. It is strange because Area should be a qualitative variable - but tree is reporting the average value of Area as a numeric variable in the leaf predicted for newdata

Question 4 - Load the South Africa Heart Disease Data and create training and test sets with the following code:

library(ElemStatLearn)
data(SAheart)
set.seed(8484)
train = sample(1:dim(SAheart)[1],size=dim(SAheart)[1]/2,replace=F)
trainSA = SAheart[train,]
testSA = SAheart[-train,]

Thrn set the seed to 13234 and fit a logistic regression model (method=“glm”,

be sure to specify family=“binomial”) with Coronary Heart Disease (chd) as the outcome

and age at onset, current alcohol consumption, obesity levels, cumulative tobacco,

type-A behaviovior, and low density lipoprotein cholesterol as predictors.

Calculate the misclassification rate for your model using this function and a prediction on the “response” scale:

missClass = function(values,prediction){sum(((prediction > 0.5)*1) != values)/length(values)}

set.seed(13234)
fit<-train(chd ~ age+alcohol+obesity+tobacco+typea+ldl, data=trainSA, method="glm", family="binomial")

## Warning in train.default(x, y, weights = w, ...): You are trying to do
## regression and your outcome only has two possible values Are you trying to
## do classification? If so, use a 2 level factor as your outcome column.

predictTrainSA<-predict(fit)
missClass(trainSA$chd, predictTrainSA)

## [1] 0.2727273

predictTestSA<-predict(fit,testSA)
missClass(testSA$chd,predictTestSA)

## [1] 0.3116883

What is the misclassification rate on the training set? What is the misclassification on the test set?

Test Set Misclassification: 0.43
Training Set: 0.31
Test Set Misclassification: 0.27
Training Set: 0.31
Test Set Misclassification: 0.35
Training Set: 0.31
Test Set Misclassification: 0.31
Training Set: 0.27

Question 5 - Load the vowel.train and vowel.test data sets:

library(ElemStatLearn)
data(vowel.train)
data(vowel.test)

Set the variable y to be a factor variable in both the training and test set.

vowel.train$y<-as.factor(vowel.train$y)
vowel.test$y<-as.factor(vowel.test$y)

Then set the seed to 33833.

set.seed(33833)
library(caret)
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

## The following object is masked from 'package:ggplot2':
## 
##     margin

Fit a random forest predictor relating the factor variable y to the remaining variables.

Read about variable importance in random forests

here:http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr

The caret package uses by default the Gini importance.

Calculate the variable importance using the varImp function in the caret package.

vowelModel<-randomForest(y ~ ., data=vowel.train)
order(varImp(vowelModel), decreasing=T)

##  [1]  2  1  5  6  8  4  9  3  7 10

What is the order of variable importance?

[NOTE: Use randomForest() specifically, not caret, as there’s been some issues reported with that approach. 11/6/2016]

The order of the variables is:

x.2, x.1, x.5, x.6, x.8, x.4, x.9, x.3, x.7,x.10
x.10, x.7, x.5, x.6, x.8, x.4, x.9, x.3, x.1,x.2
x.10, x.7, x.9, x.5, x.8, x.4, x.6, x.3, x.1,x.2
x.1, x.2, x.3, x.8, x.6, x.4, x.5, x.9, x.7,x.10

PratMachLearnQuiz3

CWerneck - Cláudia Werneck

17 de janeiro de 2018

Question 1 - For this quiz we will be using several R packages. R package versions change over time, the right answers have been checked using the following versions of the packages.

If you aren’t using these versions of the packages, your answers may not exactly match the right answer, but hopefully should be close.

Load the cell segmentation data from the AppliedPredictiveModeling package using the commands:

1. Subset the data to a training set and testing set based on the Case variable in the data set.

2. Set the seed to 125 and fit a CART model with the rpart method using all predictor variables and default caret settings.

3. In the final model what would be the final model prediction for cases with the following variable values:

a. TotalIntench2 = 23,000; FiberWidthCh1 = 10; PerimStatusCh1=2

b. TotalIntench2 = 50,000; FiberWidthCh1 = 10;VarIntenCh4 = 100

c. TotalIntench2 = 57,000; FiberWidthCh1 = 8;VarIntenCh4 = 100

d. FiberWidthCh1 = 8;VarIntenCh4 = 100; PerimStatusCh1=2

a. PS

b. WS

c. PS

d. Not possible to predict

Question 2 - If K is small in a K-fold cross validation is the bias in the estimate of out-of-sample (test set) accuracy smaller or bigger?

If K is small is the variance in the estimate of out-of-sample (test set) accuracy smaller or bigger.

Is K large or small in leave one out cross validation?

The bias is larger and the variance is smaller. Under leave one out cross validation K is equal to the sample size.

Question 3 - Load the olive oil data using the commands:

(NOTE: If you have trouble installing the pgmm package, you can download the -code-olive-/code- dataset here: olive_data.zip. After unzipping the archive, you can load the file using the -code-load()-/code- function in R.)

These data contain information on 572 different Italian olive oils from multiple regions in Italy.

Fit a classification tree where Area is the outcome variable. Then predict the value of area for the following data frame using the tree command with all defaults

What is the resulting prediction? Is the resulting prediction strange? Why or why not?

2.783. It is strange because Area should be a qualitative variable - but tree is reporting the average value of Area as a numeric variable in the leaf predicted for newdata

Question 4 - Load the South Africa Heart Disease Data and create training and test sets with the following code:

Thrn set the seed to 13234 and fit a logistic regression model (method=“glm”,

be sure to specify family=“binomial”) with Coronary Heart Disease (chd) as the outcome

and age at onset, current alcohol consumption, obesity levels, cumulative tobacco,

type-A behaviovior, and low density lipoprotein cholesterol as predictors.

Calculate the misclassification rate for your model using this function and a prediction on the “response” scale:

What is the misclassification rate on the training set? What is the misclassification on the test set?

Test Set Misclassification: 0.31

Training Set: 0.27

Question 5 - Load the vowel.train and vowel.test data sets:

Set the variable y to be a factor variable in both the training and test set.

Then set the seed to 33833.

Fit a random forest predictor relating the factor variable y to the remaining variables.

Read about variable importance in random forests

here:http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr

The caret package uses by default the Gini importance.

Calculate the variable importance using the varImp function in the caret package.

What is the order of variable importance?

[NOTE: Use randomForest() specifically, not caret, as there’s been some issues reported with that approach. 11/6/2016]

The order of the variables is:

x.2, x.1, x.5, x.6, x.8, x.4, x.9, x.3, x.7,x.10

Thanks for reading!