For the random forest model, the “randomForest” r package provides a quick solution to model building and initial tuning. We start by setting the number of trees (ntree) and number of variables in each tree (mtry) as default values. For classification problem, the default ntree is usually 500 and mtry is the sqrt(number of predictors). In modeling this problem, the validation sets are not provided in the parameters, instead, the test dataset will be used during the turning of the model.
library(randomForest)
library(pROC)
# Create a Random Forest model with default parameters
set.seed(12)
model1=randomForest(Y~., data = data.train, importance=TRUE, keep.forest = TRUE)
model1
##
## Call:
## randomForest(formula = Y ~ ., data = data.train, importance = TRUE, keep.forest = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 13.53%
## Confusion matrix:
## 0 1 class.error
## 0 259 40 0.1337793
## 1 34 214 0.1370968
In the base model, model1
, the default ntree is set to 500 and mtry set to 3. The OOB error is 13.53% (similar to the training accuracy in the optimized logistic model in the previous part). We run the test dataset here to see how the base model perform without tuning. A quick calculation from the prediction’s contingency table shows a prediction accuracy of 88% on the test set. The variable importance plots provide us an idea of the weights of importance of each variable in the base model. Since the default parameter for building individual trees in the forest is 3 out of 15 variables, we will now tune this parameter to produce better prediction results.
#Checking classification accuracy
# *test data is actually used as validation set, instead of splitting training set into train and validation sets.
table(predict(model1,data.train),data.train$Y)
combined=rbind(data.train,data.test)
str(combined)
data.test=combined[548:647,]
predValid = predict(model1, newdata = data.test, type = "class")
#test accuracy
mean(predValid == data.test$Y)
## [1] 0.88
table(predValid, data.test$Y)
##
## predValid 0 1
## 0 51 11
## 1 1 37
Importance of Variables
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## V1 -0.4139220 0.9273370 0.3846298 2.557651
## V2 1.5442342 2.4278402 2.7765972 17.167851
## V3 3.0140258 4.3543686 5.1901998 18.007191
## V4 2.6690640 2.1810745 3.5833725 2.854904
## V5 0.9013136 0.8321372 1.2797848 2.707166
## V6 12.1427385 6.9484275 13.3408290 29.950912
## V7 5.3033728 0.8678884 4.7415494 9.535665
## V8 12.2274431 8.3862621 14.2399797 25.741334
## V9 48.2786220 46.7812734 56.2520061 71.396813
## V10 8.3928376 10.4016062 12.7389440 13.678115
## V11 11.2186897 10.9092928 15.4717681 24.413438
## V12 -0.5492454 -1.6690523 -1.5798909 2.618068
## V13 2.5809757 2.8792685 3.9252374 1.832399
## V14 7.8258968 3.8508357 8.5046849 16.627588
## V15 15.9347101 10.5868446 18.3994449 26.908636
Two methods of parameter tuning are used in this model considering the number of variables and sample size: one is using the default tuneRF()
function in the randomForest
package, the other one is using for-loop to find optimal mtry
.
The result of the tuneRF()
function suggested ntry=2
produced the lowest OOB error, while the for loop tuning suggested that ntry=7
produced the highest prediction accuracy. The difference is due to the two reasons:
ntry=7
should suggest a better test accuracy than tuneRF()
function.tuneRF()
function stopped adding variables to each tree when the improvement of OOB error was less than 0.01.Tuning with tuneRF()
function
#if using OOB as model evaluation
# - tuneRF in the randomForest package
x.train = data.train[,1:15]
y.train = data.train[,16]
set.seed(12)
tuneRF(
x = x.train,
y = y.train,
ntreeTry = 500,
mtryStart = 3,
stepFactor = 1.5,
improve = 0.01,
trace = FALSE
)
## 0.02702703 0.01
## -0.01388889 0.01
## mtry OOBError
## 2.OOB 2 0.1316271
## 3.OOB 3 0.1352834
## 4.OOB 4 0.1334552
Tuning with for loop using test data
# Fine tuning parameters of Random Forest model
#if using test accuracy as model evaluation
# using for loop to find optimal mtry
set.seed(12)
a=as.numeric()
for (i in 2:15){
model2= randomForest(Y~ .,data = data.train, ntree=500, mtry=i, importance=TRUE)
predValid2 = predict(model2, newdata = data.test, type = "class")
a[i-1]= mean(predValid2 == data.test$Y)
}
a
set.seed(12)
model_RFtuned= randomForest(Y~ .,data = data.train, ntree=500, mtry=2, importance=TRUE)
model_Mtuned= randomForest(Y~ .,data = data.train, ntree=500, mtry=7, importance=TRUE)
Model Comparison
model1
##
## Call:
## randomForest(formula = Y ~ ., data = data.train, importance = TRUE, keep.forest = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 13.53%
## Confusion matrix:
## 0 1 class.error
## 0 259 40 0.1337793
## 1 34 214 0.1370968
model_RFtuned
##
## Call:
## randomForest(formula = Y ~ ., data = data.train, ntree = 500, mtry = 2, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 13.35%
## Confusion matrix:
## 0 1 class.error
## 0 263 36 0.1204013
## 1 37 211 0.1491935
model_Mtuned
##
## Call:
## randomForest(formula = Y ~ ., data = data.train, ntree = 500, mtry = 7, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 13.71%
## Confusion matrix:
## 0 1 class.error
## 0 256 43 0.1438127
## 1 32 216 0.1290323
ROC Curves
ntree=500, mtry=2
ntree=500, mtry=7
Prediction Accuracy
OOB Error:
RFtuned()
result: 13.35%Prediction Accuracy (on test data):
RFtuned()
result: 88%#prediction table
predValid_RFtuned = predict(model_RFtuned, newdata = data.test, type = "class")
mean(predValid_RFtuned == data.test$Y)
## [1] 0.88
table(predValid_RFtuned, data.test$Y)
##
## predValid_RFtuned 0 1
## 0 51 11
## 1 1 37
predValid_Mtuned = predict(model_Mtuned, newdata = data.test, type = "class")
mean(predValid_Mtuned == data.test$Y)
## [1] 0.9
table(predValid_Mtuned, data.test$Y)
##
## predValid_Mtuned 0 1
## 0 51 9
## 1 1 39
The randomForest method provides two tuned models for prediction. As shown in the results above, the two models’ prediction accuracies on the test data are very close. The ‘model_Mtuned’ (tuned with for loop) can be considered a little more accurate in predicting new data because it was tuned with the small size of test data. However, if the test data is not available, ‘model_RFtuned’ (RFtuned()
function) provides the lowest OOB error. When more sample becomes available, better model could be obtained by adding additional validation data between model training and testing could improve the prediction accuracy.
Overall, the random forest model produces similar training accuracy and a better test accuracy than the logistic regression model with much simpler implementation. However, the tuning process can be difficult to control due to the complexity of the individual decision trees.