Part 2 - Random Forest Model

Modeling and Analysis

Method and Result Selection

For the random forest model, the “randomForest” r package provides a quick solution to model building and initial tuning. We start by setting the number of trees (ntree) and number of variables in each tree (mtry) as default values. For classification problem, the default ntree is usually 500 and mtry is the sqrt(number of predictors). In modeling this problem, the validation sets are not provided in the parameters, instead, the test dataset will be used during the turning of the model.

library(randomForest)
library(pROC)
# Create a Random Forest model with default parameters
set.seed(12)
model1=randomForest(Y~., data = data.train, importance=TRUE, keep.forest = TRUE)

model1

## 
## Call:
##  randomForest(formula = Y ~ ., data = data.train, importance = TRUE,      keep.forest = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 13.53%
## Confusion matrix:
##     0   1 class.error
## 0 259  40   0.1337793
## 1  34 214   0.1370968

In the base model, model1, the default ntree is set to 500 and mtry set to 3. The OOB error is 13.53% (similar to the training accuracy in the optimized logistic model in the previous part). We run the test dataset here to see how the base model perform without tuning. A quick calculation from the prediction’s contingency table shows a prediction accuracy of 88% on the test set. The variable importance plots provide us an idea of the weights of importance of each variable in the base model. Since the default parameter for building individual trees in the forest is 3 out of 15 variables, we will now tune this parameter to produce better prediction results.

#Checking classification accuracy
# *test data is actually used as validation set, instead of splitting training set into train and validation sets.
table(predict(model1,data.train),data.train$Y)
combined=rbind(data.train,data.test)
str(combined)
data.test=combined[548:647,]
predValid = predict(model1, newdata = data.test, type = "class")

#test accuracy
mean(predValid == data.test$Y)

## [1] 0.88

table(predValid, data.test$Y)

##          
## predValid  0  1
##         0 51 11
##         1  1 37

Importance of Variables

##              0          1 MeanDecreaseAccuracy MeanDecreaseGini
## V1  -0.4139220  0.9273370            0.3846298         2.557651
## V2   1.5442342  2.4278402            2.7765972        17.167851
## V3   3.0140258  4.3543686            5.1901998        18.007191
## V4   2.6690640  2.1810745            3.5833725         2.854904
## V5   0.9013136  0.8321372            1.2797848         2.707166
## V6  12.1427385  6.9484275           13.3408290        29.950912
## V7   5.3033728  0.8678884            4.7415494         9.535665
## V8  12.2274431  8.3862621           14.2399797        25.741334
## V9  48.2786220 46.7812734           56.2520061        71.396813
## V10  8.3928376 10.4016062           12.7389440        13.678115
## V11 11.2186897 10.9092928           15.4717681        24.413438
## V12 -0.5492454 -1.6690523           -1.5798909         2.618068
## V13  2.5809757  2.8792685            3.9252374         1.832399
## V14  7.8258968  3.8508357            8.5046849        16.627588
## V15 15.9347101 10.5868446           18.3994449        26.908636

Model Tuning

Two methods of parameter tuning are used in this model considering the number of variables and sample size: one is using the default tuneRF() function in the randomForest package, the other one is using for-loop to find optimal mtry.

The result of the tuneRF() function suggested ntry=2 produced the lowest OOB error, while the for loop tuning suggested that ntry=7 produced the highest prediction accuracy. The difference is due to the two reasons:

The test data was used in the for loop tuning so that the ntry=7 should suggest a better test accuracy than tuneRF() function.
The tuneRF() function stopped adding variables to each tree when the improvement of OOB error was less than 0.01.

Tuning with tuneRF() function

#if using OOB as model evaluation
# - tuneRF in the randomForest package
x.train = data.train[,1:15]
y.train = data.train[,16]
set.seed(12)

tuneRF(
  x = x.train,
  y = y.train,
  ntreeTry = 500,
  mtryStart = 3,
  stepFactor = 1.5,
  improve = 0.01,
  trace = FALSE
)

## 0.02702703 0.01 
## -0.01388889 0.01

##       mtry  OOBError
## 2.OOB    2 0.1316271
## 3.OOB    3 0.1352834
## 4.OOB    4 0.1334552

Tuning with for loop using test data

# Fine tuning parameters of Random Forest model
#if using test accuracy as model evaluation
# using for loop to find optimal mtry
set.seed(12)
a=as.numeric()
for (i in 2:15){
  model2= randomForest(Y~ .,data = data.train, ntree=500, mtry=i, importance=TRUE)
  predValid2 = predict(model2, newdata = data.test, type = "class")
  a[i-1]= mean(predValid2 == data.test$Y)
}
a

set.seed(12)
model_RFtuned= randomForest(Y~ .,data = data.train, ntree=500, mtry=2, importance=TRUE)
model_Mtuned= randomForest(Y~ .,data = data.train, ntree=500, mtry=7, importance=TRUE)

Model Comparison

model1

## 
## Call:
##  randomForest(formula = Y ~ ., data = data.train, importance = TRUE,      keep.forest = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 13.53%
## Confusion matrix:
##     0   1 class.error
## 0 259  40   0.1337793
## 1  34 214   0.1370968

model_RFtuned

## 
## Call:
##  randomForest(formula = Y ~ ., data = data.train, ntree = 500,      mtry = 2, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 13.35%
## Confusion matrix:
##     0   1 class.error
## 0 263  36   0.1204013
## 1  37 211   0.1491935

model_Mtuned

## 
## Call:
##  randomForest(formula = Y ~ ., data = data.train, ntree = 500,      mtry = 7, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 13.71%
## Confusion matrix:
##     0   1 class.error
## 0 256  43   0.1438127
## 1  32 216   0.1290323

ROC Curves

ntree=500, mtry=2

ntree=500, mtry=7

Prediction Accuracy

OOB Error:

base model result: 13.53%
RFtuned() result: 13.35%
for-loop-tuned result: 13.71%

Prediction Accuracy (on test data):

base model result: 88%
RFtuned() result: 88%
for-loop-tuned result: 90%

#prediction table
predValid_RFtuned = predict(model_RFtuned, newdata = data.test, type = "class")

mean(predValid_RFtuned == data.test$Y)

## [1] 0.88

table(predValid_RFtuned, data.test$Y)

##                  
## predValid_RFtuned  0  1
##                 0 51 11
##                 1  1 37

predValid_Mtuned = predict(model_Mtuned, newdata = data.test, type = "class")

mean(predValid_Mtuned == data.test$Y)

## [1] 0.9

table(predValid_Mtuned, data.test$Y)

##                 
## predValid_Mtuned  0  1
##                0 51  9
##                1  1 39

Random Forest Model Conclusion

The randomForest method provides two tuned models for prediction. As shown in the results above, the two models’ prediction accuracies on the test data are very close. The ‘model_Mtuned’ (tuned with for loop) can be considered a little more accurate in predicting new data because it was tuned with the small size of test data. However, if the test data is not available, ‘model_RFtuned’ (RFtuned() function) provides the lowest OOB error. When more sample becomes available, better model could be obtained by adding additional validation data between model training and testing could improve the prediction accuracy.

Overall, the random forest model produces similar training accuracy and a better test accuracy than the logistic regression model with much simpler implementation. However, the tuning process can be difficult to control due to the complexity of the individual decision trees.

Credit Approval Data Modeling (Part 2)

XIao Wang (Vinny)

Part 2 - Random Forest Model

Modeling and Analysis

Method and Result Selection

Model Tuning

Random Forest Model Conclusion