Random Forrest Analysis for Bank Default

Step 1: Collecting Data

The credit.csv data is obtained from the Machine Learning course website of Prof. Eric Suess at http://www.sci.csueastbay.edu/~esuess/stat6620/.

Step 2: Exploring & Preparing Data

The credit.csv dataset contains 1000 observations and 17 features. The features contains information for the 1000 bank loan applicants such as their credit history, checking/saving balance, their age, the amount to borrow, purpose of using the loan, and job type etc. These are some of the independent variables used to predict the final credit status, and whether one will undergo loan default.

credit <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml7/credit.csv")
str(credit)

## 'data.frame':    1000 obs. of  17 variables:
##  $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
##  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history      : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
##  $ purpose             : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
##  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
##  $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
##  $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_credit        : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ housing             : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
##  $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job                 : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
##  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ phone               : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
##  $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...

indx = sample(1:nrow(credit), as.integer(0.9*nrow(credit)))

credit_train = credit[indx,]
credit_test = credit[-indx,]

credit_train_labels = credit[indx,17]
credit_test_labels = credit[-indx,17]

Step 3: Model Training on the Data

Using the randomForest() function in the randomForest package to contruct a random forest model. The function is taking ‘default’ as the response outcome, with the rest features as predictors for the model in the trained dataset. The function is asked to generate 500 random trees to generate a ‘forrest’, with each tree using a square root of the number of predictors as a convention. The model shows a 24.44% of error and a 75.6% accuracy.

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.3.3

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

set.seed(300)
rf.model = randomForest(default~., data = credit_train, ntree = 500,
          mtry = sqrt(16))
rf.model

## 
## Call:
##  randomForest(formula = default ~ ., data = credit_train, ntree = 500,      mtry = sqrt(16)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 23.67%
## Confusion matrix:
##      no yes class.error
## no  569  58  0.09250399
## yes 155 118  0.56776557

Step 4: Evaluating Model Performance

The random forest model is applied on the tested dataset and generate an object that contains a vector of prediction value for the tested data labels. Using a table() to generate a simple confusion matrix and observe error and accuracy.

rf.pred = predict(rf.model, credit_test)
summary(rf.pred)

##  no yes 
##  72  28

confusion.matrix = table(rf.pred, credit_test_labels)
confusion.matrix

##        credit_test_labels
## rf.pred no yes
##     no  63   9
##     yes 10  18

As it turns, the random forest model with 500 trees each with 4 random features has an accuracy rate of 79% for the bank credit tested dataset.

sum(diag(confusion.matrix))/ sum(confusion.matrix)

## [1] 0.81

sum(rf.pred==credit_test_labels) / nrow(credit_test)

## [1] 0.81

Step 5: Model Improvement

A model improvement for the random forest can be done by adjusting the values of tuning parameters. This can be done by specifically changing the values for ntree and mtry arguments. Generating a larger random forest by changing the ntree from 500 to 1000; in addition, randomly select 8 features instead of 4 features at a time. This way, because there are only 16 total preditors and half of them were selected for building a tree each time. There is enough training for each feature that each one can have a chance to appear in several models so that substantial random variation occurs from tree to tree can be canceled out to make a more unbiased model.

The improved model contains a 22.78% error and a 77.2% accuracy.

library(randomForest)
set.seed(300)
rf.model2 = randomForest(default~., data = credit_train, ntree = 1000,
          mtry = 8)
rf.model2

## 
## Call:
##  randomForest(formula = default ~ ., data = credit_train, ntree = 1000,      mtry = 8) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 24.44%
## Confusion matrix:
##      no yes class.error
## no  556  71   0.1132376
## yes 149 124   0.5457875

The improved model is applied to the tested dataset, and the accuracy has imrpoved.

rf.pred2 = predict(rf.model2, credit_test)
summary(rf.pred2)

##  no yes 
##  74  26

table(rf.pred2, credit_test_labels)

##         credit_test_labels
## rf.pred2 no yes
##      no  63  11
##      yes 10  16

sum(rf.pred2==credit_test_labels) / nrow(credit_test)

## [1] 0.79

Conclusion: Random forest is a popular meta-learning algorithm in which combines the idea of bagging and decision tree model with random feature selection. The out-of-bag (OOB) error rate in the model object is an unbiased estimate of the test set error becaue it is able to used the unselected observations per tree building as the tested data for that tree model, and are held tallied, and a vote is then taken at the end of total number of trees building to determine the final prediction of each observation. An improvement for the random forest algorithm can be done by adjusting the number of trees and number of features for ‘forest building’ because it can reduce the random variation generated from the trees that contain consistently appeared features.