See the Forest AND the Trees

##Introduction

Imagine a situation where you want to know whether spending a vacation in New York is a good idea. How would you answer this question? You haven’t been to New York, so you may want to ask your friends for advice. You know your own past traveling experience. Is it possible to make good use of it by sharing it with your friends and help them predict your enjoyment? When making predictions based on past information, statistics can be a great idea. Linear regression is a famous statistical tool. However, it relies on parametric assumptions such as a linear relationship between variables and the distribution of the population to make a reliable inference. Unfortunately, it would be hard to test whether your preference for a traveling destination would satisfy any of the assumptions.

On the other hand, as a well-known supervised machine learning algorithm, random forests bypass those assumptions. By allowing the model to discover the pattern between variables within a dataset, the algorithms can then predict. Random forests can be a great tool to make predictions in various fields. For example, it can predict whether a customer will repay their debt on time or whether a customer will close their credit card accounts. Introducing how this method works and its advantages are the main goals of this paper. I will use banking customer data from Kaggle to illustrate both how this algorithm works and its limitations.

##The Credit Card Churn Data Set

bank <- read.csv("C:/Users/User/Desktop/DataTranslation/BankChurners.csv")
colnames(bank)[c(22,23)] <- c("n_1","n_2")
#Remove irrelevant variables
bank <- bank[,-c(1,22,23)]

library(rsample)

## Warning: package 'rsample' was built under R version 3.6.3

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.6.3

library(ranger)

## Warning: package 'ranger' was built under R version 3.6.2

library(caret)

## Warning: package 'ggplot2' was built under R version 3.6.3

library(h2o)

## Warning: package 'h2o' was built under R version 3.6.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.3

library(magrittr)
library(ggplot2)
library(party)

## Warning: package 'party' was built under R version 3.6.3

## Warning: package 'modeltools' was built under R version 3.6.3

## Warning: package 'strucchange' was built under R version 3.6.3

#Split data into training set and test set
set.seed(123)
banksplit <- initial_split(data = bank, prop = 0.7)
bank_train <- training(banksplit)
bank_test <- testing(banksplit)
bank_eva<- bank_test[,-2]

set.seed(123)
#Creatae the Random Forests Model
model1 <- randomForest::randomForest(formula = Attrition_Flag~., data = bank_train, mtry = 9)
model1

## 
## Call:
##  randomForest(formula = Attrition_Flag ~ ., data = bank_train,      mtry = 9) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 9
## 
##         OOB estimate of  error rate: 3.51%
## Confusion matrix:
##                   Attrited Customer Existing Customer class.error
## Attrited Customer               967               160  0.14196983
## Existing Customer                89              5873  0.01492788

#The OBB error rate
plot(model1)

The decline curves start to flatten at around 180 trees which shows that the optimal number for trees is 180.

features <- setdiff(x = names(bank_train), y = "Attrition_Flag")

set.seed(123)
 
m2 <- tuneRF(x = bank_train[features], y = bank_train$Attrition_Flag, mtryStart = 2, ntreeTry = 500 ,stepFactor = 1.5, improve = 0.01, trace = FALSE)

## 0.1207349 0.01 
## 0.1522388 0.01 
## 0.06690141 0.01 
## 0.04528302 0.01 
## -0.0513834 0.01

#Chosse 9 as the optimal mtry value

The plot above shows that the optimal number of features slelected by each tree is nine for the least out-of-bag error rate.

OOB_RMSE <- vector(mode = "numeric", length = 100)
 
for(i in 1:length(OOB_RMSE)){
  optimal_ranger <- ranger(formula = Attrition_Flag~ .,    data = bank_train,num.trees = 500, mtry = 9, min.node.size  = 5,sample.fraction = .8, importance  = 'impurity')
 
  OOB_RMSE[i] <- sqrt(optimal_ranger$prediction.error)
}
 
hist(OOB_RMSE, breaks = 20)

#prediction
pred <- predict(model1, bank_test)
#successfully predicted
sum(pred == bank_test$Attrition_Flag)

## [1] 2928

#missed
sum(pred != bank_test$Attrition_Flag)

## [1] 110

#plot a simpled version of tree for illustrating
set.seed(10)
x <- ctree(Attrition_Flag ~ ., data=bank_train, controls =  ctree_control(mtry = 2))
plot(x, terminal_panel = node_barplot)

The barplot at the end of the nodes shows how ‘pure’ the final group is. Take the right one for example, there are 81% of attritied credit card users(those who close their credit card accounts).

#Just another way to display a tree in the model
x <- ctree(Attrition_Flag ~ ., data=bank_train, controls =  ctree_control(mtry = 1))
plot(x, type="simple",inner_panel = node_barplot)

#Counfusion Matrix
confusionMatrix(data = pred, reference = bank_test$Attrition_Flag)

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Attrited Customer Existing Customer
##   Attrited Customer               419                29
##   Existing Customer                81              2509
##                                            
##                Accuracy : 0.9638           
##                  95% CI : (0.9565, 0.9701) 
##     No Information Rate : 0.8354           
##     P-Value [Acc > NIR] : < 2.2e-16        
##                                            
##                   Kappa : 0.8626           
##                                            
##  Mcnemar's Test P-Value : 1.158e-06        
##                                            
##             Sensitivity : 0.8380           
##             Specificity : 0.9886           
##          Pos Pred Value : 0.9353           
##          Neg Pred Value : 0.9687           
##              Prevalence : 0.1646           
##          Detection Rate : 0.1379           
##    Detection Prevalence : 0.1475           
##       Balanced Accuracy : 0.9133           
##                                            
##        'Positive' Class : Attrited Customer
##

Above shows the comparison between the predicted outcome based on the model and the actual status. The model accurately predicted 2,929 customers, which is 96% of the total. However, we can still improve the accuracy rate by adjusting some parameters. There are two parameters we can adjust: the number of trees and the size of a tree. A result made by larger trees or more trees can be more accurate, but it would also take a longer time to find the pattern, and its memory usage would be high. It is needless that larger trees make more accurate predictions. In addition, another problem with larger trees is overfitting.

#Find the important features by how they improve teh inpurity in each node
options(scipen = -1)
optimal_ranger$variable.importance %>% 
  as.matrix() %>% 
  as.data.frame() %>% 
  add_rownames() %>% 
  `colnames<-`(c("varname","imp")) %>%
  arrange(desc(imp)) %>% 
  top_n(10,wt = imp) %>% 
  ggplot(mapping = aes(x = reorder(varname, imp), y = imp)) +
  geom_col() +
  coord_flip() +
  ggtitle(label = "Top 10 important variables") +
  theme(
    axis.title = element_blank() 
  )+theme_bw()

## Warning: `add_rownames()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::rownames_to_column()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

As importance plot shows, based on how much the feature improves the homogeneity within the nodes, the top three important variables are the number of total transactions, total revolving balance, and the total transaction amount in the last twelve months. Figures below show that a large proportion of attrited customers have a relatively small number of transactions and revolving balance, and their total amount of transactions are low. It is not surprising that those who seldom use their credit cards are more likely to close them. One possible reason might be that the attrited customers are just credit card churning, meaning that they open their cards just for the welcome offer.

ggplot(bank_train,aes(x=Total_Trans_Ct)) + 
  theme_bw()+
  geom_histogram(data=subset(bank_train,Attrition_Flag == "Attrited Customer"),fill = "#F5DF4D", alpha = 0.9) +
geom_histogram(data=subset(bank_train,Attrition_Flag == "Existing Customer"),fill = "#939597", alpha = 0.3) +
  theme(axis.line = element_line(colour = "black"),
    panel.grid.minor = element_blank(),axis.title.x=element_blank())+
  labs(caption="Total Transaction Amount") + 
  theme(plot.caption = element_text(hjust=0.5, size=rel(1.2)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(bank_train,aes(x=Total_Revolving_Bal)) + 
  theme_bw()+
  geom_histogram(data=subset(bank_train,Attrition_Flag == "Attrited Customer"),fill = "#F5DF4D", alpha = 0.9) +
geom_histogram(data=subset(bank_train,Attrition_Flag == "Existing Customer"),fill = "#939597", alpha = 0.3) +
  theme(axis.line = element_line(colour = "black"),
    panel.grid.minor = element_blank(),axis.title.x=element_blank())+
  labs(caption="Total Revolving Balance") + 
  theme(plot.caption = element_text(hjust=0.5, size=rel(1.2)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(bank_train,aes(x=Total_Trans_Amt)) + 
  theme_bw()+
  geom_histogram(data=subset(bank_train,Attrition_Flag == "Attrited Customer"),fill = "#F5DF4D", alpha = 0.9) +
geom_histogram(data=subset(bank_train,Attrition_Flag == "Existing Customer"),fill = "#939597", alpha = 0.3) +
  theme(axis.line = element_line(colour = "black"),
    panel.grid.minor = element_blank(),axis.title.x=element_blank())+
  labs(caption="Total No. of Transactions") + 
  theme(plot.caption = element_text(hjust=0.5, size=rel(1.2)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

####Conclusion Most of the costomers closing thier accounts sledom uses their cards. To understand the need of this costomers is the key to keep them as clients. However, that could also be an evidence showing that the welcom offer is generous so clietns might doing credit card churning here. It could be a good news for a new credit card company showing their brand awareness might increase for intriguing people to use thier product. Although, like well-developed companies, how to keep clients is crucial, information about clients’ need and the advantage of the product of this company is needed.

See the Forest AND the Trees

Introction to Random Forests with Credit Card Churn dataset

Jeffrey (Yi-Hung) Wang

2021/3/13