Machine learning to predict whether a costumer will buy deposit or not

Deposit is one of the bank product that people usually like to have, and one of its type is Term of Deposit. This type of investment is offered by bank and financial institution, when someone decided to buy this product, they will deposit a specific amount of money with bank for period of time. During this period, the deposited amount earn a fix interest rate, which is usually higher than the interest rate offered on regular saving account.

In this analysis, we will predict which costumer will have/buy in term of deposit based on their characteristics. The data we used in this prediction is from UC Irvine Machine Learning Repository. This data is about telemarketing of Portuguese Bank, and we will predicting whether a costumer will but a term of deposit or not by their characteristics. The model we will use in this prediction is Naive Bayes, Decision Tree and Random Forest, than we will compare all the model to see which one has better performance among the others.

IMPORT LIBRARY

Importing all necessary library

library(dplyr)
library(lubridate)
library(e1071) # package for naivebayes
library(caret) # package for resemple train data
library(partykit) #package for ctree function
library(ROCR) #Package for AUC
library(randomForest)
library(caret) #K-fold
library(GGally)
library(stats) # MANOVA & chi square
library(coin)
library(rattle)

READ DATA

bank <- read.csv("bank-full.csv", sep = ";", stringsAsFactors = T)
rmarkdown::paged_table(bank)

Column description:

  • Age (numeric)
  • Job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”)
  • Marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
  • Education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
  • Default: has credit in default? (binary: “yes”,“no”)
  • Balance: average yearly balance, in euros (numeric)
  • Housing: has housing loan? (binary: “yes”,“no”)
  • Loan: has personal loan? (binary: “yes”,“no”)

Related with the last contact of the current campaign:

  • Contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
  • Day: last contact day of the month (numeric)
  • Month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
  • Duration: last contact duration, in seconds (numeric)

Other attributes:

  • Campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  • Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
  • Previous: number of contacts performed before this campaign and for this client (numeric)
  • Poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)

Output variable (desired target):

  • y - has the client subscribed a term deposit? (binary: “yes”,“no”)

DATA PREPROCESSING

Data pre-processing is where we preparing our raw data before we do analysis or machine learning. So, we can be sure of the quality, consistency and compatibility of our data. Anything that can be done here they are handling abnormal value and missing value, and data coertion.

Data Coertion

glimpse(bank)
#> Rows: 45,211
#> Columns: 17
#> $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
#> $ job       <fct> management, technician, entrepreneur, blue-collar, unknown, …
#> $ marital   <fct> married, single, married, married, single, married, single, …
#> $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
#> $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
#> $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
#> $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
#> $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
#> $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
#> $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
#> $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
#> $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
#> $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
#> $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
#> $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …

Our data was already in their appropriate data type. So we dont need to do any data coertion here.

Handling Missng Value

colSums(is.na(bank))
#>       age       job   marital education   default   balance   housing      loan 
#>         0         0         0         0         0         0         0         0 
#>   contact       day     month  duration  campaign     pdays  previous  poutcome 
#>         0         0         0         0         0         0         0         0 
#>         y 
#>         0

There’s no missing value detected in our data. But, we need to look deeper to our data to check is there any unusual data that can be act as NA/missing value

summary(bank)
#>       age                 job           marital          education    
#>  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
#>  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
#>  Median :39.00   technician :7597   single  :12790   tertiary :13301  
#>  Mean   :40.94   admin.     :5171                    unknown  : 1857  
#>  3rd Qu.:48.00   services   :4154                                     
#>  Max.   :95.00   retired    :2264                                     
#>                  (Other)    :6835                                     
#>  default        balance       housing      loan            contact     
#>  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
#>  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
#>              Median :   448                           unknown  :13020  
#>              Mean   :  1362                                            
#>              3rd Qu.:  1428                                            
#>              Max.   :102127                                            
#>                                                                        
#>       day            month          duration         campaign     
#>  Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
#>  1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
#>  Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
#>  Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
#>  3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
#>  Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
#>                  (Other): 6060                                    
#>      pdays          previous           poutcome       y        
#>  Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
#>  1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
#>  Median : -1.0   Median :  0.0000   success: 1511              
#>  Mean   : 40.2   Mean   :  0.5803   unknown:36959              
#>  3rd Qu.: -1.0   3rd Qu.:  0.0000                              
#>  Max.   :871.0   Max.   :275.0000                              
#> 

As we assumed before. If we look closely to education, contact, job and poutcom columns, there’s unusual value called unknown. This could be mean that Bank didn’t know what happened at the moment or they forget to input the data. To handle this, we can do a number of things like remove all rows that contain missing value, delete the entire column or if we know exactly the number, we can impute it manually. But here, we surely don’t know what the exact number is, so we only have two options, either remove the rows with missing value in it or remove the entire column.

We can remove all rows with missing value if the total of missing value is not bigger than 5% of total rows. So, we can remove all rows with unknown value from education column, while potcome and contact column, since there’s too many unknown value, we need to remove this two columns so that the information is not reduce.

# Remove poutcome and contact columns
bank_clean <- bank %>% 
  select(-c(poutcome, contact)) %>%
  # Remove all rows with unknown value from job and education column
  filter(job != "unknown",
         education != "unknown")

Job and Education column has factor data type, it means that unknown is one of their levels. Before we do any data process or data exploration, we need to remove this unused levels to avoid bad result in our analysis.

# Drop unused levels in education column
bank_clean$education <- droplevels(bank_clean$education)

# Drop unused levels in job column
bank_clean$job <- droplevels(bank_clean$job)
summary(bank_clean)
#>       age                 job           marital          education    
#>  Min.   :18.00   blue-collar:9278   divorced: 5028   primary  : 6800  
#>  1st Qu.:33.00   management :9216   married :25946   secondary:23131  
#>  Median :39.00   technician :7355   single  :12219   tertiary :13262  
#>  Mean   :40.76   admin.     :5000                                     
#>  3rd Qu.:48.00   services   :4004                                     
#>  Max.   :95.00   retired    :2145                                     
#>                  (Other)    :6195                                     
#>  default        balance       housing      loan            day       
#>  no :42411   Min.   : -8019   no :18901   no :36086   Min.   : 1.00  
#>  yes:  782   1st Qu.:    71   yes:24292   yes: 7107   1st Qu.: 8.00  
#>              Median :   442                           Median :16.00  
#>              Mean   :  1354                           Mean   :15.81  
#>              3rd Qu.:  1412                           3rd Qu.:21.00  
#>              Max.   :102127                           Max.   :31.00  
#>                                                                      
#>      month          duration         campaign          pdays      
#>  may    :13192   Min.   :   0.0   Min.   : 1.000   Min.   : -1.0  
#>  jul    : 6601   1st Qu.: 103.0   1st Qu.: 1.000   1st Qu.: -1.0  
#>  aug    : 6037   Median : 180.0   Median : 2.000   Median : -1.0  
#>  jun    : 4980   Mean   : 258.3   Mean   : 2.758   Mean   : 40.4  
#>  nov    : 3842   3rd Qu.: 318.0   3rd Qu.: 3.000   3rd Qu.: -1.0  
#>  apr    : 2820   Max.   :4918.0   Max.   :58.000   Max.   :871.0  
#>  (Other): 5721                                                    
#>     previous          y        
#>  Min.   :  0.0000   no :38172  
#>  1st Qu.:  0.0000   yes: 5021  
#>  Median :  0.0000              
#>  Mean   :  0.5849              
#>  3rd Qu.:  0.0000              
#>  Max.   :275.0000              
#> 

Seems our data already have no missing or unknown value.

Handling Abnormal Value

If I look closely, there’s some condition that counter intuitive, where someone has no housing loan, has no personal loan and also has no credit default but have negative in their balance. As long as the person has job, normally their balance must be higher than zero, or at least as same as average people balances.

test <- bank_clean %>% 
  filter(balance < 0,
         housing == "no",
         loan == "no",
         default == "no")

rmarkdown::paged_table(test)

As we observed, there’s 475 rows with abnormal value, we will remove all of it.

bank_clean <- bank_clean %>% 
  filter(!(balance < 0 & housing == "no" & loan == "no" & default == "no"))

Another thing to note is, generally bank will be careful to offer their financial products to someone with bad financial record. We will remove anyone who have term of deposit but at the same time has credit default, negative balance, personal loan and personal housing credit.

bank_clean <- bank_clean %>% 
  filter(!(y == "yes" &balance < 0 & housing == "yes" & loan == "yes" & default == "yes"))

Handling Outliers

Talk about outlier, seems there’s some outliers in balance column. If we look to histogram plot bellow, the proportion of the data is extremely unbalance (the data have right skew). To handle this, we will use interquartile range to remove any outliers in this column.

hist(bank_clean$balance, breaks = 20)

The idea behind interquartile range is remove any value that bigger than upper and lower ineer fences. To get lower and upper inner fences, first we need to know First Quantile (Q1) and Third Quantile (Q3) of the data, to get this we can use quantile() function. Than, we subtract Q3 from Q1 to get interquartile range, here lays 50% of the total data. Last subtract Q1 with interquartile range multiply by 1.5 to get lower inner fences and add Q3 with interquartile range multiply with 1.5 to get upper inner fences. Any value that bigger than upper fences or smaller than lower fences will be assumed as outliers.

# Find Q1 and Q3
Q1 <- quantile(bank_clean$balance, probs = 0.25)
Q3 <- quantile(bank_clean$balance, probs = 0.75)

# Calculate interquartile range
interquartile_range <- Q3 - Q1

# Calculate lower and upper inner fences
lower_inner_fences <- Q1 - interquartile_range*1.5
upper_inner_fences <- Q3 + interquartile_range*1.5
# Filtering data
bank_clean <- bank_clean %>% 
  filter(balance >= lower_inner_fences,
         balance <= upper_inner_fences)

summary(bank_clean)
#>       age                 job           marital          education    
#>  Min.   :18.00   blue-collar:8439   divorced: 4518   primary  : 6061  
#>  1st Qu.:32.00   management :7832   married :22804   secondary:20850  
#>  Median :38.00   technician :6538   single  :10908   tertiary :11319  
#>  Mean   :40.45   admin.     :4510                                     
#>  3rd Qu.:48.00   services   :3658                                     
#>  Max.   :95.00   retired    :1790                                     
#>                  (Other)    :5463                                     
#>  default        balance        housing      loan            day       
#>  no :37464   Min.   :-1944.0   no :16139   no :31516   Min.   : 1.00  
#>  yes:  766   1st Qu.:   52.0   yes:22091   yes: 6714   1st Qu.: 8.00  
#>              Median :  354.0                           Median :16.00  
#>              Mean   :  647.7                           Mean   :15.76  
#>              3rd Qu.:  989.0                           3rd Qu.:21.00  
#>              Max.   : 3462.0                           Max.   :31.00  
#>                                                                       
#>      month          duration         campaign          pdays       
#>  may    :12119   Min.   :   0.0   Min.   : 1.000   Min.   : -1.00  
#>  jul    : 6039   1st Qu.: 103.0   1st Qu.: 1.000   1st Qu.: -1.00  
#>  aug    : 5284   Median : 179.0   Median : 2.000   Median : -1.00  
#>  jun    : 4309   Mean   : 256.5   Mean   : 2.766   Mean   : 40.77  
#>  nov    : 2961   3rd Qu.: 316.0   3rd Qu.: 3.000   3rd Qu.: -1.00  
#>  apr    : 2484   Max.   :3881.0   Max.   :58.000   Max.   :871.00  
#>  (Other): 5034                                                     
#>     previous          y        
#>  Min.   :  0.0000   no :33954  
#>  1st Qu.:  0.0000   yes: 4276  
#>  Median :  0.0000              
#>  Mean   :  0.5764              
#>  3rd Qu.:  0.0000              
#>  Max.   :275.0000              
#> 

Our balance column data already have better proportion. We will retain the values in the other columns, because these values are important values and not outliers.

Check for Target Class Proportion

Before we build our model, we need to check our target class proportion. This is important to reduce bias in our model result, because the model will only good in predicting majority class.

table(bank_clean$y)
#> 
#>    no   yes 
#> 33954  4276

As we can see, our target class proportion is not balance yet. To handle this we will do up sampling using upSample() function in minority class until it has same proportion

bank_clean_up <- upSample(x = bank_clean %>% select(-y), y = bank_clean$y, yname = "y")
table(bank_clean_up$y)
#> 
#>    no   yes 
#> 33954 33954

Our target variable already has balance proportion.

NAIVE BAYES MODEL

Naive Bayes is one of the supervised machine learning to do classification using Bayes theorem. Bayes theorem work by updating our perspective with new information. If you want to know more about Naive Bayes machine learning, you can visit this link.

FEATURE ENGINEERING

Naive bayes is good when the predictor is categorical, it’s because this model using predictor probability to make final prediction. From our data, we can make new column with categorical data type using age column.

bank_naive <- bank_clean_up
# Build fucntion
p <- function(x){
  if (x >= 18 & x <= 25 ){
    y <- "18 - 25"
  }
  else if (x >= 26 & x <= 30){
    y <- "26 - 30"
  }
  else if (x >= 31 & x <= 35){
    y <- "31 - 35"
  }
  else if (x >= 36 & x <= 40){
    y <- "36 - 40"
  }
  else if (x >= 41 & x <= 45){
    y <- "41 - 50"
  }
  else if (x >= 46 & x <= 50){
    y <- "46 - 50"
  }
  else if (x >= 51 & x <= 55){
    y <- "51 - 55"
  }
  else if (x >= 56 & x <= 60){
    y <- "56 - 60"
  }
  else if (x >= 61 & x <= 65){
    y <- "61 - 65"
  }
  else if (x >= 66 & x <= 70){
    y <- "65 - 70"
  }
  else if (x >= 71 & x <= 75){
    y <- "71 - 75"
  }
  else if (x >= 76 & x <= 70){
    y <- "76 - 70"
  }
  else if (x >= 71 & x <= 75){
    y <- "71 - 76"
  }
  else if (x >= 76 & x <= 80){
    y <- "76 - 80"
  }
  else if (x >= 81 & x <= 85){
    y <- "81 - 85"
  }
  else if (x >= 86 & x <= 90){
    y <- "86 - 90"
  }
  else {
    y <- "91 - 95"
  }
  
  return(y)
}

# Apply function
bank_naive$age_range <- as.factor(sapply(bank_naive$age, FUN = p))

# Remove Age column from data set
bank_naive <- bank_naive %>% select(-age)

Data Exploration

Naive Bayes model will assume that every predictor have high significance to the model, also the model will assume that there’s no multicollinearity (high correlation between predictor). So before we build the model, we need to check correlation between predictor.

Numeric-Numeric Correlation

Checking correlation between numeric column.

ggcorr(bank_naive, label = T, label_size = 3, hjust = 1)

From plot above we can say there’s no high correlation between numeric predictor

Cross Validation

We’ll split our data into 20% of data test and 80% of data train

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- sample(nrow(bank_naive), nrow(bank_naive)*0.8)

naive_train <- bank_naive[index,]
naive_test <- bank_naive[-index,]

There’s two ways to made Naive Bayes models:

  1. Using argument naiveBayes(formula, data)
  • formula: formula y ~ x. y = target, x predictor.
  • data: data
  1. Using argument naiveBayes(x, y)
  • x: variable predictor
  • y: variable target

We will make two naive bayes models to see which one have better performance, our first model will use only categorical data type, while the second one will use all column (except Y) as predictor.

First Model

build model

Using only categorical predictor

first_naive <- naiveBayes(y ~ job + marital + education + default + housing + loan + month + age_range, data = naive_train, laplace = 1)

laplace argument is a methode to avoid any missing class in predictor. For example, look at this table bellow:

new_table <- matrix(c(130, 0, 127, 143), nrow = 2, dimnames = list(c("Married", "Divorced"), c("Subscripe", "not-subscripe")))

new_table
#>          Subscripe not-subscripe
#> Married        130           127
#> Divorced         0           143
new_table %>% prop.table(margin = 1)
#>          Subscripe not-subscripe
#> Married  0.5058366     0.4941634
#> Divorced 0.0000000     1.0000000

Let say, if we put marital and y columns than plot them in frequencies table. After we plot the data, we know that no one who divorced doing any subscription in term of deposit. If this kind of situation occur in our data, no matter what other column condition, the model will always assuming that the costumer will not buy their product if they are divorced. To avoid this, we can use laplace smoothing to add 1 condition to divorced people who buy deposit. So the frequencies will become:

new_table <- matrix(c(130, 0, 127, 143) + 1, nrow = 2, dimnames = list(c("Married", "Divorced"), c("Subscripe", "not-subscripe")))

new_table
#>          Subscripe not-subscripe
#> Married        131           128
#> Divorced         1           144
new_table %>% prop.table(margin = 1)
#>            Subscripe not-subscripe
#> Married  0.505791506     0.4942085
#> Divorced 0.006896552     0.9931034

As we can see, now the frequencies of people who divorces who buy term of deposit is no longer 0.

model evaluation

naive_pred <- predict(first_naive, naive_test)
(conf_matrix_naive <- table(naive_pred, naive_test$y))
#>           
#> naive_pred   no  yes
#>        no  4886 2668
#>        yes 1927 4101

As we observed, the model was successfully made true prediction of 4145 survive passenger and 4866 of not survive passenger, and also the model made false prediction of 2624 as not survived passenger (actually they have survive probability) and 1947 prediction of survive passenger (actually they have low probability of survived). Note that this model threshold is 0.5, means that everyone who has probability equal or lower than 0.5 will be assumed not survive.

To get more information about our first naive model, let’s use confusionMatrix() function to see further to our result.

confusionMatrix(conf_matrix_naive)
#> Confusion Matrix and Statistics
#> 
#>           
#> naive_pred   no  yes
#>        no  4886 2668
#>        yes 1927 4101
#>                                                
#>                Accuracy : 0.6617               
#>                  95% CI : (0.6537, 0.6696)     
#>     No Information Rate : 0.5016               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.3231               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.7172               
#>             Specificity : 0.6059               
#>          Pos Pred Value : 0.6468               
#>          Neg Pred Value : 0.6803               
#>              Prevalence : 0.5016               
#>          Detection Rate : 0.3597               
#>    Detection Prevalence : 0.5562               
#>       Balanced Accuracy : 0.6615               
#>                                                
#>        'Positive' Class : no                   
#> 

Something we must give our attention are accuracy, Sensitivity/Recall and Pros Pred Value/Precision. Acuracy tell us about how much the model made true prediction (yes predicted as yes, and not predicted as not ) from all data the formula is: TP+TN/TOTAL. Sensitivity/recall is value that represent how much the model made true prediction of positive class (yes) from actually positive, the formula is: TP/(TP+FN). And pos pred value/Precision tell us about how much the model made true positive prediction from all positive prediction, the formula is: TP/(TP+FP).

To get better understanding about TP, TN, FP and FN, see picture bellow:

  • TN means it’s actually negative and the model predicted it as negative as well
  • FN means it’s actually positive, but the model predicted it as negative
  • FP means it’s actually positive and the model predicted it as positive as well
  • TP means it’s actually negative, but the model predicted it as positive

Than our first model summaries are: - Have Accuracy of 66% - Have sensitivity of 71% (Sensitivity is how much the model made true prediction of positive class (survived) from actually positive) - Have Precision (pos pred value) of 65% (Precision is how much the model made true positive prediction from all positive prediction)

We will do the exact step for second model

Second Model

build model

# Use all column as predictor
second_naive <- naiveBayes(y ~ ., data = naive_train, laplace = 1)

model evaluation

naive_pred <- predict(second_naive, naive_test)
(conf_matrix_naive <- table(naive_pred, naive_test$y))
#>           
#> naive_pred   no  yes
#>        no  5360 1450
#>        yes 1453 5319

Seems the model where using all column as predictor have better performance than our first model, we can know from how many the model made true prediction compared to the first model. To make sure, we will use confusionMatrix() to see the exact performance.

confusionMatrix(conf_matrix_naive)
#> Confusion Matrix and Statistics
#> 
#>           
#> naive_pred   no  yes
#>        no  5360 1450
#>        yes 1453 5319
#>                                              
#>                Accuracy : 0.7863             
#>                  95% CI : (0.7793, 0.7931)   
#>     No Information Rate : 0.5016             
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.5725             
#>                                              
#>  Mcnemar's Test P-Value : 0.9704             
#>                                              
#>             Sensitivity : 0.7867             
#>             Specificity : 0.7858             
#>          Pos Pred Value : 0.7871             
#>          Neg Pred Value : 0.7854             
#>              Prevalence : 0.5016             
#>          Detection Rate : 0.3946             
#>    Detection Prevalence : 0.5014             
#>       Balanced Accuracy : 0.7863             
#>                                              
#>        'Positive' Class : no                 
#> 

Our second model have bigger number of Accuracy, Sensitivity and pos pred value. Even tho, this result must not satisfy us, there’s still more than 20% of false prediction, which mean more than a thousands people who have survive probability predicted as not survived, it’s very dangerous.

Another way to do model evaluation is by looking into ROC (Receiver Operating Characteristics) and AUC (Area Under Curve). ROC plots the proportion of true positive rate (TPR or Sensitivity) to the proportion of a false negative rate (FNR or 1-Specificity). ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much a model is capable of distinguishing between classes. The closer the curve reaches the upper-left of the plot (True positive is high while false negative is low), the better our model is. The higher the AUC score the better our model separates our target classes. To ease your understanding, you can see the illustration below.

Let’s build our model ROC.

# We use raw as type argument to get the number of probability
naive_prob <- predict(second_naive, naive_test, type = "raw")

# Take only "yes" class
naive_roc <- data.frame(prediction=round(naive_prob[,2],4),
                      trueclass=as.numeric(naive_test$y == "yes"))

head(naive_roc)
#>   prediction trueclass
#> 1     0.0299         0
#> 2     0.1358         0
#> 3     0.1142         0
#> 4     0.0661         0
#> 5     0.0341         0
#> 6     0.2248         0
naive_roc <- ROCR::prediction(naive_roc$prediction, naive_roc$trueclass)

plot(performance(naive_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

# AUC
auc_ROCR_n <- performance(naive_roc, measure = "auc")
auc_ROCR_n <- auc_ROCR_n@y.values[[1]]
auc_ROCR_n
#> [1] 0.8600987

DECISION TREE

Decision Tree is supervised machine learning algorithm that is used for both classification and regression tasks. It is a flowchart-like structure where each internal node represents a test on a feature, each branch represents the outcome of the test, and each leaf node represents a class label or a value. To get better understanding of decision tree, see the picture bellow:

Let’s take an example of a weather in some area and we want to predict if there will be rain or not based on weather condition. Say, we already have decision tree model and we have data of weather condition. If today weather condition is overcast, it has 100% probability of rain. And what if today is sunny day? The model will look into another condition, if the humidity is low than today has low probability of rain, if it’s high it will rain and better to bring umbrella before going out.

How does a tree choose it’s root, splits it’s branches and ended up with node?

A tree will choose to split data in such a way that the resulting nodes will contain data points with as many similar class as possible (homogenous). One measurement of homogeneity/purity within groups is entropy or the measure of disorder. Entropy near 0 means most of the observations fall within the same class (homogenous). Entropy near 1 is the other way around (heterogeneous).

Decision tree is built using a top-down fashion. The root node will be chosen from a variable and conditional rules that will give the highest entropy. The root will be partitioned (split) into nodes with each partition having different entropy. For more reference about the entropy calculation, you can read this article. The difference of entropy before and after splitting is called the information gain. The tree will prefer to perform splitting using variables and rules that will result in higher information gain (from a high entropy into a lower entropy).

There are certain characteristics of decision tree model:

  • perform well on both numerical and categorical variable.

  • all predictors are assumed to interact.

  • quite robust to the problem of multicollinearity. A decision tree will choose a variable that has the highest information gain in one split, whereas a method such as logistic regression would have used both.

  • robust and insensitive to outliers. Splitting will happen at a condition where it maximizes the homogeneity within resulting groups. Outliers will have little influence on the splitting process.

Some drawbacks of decision tree model is described below:

  • prone to overfitting. Trees should know when to stop growing or it will partition every single observation possible and result in an overfitting model. To overcome that problem, we can perform pre-pruning (determine tree depth before model building) or post-pruning (letting the tree to grow and pruning it later after important classification patterns have been discovered). We can set some parameter:
    • mincriterion: The value of the test statistic (1 - p-value) that must be exceeded in order to implement a split. For example, when mincriterion is 0.95, the p-value must be smaller than 0.05 in order for a node to split. This can also act as a “regulator” for the depth of the tree. The higher the mincriterion, the harder it is to perform splitting, thus generate a smaller tree.
    • minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. Default to 20.
    • minbucket: the minimum number of observations in any leaf node. Default to round(minsplit/3).
    • maxdepth: Set the maximum depth of any node of the final tree, with the root node counted as depth 0. Default to 30.
  • is a greedy algorithm which picks an attribute that will result in the highest information gain, a strategy that will converge to local optima but does not guarantee global optimality. Some effort to overcome this problem are:
    • k-steps look-ahead trees
    • dual information distance (DID) as the splitting criterion
    • etc.

As same as naive bayes model, we will make more than one model and see which model have better performance in predicting the data.

Cross Validation

Split data into 80% of data train and 20% of data test

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- sample(nrow(bank_clean_up), nrow(bank_clean_up)*0.8)

tree_train <- bank_naive[index,]
tree_test <- bank_naive[-index,]

First Model

Our first model will use all column as predictor and we not gonna do any model tuning. Even we are not specifying any parameter, by default decision tree model will use this setting:

  • Mincriterion: 0.95
  • Minsplit: 20
  • Minbucket: 7

So, lets make our first model

model building

# Build first decision tree model
first_tree <- ctree(formula = y ~ ., data = tree_train)
# Doing prediction
first_tree_predict <- predict(first_tree, tree_test, type = "response")

model evaluation

#
confusionMatrix(first_tree_predict, tree_test$y, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  5630  341
#>        yes 1183 6428
#>                                                
#>                Accuracy : 0.8878               
#>                  95% CI : (0.8824, 0.8931)     
#>     No Information Rate : 0.5016               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.7757               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9496               
#>             Specificity : 0.8264               
#>          Pos Pred Value : 0.8446               
#>          Neg Pred Value : 0.9429               
#>              Prevalence : 0.4984               
#>          Detection Rate : 0.4733               
#>    Detection Prevalence : 0.5604               
#>       Balanced Accuracy : 0.8880               
#>                                                
#>        'Positive' Class : yes                  
#> 

From Accuracy, Sensitivity and Pos Pred Value, we can say that our first model have good performance. Even tho, we will try to do tuning in mode parameter in our second model. Our aim is to get as high accuracy as possible. Why accuracy? If you remember the definition about accuracy that we mentioned before, both our class target is important for bank to made strategy in approaching consumer (strategy for them). Higher accuracy means better the model made true prediction, or in other word the better bank made proper and effective strategy

Second Model

model building

Our second model will use this parameter setting:

  • Mincriterion of 0.2
  • Minsplit of 25
  • Minbucket of 5
# Build model
second_tree <- ctree(formula = y ~ ., data = tree_train,
                     control = ctree_control(mincriterion = 0.2,
                                             minsplit = 25,
                                             minbucket = 5))

model evaluation

predict_tuning <- predict(second_tree, tree_test, type = "response")

confusionMatrix(predict_tuning, tree_test$y, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  5903  387
#>        yes  910 6382
#>                                                
#>                Accuracy : 0.9045               
#>                  95% CI : (0.8994, 0.9094)     
#>     No Information Rate : 0.5016               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.8091               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9428               
#>             Specificity : 0.8664               
#>          Pos Pred Value : 0.8752               
#>          Neg Pred Value : 0.9385               
#>              Prevalence : 0.4984               
#>          Detection Rate : 0.4699               
#>    Detection Prevalence : 0.5369               
#>       Balanced Accuracy : 0.9046               
#>                                                
#>        'Positive' Class : yes                  
#> 

As we observed after we tuning the parameter, our second model get even more bigger number of accuracy than our first model.

Let’s made ROC plot

# We use raw as type argument to get the number of probability
tree_prob <- predict(second_tree, naive_test, type = "prob")

# Take only "yes" class
tree_roc <- data.frame(prediction=round(tree_prob[,2],4),
                       trueclass=as.numeric(tree_test$y == "yes"))

head(tree_roc)
#>    prediction trueclass
#> 3           0         0
#> 4           0         0
#> 8           0         0
#> 12          0         0
#> 19          0         0
#> 20          0         0
tree_roc <- ROCR::prediction(tree_roc$prediction, tree_roc$trueclass)

plot(performance(tree_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

# AUC
auc_ROCR_d <- performance(tree_roc, measure = "auc")
auc_ROCR_d <- auc_ROCR_d@y.values[[1]]
auc_ROCR_d
#> [1] 0.9480649

RANDOM FOREST

Imagine we have a really complex problem, to solve this problem, we ask many experts in this field to help us get the answer needed. After they give their answer, they do voting to choose which answer will be the best to solve the problem.

This analogy fit perfectly with random forest. Many decision tree were made using random data and predictor. As an expert, each decision tree will made a classification prediction, than the model will choose the most answer (if we use random forest to do regression prediction, the system will average all the answer). So Random Forest is an ensemble-based algorithm which was built based on a decision tree method and known for it’s versatility and performance.Ensemble-based algorithm itself is actually a hybrid of several machine learning techniques combined into one predictive model, built to reduce error, bias and improve predictions. The building of a Random Forest model consist of several steps:

  • Perform Bootstrap Sampling. Creating subsets of training data through random sampling with replacement to train multiple predictive models (in this case many decision trees).
  • Perform Decision Tree. Training bunch of decision tree model form each data made by Bootstrap Sampling. Using mtry argument to randomly choose predictor for each decision tree model (Automatic Feature Selection).
  • Doing Prediction. Made prediction from each decision tree model.
  • Aggregation. Made one prediction for each model. For classification, the model will use majority voting, while regression will use average of target class.

Speaking about random selection of observation and variables, there’s a techniques called K-fold Cross-Validation. This technique performs cross-validation by splitting our data into k equal sized sample group bins. The model will take one bin and choose which data will be data train and the rest of data test. The process repeated for k-times (the fold). This makes every observation has the chance to be used as both training and data test data, and therefore mas also overcome overfitting problem from the decision tree model. Bellow is an example of a 5-bins and 5-fold cross validation.

To get better understanding about how random forest work, let’s make random forest model.

Model Building

#set.seed(417)

# Determining K fold cross validaton
#ctrl <- trainControl(method = "repeatedcv",
#                      number = 5, # k-fold
#                      repeats = 3) # repetition


# Train random forest model
#fb_forest <- train(y ~ .,
#                    data = tree_train,
#                    method = "rf", # random forest
#                    trControl = ctrl)

# For convenience in reading the model, we will save it in a file
#saveRDS(fb_forest, "bank_forest.RDS")
# Read RDS file
bank_forest <- readRDS("bank_forest.RDS")

Model Evaluation

bank_forest$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 25
#> 
#>         OOB estimate of  error rate: 3.32%
#> Confusion matrix:
#>        no   yes  class.error
#> no  25353  1788 0.0658781917
#> yes    13 27172 0.0004782049

Lets see our accuracy model

TP <- 27172
TN <- 25353
FP <- 12
FN <- 1788

total <- TP + TN + FP + FN

accuracy <- (TP + TN) / total
accuracy
#> [1] 0.9668661

Our trained model has incredibly high accuracy compared to other models we already made before. But this number can be reduce while predicting new data (tree_test), so we will see confusionmatrix to see this model performance in doing prediction.

bank_predict_forest <- predict(bank_forest, newdata = tree_test)

conf_matrix <- confusionMatrix(bank_predict_forest, tree_test$y, positive = "yes")

conf_matrix
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  6392    6
#>        yes  421 6763
#>                                                
#>                Accuracy : 0.9686               
#>                  95% CI : (0.9655, 0.9714)     
#>     No Information Rate : 0.5016               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.9371               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9991               
#>             Specificity : 0.9382               
#>          Pos Pred Value : 0.9414               
#>          Neg Pred Value : 0.9991               
#>              Prevalence : 0.4984               
#>          Detection Rate : 0.4979               
#>    Detection Prevalence : 0.5289               
#>       Balanced Accuracy : 0.9687               
#>                                                
#>        'Positive' Class : yes                  
#> 

It’s even have bigger accuracy than our train model, also this accuracy is the highest among all models we already made before.

Made ROC plot

# We use raw as type argument to get the number of probability
forest_prob <- predict(bank_forest, tree_test, type = "prob")

# Take only "yes" class
forest_roc <- data.frame(prediction=round(forest_prob[,2],4),
                       trueclass=as.numeric(tree_test$y == "yes"))

head(forest_roc)
#>   prediction trueclass
#> 1      0.010         0
#> 2      0.000         0
#> 3      0.000         0
#> 4      0.002         0
#> 5      0.008         0
#> 6      0.004         0
forest_roc <- ROCR::prediction(forest_roc$prediction, forest_roc$trueclass)

plot(performance(forest_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

# AUC
auc_ROCR_f <- performance(forest_roc, measure = "auc")
auc_ROCR_f <- auc_ROCR_f@y.values[[1]]
auc_ROCR_f
#> [1] 0.9997921

Our ROC curve is excellent (almost have 100% area under curve), this is the best ROC curve compared to other models.

CONCLUSION

  • From all models we made before, random forest have better performance among all models.
  • Since our target class are equally important, the parameter of confusionMatrixthat will be used is Accuracy. Accuracy value represent how good the model in made true prediction (Treu Positive and True Negative)
  • Our random forest model have accuracy of 96,88% and AUC of 0.99%