Machine learning to predict whether a costumer will buy deposit or not
Deposit is one of the bank product that people usually like to have, and one of its type is Term of Deposit. This type of investment is offered by bank and financial institution, when someone decided to buy this product, they will deposit a specific amount of money with bank for period of time. During this period, the deposited amount earn a fix interest rate, which is usually higher than the interest rate offered on regular saving account.
In this analysis, we will predict which costumer will have/buy in term of deposit based on their characteristics. The data we used in this prediction is from UC Irvine Machine Learning Repository. This data is about telemarketing of Portuguese Bank, and we will predicting whether a costumer will but a term of deposit or not by their characteristics. The model we will use in this prediction is Naive Bayes, Decision Tree and Random Forest, than we will compare all the model to see which one has better performance among the others.
IMPORT LIBRARY
Importing all necessary library
library(dplyr)
library(lubridate)
library(e1071) # package for naivebayes
library(caret) # package for resemple train data
library(partykit) #package for ctree function
library(ROCR) #Package for AUC
library(randomForest)
library(caret) #K-fold
library(GGally)
library(stats) # MANOVA & chi square
library(coin)
library(rattle)READ DATA
bank <- read.csv("bank-full.csv", sep = ";", stringsAsFactors = T)
rmarkdown::paged_table(bank)Column description:
- Age (numeric)
- Job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”)
- Marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
- Education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
- Default: has credit in default? (binary: “yes”,“no”)
- Balance: average yearly balance, in euros (numeric)
- Housing: has housing loan? (binary: “yes”,“no”)
- Loan: has personal loan? (binary: “yes”,“no”)
Related with the last contact of the current campaign:
- Contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
- Day: last contact day of the month (numeric)
- Month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
- Duration: last contact duration, in seconds (numeric)
Other attributes:
- Campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
- Previous: number of contacts performed before this campaign and for this client (numeric)
- Poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
Output variable (desired target):
- y - has the client subscribed a term deposit? (binary: “yes”,“no”)
DATA PREPROCESSING
Data pre-processing is where we preparing our raw data before we do analysis or machine learning. So, we can be sure of the quality, consistency and compatibility of our data. Anything that can be done here they are handling abnormal value and missing value, and data coertion.
Data Coertion
glimpse(bank)#> Rows: 45,211
#> Columns: 17
#> $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
#> $ job <fct> management, technician, entrepreneur, blue-collar, unknown, …
#> $ marital <fct> married, single, married, married, single, married, single, …
#> $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
#> $ default <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
#> $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
#> $ housing <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
#> $ loan <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
#> $ contact <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
#> $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
#> $ month <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
#> $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
#> $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
#> $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
#> $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
Our data was already in their appropriate data type. So we dont need to do any data coertion here.
Handling Missng Value
colSums(is.na(bank))#> age job marital education default balance housing loan
#> 0 0 0 0 0 0 0 0
#> contact day month duration campaign pdays previous poutcome
#> 0 0 0 0 0 0 0 0
#> y
#> 0
There’s no missing value detected in our data. But, we need to look deeper to our data to check is there any unusual data that can be act as NA/missing value
summary(bank)#> age job marital education
#> Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
#> 1st Qu.:33.00 management :9458 married :27214 secondary:23202
#> Median :39.00 technician :7597 single :12790 tertiary :13301
#> Mean :40.94 admin. :5171 unknown : 1857
#> 3rd Qu.:48.00 services :4154
#> Max. :95.00 retired :2264
#> (Other) :6835
#> default balance housing loan contact
#> no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
#> yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
#> Median : 448 unknown :13020
#> Mean : 1362
#> 3rd Qu.: 1428
#> Max. :102127
#>
#> day month duration campaign
#> Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
#> 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
#> Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
#> Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
#> 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
#> Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
#> (Other): 6060
#> pdays previous poutcome y
#> Min. : -1.0 Min. : 0.0000 failure: 4901 no :39922
#> 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 yes: 5289
#> Median : -1.0 Median : 0.0000 success: 1511
#> Mean : 40.2 Mean : 0.5803 unknown:36959
#> 3rd Qu.: -1.0 3rd Qu.: 0.0000
#> Max. :871.0 Max. :275.0000
#>
As we assumed before. If we look closely to education, contact, job and poutcom columns, there’s unusual value called unknown. This could be mean that Bank didn’t know what happened at the moment or they forget to input the data. To handle this, we can do a number of things like remove all rows that contain missing value, delete the entire column or if we know exactly the number, we can impute it manually. But here, we surely don’t know what the exact number is, so we only have two options, either remove the rows with missing value in it or remove the entire column.
We can remove all rows with missing value if the total of missing value is not bigger than 5% of total rows. So, we can remove all rows with unknown value from education column, while potcome and contact column, since there’s too many unknown value, we need to remove this two columns so that the information is not reduce.
# Remove poutcome and contact columns
bank_clean <- bank %>%
select(-c(poutcome, contact)) %>%
# Remove all rows with unknown value from job and education column
filter(job != "unknown",
education != "unknown")Job and Education column has factor data type, it means that unknown is one of their levels. Before we do any data process or data exploration, we need to remove this unused levels to avoid bad result in our analysis.
# Drop unused levels in education column
bank_clean$education <- droplevels(bank_clean$education)
# Drop unused levels in job column
bank_clean$job <- droplevels(bank_clean$job)summary(bank_clean)#> age job marital education
#> Min. :18.00 blue-collar:9278 divorced: 5028 primary : 6800
#> 1st Qu.:33.00 management :9216 married :25946 secondary:23131
#> Median :39.00 technician :7355 single :12219 tertiary :13262
#> Mean :40.76 admin. :5000
#> 3rd Qu.:48.00 services :4004
#> Max. :95.00 retired :2145
#> (Other) :6195
#> default balance housing loan day
#> no :42411 Min. : -8019 no :18901 no :36086 Min. : 1.00
#> yes: 782 1st Qu.: 71 yes:24292 yes: 7107 1st Qu.: 8.00
#> Median : 442 Median :16.00
#> Mean : 1354 Mean :15.81
#> 3rd Qu.: 1412 3rd Qu.:21.00
#> Max. :102127 Max. :31.00
#>
#> month duration campaign pdays
#> may :13192 Min. : 0.0 Min. : 1.000 Min. : -1.0
#> jul : 6601 1st Qu.: 103.0 1st Qu.: 1.000 1st Qu.: -1.0
#> aug : 6037 Median : 180.0 Median : 2.000 Median : -1.0
#> jun : 4980 Mean : 258.3 Mean : 2.758 Mean : 40.4
#> nov : 3842 3rd Qu.: 318.0 3rd Qu.: 3.000 3rd Qu.: -1.0
#> apr : 2820 Max. :4918.0 Max. :58.000 Max. :871.0
#> (Other): 5721
#> previous y
#> Min. : 0.0000 no :38172
#> 1st Qu.: 0.0000 yes: 5021
#> Median : 0.0000
#> Mean : 0.5849
#> 3rd Qu.: 0.0000
#> Max. :275.0000
#>
Seems our data already have no missing or unknown value.
Handling Abnormal Value
If I look closely, there’s some condition that counter intuitive, where someone has no housing loan, has no personal loan and also has no credit default but have negative in their balance. As long as the person has job, normally their balance must be higher than zero, or at least as same as average people balances.
test <- bank_clean %>%
filter(balance < 0,
housing == "no",
loan == "no",
default == "no")
rmarkdown::paged_table(test)As we observed, there’s 475 rows with abnormal value, we will remove all of it.
bank_clean <- bank_clean %>%
filter(!(balance < 0 & housing == "no" & loan == "no" & default == "no"))Another thing to note is, generally bank will be careful to offer their financial products to someone with bad financial record. We will remove anyone who have term of deposit but at the same time has credit default, negative balance, personal loan and personal housing credit.
bank_clean <- bank_clean %>%
filter(!(y == "yes" &balance < 0 & housing == "yes" & loan == "yes" & default == "yes"))Handling Outliers
Talk about outlier, seems there’s some outliers in balance column. If we look to histogram plot bellow, the proportion of the data is extremely unbalance (the data have right skew). To handle this, we will use interquartile range to remove any outliers in this column.
hist(bank_clean$balance, breaks = 20)The idea behind interquartile range is remove any value that bigger
than upper and lower ineer fences. To get lower and upper inner fences,
first we need to know First Quantile (Q1) and Third Quantile (Q3) of the
data, to get this we can use quantile() function. Than, we
subtract Q3 from Q1 to get interquartile range, here lays 50% of the
total data. Last subtract Q1 with interquartile range multiply by 1.5 to
get lower inner fences and add Q3 with interquartile range multiply with
1.5 to get upper inner fences. Any value that bigger than upper fences
or smaller than lower fences will be assumed as outliers.
# Find Q1 and Q3
Q1 <- quantile(bank_clean$balance, probs = 0.25)
Q3 <- quantile(bank_clean$balance, probs = 0.75)
# Calculate interquartile range
interquartile_range <- Q3 - Q1
# Calculate lower and upper inner fences
lower_inner_fences <- Q1 - interquartile_range*1.5
upper_inner_fences <- Q3 + interquartile_range*1.5# Filtering data
bank_clean <- bank_clean %>%
filter(balance >= lower_inner_fences,
balance <= upper_inner_fences)summary(bank_clean)#> age job marital education
#> Min. :18.00 blue-collar:8439 divorced: 4518 primary : 6061
#> 1st Qu.:32.00 management :7832 married :22804 secondary:20850
#> Median :38.00 technician :6538 single :10908 tertiary :11319
#> Mean :40.45 admin. :4510
#> 3rd Qu.:48.00 services :3658
#> Max. :95.00 retired :1790
#> (Other) :5463
#> default balance housing loan day
#> no :37464 Min. :-1944.0 no :16139 no :31516 Min. : 1.00
#> yes: 766 1st Qu.: 52.0 yes:22091 yes: 6714 1st Qu.: 8.00
#> Median : 354.0 Median :16.00
#> Mean : 647.7 Mean :15.76
#> 3rd Qu.: 989.0 3rd Qu.:21.00
#> Max. : 3462.0 Max. :31.00
#>
#> month duration campaign pdays
#> may :12119 Min. : 0.0 Min. : 1.000 Min. : -1.00
#> jul : 6039 1st Qu.: 103.0 1st Qu.: 1.000 1st Qu.: -1.00
#> aug : 5284 Median : 179.0 Median : 2.000 Median : -1.00
#> jun : 4309 Mean : 256.5 Mean : 2.766 Mean : 40.77
#> nov : 2961 3rd Qu.: 316.0 3rd Qu.: 3.000 3rd Qu.: -1.00
#> apr : 2484 Max. :3881.0 Max. :58.000 Max. :871.00
#> (Other): 5034
#> previous y
#> Min. : 0.0000 no :33954
#> 1st Qu.: 0.0000 yes: 4276
#> Median : 0.0000
#> Mean : 0.5764
#> 3rd Qu.: 0.0000
#> Max. :275.0000
#>
Our balance column data already have better proportion. We will retain the values in the other columns, because these values are important values and not outliers.
Check for Target Class Proportion
Before we build our model, we need to check our target class proportion. This is important to reduce bias in our model result, because the model will only good in predicting majority class.
table(bank_clean$y)#>
#> no yes
#> 33954 4276
As we can see, our target class proportion is not balance yet. To
handle this we will do up sampling using upSample()
function in minority class until it has same proportion
bank_clean_up <- upSample(x = bank_clean %>% select(-y), y = bank_clean$y, yname = "y")table(bank_clean_up$y)#>
#> no yes
#> 33954 33954
Our target variable already has balance proportion.
NAIVE BAYES MODEL
Naive Bayes is one of the supervised machine learning to do classification using Bayes theorem. Bayes theorem work by updating our perspective with new information. If you want to know more about Naive Bayes machine learning, you can visit this link.
FEATURE ENGINEERING
Naive bayes is good when the predictor is categorical, it’s because this model using predictor probability to make final prediction. From our data, we can make new column with categorical data type using age column.
bank_naive <- bank_clean_up# Build fucntion
p <- function(x){
if (x >= 18 & x <= 25 ){
y <- "18 - 25"
}
else if (x >= 26 & x <= 30){
y <- "26 - 30"
}
else if (x >= 31 & x <= 35){
y <- "31 - 35"
}
else if (x >= 36 & x <= 40){
y <- "36 - 40"
}
else if (x >= 41 & x <= 45){
y <- "41 - 50"
}
else if (x >= 46 & x <= 50){
y <- "46 - 50"
}
else if (x >= 51 & x <= 55){
y <- "51 - 55"
}
else if (x >= 56 & x <= 60){
y <- "56 - 60"
}
else if (x >= 61 & x <= 65){
y <- "61 - 65"
}
else if (x >= 66 & x <= 70){
y <- "65 - 70"
}
else if (x >= 71 & x <= 75){
y <- "71 - 75"
}
else if (x >= 76 & x <= 70){
y <- "76 - 70"
}
else if (x >= 71 & x <= 75){
y <- "71 - 76"
}
else if (x >= 76 & x <= 80){
y <- "76 - 80"
}
else if (x >= 81 & x <= 85){
y <- "81 - 85"
}
else if (x >= 86 & x <= 90){
y <- "86 - 90"
}
else {
y <- "91 - 95"
}
return(y)
}
# Apply function
bank_naive$age_range <- as.factor(sapply(bank_naive$age, FUN = p))
# Remove Age column from data set
bank_naive <- bank_naive %>% select(-age)Data Exploration
Naive Bayes model will assume that every predictor have high significance to the model, also the model will assume that there’s no multicollinearity (high correlation between predictor). So before we build the model, we need to check correlation between predictor.
Numeric-Numeric Correlation
Checking correlation between numeric column.
ggcorr(bank_naive, label = T, label_size = 3, hjust = 1)From plot above we can say there’s no high correlation between numeric predictor
Cross Validation
We’ll split our data into 20% of data test and 80% of data train
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(bank_naive), nrow(bank_naive)*0.8)
naive_train <- bank_naive[index,]
naive_test <- bank_naive[-index,]There’s two ways to made Naive Bayes models:
- Using argument
naiveBayes(formula, data)
formula: formula y ~ x. y = target, x predictor.data: data
- Using argument
naiveBayes(x, y)
x: variable predictory: variable target
We will make two naive bayes models to see which one have better performance, our first model will use only categorical data type, while the second one will use all column (except Y) as predictor.
First Model
build model
Using only categorical predictor
first_naive <- naiveBayes(y ~ job + marital + education + default + housing + loan + month + age_range, data = naive_train, laplace = 1)laplace argument is a methode to avoid any missing class in predictor. For example, look at this table bellow:
new_table <- matrix(c(130, 0, 127, 143), nrow = 2, dimnames = list(c("Married", "Divorced"), c("Subscripe", "not-subscripe")))
new_table#> Subscripe not-subscripe
#> Married 130 127
#> Divorced 0 143
new_table %>% prop.table(margin = 1)#> Subscripe not-subscripe
#> Married 0.5058366 0.4941634
#> Divorced 0.0000000 1.0000000
Let say, if we put marital and y columns than plot them in frequencies table. After we plot the data, we know that no one who divorced doing any subscription in term of deposit. If this kind of situation occur in our data, no matter what other column condition, the model will always assuming that the costumer will not buy their product if they are divorced. To avoid this, we can use laplace smoothing to add 1 condition to divorced people who buy deposit. So the frequencies will become:
new_table <- matrix(c(130, 0, 127, 143) + 1, nrow = 2, dimnames = list(c("Married", "Divorced"), c("Subscripe", "not-subscripe")))
new_table#> Subscripe not-subscripe
#> Married 131 128
#> Divorced 1 144
new_table %>% prop.table(margin = 1)#> Subscripe not-subscripe
#> Married 0.505791506 0.4942085
#> Divorced 0.006896552 0.9931034
As we can see, now the frequencies of people who divorces who buy term of deposit is no longer 0.
model evaluation
naive_pred <- predict(first_naive, naive_test)(conf_matrix_naive <- table(naive_pred, naive_test$y))#>
#> naive_pred no yes
#> no 4886 2668
#> yes 1927 4101
As we observed, the model was successfully made true prediction of 4145 survive passenger and 4866 of not survive passenger, and also the model made false prediction of 2624 as not survived passenger (actually they have survive probability) and 1947 prediction of survive passenger (actually they have low probability of survived). Note that this model threshold is 0.5, means that everyone who has probability equal or lower than 0.5 will be assumed not survive.
To get more information about our first naive model, let’s use
confusionMatrix() function to see further to our
result.
confusionMatrix(conf_matrix_naive)#> Confusion Matrix and Statistics
#>
#>
#> naive_pred no yes
#> no 4886 2668
#> yes 1927 4101
#>
#> Accuracy : 0.6617
#> 95% CI : (0.6537, 0.6696)
#> No Information Rate : 0.5016
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.3231
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.7172
#> Specificity : 0.6059
#> Pos Pred Value : 0.6468
#> Neg Pred Value : 0.6803
#> Prevalence : 0.5016
#> Detection Rate : 0.3597
#> Detection Prevalence : 0.5562
#> Balanced Accuracy : 0.6615
#>
#> 'Positive' Class : no
#>
Something we must give our attention are accuracy, Sensitivity/Recall
and Pros Pred Value/Precision. Acuracy tell us about
how much the model made true prediction (yes predicted as yes, and not
predicted as not ) from all data the formula is:
TP+TN/TOTAL. Sensitivity/recall is value
that represent how much the model made true prediction of positive class
(yes) from actually positive, the formula is: TP/(TP+FN).
And pos pred value/Precision tell us about how much the
model made true positive prediction from all positive prediction, the
formula is: TP/(TP+FP).
To get better understanding about TP, TN, FP and FN, see picture bellow:
- TN means it’s actually negative and the model predicted it as negative as well
- FN means it’s actually positive, but the model predicted it as negative
- FP means it’s actually positive and the model predicted it as positive as well
- TP means it’s actually negative, but the model predicted it as positive
Than our first model summaries are: - Have Accuracy of 66% - Have sensitivity of 71% (Sensitivity is how much the model made true prediction of positive class (survived) from actually positive) - Have Precision (pos pred value) of 65% (Precision is how much the model made true positive prediction from all positive prediction)
We will do the exact step for second model
Second Model
build model
# Use all column as predictor
second_naive <- naiveBayes(y ~ ., data = naive_train, laplace = 1)model evaluation
naive_pred <- predict(second_naive, naive_test)(conf_matrix_naive <- table(naive_pred, naive_test$y))#>
#> naive_pred no yes
#> no 5360 1450
#> yes 1453 5319
Seems the model where using all column as predictor have better
performance than our first model, we can know from how many the model
made true prediction compared to the first model. To make sure, we will
use confusionMatrix() to see the exact performance.
confusionMatrix(conf_matrix_naive)#> Confusion Matrix and Statistics
#>
#>
#> naive_pred no yes
#> no 5360 1450
#> yes 1453 5319
#>
#> Accuracy : 0.7863
#> 95% CI : (0.7793, 0.7931)
#> No Information Rate : 0.5016
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.5725
#>
#> Mcnemar's Test P-Value : 0.9704
#>
#> Sensitivity : 0.7867
#> Specificity : 0.7858
#> Pos Pred Value : 0.7871
#> Neg Pred Value : 0.7854
#> Prevalence : 0.5016
#> Detection Rate : 0.3946
#> Detection Prevalence : 0.5014
#> Balanced Accuracy : 0.7863
#>
#> 'Positive' Class : no
#>
Our second model have bigger number of Accuracy, Sensitivity and pos pred value. Even tho, this result must not satisfy us, there’s still more than 20% of false prediction, which mean more than a thousands people who have survive probability predicted as not survived, it’s very dangerous.
Another way to do model evaluation is by looking into ROC (Receiver Operating Characteristics) and AUC (Area Under Curve). ROC plots the proportion of true positive rate (TPR or Sensitivity) to the proportion of a false negative rate (FNR or 1-Specificity). ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much a model is capable of distinguishing between classes. The closer the curve reaches the upper-left of the plot (True positive is high while false negative is low), the better our model is. The higher the AUC score the better our model separates our target classes. To ease your understanding, you can see the illustration below.
Let’s build our model ROC.
# We use raw as type argument to get the number of probability
naive_prob <- predict(second_naive, naive_test, type = "raw")
# Take only "yes" class
naive_roc <- data.frame(prediction=round(naive_prob[,2],4),
trueclass=as.numeric(naive_test$y == "yes"))
head(naive_roc)#> prediction trueclass
#> 1 0.0299 0
#> 2 0.1358 0
#> 3 0.1142 0
#> 4 0.0661 0
#> 5 0.0341 0
#> 6 0.2248 0
naive_roc <- ROCR::prediction(naive_roc$prediction, naive_roc$trueclass)
plot(performance(naive_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)# AUC
auc_ROCR_n <- performance(naive_roc, measure = "auc")
auc_ROCR_n <- auc_ROCR_n@y.values[[1]]
auc_ROCR_n#> [1] 0.8600987
DECISION TREE
Decision Tree is supervised machine learning algorithm that is used for both classification and regression tasks. It is a flowchart-like structure where each internal node represents a test on a feature, each branch represents the outcome of the test, and each leaf node represents a class label or a value. To get better understanding of decision tree, see the picture bellow:
Let’s take an example of a weather in some area and we want to predict if there will be rain or not based on weather condition. Say, we already have decision tree model and we have data of weather condition. If today weather condition is overcast, it has 100% probability of rain. And what if today is sunny day? The model will look into another condition, if the humidity is low than today has low probability of rain, if it’s high it will rain and better to bring umbrella before going out.
How does a tree choose it’s root, splits it’s branches and ended up with node?
A tree will choose to split data in such a way that the resulting nodes will contain data points with as many similar class as possible (homogenous). One measurement of homogeneity/purity within groups is entropy or the measure of disorder. Entropy near 0 means most of the observations fall within the same class (homogenous). Entropy near 1 is the other way around (heterogeneous).
Decision tree is built using a top-down fashion. The root node will be chosen from a variable and conditional rules that will give the highest entropy. The root will be partitioned (split) into nodes with each partition having different entropy. For more reference about the entropy calculation, you can read this article. The difference of entropy before and after splitting is called the information gain. The tree will prefer to perform splitting using variables and rules that will result in higher information gain (from a high entropy into a lower entropy).
There are certain characteristics of decision tree model:
perform well on both numerical and categorical variable.
all predictors are assumed to interact.
quite robust to the problem of multicollinearity. A decision tree will choose a variable that has the highest information gain in one split, whereas a method such as logistic regression would have used both.
robust and insensitive to outliers. Splitting will happen at a condition where it maximizes the homogeneity within resulting groups. Outliers will have little influence on the splitting process.
Some drawbacks of decision tree model is described below:
- prone to overfitting. Trees should know when to
stop growing or it will partition every single observation possible and
result in an overfitting model. To overcome that problem, we can perform
pre-pruning (determine tree depth before model building) or post-pruning
(letting the tree to grow and pruning it later after important
classification patterns have been discovered). We can set some
parameter:
mincriterion: The value of the test statistic (1 - p-value) that must be exceeded in order to implement a split. For example, when mincriterion is 0.95, the p-value must be smaller than 0.05 in order for a node to split. This can also act as a “regulator” for the depth of the tree. The higher the mincriterion, the harder it is to perform splitting, thus generate a smaller tree.minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. Default to 20.minbucket: the minimum number of observations in any leaf node. Default to round(minsplit/3).maxdepth: Set the maximum depth of any node of the final tree, with the root node counted as depth 0. Default to 30.
- is a greedy algorithm which picks an attribute that
will result in the highest information gain, a strategy that will
converge to local optima but does not guarantee global optimality. Some
effort to overcome this problem are:
- k-steps look-ahead trees
- dual information distance (DID) as the splitting criterion
- etc.
As same as naive bayes model, we will make more than one model and see which model have better performance in predicting the data.
Cross Validation
Split data into 80% of data train and 20% of data test
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(bank_clean_up), nrow(bank_clean_up)*0.8)
tree_train <- bank_naive[index,]
tree_test <- bank_naive[-index,]First Model
Our first model will use all column as predictor and we not gonna do any model tuning. Even we are not specifying any parameter, by default decision tree model will use this setting:
- Mincriterion: 0.95
- Minsplit: 20
- Minbucket: 7
So, lets make our first model
model building
# Build first decision tree model
first_tree <- ctree(formula = y ~ ., data = tree_train)# Doing prediction
first_tree_predict <- predict(first_tree, tree_test, type = "response")model evaluation
#
confusionMatrix(first_tree_predict, tree_test$y, positive = "yes")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 5630 341
#> yes 1183 6428
#>
#> Accuracy : 0.8878
#> 95% CI : (0.8824, 0.8931)
#> No Information Rate : 0.5016
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.7757
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.9496
#> Specificity : 0.8264
#> Pos Pred Value : 0.8446
#> Neg Pred Value : 0.9429
#> Prevalence : 0.4984
#> Detection Rate : 0.4733
#> Detection Prevalence : 0.5604
#> Balanced Accuracy : 0.8880
#>
#> 'Positive' Class : yes
#>
From Accuracy, Sensitivity and Pos Pred Value, we can say that our first model have good performance. Even tho, we will try to do tuning in mode parameter in our second model. Our aim is to get as high accuracy as possible. Why accuracy? If you remember the definition about accuracy that we mentioned before, both our class target is important for bank to made strategy in approaching consumer (strategy for them). Higher accuracy means better the model made true prediction, or in other word the better bank made proper and effective strategy
Second Model
model building
Our second model will use this parameter setting:
Mincriterionof 0.2Minsplitof 25Minbucketof 5
# Build model
second_tree <- ctree(formula = y ~ ., data = tree_train,
control = ctree_control(mincriterion = 0.2,
minsplit = 25,
minbucket = 5))model evaluation
predict_tuning <- predict(second_tree, tree_test, type = "response")
confusionMatrix(predict_tuning, tree_test$y, positive = "yes")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 5903 387
#> yes 910 6382
#>
#> Accuracy : 0.9045
#> 95% CI : (0.8994, 0.9094)
#> No Information Rate : 0.5016
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.8091
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.9428
#> Specificity : 0.8664
#> Pos Pred Value : 0.8752
#> Neg Pred Value : 0.9385
#> Prevalence : 0.4984
#> Detection Rate : 0.4699
#> Detection Prevalence : 0.5369
#> Balanced Accuracy : 0.9046
#>
#> 'Positive' Class : yes
#>
As we observed after we tuning the parameter, our second model get even more bigger number of accuracy than our first model.
Let’s made ROC plot
# We use raw as type argument to get the number of probability
tree_prob <- predict(second_tree, naive_test, type = "prob")
# Take only "yes" class
tree_roc <- data.frame(prediction=round(tree_prob[,2],4),
trueclass=as.numeric(tree_test$y == "yes"))
head(tree_roc)#> prediction trueclass
#> 3 0 0
#> 4 0 0
#> 8 0 0
#> 12 0 0
#> 19 0 0
#> 20 0 0
tree_roc <- ROCR::prediction(tree_roc$prediction, tree_roc$trueclass)
plot(performance(tree_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)# AUC
auc_ROCR_d <- performance(tree_roc, measure = "auc")
auc_ROCR_d <- auc_ROCR_d@y.values[[1]]
auc_ROCR_d#> [1] 0.9480649
RANDOM FOREST
Imagine we have a really complex problem, to solve this problem, we ask many experts in this field to help us get the answer needed. After they give their answer, they do voting to choose which answer will be the best to solve the problem.
This analogy fit perfectly with random forest. Many decision tree were made using random data and predictor. As an expert, each decision tree will made a classification prediction, than the model will choose the most answer (if we use random forest to do regression prediction, the system will average all the answer). So Random Forest is an ensemble-based algorithm which was built based on a decision tree method and known for it’s versatility and performance.Ensemble-based algorithm itself is actually a hybrid of several machine learning techniques combined into one predictive model, built to reduce error, bias and improve predictions. The building of a Random Forest model consist of several steps:
- Perform Bootstrap Sampling. Creating subsets of training data through random sampling with replacement to train multiple predictive models (in this case many decision trees).
- Perform Decision Tree. Training bunch of decision
tree model form each data made by Bootstrap Sampling. Using
mtryargument to randomly choose predictor for each decision tree model (Automatic Feature Selection). - Doing Prediction. Made prediction from each decision tree model.
- Aggregation. Made one prediction for each model. For classification, the model will use majority voting, while regression will use average of target class.
Speaking about random selection of observation and
variables, there’s a techniques called K-fold
Cross-Validation. This technique performs cross-validation by
splitting our data into k equal sized sample group bins.
The model will take one bin and choose which data will be data train and
the rest of data test. The process repeated for k-times (the fold). This
makes every observation has the chance to be used as both training and
data test data, and therefore mas also overcome overfitting problem from
the decision tree model. Bellow is an example of a 5-bins and 5-fold
cross validation.
To get better understanding about how random forest work, let’s make random forest model.
Model Building
#set.seed(417)
# Determining K fold cross validaton
#ctrl <- trainControl(method = "repeatedcv",
# number = 5, # k-fold
# repeats = 3) # repetition
# Train random forest model
#fb_forest <- train(y ~ .,
# data = tree_train,
# method = "rf", # random forest
# trControl = ctrl)
# For convenience in reading the model, we will save it in a file
#saveRDS(fb_forest, "bank_forest.RDS")# Read RDS file
bank_forest <- readRDS("bank_forest.RDS")Model Evaluation
bank_forest$finalModel#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 25
#>
#> OOB estimate of error rate: 3.32%
#> Confusion matrix:
#> no yes class.error
#> no 25353 1788 0.0658781917
#> yes 13 27172 0.0004782049
Lets see our accuracy model
TP <- 27172
TN <- 25353
FP <- 12
FN <- 1788
total <- TP + TN + FP + FN
accuracy <- (TP + TN) / total
accuracy#> [1] 0.9668661
Our trained model has incredibly high accuracy compared to other
models we already made before. But this number can be reduce while
predicting new data (tree_test), so we will see
confusionmatrix to see this model performance in doing
prediction.
bank_predict_forest <- predict(bank_forest, newdata = tree_test)
conf_matrix <- confusionMatrix(bank_predict_forest, tree_test$y, positive = "yes")
conf_matrix#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 6392 6
#> yes 421 6763
#>
#> Accuracy : 0.9686
#> 95% CI : (0.9655, 0.9714)
#> No Information Rate : 0.5016
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.9371
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.9991
#> Specificity : 0.9382
#> Pos Pred Value : 0.9414
#> Neg Pred Value : 0.9991
#> Prevalence : 0.4984
#> Detection Rate : 0.4979
#> Detection Prevalence : 0.5289
#> Balanced Accuracy : 0.9687
#>
#> 'Positive' Class : yes
#>
It’s even have bigger accuracy than our train model, also this accuracy is the highest among all models we already made before.
Made ROC plot
# We use raw as type argument to get the number of probability
forest_prob <- predict(bank_forest, tree_test, type = "prob")
# Take only "yes" class
forest_roc <- data.frame(prediction=round(forest_prob[,2],4),
trueclass=as.numeric(tree_test$y == "yes"))
head(forest_roc)#> prediction trueclass
#> 1 0.010 0
#> 2 0.000 0
#> 3 0.000 0
#> 4 0.002 0
#> 5 0.008 0
#> 6 0.004 0
forest_roc <- ROCR::prediction(forest_roc$prediction, forest_roc$trueclass)
plot(performance(forest_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)# AUC
auc_ROCR_f <- performance(forest_roc, measure = "auc")
auc_ROCR_f <- auc_ROCR_f@y.values[[1]]
auc_ROCR_f#> [1] 0.9997921
Our ROC curve is excellent (almost have 100% area under curve), this is the best ROC curve compared to other models.
CONCLUSION
- From all models we made before, random forest have better performance among all models.
- Since our target class are equally important, the parameter of
confusionMatrixthat will be used is Accuracy. Accuracy value represent how good the model in made true prediction (Treu Positive and True Negative) - Our random forest model have accuracy of 96,88% and AUC of 0.99%