We are going to do a simple classification model with a data from kaggle, we are going to do some classification machine learning model on a bank marketing data which predicts whether and individual will end up making a deposit or not.
This data is obtained from https://www.kaggle.com/janiobachmann/bank-marketing-dataset
First of all, we are going to load the necessary libraries and dataset for this modeling.
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(caret)
library(vtreat)
library(FSelector)
library(glmnet)
library(e1071)
library(magrittr)
library(wesanderson)
library(ggpubr)Now, we want to load the corresponding dataset.
df_bank<-read_delim("bank-additional-full.csv",delim = ";")Now, we want to inspect the columns of the data
str(df_bank)## tibble [41,188 x 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:41188] 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr [1:41188] "housemaid" "services" "services" "admin." ...
## $ marital : chr [1:41188] "married" "married" "married" "married" ...
## $ education : chr [1:41188] "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr [1:41188] "no" "unknown" "no" "no" ...
## $ housing : chr [1:41188] "no" "no" "yes" "no" ...
## $ loan : chr [1:41188] "no" "no" "no" "no" ...
## $ contact : chr [1:41188] "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr [1:41188] "may" "may" "may" "may" ...
## $ day_of_week : chr [1:41188] "mon" "mon" "mon" "mon" ...
## $ duration : num [1:41188] 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : num [1:41188] 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : num [1:41188] 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : num [1:41188] 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr [1:41188] "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num [1:41188] 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num [1:41188] 94 94 94 94 94 ...
## $ cons.conf.idx : num [1:41188] -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num [1:41188] 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num [1:41188] 5191 5191 5191 5191 5191 ...
## $ y : chr [1:41188] "no" "no" "no" "no" ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. job = col_character(),
## .. marital = col_character(),
## .. education = col_character(),
## .. default = col_character(),
## .. housing = col_character(),
## .. loan = col_character(),
## .. contact = col_character(),
## .. month = col_character(),
## .. day_of_week = col_character(),
## .. duration = col_double(),
## .. campaign = col_double(),
## .. pdays = col_double(),
## .. previous = col_double(),
## .. poutcome = col_character(),
## .. emp.var.rate = col_double(),
## .. cons.price.idx = col_double(),
## .. cons.conf.idx = col_double(),
## .. euribor3m = col_double(),
## .. nr.employed = col_double(),
## .. y = col_character()
## .. )
This dataset contains 21 columns of data, which consists of 20 predictors and 1 to be predicted variables. The variable to be predicted is y, in which the customer end up making a deposit in the bank or not. Here is a description of the rest of the variables :
Client Data
Other attributes
Social and economic context attributes
Let’s inspect the dataset
head(df_bank)See that from the job column, there is some inconsistency with the formatting of row 4, in which “admin.” contains an extra full stop in which the other rows does not.
Now, we want all the formats to be consistent so we delete the full stop
df_bank$job<-gsub(".", "", df_bank$job, fixed = TRUE)Let’s inspect our job column again
table(df_bank$job)##
## admin blue-collar entrepreneur housemaid management
## 10422 9254 1456 1060 2924
## retired self-employed services student technician
## 1720 1421 3969 875 6743
## unemployed unknown
## 1014 330
All done, now let’s check if there are any empty columns
colSums(is.na(df_bank))## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y
## 0
Perfect, let’s move on to the next step.
Now, let’s explore the distribution and inspect outliers in the numerical variables. First, we are going to create boxplots
df_bank%>%
select_if(is.numeric) %>%
pivot_longer(cols=c(1:9),names_to = "variable",values_to = "value") %>%
ggplot(aes(y=value))+geom_boxplot()+facet_wrap(~variable,scales="free")It seems like the distribution of our variables are skewed, especially the variables age, campaign, duration, pdays, previous. To see a more concise graph of the distributions of our model, we are going to use the histogram
Now, let’s create the histogram.
df_bank%>%
select_if(is.numeric) %>%
pivot_longer(cols=c(1:9),names_to = "variable",values_to = "value") %>%
ggplot(aes(y=value))+geom_histogram()+facet_wrap(~variable,scales="free")+coord_flip()+theme_minimal()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As we can see from the charts above :
df_banka<-df_bank %>%group_by(y,marital) %>% count() %>%
ggplot(aes(reorder(marital,n),n))+geom_col(aes(fill=y))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=2, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Marital Status")
b<-df_bank %>%group_by(y,job) %>% count() %>%
ggplot(aes(reorder(job,n),n))+geom_col(aes(fill=y))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=2, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Job")
figure<-ggarrange(a, b,
common.legend = TRUE, legend = "bottom")
annotate_figure(figure,top = text_grob("Term Deposit Subscription Based on Marital Status and Job", color = "black", face = "bold", size = 15))From the plot above, we can see that most of the people that ended up making a term deposit subscription are :
c<-df_bank %>%group_by(y,education) %>% count() %>%
ggplot(aes(reorder(education,n),n))+geom_col(aes(fill=y))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=2, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Education")
d<-df_bank %>%group_by(y,default) %>% count() %>%
ggplot(aes(reorder(default,n),n))+geom_col(aes(fill=y))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=2, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Default")
figure2<-ggarrange(c, d,
common.legend = TRUE, legend = "bottom")
annotate_figure(figure2,top = text_grob("Term Deposit Subscription Based on Education and Credit Default", color = "black", face = "bold", size = 15))From the plot above, we can see that most of the people that ended up making a term deposit subscription are :
e<-df_bank %>%group_by(y,housing) %>% count() %>%
ggplot(aes(reorder(housing,n),n))+geom_col(aes(fill=y))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=2, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Housing")
f<-df_bank %>%group_by(y,loan) %>% count() %>%
ggplot(aes(reorder(loan,n),n))+geom_col(aes(fill=y))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=2, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Loan")
figure3<-ggarrange(e, f,
common.legend = TRUE, legend = "bottom")
annotate_figure(figure3,top = text_grob("Term Deposit Subscription Based on Housing and Loan Status", color = "black", face = "bold", size = 15))From the plot above, we can see that most of the people that ended up making a term deposit subscription are :
Since the predicted variable in our first model which is the logistic regression will be in probability, we will convert all of the target variables into numeric first. We will also convert the qualitative variables into factor.
df_bank<-df_bank %>%
mutate(y=as.factor(y)) %>%
mutate(y=ifelse(y=="yes",1,0)) %>%
mutate_if(is.character,as.factor)To reduce the chance of overfiting our data, we are going to split the data into training and testing set in order to then check our model’s performance after the modeling step. In this project, we are going to split the data into 70% train set and 30% test set.
RNGkind(sample.kind = "Rounding")
set.seed(125)
row_data <- nrow(df_bank)
index <- sample(row_data, row_data*0.7)
data_train <- df_bank[ index, ]
data_test <- df_bank[ -index, ]
data_train<-data_train %>% mutate(y=as.factor(y))
data_test<-data_test %>% mutate(y=as.factor(y))x_train<-as.data.frame(data_train) %>% select(-y)
y_train<-data_train$y
x_test<-as.data.frame(data_test) %>% select(-y)
y_test<-data_test$yAs we’ve seen earlier in the EDA process, we have obtained that the classes of the response variable is in fact very imbalanced.
prop.table(table(y_train))## y_train
## 0 1
## 0.8888349 0.1111651
As shown from the result above, there is around 89% of responses which corresponds to 0, and only around 11% of the responses corresponds to 1.
Thus, we need to do down sampling/up sampling so the model can learn equally on both responses.
Since we have 41188 rows of data which is quite hefty, we are going to do down sampling instead of of up sampling.
set.seed(123)
train_downsample <- downSample(x = x_train,
y = y_train,
yname = "y")
train_downsample<-train_downsample %>% mutate(y=as.factor(y)) Now, let’s attempt a stepwise logistic regression on our data
model_full<-glm(y~.,train_downsample, family = "binomial")
model_null<-glm(y~1,train_downsample, family = "binomial")
model_logistic_step <- step(model_null,
scope = list(lower = model_null, upper = model_full),
direction = "both"
)## Start: AIC=8888.15
## y ~ 1
##
## Df Deviance AIC
## + duration 1 6990.8 6994.8
## + nr.employed 1 7417.1 7421.1
## + euribor3m 1 7612.9 7616.9
## + emp.var.rate 1 7700.2 7704.2
## + poutcome 2 8177.1 8183.1
## + pdays 1 8189.9 8193.9
## + month 9 8205.0 8225.0
## + previous 1 8438.6 8442.6
## + contact 1 8528.1 8532.1
## + job 11 8560.3 8584.3
## + cons.price.idx 1 8657.3 8661.3
## + default 1 8691.8 8695.8
## + education 7 8771.7 8787.7
## + campaign 1 8797.4 8801.4
## + marital 3 8830.9 8838.9
## + cons.conf.idx 1 8837.6 8841.6
## + day_of_week 4 8870.5 8880.5
## + age 1 8878.4 8882.4
## + loan 2 8879.9 8885.9
## + housing 2 8881.4 8887.4
## <none> 8886.1 8888.1
##
## Step: AIC=6994.85
## y ~ duration
##
## Df Deviance AIC
## + nr.employed 1 4851.8 4857.8
## + euribor3m 1 5065.0 5071.0
## + emp.var.rate 1 5123.1 5129.1
## + month 9 6006.5 6028.5
## + poutcome 2 6136.6 6144.6
## + pdays 1 6171.7 6177.7
## + previous 1 6403.9 6409.9
## + job 11 6506.7 6532.7
## + contact 1 6572.6 6578.6
## + cons.price.idx 1 6609.8 6615.8
## + default 1 6678.8 6684.8
## + education 7 6812.2 6830.2
## + cons.conf.idx 1 6877.6 6883.6
## + campaign 1 6891.5 6897.5
## + marital 3 6924.9 6934.9
## + age 1 6973.1 6979.1
## + day_of_week 4 6978.4 6990.4
## + loan 2 6982.6 6990.6
## + housing 2 6986.0 6994.0
## <none> 6990.8 6994.8
## - duration 1 8886.1 8888.1
##
## Step: AIC=4857.78
## y ~ duration + nr.employed
##
## Df Deviance AIC
## + month 9 4476.8 4500.8
## + poutcome 2 4685.9 4695.9
## + job 11 4727.0 4755.0
## + pdays 1 4751.9 4759.9
## + education 7 4757.7 4777.7
## + default 1 4802.3 4810.3
## + emp.var.rate 1 4807.3 4815.3
## + contact 1 4812.7 4820.7
## + cons.conf.idx 1 4823.4 4831.4
## + cons.price.idx 1 4829.4 4837.4
## + marital 3 4841.6 4853.6
## + campaign 1 4846.5 4854.5
## <none> 4851.8 4857.8
## + loan 2 4847.8 4857.8
## + euribor3m 1 4851.2 4859.2
## + age 1 4851.4 4859.4
## + previous 1 4851.7 4859.7
## + housing 2 4851.3 4861.3
## + day_of_week 4 4847.7 4861.7
## - nr.employed 1 6990.8 6994.8
## - duration 1 7417.1 7421.1
##
## Step: AIC=4500.84
## y ~ duration + nr.employed + month
##
## Df Deviance AIC
## + poutcome 2 4349.7 4377.7
## + pdays 1 4407.5 4433.5
## + emp.var.rate 1 4435.1 4461.1
## + education 7 4425.8 4463.8
## + job 11 4423.3 4469.3
## + default 1 4450.1 4476.1
## + cons.price.idx 1 4454.1 4480.1
## + euribor3m 1 4469.0 4495.0
## + campaign 1 4470.7 4496.7
## + marital 3 4468.6 4498.6
## + contact 1 4474.4 4500.4
## <none> 4476.8 4500.8
## + cons.conf.idx 1 4474.8 4500.8
## + age 1 4476.2 4502.2
## + previous 1 4476.8 4502.8
## + loan 2 4475.6 4503.6
## + day_of_week 4 4472.1 4504.1
## + housing 2 4476.7 4504.7
## - month 9 4851.8 4857.8
## - nr.employed 1 6006.5 6028.5
## - duration 1 7108.1 7130.1
##
## Step: AIC=4377.69
## y ~ duration + nr.employed + month + poutcome
##
## Df Deviance AIC
## + emp.var.rate 1 4286.8 4316.8
## + cons.price.idx 1 4310.3 4340.3
## + education 7 4302.6 4344.6
## + job 11 4301.2 4351.2
## + default 1 4323.8 4353.8
## + euribor3m 1 4330.2 4360.2
## + contact 1 4344.4 4374.4
## + marital 3 4341.3 4375.3
## + campaign 1 4345.4 4375.4
## + previous 1 4347.0 4377.0
## <none> 4349.7 4377.7
## + age 1 4348.7 4378.7
## + cons.conf.idx 1 4349.4 4379.4
## + pdays 1 4349.7 4379.7
## + loan 2 4348.5 4380.5
## + day_of_week 4 4344.9 4380.9
## + housing 2 4349.6 4381.6
## - poutcome 2 4476.8 4500.8
## - month 9 4685.9 4695.9
## - nr.employed 1 5404.5 5430.5
## - duration 1 6965.8 6991.8
##
## Step: AIC=4316.81
## y ~ duration + nr.employed + month + poutcome + emp.var.rate
##
## Df Deviance AIC
## + cons.price.idx 1 4240.6 4272.6
## + education 7 4240.2 4284.2
## + job 11 4238.2 4290.2
## + euribor3m 1 4261.1 4293.1
## + default 1 4265.9 4297.9
## + marital 3 4279.9 4315.9
## + campaign 1 4284.4 4316.4
## <none> 4286.8 4316.8
## + cons.conf.idx 1 4285.8 4317.8
## + age 1 4286.5 4318.5
## + pdays 1 4286.5 4318.5
## + contact 1 4286.7 4318.7
## + previous 1 4286.8 4318.8
## + loan 2 4285.6 4319.6
## + housing 2 4286.7 4320.7
## + day_of_week 4 4283.0 4321.0
## - emp.var.rate 1 4349.7 4377.7
## - nr.employed 1 4392.3 4420.3
## - poutcome 2 4435.1 4461.1
## - month 9 4618.7 4630.7
## - duration 1 6954.4 6982.4
##
## Step: AIC=4272.57
## y ~ duration + nr.employed + month + poutcome + emp.var.rate +
## cons.price.idx
##
## Df Deviance AIC
## + education 7 4193.2 4239.2
## + job 11 4192.8 4246.8
## + default 1 4220.4 4254.4
## + euribor3m 1 4226.9 4260.9
## + cons.conf.idx 1 4236.3 4270.3
## + contact 1 4236.7 4270.7
## + marital 3 4233.2 4271.2
## + campaign 1 4238.5 4272.5
## <none> 4240.6 4272.6
## + age 1 4240.0 4274.0
## + pdays 1 4240.3 4274.3
## + previous 1 4240.5 4274.5
## + loan 2 4239.9 4275.9
## + day_of_week 4 4236.4 4276.4
## + housing 2 4240.4 4276.4
## - nr.employed 1 4250.6 4280.6
## - cons.price.idx 1 4286.8 4316.8
## - emp.var.rate 1 4310.3 4340.3
## - poutcome 2 4382.8 4410.8
## - month 9 4612.0 4626.0
## - duration 1 6930.5 6960.5
##
## Step: AIC=4239.19
## y ~ duration + nr.employed + month + poutcome + emp.var.rate +
## cons.price.idx + education
##
## Df Deviance AIC
## + euribor3m 1 4180.7 4228.7
## + default 1 4181.0 4229.0
## + job 11 4164.1 4232.1
## + cons.conf.idx 1 4189.1 4237.1
## + contact 1 4189.4 4237.4
## + campaign 1 4190.7 4238.7
## <none> 4193.2 4239.2
## + age 1 4192.6 4240.6
## + pdays 1 4192.7 4240.7
## + previous 1 4193.1 4241.1
## + marital 3 4189.7 4241.7
## + loan 2 4192.4 4242.4
## + day_of_week 4 4188.7 4242.7
## + housing 2 4193.1 4243.1
## - nr.employed 1 4203.9 4247.9
## - education 7 4240.6 4272.6
## - cons.price.idx 1 4240.2 4284.2
## - emp.var.rate 1 4263.8 4307.8
## - poutcome 2 4331.7 4373.7
## - month 9 4530.9 4558.9
## - duration 1 6899.3 6943.3
##
## Step: AIC=4228.7
## y ~ duration + nr.employed + month + poutcome + emp.var.rate +
## cons.price.idx + education + euribor3m
##
## Df Deviance AIC
## + contact 1 4167.1 4217.1
## + default 1 4168.5 4218.5
## + job 11 4153.7 4223.7
## - nr.employed 1 4181.2 4227.2
## + campaign 1 4178.6 4228.6
## <none> 4180.7 4228.7
## + cons.conf.idx 1 4179.5 4229.5
## + age 1 4180.4 4230.4
## + pdays 1 4180.4 4230.4
## + previous 1 4180.6 4230.6
## + marital 3 4177.3 4231.3
## + loan 2 4179.9 4231.9
## + housing 2 4180.5 4232.5
## + day_of_week 4 4176.7 4232.7
## - euribor3m 1 4193.2 4239.2
## - education 7 4226.9 4260.9
## - cons.price.idx 1 4215.8 4261.8
## - emp.var.rate 1 4259.6 4305.6
## - poutcome 2 4314.3 4358.3
## - month 9 4493.8 4523.8
## - duration 1 6887.7 6933.7
##
## Step: AIC=4217.1
## y ~ duration + nr.employed + month + poutcome + emp.var.rate +
## cons.price.idx + education + euribor3m + contact
##
## Df Deviance AIC
## + default 1 4155.3 4207.3
## + job 11 4140.0 4212.0
## - nr.employed 1 4167.7 4215.7
## <none> 4167.1 4217.1
## + campaign 1 4165.7 4217.7
## + age 1 4166.8 4218.8
## + previous 1 4166.8 4218.8
## + pdays 1 4166.8 4218.8
## + cons.conf.idx 1 4167.1 4219.1
## + marital 3 4163.9 4219.9
## + loan 2 4166.2 4220.2
## + day_of_week 4 4163.0 4221.0
## + housing 2 4167.0 4221.0
## - contact 1 4180.7 4228.7
## - euribor3m 1 4189.4 4237.4
## - education 7 4212.6 4248.6
## - cons.price.idx 1 4211.7 4259.7
## - emp.var.rate 1 4259.3 4307.3
## - poutcome 2 4299.5 4345.5
## - month 9 4435.9 4467.9
## - duration 1 6860.8 6908.8
##
## Step: AIC=4207.29
## y ~ duration + nr.employed + month + poutcome + emp.var.rate +
## cons.price.idx + education + euribor3m + contact + default
##
## Df Deviance AIC
## + job 11 4129.6 4203.6
## - nr.employed 1 4155.8 4205.8
## <none> 4155.3 4207.3
## + campaign 1 4154.0 4208.0
## + age 1 4154.4 4208.4
## + pdays 1 4155.0 4209.0
## + previous 1 4155.0 4209.0
## + cons.conf.idx 1 4155.2 4209.2
## + loan 2 4154.2 4210.2
## + marital 3 4152.5 4210.5
## + housing 2 4155.2 4211.2
## + day_of_week 4 4151.2 4211.2
## - default 1 4167.1 4217.1
## - contact 1 4168.5 4218.5
## - euribor3m 1 4177.4 4227.4
## - education 7 4193.3 4231.3
## - cons.price.idx 1 4198.9 4248.9
## - emp.var.rate 1 4245.0 4295.0
## - poutcome 2 4286.6 4334.6
## - month 9 4417.0 4451.0
## - duration 1 6857.1 6907.1
##
## Step: AIC=4203.56
## y ~ duration + nr.employed + month + poutcome + emp.var.rate +
## cons.price.idx + education + euribor3m + contact + default +
## job
##
## Df Deviance AIC
## - nr.employed 1 4130.5 4202.5
## <none> 4129.6 4203.6
## + campaign 1 4128.2 4204.2
## + previous 1 4129.3 4205.3
## + pdays 1 4129.3 4205.3
## + cons.conf.idx 1 4129.5 4205.5
## + age 1 4129.5 4205.5
## + loan 2 4128.5 4206.5
## + marital 3 4126.6 4206.6
## - job 11 4155.3 4207.3
## + housing 2 4129.4 4207.4
## + day_of_week 4 4125.7 4207.7
## - default 1 4140.0 4212.0
## - contact 1 4143.0 4215.0
## - education 7 4156.1 4216.1
## - euribor3m 1 4149.1 4221.1
## - cons.price.idx 1 4173.0 4245.0
## - emp.var.rate 1 4217.5 4289.5
## - poutcome 2 4257.7 4327.7
## - month 9 4366.9 4422.9
## - duration 1 6840.5 6912.5
##
## Step: AIC=4202.53
## y ~ duration + month + poutcome + emp.var.rate + cons.price.idx +
## education + euribor3m + contact + default + job
##
## Df Deviance AIC
## <none> 4130.5 4202.5
## + campaign 1 4129.2 4203.2
## + nr.employed 1 4129.6 4203.6
## + previous 1 4130.2 4204.2
## + cons.conf.idx 1 4130.3 4204.3
## + pdays 1 4130.3 4204.3
## + age 1 4130.5 4204.5
## + loan 2 4129.4 4205.4
## + marital 3 4127.6 4205.6
## - job 11 4155.8 4205.8
## + housing 2 4130.4 4206.4
## + day_of_week 4 4126.8 4206.8
## - default 1 4141.0 4211.0
## - contact 1 4143.9 4213.9
## - education 7 4157.0 4215.0
## - euribor3m 1 4163.7 4233.7
## - cons.price.idx 1 4254.4 4324.4
## - poutcome 2 4257.8 4325.8
## - emp.var.rate 1 4265.2 4335.2
## - month 9 4407.6 4461.6
## - duration 1 6840.5 6910.5
Now, let’s examine the model that we have obtained by performing the stepwise regression
summary(model_logistic_step) ##
## Call:
## glm(formula = y ~ duration + month + poutcome + emp.var.rate +
## cons.price.idx + education + euribor3m + contact + default +
## job, family = "binomial", data = train_downsample)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.1589 -0.3753 -0.0492 0.4757 2.6380
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.262e+02 2.132e+01 -10.612 < 2e-16 ***
## duration 6.906e-03 1.999e-04 34.556 < 2e-16 ***
## monthaug 8.954e-01 1.861e-01 4.810 1.51e-06 ***
## monthdec -2.278e-01 4.807e-01 -0.474 0.635494
## monthjul -1.358e-02 1.815e-01 -0.075 0.940389
## monthjun -1.001e+00 2.066e-01 -4.844 1.27e-06 ***
## monthmar 1.696e+00 2.404e-01 7.057 1.70e-12 ***
## monthmay -8.558e-01 1.437e-01 -5.953 2.63e-09 ***
## monthnov -9.105e-01 2.092e-01 -4.353 1.34e-05 ***
## monthoct -7.236e-02 2.671e-01 -0.271 0.786443
## monthsep 2.644e-01 2.805e-01 0.943 0.345807
## poutcomenonexistent 6.980e-01 1.212e-01 5.761 8.34e-09 ***
## poutcomesuccess 2.171e+00 2.165e-01 10.025 < 2e-16 ***
## emp.var.rate -2.356e+00 2.093e-01 -11.259 < 2e-16 ***
## cons.price.idx 2.345e+00 2.235e-01 10.492 < 2e-16 ***
## educationbasic.6y 4.323e-01 2.373e-01 1.821 0.068539 .
## educationbasic.9y 2.309e-01 1.826e-01 1.265 0.205943
## educationhigh.school 4.652e-01 1.789e-01 2.601 0.009296 **
## educationilliterate 1.316e+01 2.282e+02 0.058 0.954011
## educationprofessional.course 4.154e-01 1.979e-01 2.099 0.035777 *
## educationuniversity.degree 7.684e-01 1.823e-01 4.216 2.49e-05 ***
## educationunknown 6.305e-01 2.423e-01 2.603 0.009253 **
## euribor3m 8.823e-01 1.542e-01 5.723 1.05e-08 ***
## contacttelephone -5.218e-01 1.428e-01 -3.653 0.000259 ***
## defaultunknown -3.919e-01 1.218e-01 -3.217 0.001296 **
## jobblue-collar -2.335e-01 1.539e-01 -1.517 0.129272
## jobentrepreneur -8.190e-02 2.501e-01 -0.328 0.743280
## jobhousemaid -2.530e-02 2.856e-01 -0.089 0.929410
## jobmanagement -1.762e-01 1.683e-01 -1.047 0.295034
## jobretired 4.894e-01 1.935e-01 2.529 0.011444 *
## jobself-employed -1.873e-01 2.200e-01 -0.851 0.394540
## jobservices -1.751e-01 1.634e-01 -1.072 0.283753
## jobstudent 4.260e-01 2.275e-01 1.872 0.061143 .
## jobtechnician 1.063e-01 1.429e-01 0.744 0.457067
## jobunemployed 4.190e-01 2.606e-01 1.608 0.107821
## jobunknown -1.143e-01 3.886e-01 -0.294 0.768645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8886.1 on 6409 degrees of freedom
## Residual deviance: 4130.5 on 6374 degrees of freedom
## AIC: 4202.5
##
## Number of Fisher Scoring iterations: 12
Now, we are going to plot the model coefficients to examine which variable of the model increases the odds the most.
plot_coeffs <- function(mlr_model) {
coeffs <- exp(coefficients(mlr_model))
mp <- barplot(coeffs[order(coeffs,decreasing = T)], col="#3F97D0", xaxt='n', main="Regression Coefficients")
lablist <- names(coeffs)
text(mp, par("usr")[3], labels = lablist, srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex=0.6)
}
plot_coeffs(model_logistic_step)It seems like the intercept of the model impact the odds the most.
Next, we are going predict and examine our model’s performance.
predicted_step<-predict(model_logistic_step,x_test,type="response")
pred_step_test <- ifelse(predicted_step > 0.5, 1, 0) %>% as.factor()
cm_step<-confusionMatrix(pred_step_test,y_test,positive = "1")
cm_step## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9349 172
## 1 1573 1263
##
## Accuracy : 0.8588
## 95% CI : (0.8525, 0.8649)
## No Information Rate : 0.8839
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.5169
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8801
## Specificity : 0.8560
## Pos Pred Value : 0.4453
## Neg Pred Value : 0.9819
## Prevalence : 0.1161
## Detection Rate : 0.1022
## Detection Prevalence : 0.2295
## Balanced Accuracy : 0.8681
##
## 'Positive' Class : 1
##
Next, we are going to do K-nearest neighbors which classifies the outcome based on the nearest neighbors. KNN measures it’s distance using a Euclidean Distance.
Since we have quite a few categorical variable, one hot encoding all of our qualitative data is necessary for the KNN algorithm since it measures euclidean distance.
treatplan<-designTreatmentsZ(df_bank,colnames(df_bank),verbose = F)Now, we examine the score frame
scoreFrame <- treatplan %>%
use_series(scoreFrame) %>%
select(varName, origName, code)We only want the rows with codes “clean” or “lev”
newvars <- scoreFrame %>%
filter(code %in% c("clean","lev")) %>%
use_series(varName)Now, we split the data into the training and testing data using the same ratio used in the previous step, which is 70% training data and 30% testing data. We also need to scale our data prior to KNN.
df_bank.treat <- prepare(treatplan, df_bank, varRestriction = newvars)
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
row_data <- nrow(df_bank.treat)
index <- sample(row_data, row_data*0.7)
df_train.treat <- df_bank.treat[ index, ] %>% mutate(y=as.factor(y))
data_train.treat <- downSample(x = df_train.treat %>% select(-y),
y = df_train.treat $y, list = F,
yname = "y")
table(data_train.treat$y)##
## 0 1
## 3277 3277
data_test.treat <- df_bank.treat[ -index, ]
data_train.treat<-as.data.frame(data_train.treat) %>% mutate(y=as.factor(y))
data_test.treat<-data_test.treat %>% mutate(y=as.factor(y))
x_train.treat<-as.matrix(as.data.frame(data_train.treat) %>% select(-y)) %>% scale()
y_train.treat<-data_train.treat$y
x_test.treat<-as.matrix(as.data.frame(data_test.treat) %>% select(-y)) %>% scale(center = attr(x_train.treat, "scaled:center"),scale=attr(x_train.treat, "scaled:scale"))
y_test.treat<-data_test.treat$yFirstly, we want to find the optimum number of k. The optimum number of k is obtained by the square root of the number of rows in our data.
k_choose <- sqrt(nrow(x_train.treat)) %>% round()
k_choose## [1] 81
Next, we want to classify the test data based on the KNN that we have created with the train data. Since KNN does not accept character variables, we’ll assign the values based on the one-hot encoded test data. Now, let’s predict using our KNN model.
pred_knn <- knn3Train(train = x_train.treat,
cl = y_train.treat,
test = x_test.treat,
k = k_choose
) %>%
as.factor()Now, we evaluate the confusion matrix.
cm_knn<-confusionMatrix(pred_knn,y_test.treat,positive="1")
cm_knn## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9662 493
## 1 1332 870
##
## Accuracy : 0.8523
## 95% CI : (0.8459, 0.8585)
## No Information Rate : 0.8897
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4073
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.63830
## Specificity : 0.87884
## Pos Pred Value : 0.39510
## Neg Pred Value : 0.95145
## Prevalence : 0.11030
## Detection Rate : 0.07041
## Detection Prevalence : 0.17820
## Balanced Accuracy : 0.75857
##
## 'Positive' Class : 1
##
Since the lasso regression model don’t accept the data in a variable form, we are going to use the one hot encoded data that we used in the KNN modeling.
Now, we will attempt to do a lasso regression to improve our model. Lasso Regression is a regression method that minimizes a loss function and the regularity of our model.
Lasso regression utilizes penalty which is comprised of all the estimated parameters. In which the lambda could have any value between zero to infinity. Which decides how aggressive regularization is performed. Lasso regression penalize the sum of absolute values of coefficients. As the lambda value increases, coefficients decrease and eventually become zero. This way, lasso regression eliminates insignificant variables from our model.
First, we find the perfect lambda for our model using the cv.glmnet function.
set.seed(123)
cv.lasso <- cv.glmnet(x_train.treat, y_train.treat, alpha = 1, family = "binomial")
plot(cv.lasso)Now we model using the lambda.min which gives minimum mean cross-validated error.
lasso.model <- glmnet(x_train.treat, y_train.treat, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.min)Now, let’s predict using our model
x.test <- model.matrix(y ~., data_test.treat)[,-1]
probabilities <- lasso.model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, 1, 0) %>% as.factor()Now, we examine the confusion matrix.
cm_lasso<-confusionMatrix(predicted.classes,y_test.treat,positive = "1")
cm_lasso## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1969 3
## 1 9025 1360
##
## Accuracy : 0.2694
## 95% CI : (0.2616, 0.2773)
## No Information Rate : 0.8897
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0454
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9978
## Specificity : 0.1791
## Pos Pred Value : 0.1310
## Neg Pred Value : 0.9985
## Prevalence : 0.1103
## Detection Rate : 0.1101
## Detection Prevalence : 0.8404
## Balanced Accuracy : 0.5884
##
## 'Positive' Class : 1
##
Now, we will see if doing feature selection prior to the modeling process would in fact increase our model’s performance
weights <- chi.squared(y~.,data_train)
print(weights)## attr_importance
## age 0.194145100
## job 0.151235731
## marital 0.051383230
## education 0.069303355
## default 0.091688851
## housing 0.009279667
## loan 0.006945143
## contact 0.145106996
## month 0.272467621
## day_of_week 0.028561231
## duration 0.433594881
## campaign 0.060608541
## pdays 0.321069669
## previous 0.227880116
## poutcome 0.317159682
## emp.var.rate 0.371260832
## cons.price.idx 0.421444841
## cons.conf.idx 0.418885586
## euribor3m 0.423278970
## nr.employed 0.404613761
weights %>% arrange(desc(attr_importance))subset <- cutoff.k(weights, 6)
subset## [1] "duration" "euribor3m" "cons.price.idx" "cons.conf.idx"
## [5] "nr.employed" "emp.var.rate"
f <- as.simple.formula(subset, "Class")
print(f)## Class ~ duration + euribor3m + cons.price.idx + cons.conf.idx +
## nr.employed + emp.var.rate
## <environment: 0x0000000027517a70>
Since we set the limit to 5 out of 20 possible candidates of the explanatory variables would be selected in terms of significance we are going to see if this is in fact will be useful. On the variable explanation noted on kaggle, this model would not be realistic if we have included the duration variable. So, even though duration is one of the most significant variable, we will omit it from the model.
train_downsample_selected <-train_downsample %>% select(c(euribor3m,cons.price.idx,cons.conf.idx,emp.var.rate,nr.employed,y))Model using the selected variables
model_selected<-glm(y~euribor3m+cons.price.idx+cons.conf.idx+emp.var.rate+nr.employed,train_downsample_selected,family="binomial")Let’s predict using our model
predicted_selected<-predict(model_selected,data_test,type="response")
pred_step_adjusted <- ifelse(predicted_selected > 0.5, 1, 0) %>% as.factor()Now, let’s examine the confusion matrix
cm_adjusted<-confusionMatrix(table(pred_step_adjusted,data_test$y),positive = "1")
cm_adjusted## Confusion Matrix and Statistics
##
##
## pred_step_adjusted 0 1
## 0 7915 400
## 1 3007 1035
##
## Accuracy : 0.7243
## 95% CI : (0.7163, 0.7322)
## No Information Rate : 0.8839
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2493
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.72125
## Specificity : 0.72468
## Pos Pred Value : 0.25606
## Neg Pred Value : 0.95189
## Prevalence : 0.11613
## Detection Rate : 0.08376
## Detection Prevalence : 0.32710
## Balanced Accuracy : 0.72297
##
## 'Positive' Class : 1
##
We have obtained several models, since our purpose is to generate a model for bank marketing purposes, a false-negative is way more harmful to the marketing strategy than a false positive. Since the goal of marketing is to get as many positives as possible, we are going to use sensitivity as the metric to compare all of our models’ performance.
data.frame("Stepwise Regression"=cm_step$byClass['Sensitivity'],
"KNN"=cm_knn$byClass['Sensitivity'],
"Lasso Regression"=cm_lasso$byClass['Sensitivity'],
"Feature Selection"=cm_adjusted$byClass['Sensitivity'])Based on our comparison of the sensitivity in all of our models, we have come into terms that the stepwise logistic regression performs the best out of all our models. This means that after all of our efforts to improve the model, the initial model still performs the best. Thus, the stepwise logistic model is the final best model for this analysis.