Introduction

We are going to do a simple classification model with a data from kaggle, we are going to do some classification machine learning model on a bank marketing data which predicts whether and individual will end up making a deposit or not.

This data is obtained from https://www.kaggle.com/janiobachmann/bank-marketing-dataset

Data Preparation

First of all, we are going to load the necessary libraries and dataset for this modeling.

Loading required libraries

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(caret)
library(vtreat)
library(FSelector)
library(glmnet)
library(e1071)
library(magrittr)
library(wesanderson)
library(ggpubr)

Now, we want to load the corresponding dataset.

df_bank<-read_delim("bank-additional-full.csv",delim = ";")

Now, we want to inspect the columns of the data

str(df_bank)
## tibble [41,188 x 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age           : num [1:41188] 56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr [1:41188] "housemaid" "services" "services" "admin." ...
##  $ marital       : chr [1:41188] "married" "married" "married" "married" ...
##  $ education     : chr [1:41188] "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr [1:41188] "no" "unknown" "no" "no" ...
##  $ housing       : chr [1:41188] "no" "no" "yes" "no" ...
##  $ loan          : chr [1:41188] "no" "no" "no" "no" ...
##  $ contact       : chr [1:41188] "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr [1:41188] "may" "may" "may" "may" ...
##  $ day_of_week   : chr [1:41188] "mon" "mon" "mon" "mon" ...
##  $ duration      : num [1:41188] 261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : num [1:41188] 1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : num [1:41188] 999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : num [1:41188] 0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr [1:41188] "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num [1:41188] 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num [1:41188] 94 94 94 94 94 ...
##  $ cons.conf.idx : num [1:41188] -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num [1:41188] 4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num [1:41188] 5191 5191 5191 5191 5191 ...
##  $ y             : chr [1:41188] "no" "no" "no" "no" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   job = col_character(),
##   ..   marital = col_character(),
##   ..   education = col_character(),
##   ..   default = col_character(),
##   ..   housing = col_character(),
##   ..   loan = col_character(),
##   ..   contact = col_character(),
##   ..   month = col_character(),
##   ..   day_of_week = col_character(),
##   ..   duration = col_double(),
##   ..   campaign = col_double(),
##   ..   pdays = col_double(),
##   ..   previous = col_double(),
##   ..   poutcome = col_character(),
##   ..   emp.var.rate = col_double(),
##   ..   cons.price.idx = col_double(),
##   ..   cons.conf.idx = col_double(),
##   ..   euribor3m = col_double(),
##   ..   nr.employed = col_double(),
##   ..   y = col_character()
##   .. )

Variable Description

This dataset contains 21 columns of data, which consists of 20 predictors and 1 to be predicted variables. The variable to be predicted is y, in which the customer end up making a deposit in the bank or not. Here is a description of the rest of the variables :

Client Data

  • Age (numeric)
  • Job : type of job (categorical: ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’)
  • Marital : marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’ ; note: ‘divorced’ means divorced or widowed)
  • Education (categorical: ‘basic.4y’, ‘basic.6y’, ‘basic.9y’, ‘high.school’, ‘illiterate’,
    ‘professional.course’, ‘university.degree’, ‘unknown’)
  • Default: has credit in default?
  • Housing: has housing loan? (categorical: ‘no’, ‘yes’, ‘unknown’)
  • Loan: has personal loan? (categorical: ‘no’, ‘yes’, ‘unknown’) Related with the last contact of the current campaign:
  • Contact: contact communication type (categorical: ‘cellular’,‘telephone’)
  • Month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, ‘nov’, ‘dec’)
  • Dayofweek: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
  • Duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other attributes

  • Campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  • Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  • Previous: number of contacts performed before this campaign and for this client (numeric)
  • Poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

Social and economic context attributes

  • Emp.var.rate: employment variation rate - quarterly indicator (numeric)
  • Cons.price.idx: consumer price index - monthly indicator (numeric)
  • Cons.conf.idx: consumer confidence index - monthly indicator (numeric)
  • Euribor3m: euribor 3 month rate - daily indicator (numeric)
  • Nr.employed: number of employees - quarterly indicator (numeric) Output variable (desired target):
  • y - has the client subscribed a term deposit? (binary: ‘yes’, ‘no’

Data Preprocessing

Let’s inspect the dataset

head(df_bank)

See that from the job column, there is some inconsistency with the formatting of row 4, in which “admin.” contains an extra full stop in which the other rows does not.

Now, we want all the formats to be consistent so we delete the full stop

df_bank$job<-gsub(".", "", df_bank$job, fixed = TRUE)

Let’s inspect our job column again

table(df_bank$job)
## 
##         admin   blue-collar  entrepreneur     housemaid    management 
##         10422          9254          1456          1060          2924 
##       retired self-employed      services       student    technician 
##          1720          1421          3969           875          6743 
##    unemployed       unknown 
##          1014           330

All done, now let’s check if there are any empty columns

colSums(is.na(df_bank))
##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y 
##              0

Perfect, let’s move on to the next step.

Exploratory Data Analysis

Boxplot

Now, let’s explore the distribution and inspect outliers in the numerical variables. First, we are going to create boxplots

df_bank%>% 
select_if(is.numeric) %>% 
pivot_longer(cols=c(1:9),names_to = "variable",values_to = "value") %>% 
   ggplot(aes(y=value))+geom_boxplot()+facet_wrap(~variable,scales="free")

It seems like the distribution of our variables are skewed, especially the variables age, campaign, duration, pdays, previous. To see a more concise graph of the distributions of our model, we are going to use the histogram

Histogram

Now, let’s create the histogram.

df_bank%>% 
select_if(is.numeric) %>% 
pivot_longer(cols=c(1:9),names_to = "variable",values_to = "value") %>% 
   ggplot(aes(y=value))+geom_histogram()+facet_wrap(~variable,scales="free")+coord_flip()+theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As we can see from the charts above :

  • The variables pdays and euribor3m are skewed to the left.
  • The variables campaign, duration and previous are skewed to the right.

Bar Plot

df_bank
a<-df_bank %>%group_by(y,marital) %>% count() %>% 
ggplot(aes(reorder(marital,n),n))+geom_col(aes(fill=y))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=2, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Marital Status")

b<-df_bank %>%group_by(y,job) %>% count() %>% 
ggplot(aes(reorder(job,n),n))+geom_col(aes(fill=y))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=2, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Job")

figure<-ggarrange(a, b,
          common.legend = TRUE, legend = "bottom")
annotate_figure(figure,top = text_grob("Term Deposit Subscription Based on Marital Status and Job", color = "black", face = "bold", size = 15))

From the plot above, we can see that most of the people that ended up making a term deposit subscription are :

  • Are Married
  • Is an Admin by profession
c<-df_bank %>%group_by(y,education) %>% count() %>% 
ggplot(aes(reorder(education,n),n))+geom_col(aes(fill=y))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=2, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Education")

d<-df_bank %>%group_by(y,default) %>% count() %>% 
ggplot(aes(reorder(default,n),n))+geom_col(aes(fill=y))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=2, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Default")

figure2<-ggarrange(c, d,
          common.legend = TRUE, legend = "bottom")
annotate_figure(figure2,top = text_grob("Term Deposit Subscription Based on Education and Credit Default", color = "black", face = "bold", size = 15))

From the plot above, we can see that most of the people that ended up making a term deposit subscription are :

  • Has a university degree
  • Doesn’t have a credit in default
e<-df_bank %>%group_by(y,housing) %>% count() %>% 
ggplot(aes(reorder(housing,n),n))+geom_col(aes(fill=y))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=2, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Housing")

f<-df_bank %>%group_by(y,loan) %>% count() %>% 
ggplot(aes(reorder(loan,n),n))+geom_col(aes(fill=y))+theme_light()+coord_flip()+scale_fill_manual(values=wes_palette(n=2, name="BottleRocket1"))+theme(axis.title.x=element_blank(),axis.title.y=element_blank(),legend.title = element_blank(),plot.title = element_text(size=10))+ggtitle("Loan")

figure3<-ggarrange(e, f,
          common.legend = TRUE, legend = "bottom")
annotate_figure(figure3,top = text_grob("Term Deposit Subscription Based on Housing and Loan Status", color = "black", face = "bold", size = 15))

From the plot above, we can see that most of the people that ended up making a term deposit subscription are :

  • Has a housing loan
  • Doesn’t have a personal loan

Data Preparation

Since the predicted variable in our first model which is the logistic regression will be in probability, we will convert all of the target variables into numeric first. We will also convert the qualitative variables into factor.

df_bank<-df_bank %>% 
  mutate(y=as.factor(y)) %>% 
  mutate(y=ifelse(y=="yes",1,0)) %>% 
  mutate_if(is.character,as.factor)

Splitting Data

To reduce the chance of overfiting our data, we are going to split the data into training and testing set in order to then check our model’s performance after the modeling step. In this project, we are going to split the data into 70% train set and 30% test set.

RNGkind(sample.kind = "Rounding")
set.seed(125)
row_data <- nrow(df_bank)

index <- sample(row_data, row_data*0.7)

data_train <- df_bank[ index, ]
data_test <- df_bank[ -index, ]
data_train<-data_train %>% mutate(y=as.factor(y))
data_test<-data_test %>% mutate(y=as.factor(y))
x_train<-as.data.frame(data_train) %>% select(-y)
y_train<-data_train$y
x_test<-as.data.frame(data_test) %>% select(-y)
y_test<-data_test$y

Downsampling

As we’ve seen earlier in the EDA process, we have obtained that the classes of the response variable is in fact very imbalanced.

prop.table(table(y_train))
## y_train
##         0         1 
## 0.8888349 0.1111651

As shown from the result above, there is around 89% of responses which corresponds to 0, and only around 11% of the responses corresponds to 1.

Thus, we need to do down sampling/up sampling so the model can learn equally on both responses.

Since we have 41188 rows of data which is quite hefty, we are going to do down sampling instead of of up sampling.

set.seed(123)
train_downsample <- downSample(x = x_train,
                               y = y_train,
                               yname = "y")
train_downsample<-train_downsample %>% mutate(y=as.factor(y))    

Logistic Regression

Modeling

Now, let’s attempt a stepwise logistic regression on our data

model_full<-glm(y~.,train_downsample, family = "binomial")

model_null<-glm(y~1,train_downsample, family = "binomial")
model_logistic_step <- step(model_null, 
                   scope = list(lower = model_null, upper = model_full),
                   direction = "both"
                   )
## Start:  AIC=8888.15
## y ~ 1
## 
##                  Df Deviance    AIC
## + duration        1   6990.8 6994.8
## + nr.employed     1   7417.1 7421.1
## + euribor3m       1   7612.9 7616.9
## + emp.var.rate    1   7700.2 7704.2
## + poutcome        2   8177.1 8183.1
## + pdays           1   8189.9 8193.9
## + month           9   8205.0 8225.0
## + previous        1   8438.6 8442.6
## + contact         1   8528.1 8532.1
## + job            11   8560.3 8584.3
## + cons.price.idx  1   8657.3 8661.3
## + default         1   8691.8 8695.8
## + education       7   8771.7 8787.7
## + campaign        1   8797.4 8801.4
## + marital         3   8830.9 8838.9
## + cons.conf.idx   1   8837.6 8841.6
## + day_of_week     4   8870.5 8880.5
## + age             1   8878.4 8882.4
## + loan            2   8879.9 8885.9
## + housing         2   8881.4 8887.4
## <none>                8886.1 8888.1
## 
## Step:  AIC=6994.85
## y ~ duration
## 
##                  Df Deviance    AIC
## + nr.employed     1   4851.8 4857.8
## + euribor3m       1   5065.0 5071.0
## + emp.var.rate    1   5123.1 5129.1
## + month           9   6006.5 6028.5
## + poutcome        2   6136.6 6144.6
## + pdays           1   6171.7 6177.7
## + previous        1   6403.9 6409.9
## + job            11   6506.7 6532.7
## + contact         1   6572.6 6578.6
## + cons.price.idx  1   6609.8 6615.8
## + default         1   6678.8 6684.8
## + education       7   6812.2 6830.2
## + cons.conf.idx   1   6877.6 6883.6
## + campaign        1   6891.5 6897.5
## + marital         3   6924.9 6934.9
## + age             1   6973.1 6979.1
## + day_of_week     4   6978.4 6990.4
## + loan            2   6982.6 6990.6
## + housing         2   6986.0 6994.0
## <none>                6990.8 6994.8
## - duration        1   8886.1 8888.1
## 
## Step:  AIC=4857.78
## y ~ duration + nr.employed
## 
##                  Df Deviance    AIC
## + month           9   4476.8 4500.8
## + poutcome        2   4685.9 4695.9
## + job            11   4727.0 4755.0
## + pdays           1   4751.9 4759.9
## + education       7   4757.7 4777.7
## + default         1   4802.3 4810.3
## + emp.var.rate    1   4807.3 4815.3
## + contact         1   4812.7 4820.7
## + cons.conf.idx   1   4823.4 4831.4
## + cons.price.idx  1   4829.4 4837.4
## + marital         3   4841.6 4853.6
## + campaign        1   4846.5 4854.5
## <none>                4851.8 4857.8
## + loan            2   4847.8 4857.8
## + euribor3m       1   4851.2 4859.2
## + age             1   4851.4 4859.4
## + previous        1   4851.7 4859.7
## + housing         2   4851.3 4861.3
## + day_of_week     4   4847.7 4861.7
## - nr.employed     1   6990.8 6994.8
## - duration        1   7417.1 7421.1
## 
## Step:  AIC=4500.84
## y ~ duration + nr.employed + month
## 
##                  Df Deviance    AIC
## + poutcome        2   4349.7 4377.7
## + pdays           1   4407.5 4433.5
## + emp.var.rate    1   4435.1 4461.1
## + education       7   4425.8 4463.8
## + job            11   4423.3 4469.3
## + default         1   4450.1 4476.1
## + cons.price.idx  1   4454.1 4480.1
## + euribor3m       1   4469.0 4495.0
## + campaign        1   4470.7 4496.7
## + marital         3   4468.6 4498.6
## + contact         1   4474.4 4500.4
## <none>                4476.8 4500.8
## + cons.conf.idx   1   4474.8 4500.8
## + age             1   4476.2 4502.2
## + previous        1   4476.8 4502.8
## + loan            2   4475.6 4503.6
## + day_of_week     4   4472.1 4504.1
## + housing         2   4476.7 4504.7
## - month           9   4851.8 4857.8
## - nr.employed     1   6006.5 6028.5
## - duration        1   7108.1 7130.1
## 
## Step:  AIC=4377.69
## y ~ duration + nr.employed + month + poutcome
## 
##                  Df Deviance    AIC
## + emp.var.rate    1   4286.8 4316.8
## + cons.price.idx  1   4310.3 4340.3
## + education       7   4302.6 4344.6
## + job            11   4301.2 4351.2
## + default         1   4323.8 4353.8
## + euribor3m       1   4330.2 4360.2
## + contact         1   4344.4 4374.4
## + marital         3   4341.3 4375.3
## + campaign        1   4345.4 4375.4
## + previous        1   4347.0 4377.0
## <none>                4349.7 4377.7
## + age             1   4348.7 4378.7
## + cons.conf.idx   1   4349.4 4379.4
## + pdays           1   4349.7 4379.7
## + loan            2   4348.5 4380.5
## + day_of_week     4   4344.9 4380.9
## + housing         2   4349.6 4381.6
## - poutcome        2   4476.8 4500.8
## - month           9   4685.9 4695.9
## - nr.employed     1   5404.5 5430.5
## - duration        1   6965.8 6991.8
## 
## Step:  AIC=4316.81
## y ~ duration + nr.employed + month + poutcome + emp.var.rate
## 
##                  Df Deviance    AIC
## + cons.price.idx  1   4240.6 4272.6
## + education       7   4240.2 4284.2
## + job            11   4238.2 4290.2
## + euribor3m       1   4261.1 4293.1
## + default         1   4265.9 4297.9
## + marital         3   4279.9 4315.9
## + campaign        1   4284.4 4316.4
## <none>                4286.8 4316.8
## + cons.conf.idx   1   4285.8 4317.8
## + age             1   4286.5 4318.5
## + pdays           1   4286.5 4318.5
## + contact         1   4286.7 4318.7
## + previous        1   4286.8 4318.8
## + loan            2   4285.6 4319.6
## + housing         2   4286.7 4320.7
## + day_of_week     4   4283.0 4321.0
## - emp.var.rate    1   4349.7 4377.7
## - nr.employed     1   4392.3 4420.3
## - poutcome        2   4435.1 4461.1
## - month           9   4618.7 4630.7
## - duration        1   6954.4 6982.4
## 
## Step:  AIC=4272.57
## y ~ duration + nr.employed + month + poutcome + emp.var.rate + 
##     cons.price.idx
## 
##                  Df Deviance    AIC
## + education       7   4193.2 4239.2
## + job            11   4192.8 4246.8
## + default         1   4220.4 4254.4
## + euribor3m       1   4226.9 4260.9
## + cons.conf.idx   1   4236.3 4270.3
## + contact         1   4236.7 4270.7
## + marital         3   4233.2 4271.2
## + campaign        1   4238.5 4272.5
## <none>                4240.6 4272.6
## + age             1   4240.0 4274.0
## + pdays           1   4240.3 4274.3
## + previous        1   4240.5 4274.5
## + loan            2   4239.9 4275.9
## + day_of_week     4   4236.4 4276.4
## + housing         2   4240.4 4276.4
## - nr.employed     1   4250.6 4280.6
## - cons.price.idx  1   4286.8 4316.8
## - emp.var.rate    1   4310.3 4340.3
## - poutcome        2   4382.8 4410.8
## - month           9   4612.0 4626.0
## - duration        1   6930.5 6960.5
## 
## Step:  AIC=4239.19
## y ~ duration + nr.employed + month + poutcome + emp.var.rate + 
##     cons.price.idx + education
## 
##                  Df Deviance    AIC
## + euribor3m       1   4180.7 4228.7
## + default         1   4181.0 4229.0
## + job            11   4164.1 4232.1
## + cons.conf.idx   1   4189.1 4237.1
## + contact         1   4189.4 4237.4
## + campaign        1   4190.7 4238.7
## <none>                4193.2 4239.2
## + age             1   4192.6 4240.6
## + pdays           1   4192.7 4240.7
## + previous        1   4193.1 4241.1
## + marital         3   4189.7 4241.7
## + loan            2   4192.4 4242.4
## + day_of_week     4   4188.7 4242.7
## + housing         2   4193.1 4243.1
## - nr.employed     1   4203.9 4247.9
## - education       7   4240.6 4272.6
## - cons.price.idx  1   4240.2 4284.2
## - emp.var.rate    1   4263.8 4307.8
## - poutcome        2   4331.7 4373.7
## - month           9   4530.9 4558.9
## - duration        1   6899.3 6943.3
## 
## Step:  AIC=4228.7
## y ~ duration + nr.employed + month + poutcome + emp.var.rate + 
##     cons.price.idx + education + euribor3m
## 
##                  Df Deviance    AIC
## + contact         1   4167.1 4217.1
## + default         1   4168.5 4218.5
## + job            11   4153.7 4223.7
## - nr.employed     1   4181.2 4227.2
## + campaign        1   4178.6 4228.6
## <none>                4180.7 4228.7
## + cons.conf.idx   1   4179.5 4229.5
## + age             1   4180.4 4230.4
## + pdays           1   4180.4 4230.4
## + previous        1   4180.6 4230.6
## + marital         3   4177.3 4231.3
## + loan            2   4179.9 4231.9
## + housing         2   4180.5 4232.5
## + day_of_week     4   4176.7 4232.7
## - euribor3m       1   4193.2 4239.2
## - education       7   4226.9 4260.9
## - cons.price.idx  1   4215.8 4261.8
## - emp.var.rate    1   4259.6 4305.6
## - poutcome        2   4314.3 4358.3
## - month           9   4493.8 4523.8
## - duration        1   6887.7 6933.7
## 
## Step:  AIC=4217.1
## y ~ duration + nr.employed + month + poutcome + emp.var.rate + 
##     cons.price.idx + education + euribor3m + contact
## 
##                  Df Deviance    AIC
## + default         1   4155.3 4207.3
## + job            11   4140.0 4212.0
## - nr.employed     1   4167.7 4215.7
## <none>                4167.1 4217.1
## + campaign        1   4165.7 4217.7
## + age             1   4166.8 4218.8
## + previous        1   4166.8 4218.8
## + pdays           1   4166.8 4218.8
## + cons.conf.idx   1   4167.1 4219.1
## + marital         3   4163.9 4219.9
## + loan            2   4166.2 4220.2
## + day_of_week     4   4163.0 4221.0
## + housing         2   4167.0 4221.0
## - contact         1   4180.7 4228.7
## - euribor3m       1   4189.4 4237.4
## - education       7   4212.6 4248.6
## - cons.price.idx  1   4211.7 4259.7
## - emp.var.rate    1   4259.3 4307.3
## - poutcome        2   4299.5 4345.5
## - month           9   4435.9 4467.9
## - duration        1   6860.8 6908.8
## 
## Step:  AIC=4207.29
## y ~ duration + nr.employed + month + poutcome + emp.var.rate + 
##     cons.price.idx + education + euribor3m + contact + default
## 
##                  Df Deviance    AIC
## + job            11   4129.6 4203.6
## - nr.employed     1   4155.8 4205.8
## <none>                4155.3 4207.3
## + campaign        1   4154.0 4208.0
## + age             1   4154.4 4208.4
## + pdays           1   4155.0 4209.0
## + previous        1   4155.0 4209.0
## + cons.conf.idx   1   4155.2 4209.2
## + loan            2   4154.2 4210.2
## + marital         3   4152.5 4210.5
## + housing         2   4155.2 4211.2
## + day_of_week     4   4151.2 4211.2
## - default         1   4167.1 4217.1
## - contact         1   4168.5 4218.5
## - euribor3m       1   4177.4 4227.4
## - education       7   4193.3 4231.3
## - cons.price.idx  1   4198.9 4248.9
## - emp.var.rate    1   4245.0 4295.0
## - poutcome        2   4286.6 4334.6
## - month           9   4417.0 4451.0
## - duration        1   6857.1 6907.1
## 
## Step:  AIC=4203.56
## y ~ duration + nr.employed + month + poutcome + emp.var.rate + 
##     cons.price.idx + education + euribor3m + contact + default + 
##     job
## 
##                  Df Deviance    AIC
## - nr.employed     1   4130.5 4202.5
## <none>                4129.6 4203.6
## + campaign        1   4128.2 4204.2
## + previous        1   4129.3 4205.3
## + pdays           1   4129.3 4205.3
## + cons.conf.idx   1   4129.5 4205.5
## + age             1   4129.5 4205.5
## + loan            2   4128.5 4206.5
## + marital         3   4126.6 4206.6
## - job            11   4155.3 4207.3
## + housing         2   4129.4 4207.4
## + day_of_week     4   4125.7 4207.7
## - default         1   4140.0 4212.0
## - contact         1   4143.0 4215.0
## - education       7   4156.1 4216.1
## - euribor3m       1   4149.1 4221.1
## - cons.price.idx  1   4173.0 4245.0
## - emp.var.rate    1   4217.5 4289.5
## - poutcome        2   4257.7 4327.7
## - month           9   4366.9 4422.9
## - duration        1   6840.5 6912.5
## 
## Step:  AIC=4202.53
## y ~ duration + month + poutcome + emp.var.rate + cons.price.idx + 
##     education + euribor3m + contact + default + job
## 
##                  Df Deviance    AIC
## <none>                4130.5 4202.5
## + campaign        1   4129.2 4203.2
## + nr.employed     1   4129.6 4203.6
## + previous        1   4130.2 4204.2
## + cons.conf.idx   1   4130.3 4204.3
## + pdays           1   4130.3 4204.3
## + age             1   4130.5 4204.5
## + loan            2   4129.4 4205.4
## + marital         3   4127.6 4205.6
## - job            11   4155.8 4205.8
## + housing         2   4130.4 4206.4
## + day_of_week     4   4126.8 4206.8
## - default         1   4141.0 4211.0
## - contact         1   4143.9 4213.9
## - education       7   4157.0 4215.0
## - euribor3m       1   4163.7 4233.7
## - cons.price.idx  1   4254.4 4324.4
## - poutcome        2   4257.8 4325.8
## - emp.var.rate    1   4265.2 4335.2
## - month           9   4407.6 4461.6
## - duration        1   6840.5 6910.5

Now, let’s examine the model that we have obtained by performing the stepwise regression

summary(model_logistic_step) 
## 
## Call:
## glm(formula = y ~ duration + month + poutcome + emp.var.rate + 
##     cons.price.idx + education + euribor3m + contact + default + 
##     job, family = "binomial", data = train_downsample)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.1589  -0.3753  -0.0492   0.4757   2.6380  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.262e+02  2.132e+01 -10.612  < 2e-16 ***
## duration                      6.906e-03  1.999e-04  34.556  < 2e-16 ***
## monthaug                      8.954e-01  1.861e-01   4.810 1.51e-06 ***
## monthdec                     -2.278e-01  4.807e-01  -0.474 0.635494    
## monthjul                     -1.358e-02  1.815e-01  -0.075 0.940389    
## monthjun                     -1.001e+00  2.066e-01  -4.844 1.27e-06 ***
## monthmar                      1.696e+00  2.404e-01   7.057 1.70e-12 ***
## monthmay                     -8.558e-01  1.437e-01  -5.953 2.63e-09 ***
## monthnov                     -9.105e-01  2.092e-01  -4.353 1.34e-05 ***
## monthoct                     -7.236e-02  2.671e-01  -0.271 0.786443    
## monthsep                      2.644e-01  2.805e-01   0.943 0.345807    
## poutcomenonexistent           6.980e-01  1.212e-01   5.761 8.34e-09 ***
## poutcomesuccess               2.171e+00  2.165e-01  10.025  < 2e-16 ***
## emp.var.rate                 -2.356e+00  2.093e-01 -11.259  < 2e-16 ***
## cons.price.idx                2.345e+00  2.235e-01  10.492  < 2e-16 ***
## educationbasic.6y             4.323e-01  2.373e-01   1.821 0.068539 .  
## educationbasic.9y             2.309e-01  1.826e-01   1.265 0.205943    
## educationhigh.school          4.652e-01  1.789e-01   2.601 0.009296 ** 
## educationilliterate           1.316e+01  2.282e+02   0.058 0.954011    
## educationprofessional.course  4.154e-01  1.979e-01   2.099 0.035777 *  
## educationuniversity.degree    7.684e-01  1.823e-01   4.216 2.49e-05 ***
## educationunknown              6.305e-01  2.423e-01   2.603 0.009253 ** 
## euribor3m                     8.823e-01  1.542e-01   5.723 1.05e-08 ***
## contacttelephone             -5.218e-01  1.428e-01  -3.653 0.000259 ***
## defaultunknown               -3.919e-01  1.218e-01  -3.217 0.001296 ** 
## jobblue-collar               -2.335e-01  1.539e-01  -1.517 0.129272    
## jobentrepreneur              -8.190e-02  2.501e-01  -0.328 0.743280    
## jobhousemaid                 -2.530e-02  2.856e-01  -0.089 0.929410    
## jobmanagement                -1.762e-01  1.683e-01  -1.047 0.295034    
## jobretired                    4.894e-01  1.935e-01   2.529 0.011444 *  
## jobself-employed             -1.873e-01  2.200e-01  -0.851 0.394540    
## jobservices                  -1.751e-01  1.634e-01  -1.072 0.283753    
## jobstudent                    4.260e-01  2.275e-01   1.872 0.061143 .  
## jobtechnician                 1.063e-01  1.429e-01   0.744 0.457067    
## jobunemployed                 4.190e-01  2.606e-01   1.608 0.107821    
## jobunknown                   -1.143e-01  3.886e-01  -0.294 0.768645    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8886.1  on 6409  degrees of freedom
## Residual deviance: 4130.5  on 6374  degrees of freedom
## AIC: 4202.5
## 
## Number of Fisher Scoring iterations: 12

Plotting Coefficients

Now, we are going to plot the model coefficients to examine which variable of the model increases the odds the most.

plot_coeffs <- function(mlr_model) {
      coeffs <- exp(coefficients(mlr_model))
      mp <- barplot(coeffs[order(coeffs,decreasing = T)], col="#3F97D0", xaxt='n', main="Regression Coefficients")
      lablist <- names(coeffs)
      text(mp, par("usr")[3], labels = lablist, srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex=0.6)
}

plot_coeffs(model_logistic_step)

It seems like the intercept of the model impact the odds the most.

Prediction

Next, we are going predict and examine our model’s performance.

predicted_step<-predict(model_logistic_step,x_test,type="response")
pred_step_test <- ifelse(predicted_step > 0.5, 1, 0) %>% as.factor()
cm_step<-confusionMatrix(pred_step_test,y_test,positive = "1")
cm_step
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 9349  172
##          1 1573 1263
##                                           
##                Accuracy : 0.8588          
##                  95% CI : (0.8525, 0.8649)
##     No Information Rate : 0.8839          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.5169          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8801          
##             Specificity : 0.8560          
##          Pos Pred Value : 0.4453          
##          Neg Pred Value : 0.9819          
##              Prevalence : 0.1161          
##          Detection Rate : 0.1022          
##    Detection Prevalence : 0.2295          
##       Balanced Accuracy : 0.8681          
##                                           
##        'Positive' Class : 1               
## 

KNN

Next, we are going to do K-nearest neighbors which classifies the outcome based on the nearest neighbors. KNN measures it’s distance using a Euclidean Distance.

One Hot Encoding and Feature Scaling

Since we have quite a few categorical variable, one hot encoding all of our qualitative data is necessary for the KNN algorithm since it measures euclidean distance.

treatplan<-designTreatmentsZ(df_bank,colnames(df_bank),verbose = F)

Now, we examine the score frame

scoreFrame <- treatplan %>%
    use_series(scoreFrame) %>%
    select(varName, origName, code)

We only want the rows with codes “clean” or “lev”

newvars <- scoreFrame %>%
    filter(code %in% c("clean","lev")) %>%
    use_series(varName)

Now, we split the data into the training and testing data using the same ratio used in the previous step, which is 70% training data and 30% testing data. We also need to scale our data prior to KNN.

df_bank.treat <- prepare(treatplan, df_bank, varRestriction = newvars)
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(123)
row_data <- nrow(df_bank.treat)
index <- sample(row_data, row_data*0.7)

df_train.treat <- df_bank.treat[ index, ] %>% mutate(y=as.factor(y))
data_train.treat <- downSample(x = df_train.treat  %>% select(-y),
                               y = df_train.treat $y, list = F,
                               yname = "y")
table(data_train.treat$y)
## 
##    0    1 
## 3277 3277
data_test.treat <- df_bank.treat[ -index, ]
data_train.treat<-as.data.frame(data_train.treat) %>% mutate(y=as.factor(y))
data_test.treat<-data_test.treat %>% mutate(y=as.factor(y))
x_train.treat<-as.matrix(as.data.frame(data_train.treat) %>% select(-y)) %>% scale()
y_train.treat<-data_train.treat$y
x_test.treat<-as.matrix(as.data.frame(data_test.treat) %>% select(-y)) %>%  scale(center = attr(x_train.treat, "scaled:center"),scale=attr(x_train.treat, "scaled:scale"))
y_test.treat<-data_test.treat$y

Modeling

Firstly, we want to find the optimum number of k. The optimum number of k is obtained by the square root of the number of rows in our data.

k_choose <- sqrt(nrow(x_train.treat)) %>% round()
k_choose
## [1] 81

Prediction

Next, we want to classify the test data based on the KNN that we have created with the train data. Since KNN does not accept character variables, we’ll assign the values based on the one-hot encoded test data. Now, let’s predict using our KNN model.

pred_knn <- knn3Train(train = x_train.treat, 
                      cl = y_train.treat, 
                      test = x_test.treat,
                      k = k_choose
                      ) %>% 
  as.factor()

Now, we evaluate the confusion matrix.

cm_knn<-confusionMatrix(pred_knn,y_test.treat,positive="1")
cm_knn
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 9662  493
##          1 1332  870
##                                           
##                Accuracy : 0.8523          
##                  95% CI : (0.8459, 0.8585)
##     No Information Rate : 0.8897          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4073          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.63830         
##             Specificity : 0.87884         
##          Pos Pred Value : 0.39510         
##          Neg Pred Value : 0.95145         
##              Prevalence : 0.11030         
##          Detection Rate : 0.07041         
##    Detection Prevalence : 0.17820         
##       Balanced Accuracy : 0.75857         
##                                           
##        'Positive' Class : 1               
## 

Model Improvement

Lasso Regression

Since the lasso regression model don’t accept the data in a variable form, we are going to use the one hot encoded data that we used in the KNN modeling.

Now, we will attempt to do a lasso regression to improve our model. Lasso Regression is a regression method that minimizes a loss function and the regularity of our model.

Lasso regression utilizes penalty which is comprised of all the estimated parameters. In which the lambda could have any value between zero to infinity. Which decides how aggressive regularization is performed. Lasso regression penalize the sum of absolute values of coefficients. As the lambda value increases, coefficients decrease and eventually become zero. This way, lasso regression eliminates insignificant variables from our model.

First, we find the perfect lambda for our model using the cv.glmnet function.

Modeling

set.seed(123)
cv.lasso <- cv.glmnet(x_train.treat, y_train.treat, alpha = 1, family = "binomial")
plot(cv.lasso)

Now we model using the lambda.min which gives minimum mean cross-validated error.

lasso.model <- glmnet(x_train.treat, y_train.treat, alpha = 1, family = "binomial",
                      lambda = cv.lasso$lambda.min)

Prediction

Now, let’s predict using our model

x.test <- model.matrix(y ~., data_test.treat)[,-1]
probabilities <- lasso.model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, 1, 0) %>% as.factor()

Now, we examine the confusion matrix.

cm_lasso<-confusionMatrix(predicted.classes,y_test.treat,positive = "1")
cm_lasso
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1969    3
##          1 9025 1360
##                                           
##                Accuracy : 0.2694          
##                  95% CI : (0.2616, 0.2773)
##     No Information Rate : 0.8897          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0454          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9978          
##             Specificity : 0.1791          
##          Pos Pred Value : 0.1310          
##          Neg Pred Value : 0.9985          
##              Prevalence : 0.1103          
##          Detection Rate : 0.1101          
##    Detection Prevalence : 0.8404          
##       Balanced Accuracy : 0.5884          
##                                           
##        'Positive' Class : 1               
## 

Feature Selection

Now, we will see if doing feature selection prior to the modeling process would in fact increase our model’s performance

weights <- chi.squared(y~.,data_train)
print(weights)
##                attr_importance
## age                0.194145100
## job                0.151235731
## marital            0.051383230
## education          0.069303355
## default            0.091688851
## housing            0.009279667
## loan               0.006945143
## contact            0.145106996
## month              0.272467621
## day_of_week        0.028561231
## duration           0.433594881
## campaign           0.060608541
## pdays              0.321069669
## previous           0.227880116
## poutcome           0.317159682
## emp.var.rate       0.371260832
## cons.price.idx     0.421444841
## cons.conf.idx      0.418885586
## euribor3m          0.423278970
## nr.employed        0.404613761
weights %>% arrange(desc(attr_importance))
subset <- cutoff.k(weights, 6)
subset
## [1] "duration"       "euribor3m"      "cons.price.idx" "cons.conf.idx" 
## [5] "nr.employed"    "emp.var.rate"
f <- as.simple.formula(subset, "Class")
print(f)
## Class ~ duration + euribor3m + cons.price.idx + cons.conf.idx + 
##     nr.employed + emp.var.rate
## <environment: 0x0000000027517a70>

Since we set the limit to 5 out of 20 possible candidates of the explanatory variables would be selected in terms of significance we are going to see if this is in fact will be useful. On the variable explanation noted on kaggle, this model would not be realistic if we have included the duration variable. So, even though duration is one of the most significant variable, we will omit it from the model.

train_downsample_selected <-train_downsample %>% select(c(euribor3m,cons.price.idx,cons.conf.idx,emp.var.rate,nr.employed,y))

Modeling

Model using the selected variables

model_selected<-glm(y~euribor3m+cons.price.idx+cons.conf.idx+emp.var.rate+nr.employed,train_downsample_selected,family="binomial")

Prediction

Let’s predict using our model

predicted_selected<-predict(model_selected,data_test,type="response")
pred_step_adjusted <- ifelse(predicted_selected > 0.5, 1, 0) %>% as.factor()

Now, let’s examine the confusion matrix

cm_adjusted<-confusionMatrix(table(pred_step_adjusted,data_test$y),positive = "1")
cm_adjusted
## Confusion Matrix and Statistics
## 
##                   
## pred_step_adjusted    0    1
##                  0 7915  400
##                  1 3007 1035
##                                           
##                Accuracy : 0.7243          
##                  95% CI : (0.7163, 0.7322)
##     No Information Rate : 0.8839          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2493          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.72125         
##             Specificity : 0.72468         
##          Pos Pred Value : 0.25606         
##          Neg Pred Value : 0.95189         
##              Prevalence : 0.11613         
##          Detection Rate : 0.08376         
##    Detection Prevalence : 0.32710         
##       Balanced Accuracy : 0.72297         
##                                           
##        'Positive' Class : 1               
## 

General Recap and Conclusion

We have obtained several models, since our purpose is to generate a model for bank marketing purposes, a false-negative is way more harmful to the marketing strategy than a false positive. Since the goal of marketing is to get as many positives as possible, we are going to use sensitivity as the metric to compare all of our models’ performance.

data.frame("Stepwise Regression"=cm_step$byClass['Sensitivity'],
           "KNN"=cm_knn$byClass['Sensitivity'],
           "Lasso Regression"=cm_lasso$byClass['Sensitivity'],
           "Feature Selection"=cm_adjusted$byClass['Sensitivity'])

Based on our comparison of the sensitivity in all of our models, we have come into terms that the stepwise logistic regression performs the best out of all our models. This means that after all of our efforts to improve the model, the initial model still performs the best. Thus, the stepwise logistic model is the final best model for this analysis.