A key aspect of a company’s success in increasing their profitability is achieving maximum contact with their targeted audience. As technology and analytical knowledge continues to expand, they can be utilized to create targeted marketing campaigns and form higher odds when finding the potential for investments.
The purpose of this case study is to empower a Portuguese bank to apply a targeted marketing campaign that will result in customers subscribing to a term deposit. This task will be accomplished by building a model capable of predicting whether those clients will subscribe to the term deposit. Data used for this case study came from the UCI Machine Learning Repository, found at this website: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing. The data will be used to determine what are the best predictors of term deposits, what model can best predict whether a client will subscribe to a term deposit, and how accurate that model can be.
A logistic regression model resulted in the highest accuracy and was best able to predict if a client would subscribe to a term deposit. The determining variables according to the model were campaign, consumer confidence index, contact, month, and number of employees.
a. Introduction/Background
This case study is attempting to identify the best possible predictive model in order to identify customers who are likely to subscribe to a term deposit. The data needs to be cleaned and the most effective model needs to be identified. Model effectiveness will be based on accuracy, reliability, and interpretability which will be shown throughout the case.
b. Purpose of study/importance of study/statement of problem
By identifying customers who are likely to subscribe to a term deposit, the bank can implement a targeted marketing campaign that uses data to funnel resources to the most productive channels. Once the most likely subscribers are identified, they can then become the target of marketing campaigns.
c. Questions to be answered/conceptual statement of hypotheses
Which model is most accurate?
What variables should be kept?
Which variable is most affected?
Was the hypothesis accurate?
d. Outline of remainder of report
As previously stated, this case is to create the best possible predictive model in order to find a targeting audience for their marketing campaigns. As resources are allocated to targeting the interested audience, this will prove to the bank that using this methodology and statistical analysis is in their best interest for maximum profitability.
a. Identification, classification, and operationalization of variables.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(corrplot)
## corrplot 0.92 loaded
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(e1071)
library(readr)
df <- read_delim("C:/Users/nhact/OneDrive - University of Texas at San Antonio/Documents/Spring 2023/MS 4203 Business Analytics Application/Week 3/bank-additional.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
## Rows: 4119 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (11): job, marital, education, default, housing, loan, contact, month, d...
## dbl (10): age, duration, campaign, pdays, previous, emp.var.rate, cons.price...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
There are 20 independent variables and 1 dependent variable in this
data set.
Independent variables:
age (numeric)
job : type of job (categorical)
marital : marital status
education (categorical)
default: has credit in default? (categorical)
housing: has housing loan? (categorical)
loan: has personal loan? (categorical)
contact: contact communication type (categorical)
month: last contact month of year (categorical)
day_of_week: last contact day of the week (categorical)
duration: last contact duration, in seconds (numeric). However, since this variable highly correlate to our y-value, my team decide to not include this variable in the model.
campaign: number of contacts performed during this campaign and for this client (numeric)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric). My team decide to change this into categorical
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical)
emp.var.rate: employment variation rate - quarterly indicator (numeric)
cons.price.idx: consumer price index - monthly indicator (numeric)
cons.conf.idx: consumer confidence index - monthly indicator (numeric)
euribor3m: euribor 3 month rate - daily indicator (numeric)
nr.employed: number of employees - quarterly indicator (numeric)
Dependent variable:
b. Statements of hypotheses being tested and/or models being developed
The null hypothesis is that all coefficients in the model are equal to zero, meaning none of the predictor variables have a statistically significant relationship with the response variable, y. The alternative hypothesis states that not every coefficient is equal to zero and some predictor variables would have a significant relationship to y.
c. Sampling techniques, if full data is not being used.
We use an 80-20 training split, taken randomly from the main data
d. Data collection process, including data sources, data size, etc. Primary/secondary?
The data is taken from “A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014.” http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
The bank-additional.csv has 4119 values (10% of the full data set). There is no information if this data set is primary or secondary.
e. Modeling analysis/techniques used
We are developing a logistic regression, linear discriminate analysis, and support vector machine model to test what variables would have impact on the subscription of bank term deposits. After viewing the accuracy, sensitivity, and specificity of each model, we then can conclude which one fits the data the best.
f. Methodological assumptions and limitations
The key assumption of this binary logistic regression is that the dependent variable is binary. In this case, y is binary as “yes” or “no”. The second assumption is that the observations are independent of each other. This was addressed in the model by removing variables that were perfect separators of the dependent variable y. The third assumption is that there is little or no multicollinearity among the independent variables. Variables that showed high multicollinearity were removed from the model. The final assumption of logistic regression is that there is a high number of observations. In this case we used over 3,000 observations to ensure we met those criteria.
a. Data cleaning
When looking at the data structure, we notice that pdays variable has the 999 value, which represents client hasn’t been contacted. Therefore, we create a new column called “contacted?” with 2 factors based on pdays column. With row that has a 999 value, “contacted” column would show “No” for not contacted, and the rest are “Yes.”
We also change all character column into factor column with function as.factor()
We also removing the unknown values from the data sets
b. Data preprocessing
After running cor() function, we realize that there is a multicollinearity between loan, emp.var.rate, and euribor3m so our team decide to remove these variables. Moreover, because knowing duration would let us know the outcome of y so we decide to remove duration from the model.
Data distribution: the data is normally distributed
c. Data limitations
a. Results presented in tables or charts when appropriate
For the continuous variables, GVIF is the same as the VIF values before; but for the categorical variables, we now get one GVIF value for each separate category type (e.g. one value for all age groups, another value for all regional indicator variables and so on)
To make GVIFs comparable across dimensions, we will be using GVIF^(1/(2DF)) + There are no GVIF^(1/(2DF)) that are greater than 5 so we can say that y is not correlate with the remaining explanatory variables.
Running a backward regression in this data set is important as it will narrow our variables down to all the statistically significant ones. After conducting the regression, the outcome had six variables: poutcome, campaign, cons.conf.idx, contact, month and nr.employed. Consumer confidence index is considered an outlier within these 6 final variables as it is still not statistically significant.
From this, we will only keep the significant variables
b. Results reported with respect to hypotheses/models.
For logistic regression model: Because there is a large difference between “yes” and “no” of y variable, we decide to reduce the predicted probability of the logistic regression model from 0.5 to 0.08 and were able to increase the sensitivity of model. However, the accuracy of the model decreases. The new accuracy is at 72.98% while sensitivity is 68.42% and Specificity is 73.61% respectively. Logistic model has a much better combination than other two.
For LDA model, the accuracy is 87.38%. Specificity is high at 95.02% however sensitivity are low at 33.89%
For SVM model, the accuracy is 89.32%. Specificity is high at 99.08% however sensitivity are extremely low at 19.73%.
Both the LDA and SVM model has high accuracy and specificity but low sensitivity, which is not good to predict a yes respond from customer
c. Factual information kept separate from interpretation, inference and evaluation.
There are four variables with p values lower than 0.05, which are: poutcome, contact, month and nr.employed. With that said, these 4 values are only 66.66% of the final 6 variables. Looking at the facts from the full model, there is an accuracy of 75.49%, sensitivity of 71.35%, and specificity of 76.06%.
a. Discuss alternative methodologies
In conclusion, we recommend using the logistic regression model discussed above. The logistic regression model is the most accurate and the best at predicting which customers will subscribe to a term deposit. This will help tailor marketing campaigns to the customers most likely to subscribe to a term deposit.
The determining variables according to the model were campaign, consumer confidence index, contact, month, previous outcome, and number of employees. These are the variables used in the final model. In this model, the null hypothesis was rejected as there are several variables that are significant.
Although this model is the most accurate model that we were able to create, it could be improved by accessing more data. There was a low number of “yes” responses to the dependent ‘y’ variable. Because of the uneven number of “yes” and “no” responses it was difficult to build a model that balanced overall accuracy, sensitivity, and specificity.
#Change pdays into categorical variable #use if_else function
df$contacted <- ifelse(df$pdays == 999, "No", "Yes")
df$contacted <- as.factor(df$contacted)
#Change character into factor variable
df$job <- as.factor(df$job)
df$marital <- as.factor(df$marital)
df$education <- as.factor(df$education)
df$default <- as.factor(df$default)
df$housing <- as.factor(df$housing)
df$loan <- as.factor(df$loan)
df$contact <- as.factor(df$contact)
df$month <- as.factor(df$month)
df$day_of_week <- as.factor(df$day_of_week)
df$poutcome <- as.factor(df$poutcome)
df$y <- as.factor(df$y)
str(df)
## spc_tbl_ [4,119 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:4119] 30 39 25 38 47 32 32 41 31 35 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 2 1 2 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 3 1 3 2 3 1 3 3 1 1 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 1 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
## $ duration : num [1:4119] 487 346 227 17 58 128 290 44 68 170 ...
## $ campaign : num [1:4119] 2 4 1 3 1 3 4 2 1 1 ...
## $ pdays : num [1:4119] 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : num [1:4119] 0 0 0 0 0 2 0 0 1 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ emp.var.rate : num [1:4119] -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
## $ cons.price.idx: num [1:4119] 92.9 94 94.5 94.5 93.2 ...
## $ cons.conf.idx : num [1:4119] -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
## $ euribor3m : num [1:4119] 1.31 4.86 4.96 4.96 4.19 ...
## $ nr.employed : num [1:4119] 5099 5191 5228 5228 5196 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ contacted : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. job = col_character(),
## .. marital = col_character(),
## .. education = col_character(),
## .. default = col_character(),
## .. housing = col_character(),
## .. loan = col_character(),
## .. contact = col_character(),
## .. month = col_character(),
## .. day_of_week = col_character(),
## .. duration = col_double(),
## .. campaign = col_double(),
## .. pdays = col_double(),
## .. previous = col_double(),
## .. poutcome = col_character(),
## .. emp.var.rate = col_double(),
## .. cons.price.idx = col_double(),
## .. cons.conf.idx = col_double(),
## .. euribor3m = col_double(),
## .. nr.employed = col_double(),
## .. y = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
### remove unkown observations from variables
df <- df %>% filter(df$job != "unknown")
df <- df %>% filter(df$default != "unknown")
df <- df %>% filter(df$default != "1")
df <- df %>% filter(df$housing != "unknown")
df <- df %>% filter(df$marital != "unknown")
df <- df %>% filter(df$education != "unknown")
df <- df %>% filter(df$education != "illiterate")
### Check if the dependent variable is categorical
levels(df$y)
## [1] "no" "yes"
#Create correlation plot (must have numerical values)
df_num = select_if(df, is.numeric)
M = cor(df_num)
corrplot(M, method = "number")
#remove duration (perfect separator), take out high multicollinearity
df = dplyr::select(df, -c(duration,loan,pdays,emp.var.rate, euribor3m))
#correlation plot after removing the corellate variables
df_num = select_if(df, is.numeric)
M = cor(df_num)
corrplot(M, method = "number")
#Create a training set
set.seed(1)
tr_ind = sample(nrow(df), 0.8*nrow(df), replace = F)
dftrain = df[tr_ind,]
dftest = df[-tr_ind,]
#Check for distribution
ggplot(data = df, aes(x = age)) +
geom_histogram(binwidth = 5, col = "white")
#Fit a logistic regression model
str(dftrain)
## tibble [2,471 × 17] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:2471] 42 25 30 27 30 29 30 33 48 29 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 1 1 8 5 1 2 10 1 5 1 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 3 3 2 2 3 2 3 3 2 2 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 7 7 4 7 7 4 7 4 7 7 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 3 1 3 1 3 3 1 3 3 3 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 2 1 2 1 1 2 1 1 1 1 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 1 4 4 2 4 8 1 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 5 5 1 2 3 5 2 4 4 2 ...
## $ campaign : num [1:2471] 1 1 2 1 8 1 1 5 2 1 ...
## $ previous : num [1:2471] 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ cons.price.idx: num [1:2471] 94 92.9 94 93.1 93.9 ...
## $ cons.conf.idx : num [1:2471] -36.4 -46.2 -36.4 -47.1 -42.7 -42.7 -36.1 -42.7 -30.1 -47.1 ...
## $ nr.employed : num [1:2471] 5191 5099 5191 5099 5228 ...
## $ y : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 2 1 ...
## $ contacted : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
m1.log = glm(y ~ ., data = dftrain, family = binomial)
summary(m1.log)
##
## Call:
## glm(formula = y ~ ., family = binomial, data = dftrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1021 -0.4034 -0.3204 -0.2390 2.8558
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 35.048711 15.771993 2.222 0.026269 *
## age 0.012253 0.008672 1.413 0.157670
## jobblue-collar -0.187077 0.296801 -0.630 0.528490
## jobentrepreneur -0.599626 0.510054 -1.176 0.239749
## jobhousemaid -0.288891 0.554474 -0.521 0.602354
## jobmanagement -0.558815 0.311115 -1.796 0.072468 .
## jobretired -0.321940 0.389087 -0.827 0.407996
## jobself-employed -0.201862 0.386050 -0.523 0.601051
## jobservices -0.267175 0.322732 -0.828 0.407753
## jobstudent 0.348582 0.446224 0.781 0.434695
## jobtechnician 0.038904 0.239347 0.163 0.870877
## jobunemployed 0.199343 0.395327 0.504 0.614086
## maritalmarried 0.012259 0.248807 0.049 0.960702
## maritalsingle -0.041936 0.281377 -0.149 0.881523
## educationbasic.6y 0.084206 0.459416 0.183 0.854570
## educationbasic.9y 0.102711 0.352699 0.291 0.770887
## educationhigh.school -0.023885 0.345156 -0.069 0.944830
## educationprofessional.course -0.137456 0.371023 -0.370 0.711025
## educationuniversity.degree 0.097551 0.343205 0.284 0.776230
## defaultyes -9.133425 324.744152 -0.028 0.977562
## housingyes 0.148173 0.148013 1.001 0.316786
## contacttelephone -1.059622 0.275905 -3.841 0.000123 ***
## monthaug -0.590349 0.382540 -1.543 0.122774
## monthdec -0.560688 0.724744 -0.774 0.439146
## monthjul -0.362948 0.373758 -0.971 0.331509
## monthjun 0.570678 0.339035 1.683 0.092328 .
## monthmar 1.280600 0.464286 2.758 0.005812 **
## monthmay -0.551132 0.283048 -1.947 0.051518 .
## monthnov -0.652112 0.345354 -1.888 0.058994 .
## monthoct -0.242889 0.446462 -0.544 0.586420
## monthsep -0.453471 0.494707 -0.917 0.359329
## day_of_weekmon -0.029052 0.231774 -0.125 0.900249
## day_of_weekthu 0.211680 0.230968 0.916 0.359410
## day_of_weektue -0.068243 0.240210 -0.284 0.776336
## day_of_weekwed 0.133536 0.240247 0.556 0.578329
## campaign -0.069614 0.042928 -1.622 0.104882
## previous 0.072429 0.190560 0.380 0.703881
## poutcomenonexistent 0.620169 0.322976 1.920 0.054837 .
## poutcomesuccess 1.437520 0.669060 2.149 0.031669 *
## cons.price.idx 0.162356 0.158581 1.024 0.305927
## cons.conf.idx 0.036418 0.019038 1.913 0.055758 .
## nr.employed -0.009982 0.001192 -8.372 < 2e-16 ***
## contactedYes 0.521563 0.667831 0.781 0.434814
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1803.3 on 2470 degrees of freedom
## Residual deviance: 1374.0 on 2428 degrees of freedom
## AIC: 1460
##
## Number of Fisher Scoring iterations: 11
vif(m1.log)
## GVIF Df GVIF^(1/(2*Df))
## age 2.109618 1 1.452452
## job 7.025332 10 1.102385
## marital 1.471561 2 1.101399
## education 3.560508 5 1.135406
## default 1.000003 1 1.000001
## housing 1.045273 1 1.022386
## contact 2.106332 1 1.451321
## month 6.417824 9 1.108804
## day_of_week 1.146943 4 1.017285
## campaign 1.053389 1 1.026347
## previous 4.197651 1 2.048817
## poutcome 19.914791 2 2.112486
## cons.price.idx 2.137303 1 1.461952
## cons.conf.idx 2.517507 1 1.586665
## nr.employed 2.096932 1 1.448079
## contacted 8.738895 1 2.956162
### Model selection/variable selection
m2.log = step(m1.log, direction = "backward")
## Start: AIC=1459.97
## y ~ age + job + marital + education + default + housing + contact +
## month + day_of_week + campaign + previous + poutcome + cons.price.idx +
## cons.conf.idx + nr.employed + contacted
##
## Df Deviance AIC
## - job 10 1381.3 1447.3
## - education 5 1375.1 1451.1
## - day_of_week 4 1376.1 1454.1
## - marital 2 1374.1 1456.1
## - default 1 1374.0 1458.0
## - previous 1 1374.1 1458.1
## - contacted 1 1374.6 1458.6
## - housing 1 1375.0 1459.0
## - cons.price.idx 1 1375.0 1459.0
## - age 1 1376.0 1460.0
## <none> 1374.0 1460.0
## - campaign 1 1377.1 1461.1
## - cons.conf.idx 1 1377.7 1461.7
## - poutcome 2 1381.1 1463.1
## - contact 1 1390.7 1474.7
## - month 9 1413.6 1481.6
## - nr.employed 1 1439.5 1523.5
##
## Step: AIC=1447.28
## y ~ age + marital + education + default + housing + contact +
## month + day_of_week + campaign + previous + poutcome + cons.price.idx +
## cons.conf.idx + nr.employed + contacted
##
## Df Deviance AIC
## - education 5 1381.7 1437.7
## - day_of_week 4 1383.2 1441.2
## - marital 2 1381.3 1443.3
## - previous 1 1381.3 1445.3
## - default 1 1381.3 1445.3
## - contacted 1 1382.1 1446.1
## - housing 1 1382.1 1446.1
## - cons.price.idx 1 1382.4 1446.4
## - age 1 1382.4 1446.4
## <none> 1381.3 1447.3
## - campaign 1 1384.4 1448.4
## - cons.conf.idx 1 1385.0 1449.0
## - poutcome 2 1387.8 1449.8
## - contact 1 1397.1 1461.1
## - month 9 1423.5 1471.5
## - nr.employed 1 1452.8 1516.8
##
## Step: AIC=1437.74
## y ~ age + marital + default + housing + contact + month + day_of_week +
## campaign + previous + poutcome + cons.price.idx + cons.conf.idx +
## nr.employed + contacted
##
## Df Deviance AIC
## - day_of_week 4 1383.8 1431.8
## - marital 2 1381.8 1433.8
## - default 1 1381.8 1435.8
## - previous 1 1381.8 1435.8
## - contacted 1 1382.5 1436.5
## - housing 1 1382.7 1436.7
## - cons.price.idx 1 1382.8 1436.8
## - age 1 1382.8 1436.8
## <none> 1381.7 1437.7
## - campaign 1 1384.9 1438.9
## - cons.conf.idx 1 1385.5 1439.5
## - poutcome 2 1388.5 1440.5
## - contact 1 1397.8 1451.8
## - month 9 1424.4 1462.4
## - nr.employed 1 1453.3 1507.3
##
## Step: AIC=1431.8
## y ~ age + marital + default + housing + contact + month + campaign +
## previous + poutcome + cons.price.idx + cons.conf.idx + nr.employed +
## contacted
##
## Df Deviance AIC
## - marital 2 1383.8 1427.8
## - default 1 1383.9 1429.9
## - previous 1 1383.9 1429.9
## - contacted 1 1384.5 1430.5
## - age 1 1384.8 1430.8
## - housing 1 1384.8 1430.8
## - cons.price.idx 1 1384.9 1430.9
## <none> 1383.8 1431.8
## - campaign 1 1387.0 1433.0
## - cons.conf.idx 1 1387.8 1433.8
## - poutcome 2 1390.7 1434.7
## - contact 1 1400.5 1446.5
## - month 9 1425.6 1455.6
## - nr.employed 1 1454.5 1500.5
##
## Step: AIC=1427.83
## y ~ age + default + housing + contact + month + campaign + previous +
## poutcome + cons.price.idx + cons.conf.idx + nr.employed +
## contacted
##
## Df Deviance AIC
## - default 1 1383.9 1425.9
## - previous 1 1383.9 1425.9
## - contacted 1 1384.5 1426.5
## - housing 1 1384.8 1426.8
## - cons.price.idx 1 1384.9 1426.9
## - age 1 1385.0 1427.0
## <none> 1383.8 1427.8
## - campaign 1 1387.0 1429.0
## - cons.conf.idx 1 1387.8 1429.8
## - poutcome 2 1390.8 1430.8
## - contact 1 1400.5 1442.5
## - month 9 1426.0 1452.0
## - nr.employed 1 1455.7 1497.7
##
## Step: AIC=1425.89
## y ~ age + housing + contact + month + campaign + previous + poutcome +
## cons.price.idx + cons.conf.idx + nr.employed + contacted
##
## Df Deviance AIC
## - previous 1 1384.0 1424.0
## - contacted 1 1384.6 1424.6
## - housing 1 1384.9 1424.9
## - cons.price.idx 1 1385.0 1425.0
## - age 1 1385.0 1425.0
## <none> 1383.9 1425.9
## - campaign 1 1387.1 1427.1
## - cons.conf.idx 1 1387.9 1427.9
## - poutcome 2 1390.8 1428.8
## - contact 1 1400.5 1440.5
## - month 9 1426.1 1450.1
## - nr.employed 1 1455.9 1495.9
##
## Step: AIC=1423.95
## y ~ age + housing + contact + month + campaign + poutcome + cons.price.idx +
## cons.conf.idx + nr.employed + contacted
##
## Df Deviance AIC
## - contacted 1 1384.9 1422.9
## - housing 1 1385.0 1423.0
## - age 1 1385.1 1423.1
## - cons.price.idx 1 1385.2 1423.2
## <none> 1384.0 1424.0
## - campaign 1 1387.2 1425.2
## - cons.conf.idx 1 1388.0 1426.0
## - poutcome 2 1394.6 1430.6
## - contact 1 1400.9 1438.9
## - month 9 1426.1 1448.1
## - nr.employed 1 1458.0 1496.0
##
## Step: AIC=1422.88
## y ~ age + housing + contact + month + campaign + poutcome + cons.price.idx +
## cons.conf.idx + nr.employed
##
## Df Deviance AIC
## - housing 1 1385.9 1421.9
## - age 1 1386.0 1422.0
## - cons.price.idx 1 1386.3 1422.3
## <none> 1384.9 1422.9
## - campaign 1 1388.1 1424.1
## - cons.conf.idx 1 1389.0 1425.0
## - contact 1 1402.1 1438.1
## - month 9 1427.9 1447.9
## - poutcome 2 1430.2 1464.2
## - nr.employed 1 1459.7 1495.7
##
## Step: AIC=1421.89
## y ~ age + contact + month + campaign + poutcome + cons.price.idx +
## cons.conf.idx + nr.employed
##
## Df Deviance AIC
## - age 1 1387.0 1421.0
## - cons.price.idx 1 1387.2 1421.2
## <none> 1385.9 1421.9
## - campaign 1 1389.1 1423.1
## - cons.conf.idx 1 1389.8 1423.8
## - contact 1 1403.0 1437.0
## - month 9 1429.0 1447.0
## - poutcome 2 1432.2 1464.2
## - nr.employed 1 1460.5 1494.5
##
## Step: AIC=1420.99
## y ~ contact + month + campaign + poutcome + cons.price.idx +
## cons.conf.idx + nr.employed
##
## Df Deviance AIC
## - cons.price.idx 1 1388.3 1420.3
## <none> 1387.0 1421.0
## - campaign 1 1390.2 1422.2
## - cons.conf.idx 1 1391.2 1423.2
## - contact 1 1403.9 1435.9
## - month 9 1430.5 1446.5
## - poutcome 2 1433.6 1463.6
## - nr.employed 1 1462.7 1494.7
##
## Step: AIC=1420.28
## y ~ contact + month + campaign + poutcome + cons.conf.idx + nr.employed
##
## Df Deviance AIC
## <none> 1388.3 1420.3
## - campaign 1 1391.2 1421.2
## - cons.conf.idx 1 1391.2 1421.2
## - contact 1 1405.2 1435.2
## - month 9 1433.0 1447.0
## - poutcome 2 1436.4 1464.4
## - nr.employed 1 1466.0 1496.0
summary(m2.log)
##
## Call:
## glm(formula = y ~ contact + month + campaign + poutcome + cons.conf.idx +
## nr.employed, family = binomial, data = dftrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0815 -0.4012 -0.3324 -0.2349 2.8027
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 52.006402 5.671505 9.170 < 2e-16 ***
## contacttelephone -0.883998 0.223193 -3.961 7.47e-05 ***
## monthaug -0.378873 0.353554 -1.072 0.28389
## monthdec -0.443824 0.700902 -0.633 0.52659
## monthjul -0.166063 0.329153 -0.505 0.61390
## monthjun 0.645854 0.320473 2.015 0.04387 *
## monthmar 1.350693 0.445412 3.032 0.00243 **
## monthmay -0.591406 0.272404 -2.171 0.02993 *
## monthnov -0.564282 0.325688 -1.733 0.08317 .
## monthoct -0.165164 0.429591 -0.384 0.70063
## monthsep -0.276345 0.468964 -0.589 0.55568
## campaign -0.067370 0.042687 -1.578 0.11451
## poutcomenonexistent 0.464402 0.209077 2.221 0.02634 *
## poutcomesuccess 1.937029 0.291999 6.634 3.27e-11 ***
## cons.conf.idx 0.027423 0.016023 1.711 0.08699 .
## nr.employed -0.010294 0.001123 -9.165 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1803.3 on 2470 degrees of freedom
## Residual deviance: 1388.3 on 2455 degrees of freedom
## AIC: 1420.3
##
## Number of Fisher Scoring iterations: 6
df = dplyr::select(df, -c(education, default, housing, day_of_week, previous,
cons.price.idx, contacted,job,marital))
### training testing split after backward regression
set.seed(1)
tr_ind = sample(nrow(df), 0.8*nrow(df), replace = F)
dftrain = df[tr_ind,]
dftest = df[-tr_ind,]
### Build logistic regression model after backward regression
m1.log = glm(y ~ ., data = dftrain, family = binomial)
### Full regression model
m1.log_full=glm(y~., data = df, family = binomial)
summary(m1.log_full)
##
## Call:
## glm(formula = y ~ ., family = binomial, data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0236 -0.3908 -0.3372 -0.2361 2.7738
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 52.0327183 5.0389352 10.326 < 2e-16 ***
## age 0.0059845 0.0054162 1.105 0.269191
## contacttelephone -0.8912843 0.2004868 -4.446 8.76e-06 ***
## monthaug -0.3012133 0.3174764 -0.949 0.342736
## monthdec 0.1953431 0.5818323 0.336 0.737069
## monthjul -0.0167667 0.2932955 -0.057 0.954412
## monthjun 0.6610772 0.2904340 2.276 0.022836 *
## monthmar 1.3468500 0.3986763 3.378 0.000729 ***
## monthmay -0.5837322 0.2483575 -2.350 0.018755 *
## monthnov -0.4502407 0.2916878 -1.544 0.122692
## monthoct -0.1972481 0.3924872 -0.503 0.615274
## monthsep -0.6258314 0.4053230 -1.544 0.122581
## campaign -0.0675373 0.0384434 -1.757 0.078952 .
## poutcomenonexistent 0.3949287 0.1874960 2.106 0.035176 *
## poutcomesuccess 1.8036286 0.2603692 6.927 4.29e-12 ***
## cons.conf.idx 0.0314526 0.0146378 2.149 0.031656 *
## nr.employed -0.0103070 0.0009932 -10.377 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2264.2 on 3088 degrees of freedom
## Residual deviance: 1752.2 on 3072 degrees of freedom
## AIC: 1786.2
##
## Number of Fisher Scoring iterations: 6
### Make predictions for log reg using training data
predprob_train = predict.glm(m1.log, newdata = dftrain, type = "response")
predclass_log_train = ifelse(predprob_train >= 0.08, "yes", "no")
predclass_log_train <- as.factor(predclass_log_train)
caret::confusionMatrix(as.factor(predclass_log_train), as.factor(dftrain$y), positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1586 76
## yes 591 218
##
## Accuracy : 0.7301
## 95% CI : (0.7121, 0.7475)
## No Information Rate : 0.881
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2674
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.74150
## Specificity : 0.72853
## Pos Pred Value : 0.26947
## Neg Pred Value : 0.95427
## Prevalence : 0.11898
## Detection Rate : 0.08822
## Detection Prevalence : 0.32740
## Balanced Accuracy : 0.73501
##
## 'Positive' Class : yes
##
### Make predictions for log reg using testing data
predprob = predict.glm(m1.log, newdata = dftest, type = "response")
predclass_log = ifelse(predprob >= 0.08, "yes", "no")
predclass_log <- as.factor(predclass_log)
caret::confusionMatrix(as.factor(predclass_log), as.factor(dftest$y), positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 399 24
## yes 143 52
##
## Accuracy : 0.7298
## 95% CI : (0.6929, 0.7644)
## No Information Rate : 0.877
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2513
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.68421
## Specificity : 0.73616
## Pos Pred Value : 0.26667
## Neg Pred Value : 0.94326
## Prevalence : 0.12298
## Detection Rate : 0.08414
## Detection Prevalence : 0.31553
## Balanced Accuracy : 0.71019
##
## 'Positive' Class : yes
##
### Make predictions for log reg using full model
full_model = predict.glm(m1.log, newdata = df, type = "response")
predclass_log_full = ifelse(full_model >= 0.08, "yes", "no")
predclass_log_full <- as.factor(predclass_log_full)
caret::confusionMatrix(as.factor(predclass_log_full), as.factor(df$y), positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 1985 100
## yes 734 270
##
## Accuracy : 0.73
## 95% CI : (0.714, 0.7456)
## No Information Rate : 0.8802
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2642
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.72973
## Specificity : 0.73005
## Pos Pred Value : 0.26892
## Neg Pred Value : 0.95204
## Prevalence : 0.11978
## Detection Rate : 0.08741
## Detection Prevalence : 0.32502
## Balanced Accuracy : 0.72989
##
## 'Positive' Class : yes
##
### Build LDA model as comparison
m1.lda = lda(y ~ ., data = dftrain)
### Make predictions for lda
predclass_lda = predict(m1.lda, newdata = dftest)
caret::confusionMatrix(as.factor(predclass_lda$class), as.factor(dftest$y), positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 515 51
## yes 27 25
##
## Accuracy : 0.8738
## 95% CI : (0.845, 0.8989)
## No Information Rate : 0.877
## P-Value [Acc > NIR] : 0.625554
##
## Kappa : 0.323
##
## Mcnemar's Test P-Value : 0.009208
##
## Sensitivity : 0.32895
## Specificity : 0.95018
## Pos Pred Value : 0.48077
## Neg Pred Value : 0.90989
## Prevalence : 0.12298
## Detection Rate : 0.04045
## Detection Prevalence : 0.08414
## Balanced Accuracy : 0.63957
##
## 'Positive' Class : yes
##
#Build the SVM model:
form1 = y ~.
tuned = tune.svm(form1, data=dftrain, gamma = seq(0.1, .1, by= 0.01), cost = seq(.1,1, by = .1))
mysvm = svm(formula = form1, data = dftrain, gamma = tuned$best.parameters$gamma, cost = tuned$best.parameters$cost) #This line might take 2,3 minutes to run
summary(mysvm) #this line is to build the model on training data
##
## Call:
## svm(formula = form1, data = dftrain, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.6
##
## Number of Support Vectors: 701
##
## ( 287 414 )
##
##
## Number of Classes: 2
##
## Levels:
## no yes
svmpredict = predict(mysvm, dftest, type = 'response') # This line to classify our test data and get the confusion matrix.
caret::confusionMatrix(as.factor(svmpredict), as.factor(dftest$y), positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 537 61
## yes 5 15
##
## Accuracy : 0.8932
## 95% CI : (0.8661, 0.9164)
## No Information Rate : 0.877
## P-Value [Acc > NIR] : 0.121
##
## Kappa : 0.2754
##
## Mcnemar's Test P-Value : 1.288e-11
##
## Sensitivity : 0.19737
## Specificity : 0.99077
## Pos Pred Value : 0.75000
## Neg Pred Value : 0.89799
## Prevalence : 0.12298
## Detection Rate : 0.02427
## Detection Prevalence : 0.03236
## Balanced Accuracy : 0.59407
##
## 'Positive' Class : yes
##