Bank Case Study MS4203

1. Executive Summary

A key aspect of a company’s success in increasing their profitability is achieving maximum contact with their targeted audience. As technology and analytical knowledge continues to expand, they can be utilized to create targeted marketing campaigns and form higher odds when finding the potential for investments.

The purpose of this case study is to empower a Portuguese bank to apply a targeted marketing campaign that will result in customers subscribing to a term deposit. This task will be accomplished by building a model capable of predicting whether those clients will subscribe to the term deposit. Data used for this case study came from the UCI Machine Learning Repository, found at this website: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing. The data will be used to determine what are the best predictors of term deposits, what model can best predict whether a client will subscribe to a term deposit, and how accurate that model can be.

A logistic regression model resulted in the highest accuracy and was best able to predict if a client would subscribe to a term deposit. The determining variables according to the model were campaign, consumer confidence index, contact, month, and number of employees.

2. The Problem

a. Introduction/Background

This case study is attempting to identify the best possible predictive model in order to identify customers who are likely to subscribe to a term deposit. The data needs to be cleaned and the most effective model needs to be identified. Model effectiveness will be based on accuracy, reliability, and interpretability which will be shown throughout the case.

b. Purpose of study/importance of study/statement of problem

By identifying customers who are likely to subscribe to a term deposit, the bank can implement a targeted marketing campaign that uses data to funnel resources to the most productive channels. Once the most likely subscribers are identified, they can then become the target of marketing campaigns.

c. Questions to be answered/conceptual statement of hypotheses

Which model is most accurate?
What variables should be kept?
Which variable is most affected?
Was the hypothesis accurate?

d. Outline of remainder of report

As previously stated, this case is to create the best possible predictive model in order to find a targeting audience for their marketing campaigns. As resources are allocated to targeting the interested audience, this will prove to the bank that using this methodology and statistical analysis is in their best interest for maximum profitability.

4. Methodology

a. Identification, classification, and operationalization of variables.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(corrplot)

## corrplot 0.92 loaded

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(e1071)

library(readr)
df <- read_delim("C:/Users/nhact/OneDrive - University of Texas at San Antonio/Documents/Spring 2023/MS 4203 Business Analytics Application/Week 3/bank-additional.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)

## Rows: 4119 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (11): job, marital, education, default, housing, loan, contact, month, d...
## dbl (10): age, duration, campaign, pdays, previous, emp.var.rate, cons.price...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(df)

There are 20 independent variables and 1 dependent variable in this data set.
Independent variables:

age (numeric)
job : type of job (categorical)
marital : marital status
education (categorical)
default: has credit in default? (categorical)
housing: has housing loan? (categorical)
loan: has personal loan? (categorical)
contact: contact communication type (categorical)
month: last contact month of year (categorical)
day_of_week: last contact day of the week (categorical)
duration: last contact duration, in seconds (numeric). However, since this variable highly correlate to our y-value, my team decide to not include this variable in the model.
campaign: number of contacts performed during this campaign and for this client (numeric)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric). My team decide to change this into categorical
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical)
emp.var.rate: employment variation rate - quarterly indicator (numeric)
cons.price.idx: consumer price index - monthly indicator (numeric)
cons.conf.idx: consumer confidence index - monthly indicator (numeric)
euribor3m: euribor 3 month rate - daily indicator (numeric)
nr.employed: number of employees - quarterly indicator (numeric)

Dependent variable:

y: has the client subscribed a term deposit? (binary)

b. Statements of hypotheses being tested and/or models being developed

The null hypothesis is that all coefficients in the model are equal to zero, meaning none of the predictor variables have a statistically significant relationship with the response variable, y. The alternative hypothesis states that not every coefficient is equal to zero and some predictor variables would have a significant relationship to y.

c. Sampling techniques, if full data is not being used.

We use an 80-20 training split, taken randomly from the main data

d. Data collection process, including data sources, data size, etc. Primary/secondary?

The data is taken from “A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014.” http://archive.ics.uci.edu/ml/datasets/Bank+Marketing

The bank-additional.csv has 4119 values (10% of the full data set). There is no information if this data set is primary or secondary.

e. Modeling analysis/techniques used

We are developing a logistic regression, linear discriminate analysis, and support vector machine model to test what variables would have impact on the subscription of bank term deposits. After viewing the accuracy, sensitivity, and specificity of each model, we then can conclude which one fits the data the best.

f. Methodological assumptions and limitations

The key assumption of this binary logistic regression is that the dependent variable is binary. In this case, y is binary as “yes” or “no”. The second assumption is that the observations are independent of each other. This was addressed in the model by removing variables that were perfect separators of the dependent variable y. The third assumption is that there is little or no multicollinearity among the independent variables. Variables that showed high multicollinearity were removed from the model. The final assumption of logistic regression is that there is a high number of observations. In this case we used over 3,000 observations to ensure we met those criteria.

5. Data

a. Data cleaning

When looking at the data structure, we notice that pdays variable has the 999 value, which represents client hasn’t been contacted. Therefore, we create a new column called “contacted?” with 2 factors based on pdays column. With row that has a 999 value, “contacted” column would show “No” for not contacted, and the rest are “Yes.”

We also change all character column into factor column with function as.factor()

We also removing the unknown values from the data sets

b. Data preprocessing

After running cor() function, we realize that there is a multicollinearity between loan, emp.var.rate, and euribor3m so our team decide to remove these variables. Moreover, because knowing duration would let us know the outcome of y so we decide to remove duration from the model.
Data distribution: the data is normally distributed

c. Data limitations

There is a wide gap between the initial data of y. The data set has more “no” than “yes,” which can create problem as the model became excellent at predicting a “no” but not so great at predicting a “yes” to the term deposit.

6. Findings (Results)

a. Results presented in tables or charts when appropriate

For the continuous variables, GVIF is the same as the VIF values before; but for the categorical variables, we now get one GVIF value for each separate category type (e.g. one value for all age groups, another value for all regional indicator variables and so on)

To make GVIFs comparable across dimensions, we will be using GVIF^(1/(2DF)) + There are no GVIF^(1/(2DF)) that are greater than 5 so we can say that y is not correlate with the remaining explanatory variables.

Running a backward regression in this data set is important as it will narrow our variables down to all the statistically significant ones. After conducting the regression, the outcome had six variables: poutcome, campaign, cons.conf.idx, contact, month and nr.employed. Consumer confidence index is considered an outlier within these 6 final variables as it is still not statistically significant.

From this, we will only keep the significant variables

b. Results reported with respect to hypotheses/models.

For logistic regression model: Because there is a large difference between “yes” and “no” of y variable, we decide to reduce the predicted probability of the logistic regression model from 0.5 to 0.08 and were able to increase the sensitivity of model. However, the accuracy of the model decreases. The new accuracy is at 72.98% while sensitivity is 68.42% and Specificity is 73.61% respectively. Logistic model has a much better combination than other two.

For LDA model, the accuracy is 87.38%. Specificity is high at 95.02% however sensitivity are low at 33.89%

For SVM model, the accuracy is 89.32%. Specificity is high at 99.08% however sensitivity are extremely low at 19.73%.

Both the LDA and SVM model has high accuracy and specificity but low sensitivity, which is not good to predict a yes respond from customer

c. Factual information kept separate from interpretation, inference and evaluation.

There are four variables with p values lower than 0.05, which are: poutcome, contact, month and nr.employed. With that said, these 4 values are only 66.66% of the final 6 variables. Looking at the facts from the full model, there is an accuracy of 75.49%, sensitivity of 71.35%, and specificity of 76.06%.

7. Conclusions and Recommendations

a. Discuss alternative methodologies

In conclusion, we recommend using the logistic regression model discussed above. The logistic regression model is the most accurate and the best at predicting which customers will subscribe to a term deposit. This will help tailor marketing campaigns to the customers most likely to subscribe to a term deposit.

The determining variables according to the model were campaign, consumer confidence index, contact, month, previous outcome, and number of employees. These are the variables used in the final model. In this model, the null hypothesis was rejected as there are several variables that are significant.

Although this model is the most accurate model that we were able to create, it could be improved by accessing more data. There was a low number of “yes” responses to the dependent ‘y’ variable. Because of the uneven number of “yes” and “no” responses it was difficult to build a model that balanced overall accuracy, sensitivity, and specificity.

8. APPENDIX:

#Change pdays into categorical variable #use if_else function
df$contacted <- ifelse(df$pdays == 999, "No", "Yes")
df$contacted <- as.factor(df$contacted)

#Change character into factor variable
df$job <- as.factor(df$job)
df$marital <- as.factor(df$marital)
df$education <- as.factor(df$education)
df$default <- as.factor(df$default)
df$housing <- as.factor(df$housing)
df$loan <- as.factor(df$loan)
df$contact <- as.factor(df$contact)
df$month <- as.factor(df$month)
df$day_of_week <- as.factor(df$day_of_week)
df$poutcome <- as.factor(df$poutcome)
df$y <-  as.factor(df$y)
str(df)

## spc_tbl_ [4,119 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age           : num [1:4119] 30 39 25 38 47 32 32 41 31 35 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 2 1 2 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 3 1 3 2 3 1 3 3 1 1 ...
##  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 1 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
##  $ duration      : num [1:4119] 487 346 227 17 58 128 290 44 68 170 ...
##  $ campaign      : num [1:4119] 2 4 1 3 1 3 4 2 1 1 ...
##  $ pdays         : num [1:4119] 999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : num [1:4119] 0 0 0 0 0 2 0 0 1 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ emp.var.rate  : num [1:4119] -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
##  $ cons.price.idx: num [1:4119] 92.9 94 94.5 94.5 93.2 ...
##  $ cons.conf.idx : num [1:4119] -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
##  $ euribor3m     : num [1:4119] 1.31 4.86 4.96 4.96 4.19 ...
##  $ nr.employed   : num [1:4119] 5099 5191 5228 5228 5196 ...
##  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contacted     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   job = col_character(),
##   ..   marital = col_character(),
##   ..   education = col_character(),
##   ..   default = col_character(),
##   ..   housing = col_character(),
##   ..   loan = col_character(),
##   ..   contact = col_character(),
##   ..   month = col_character(),
##   ..   day_of_week = col_character(),
##   ..   duration = col_double(),
##   ..   campaign = col_double(),
##   ..   pdays = col_double(),
##   ..   previous = col_double(),
##   ..   poutcome = col_character(),
##   ..   emp.var.rate = col_double(),
##   ..   cons.price.idx = col_double(),
##   ..   cons.conf.idx = col_double(),
##   ..   euribor3m = col_double(),
##   ..   nr.employed = col_double(),
##   ..   y = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

### remove unkown observations from variables
df <- df %>% filter(df$job != "unknown")
df <- df %>% filter(df$default != "unknown")
df <- df %>% filter(df$default != "1")
df <- df %>% filter(df$housing != "unknown")
df <- df %>% filter(df$marital != "unknown")
df <- df %>% filter(df$education != "unknown")
df <- df %>% filter(df$education != "illiterate")

### Check if the dependent variable is categorical
levels(df$y)

## [1] "no"  "yes"

#Create correlation plot (must have numerical values)
df_num = select_if(df, is.numeric)
M = cor(df_num)
corrplot(M, method = "number")

#remove duration (perfect separator), take out high multicollinearity
df = dplyr::select(df, -c(duration,loan,pdays,emp.var.rate, euribor3m))

#correlation plot after removing the corellate variables
df_num = select_if(df, is.numeric)
M = cor(df_num)
corrplot(M, method = "number")

#Create a training set 
set.seed(1)
tr_ind = sample(nrow(df), 0.8*nrow(df), replace = F)
dftrain = df[tr_ind,]
dftest = df[-tr_ind,]

#Check for distribution
 
ggplot(data = df, aes(x = age)) + 
  geom_histogram(binwidth = 5, col = "white")

#Fit a logistic regression model
str(dftrain)

## tibble [2,471 × 17] (S3: tbl_df/tbl/data.frame)
##  $ age           : num [1:2471] 42 25 30 27 30 29 30 33 48 29 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 1 1 8 5 1 2 10 1 5 1 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 3 3 2 2 3 2 3 3 2 2 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 7 7 4 7 7 4 7 4 7 7 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 3 1 3 1 3 3 1 3 3 3 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 2 1 2 1 1 2 1 1 1 1 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 1 4 4 2 4 8 1 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 5 5 1 2 3 5 2 4 4 2 ...
##  $ campaign      : num [1:2471] 1 1 2 1 8 1 1 5 2 1 ...
##  $ previous      : num [1:2471] 0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ cons.price.idx: num [1:2471] 94 92.9 94 93.1 93.9 ...
##  $ cons.conf.idx : num [1:2471] -36.4 -46.2 -36.4 -47.1 -42.7 -42.7 -36.1 -42.7 -30.1 -47.1 ...
##  $ nr.employed   : num [1:2471] 5191 5099 5191 5099 5228 ...
##  $ y             : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 2 1 ...
##  $ contacted     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

m1.log = glm(y ~ ., data = dftrain, family = binomial)
summary(m1.log)

## 
## Call:
## glm(formula = y ~ ., family = binomial, data = dftrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1021  -0.4034  -0.3204  -0.2390   2.8558  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   35.048711  15.771993   2.222 0.026269 *  
## age                            0.012253   0.008672   1.413 0.157670    
## jobblue-collar                -0.187077   0.296801  -0.630 0.528490    
## jobentrepreneur               -0.599626   0.510054  -1.176 0.239749    
## jobhousemaid                  -0.288891   0.554474  -0.521 0.602354    
## jobmanagement                 -0.558815   0.311115  -1.796 0.072468 .  
## jobretired                    -0.321940   0.389087  -0.827 0.407996    
## jobself-employed              -0.201862   0.386050  -0.523 0.601051    
## jobservices                   -0.267175   0.322732  -0.828 0.407753    
## jobstudent                     0.348582   0.446224   0.781 0.434695    
## jobtechnician                  0.038904   0.239347   0.163 0.870877    
## jobunemployed                  0.199343   0.395327   0.504 0.614086    
## maritalmarried                 0.012259   0.248807   0.049 0.960702    
## maritalsingle                 -0.041936   0.281377  -0.149 0.881523    
## educationbasic.6y              0.084206   0.459416   0.183 0.854570    
## educationbasic.9y              0.102711   0.352699   0.291 0.770887    
## educationhigh.school          -0.023885   0.345156  -0.069 0.944830    
## educationprofessional.course  -0.137456   0.371023  -0.370 0.711025    
## educationuniversity.degree     0.097551   0.343205   0.284 0.776230    
## defaultyes                    -9.133425 324.744152  -0.028 0.977562    
## housingyes                     0.148173   0.148013   1.001 0.316786    
## contacttelephone              -1.059622   0.275905  -3.841 0.000123 ***
## monthaug                      -0.590349   0.382540  -1.543 0.122774    
## monthdec                      -0.560688   0.724744  -0.774 0.439146    
## monthjul                      -0.362948   0.373758  -0.971 0.331509    
## monthjun                       0.570678   0.339035   1.683 0.092328 .  
## monthmar                       1.280600   0.464286   2.758 0.005812 ** 
## monthmay                      -0.551132   0.283048  -1.947 0.051518 .  
## monthnov                      -0.652112   0.345354  -1.888 0.058994 .  
## monthoct                      -0.242889   0.446462  -0.544 0.586420    
## monthsep                      -0.453471   0.494707  -0.917 0.359329    
## day_of_weekmon                -0.029052   0.231774  -0.125 0.900249    
## day_of_weekthu                 0.211680   0.230968   0.916 0.359410    
## day_of_weektue                -0.068243   0.240210  -0.284 0.776336    
## day_of_weekwed                 0.133536   0.240247   0.556 0.578329    
## campaign                      -0.069614   0.042928  -1.622 0.104882    
## previous                       0.072429   0.190560   0.380 0.703881    
## poutcomenonexistent            0.620169   0.322976   1.920 0.054837 .  
## poutcomesuccess                1.437520   0.669060   2.149 0.031669 *  
## cons.price.idx                 0.162356   0.158581   1.024 0.305927    
## cons.conf.idx                  0.036418   0.019038   1.913 0.055758 .  
## nr.employed                   -0.009982   0.001192  -8.372  < 2e-16 ***
## contactedYes                   0.521563   0.667831   0.781 0.434814    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1803.3  on 2470  degrees of freedom
## Residual deviance: 1374.0  on 2428  degrees of freedom
## AIC: 1460
## 
## Number of Fisher Scoring iterations: 11

vif(m1.log)

##                     GVIF Df GVIF^(1/(2*Df))
## age             2.109618  1        1.452452
## job             7.025332 10        1.102385
## marital         1.471561  2        1.101399
## education       3.560508  5        1.135406
## default         1.000003  1        1.000001
## housing         1.045273  1        1.022386
## contact         2.106332  1        1.451321
## month           6.417824  9        1.108804
## day_of_week     1.146943  4        1.017285
## campaign        1.053389  1        1.026347
## previous        4.197651  1        2.048817
## poutcome       19.914791  2        2.112486
## cons.price.idx  2.137303  1        1.461952
## cons.conf.idx   2.517507  1        1.586665
## nr.employed     2.096932  1        1.448079
## contacted       8.738895  1        2.956162

### Model selection/variable selection
m2.log = step(m1.log, direction = "backward")

## Start:  AIC=1459.97
## y ~ age + job + marital + education + default + housing + contact + 
##     month + day_of_week + campaign + previous + poutcome + cons.price.idx + 
##     cons.conf.idx + nr.employed + contacted
## 
##                  Df Deviance    AIC
## - job            10   1381.3 1447.3
## - education       5   1375.1 1451.1
## - day_of_week     4   1376.1 1454.1
## - marital         2   1374.1 1456.1
## - default         1   1374.0 1458.0
## - previous        1   1374.1 1458.1
## - contacted       1   1374.6 1458.6
## - housing         1   1375.0 1459.0
## - cons.price.idx  1   1375.0 1459.0
## - age             1   1376.0 1460.0
## <none>                1374.0 1460.0
## - campaign        1   1377.1 1461.1
## - cons.conf.idx   1   1377.7 1461.7
## - poutcome        2   1381.1 1463.1
## - contact         1   1390.7 1474.7
## - month           9   1413.6 1481.6
## - nr.employed     1   1439.5 1523.5
## 
## Step:  AIC=1447.28
## y ~ age + marital + education + default + housing + contact + 
##     month + day_of_week + campaign + previous + poutcome + cons.price.idx + 
##     cons.conf.idx + nr.employed + contacted
## 
##                  Df Deviance    AIC
## - education       5   1381.7 1437.7
## - day_of_week     4   1383.2 1441.2
## - marital         2   1381.3 1443.3
## - previous        1   1381.3 1445.3
## - default         1   1381.3 1445.3
## - contacted       1   1382.1 1446.1
## - housing         1   1382.1 1446.1
## - cons.price.idx  1   1382.4 1446.4
## - age             1   1382.4 1446.4
## <none>                1381.3 1447.3
## - campaign        1   1384.4 1448.4
## - cons.conf.idx   1   1385.0 1449.0
## - poutcome        2   1387.8 1449.8
## - contact         1   1397.1 1461.1
## - month           9   1423.5 1471.5
## - nr.employed     1   1452.8 1516.8
## 
## Step:  AIC=1437.74
## y ~ age + marital + default + housing + contact + month + day_of_week + 
##     campaign + previous + poutcome + cons.price.idx + cons.conf.idx + 
##     nr.employed + contacted
## 
##                  Df Deviance    AIC
## - day_of_week     4   1383.8 1431.8
## - marital         2   1381.8 1433.8
## - default         1   1381.8 1435.8
## - previous        1   1381.8 1435.8
## - contacted       1   1382.5 1436.5
## - housing         1   1382.7 1436.7
## - cons.price.idx  1   1382.8 1436.8
## - age             1   1382.8 1436.8
## <none>                1381.7 1437.7
## - campaign        1   1384.9 1438.9
## - cons.conf.idx   1   1385.5 1439.5
## - poutcome        2   1388.5 1440.5
## - contact         1   1397.8 1451.8
## - month           9   1424.4 1462.4
## - nr.employed     1   1453.3 1507.3
## 
## Step:  AIC=1431.8
## y ~ age + marital + default + housing + contact + month + campaign + 
##     previous + poutcome + cons.price.idx + cons.conf.idx + nr.employed + 
##     contacted
## 
##                  Df Deviance    AIC
## - marital         2   1383.8 1427.8
## - default         1   1383.9 1429.9
## - previous        1   1383.9 1429.9
## - contacted       1   1384.5 1430.5
## - age             1   1384.8 1430.8
## - housing         1   1384.8 1430.8
## - cons.price.idx  1   1384.9 1430.9
## <none>                1383.8 1431.8
## - campaign        1   1387.0 1433.0
## - cons.conf.idx   1   1387.8 1433.8
## - poutcome        2   1390.7 1434.7
## - contact         1   1400.5 1446.5
## - month           9   1425.6 1455.6
## - nr.employed     1   1454.5 1500.5
## 
## Step:  AIC=1427.83
## y ~ age + default + housing + contact + month + campaign + previous + 
##     poutcome + cons.price.idx + cons.conf.idx + nr.employed + 
##     contacted
## 
##                  Df Deviance    AIC
## - default         1   1383.9 1425.9
## - previous        1   1383.9 1425.9
## - contacted       1   1384.5 1426.5
## - housing         1   1384.8 1426.8
## - cons.price.idx  1   1384.9 1426.9
## - age             1   1385.0 1427.0
## <none>                1383.8 1427.8
## - campaign        1   1387.0 1429.0
## - cons.conf.idx   1   1387.8 1429.8
## - poutcome        2   1390.8 1430.8
## - contact         1   1400.5 1442.5
## - month           9   1426.0 1452.0
## - nr.employed     1   1455.7 1497.7
## 
## Step:  AIC=1425.89
## y ~ age + housing + contact + month + campaign + previous + poutcome + 
##     cons.price.idx + cons.conf.idx + nr.employed + contacted
## 
##                  Df Deviance    AIC
## - previous        1   1384.0 1424.0
## - contacted       1   1384.6 1424.6
## - housing         1   1384.9 1424.9
## - cons.price.idx  1   1385.0 1425.0
## - age             1   1385.0 1425.0
## <none>                1383.9 1425.9
## - campaign        1   1387.1 1427.1
## - cons.conf.idx   1   1387.9 1427.9
## - poutcome        2   1390.8 1428.8
## - contact         1   1400.5 1440.5
## - month           9   1426.1 1450.1
## - nr.employed     1   1455.9 1495.9
## 
## Step:  AIC=1423.95
## y ~ age + housing + contact + month + campaign + poutcome + cons.price.idx + 
##     cons.conf.idx + nr.employed + contacted
## 
##                  Df Deviance    AIC
## - contacted       1   1384.9 1422.9
## - housing         1   1385.0 1423.0
## - age             1   1385.1 1423.1
## - cons.price.idx  1   1385.2 1423.2
## <none>                1384.0 1424.0
## - campaign        1   1387.2 1425.2
## - cons.conf.idx   1   1388.0 1426.0
## - poutcome        2   1394.6 1430.6
## - contact         1   1400.9 1438.9
## - month           9   1426.1 1448.1
## - nr.employed     1   1458.0 1496.0
## 
## Step:  AIC=1422.88
## y ~ age + housing + contact + month + campaign + poutcome + cons.price.idx + 
##     cons.conf.idx + nr.employed
## 
##                  Df Deviance    AIC
## - housing         1   1385.9 1421.9
## - age             1   1386.0 1422.0
## - cons.price.idx  1   1386.3 1422.3
## <none>                1384.9 1422.9
## - campaign        1   1388.1 1424.1
## - cons.conf.idx   1   1389.0 1425.0
## - contact         1   1402.1 1438.1
## - month           9   1427.9 1447.9
## - poutcome        2   1430.2 1464.2
## - nr.employed     1   1459.7 1495.7
## 
## Step:  AIC=1421.89
## y ~ age + contact + month + campaign + poutcome + cons.price.idx + 
##     cons.conf.idx + nr.employed
## 
##                  Df Deviance    AIC
## - age             1   1387.0 1421.0
## - cons.price.idx  1   1387.2 1421.2
## <none>                1385.9 1421.9
## - campaign        1   1389.1 1423.1
## - cons.conf.idx   1   1389.8 1423.8
## - contact         1   1403.0 1437.0
## - month           9   1429.0 1447.0
## - poutcome        2   1432.2 1464.2
## - nr.employed     1   1460.5 1494.5
## 
## Step:  AIC=1420.99
## y ~ contact + month + campaign + poutcome + cons.price.idx + 
##     cons.conf.idx + nr.employed
## 
##                  Df Deviance    AIC
## - cons.price.idx  1   1388.3 1420.3
## <none>                1387.0 1421.0
## - campaign        1   1390.2 1422.2
## - cons.conf.idx   1   1391.2 1423.2
## - contact         1   1403.9 1435.9
## - month           9   1430.5 1446.5
## - poutcome        2   1433.6 1463.6
## - nr.employed     1   1462.7 1494.7
## 
## Step:  AIC=1420.28
## y ~ contact + month + campaign + poutcome + cons.conf.idx + nr.employed
## 
##                 Df Deviance    AIC
## <none>               1388.3 1420.3
## - campaign       1   1391.2 1421.2
## - cons.conf.idx  1   1391.2 1421.2
## - contact        1   1405.2 1435.2
## - month          9   1433.0 1447.0
## - poutcome       2   1436.4 1464.4
## - nr.employed    1   1466.0 1496.0

summary(m2.log)

## 
## Call:
## glm(formula = y ~ contact + month + campaign + poutcome + cons.conf.idx + 
##     nr.employed, family = binomial, data = dftrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0815  -0.4012  -0.3324  -0.2349   2.8027  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         52.006402   5.671505   9.170  < 2e-16 ***
## contacttelephone    -0.883998   0.223193  -3.961 7.47e-05 ***
## monthaug            -0.378873   0.353554  -1.072  0.28389    
## monthdec            -0.443824   0.700902  -0.633  0.52659    
## monthjul            -0.166063   0.329153  -0.505  0.61390    
## monthjun             0.645854   0.320473   2.015  0.04387 *  
## monthmar             1.350693   0.445412   3.032  0.00243 ** 
## monthmay            -0.591406   0.272404  -2.171  0.02993 *  
## monthnov            -0.564282   0.325688  -1.733  0.08317 .  
## monthoct            -0.165164   0.429591  -0.384  0.70063    
## monthsep            -0.276345   0.468964  -0.589  0.55568    
## campaign            -0.067370   0.042687  -1.578  0.11451    
## poutcomenonexistent  0.464402   0.209077   2.221  0.02634 *  
## poutcomesuccess      1.937029   0.291999   6.634 3.27e-11 ***
## cons.conf.idx        0.027423   0.016023   1.711  0.08699 .  
## nr.employed         -0.010294   0.001123  -9.165  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1803.3  on 2470  degrees of freedom
## Residual deviance: 1388.3  on 2455  degrees of freedom
## AIC: 1420.3
## 
## Number of Fisher Scoring iterations: 6

df = dplyr::select(df, -c(education, default, housing, day_of_week, previous, 
                          cons.price.idx, contacted,job,marital))

### training testing split after backward regression
set.seed(1)
tr_ind = sample(nrow(df), 0.8*nrow(df), replace = F)
dftrain = df[tr_ind,]
dftest = df[-tr_ind,]
 
### Build logistic regression model after backward regression
m1.log = glm(y ~ ., data = dftrain, family = binomial)

### Full regression model
m1.log_full=glm(y~., data = df, family = binomial)
summary(m1.log_full)

## 
## Call:
## glm(formula = y ~ ., family = binomial, data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0236  -0.3908  -0.3372  -0.2361   2.7738  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         52.0327183  5.0389352  10.326  < 2e-16 ***
## age                  0.0059845  0.0054162   1.105 0.269191    
## contacttelephone    -0.8912843  0.2004868  -4.446 8.76e-06 ***
## monthaug            -0.3012133  0.3174764  -0.949 0.342736    
## monthdec             0.1953431  0.5818323   0.336 0.737069    
## monthjul            -0.0167667  0.2932955  -0.057 0.954412    
## monthjun             0.6610772  0.2904340   2.276 0.022836 *  
## monthmar             1.3468500  0.3986763   3.378 0.000729 ***
## monthmay            -0.5837322  0.2483575  -2.350 0.018755 *  
## monthnov            -0.4502407  0.2916878  -1.544 0.122692    
## monthoct            -0.1972481  0.3924872  -0.503 0.615274    
## monthsep            -0.6258314  0.4053230  -1.544 0.122581    
## campaign            -0.0675373  0.0384434  -1.757 0.078952 .  
## poutcomenonexistent  0.3949287  0.1874960   2.106 0.035176 *  
## poutcomesuccess      1.8036286  0.2603692   6.927 4.29e-12 ***
## cons.conf.idx        0.0314526  0.0146378   2.149 0.031656 *  
## nr.employed         -0.0103070  0.0009932 -10.377  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2264.2  on 3088  degrees of freedom
## Residual deviance: 1752.2  on 3072  degrees of freedom
## AIC: 1786.2
## 
## Number of Fisher Scoring iterations: 6

### Make predictions for log reg using training data
predprob_train = predict.glm(m1.log, newdata = dftrain, type = "response")
predclass_log_train = ifelse(predprob_train >= 0.08, "yes", "no")
predclass_log_train <- as.factor(predclass_log_train)
caret::confusionMatrix(as.factor(predclass_log_train), as.factor(dftrain$y), positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1586   76
##        yes  591  218
##                                           
##                Accuracy : 0.7301          
##                  95% CI : (0.7121, 0.7475)
##     No Information Rate : 0.881           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2674          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.74150         
##             Specificity : 0.72853         
##          Pos Pred Value : 0.26947         
##          Neg Pred Value : 0.95427         
##              Prevalence : 0.11898         
##          Detection Rate : 0.08822         
##    Detection Prevalence : 0.32740         
##       Balanced Accuracy : 0.73501         
##                                           
##        'Positive' Class : yes             
##

### Make predictions for log reg using testing data 
predprob = predict.glm(m1.log, newdata = dftest, type = "response")
predclass_log = ifelse(predprob >= 0.08, "yes", "no")
predclass_log <- as.factor(predclass_log)
caret::confusionMatrix(as.factor(predclass_log), as.factor(dftest$y), positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  399  24
##        yes 143  52
##                                           
##                Accuracy : 0.7298          
##                  95% CI : (0.6929, 0.7644)
##     No Information Rate : 0.877           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2513          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.68421         
##             Specificity : 0.73616         
##          Pos Pred Value : 0.26667         
##          Neg Pred Value : 0.94326         
##              Prevalence : 0.12298         
##          Detection Rate : 0.08414         
##    Detection Prevalence : 0.31553         
##       Balanced Accuracy : 0.71019         
##                                           
##        'Positive' Class : yes             
##

### Make predictions for log reg using full model 
full_model = predict.glm(m1.log, newdata = df, type = "response")
predclass_log_full = ifelse(full_model >= 0.08, "yes", "no")
predclass_log_full <- as.factor(predclass_log_full)
caret::confusionMatrix(as.factor(predclass_log_full), as.factor(df$y), positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  1985  100
##        yes  734  270
##                                          
##                Accuracy : 0.73           
##                  95% CI : (0.714, 0.7456)
##     No Information Rate : 0.8802         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.2642         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.72973        
##             Specificity : 0.73005        
##          Pos Pred Value : 0.26892        
##          Neg Pred Value : 0.95204        
##              Prevalence : 0.11978        
##          Detection Rate : 0.08741        
##    Detection Prevalence : 0.32502        
##       Balanced Accuracy : 0.72989        
##                                          
##        'Positive' Class : yes            
##

### Build LDA model as comparison
m1.lda = lda(y ~ ., data = dftrain)

### Make predictions for lda
predclass_lda = predict(m1.lda, newdata = dftest)
caret::confusionMatrix(as.factor(predclass_lda$class), as.factor(dftest$y), positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  515  51
##        yes  27  25
##                                          
##                Accuracy : 0.8738         
##                  95% CI : (0.845, 0.8989)
##     No Information Rate : 0.877          
##     P-Value [Acc > NIR] : 0.625554       
##                                          
##                   Kappa : 0.323          
##                                          
##  Mcnemar's Test P-Value : 0.009208       
##                                          
##             Sensitivity : 0.32895        
##             Specificity : 0.95018        
##          Pos Pred Value : 0.48077        
##          Neg Pred Value : 0.90989        
##              Prevalence : 0.12298        
##          Detection Rate : 0.04045        
##    Detection Prevalence : 0.08414        
##       Balanced Accuracy : 0.63957        
##                                          
##        'Positive' Class : yes            
##

#Build the SVM model:
form1 = y ~.
tuned = tune.svm(form1, data=dftrain, gamma = seq(0.1, .1, by= 0.01), cost = seq(.1,1, by = .1))

mysvm = svm(formula = form1, data = dftrain, gamma = tuned$best.parameters$gamma, cost = tuned$best.parameters$cost) #This line might take 2,3 minutes to run
summary(mysvm) #this line is to build the model on training data

## 
## Call:
## svm(formula = form1, data = dftrain, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.6 
## 
## Number of Support Vectors:  701
## 
##  ( 287 414 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  no yes

svmpredict = predict(mysvm, dftest, type = 'response')  # This line to classify our test data and get the confusion matrix.
caret::confusionMatrix(as.factor(svmpredict), as.factor(dftest$y), positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  537  61
##        yes   5  15
##                                           
##                Accuracy : 0.8932          
##                  95% CI : (0.8661, 0.9164)
##     No Information Rate : 0.877           
##     P-Value [Acc > NIR] : 0.121           
##                                           
##                   Kappa : 0.2754          
##                                           
##  Mcnemar's Test P-Value : 1.288e-11       
##                                           
##             Sensitivity : 0.19737         
##             Specificity : 0.99077         
##          Pos Pred Value : 0.75000         
##          Neg Pred Value : 0.89799         
##              Prevalence : 0.12298         
##          Detection Rate : 0.02427         
##    Detection Prevalence : 0.03236         
##       Balanced Accuracy : 0.59407         
##                                           
##        'Positive' Class : yes             
##

Bank Case Study MS4203

Tu Thai-pwc728; Drew Hinck-flh324; Nicole Hayek-mrv969

2023-02-11

1. Executive Summary

2. The Problem

4. Methodology

5. Data

6. Findings (Results)

7. Conclusions and Recommendations

8. APPENDIX:

Bank Case Study MS4203

Tu Thai-pwc728; Drew Hinck-flh324; Nicole Hayek-mrv969

2023-02-11

1. Executive Summary

2. The Problem

3. Review of Related Literature

4. Methodology

5. Data

6. Findings (Results)

7. Conclusions and Recommendations

8. APPENDIX: