Bank Marketing Case

I. Executive Summary

This data set is correlated directly with marketing campaigns that were conducted by the Portuguese Banking institution. These model below show th correlation between the different variable tracked in the campaign. In this case study we look to improve the efficiency of the marketing campaign by defining the main factors that may affect the success of the campaign.

II. The Problem

The task we were assigned for this case was to run a linear regression model to decide if clients would subscribe to a bank deposit. Based on the results of the various input variables. The process of accurately predicting customers decisions on whether make a deposit will allow us to find the most prominent predictors and the data to give the company a better understanding of who to market to. This task could be solved by a logistical regression we will determine is this is an accurate predictor or not by checking F-score. In this case there was

The main question to be answered is what is a term deposit? A term deposit can be classified as a cash investment is held by a financial institution. These investments carry a short-term maturity. Which the investor must recognize before buying as they will only have access to the fund once the term has expired.

IV. Methodology

What are the variables that we are dealing with in the case. What types of variables are they continuous, numerical, categorical. What type of sampling techniques we used. Was this full full data set or just a sample of the data. Talk about the assumptions and limitations to the model that we chose to use.

Variables related to bank client data:

Age: Client’s age.(numerical) Job: Client’s type of job. (categorical) Marital: Client’s marital status, divorced means divorced or widowed.(categorical) Education: Client’s education.(categorical) Default: Client has previosly defaulted.(categorical) Housing: Client has a housing loan.(categorical) Loan: Client has a personal loan.(categorical)

Variables related to last contact of the current marketing campaign:

Contact: Contact communication type (telephone or cellular). Month: Last contact month of year. day_of_week: Last contact day of week. duration: Last contact duration in seconds. If duration is 0s, then we never contacted a client to sign up for a term deposit account. Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) Previous: number of contacts performed before this campaign and for this client (numeric) Poutcome: outcome of the previous marketing campaign (categorical: ‘failure,’‘nonexistent,’‘success’)

Social and economic context attributes:

Emp.var.rate: employment variation rate - quarterly indicator (numeric) Cons.price.idx: consumer price index - monthly indicator (numeric) Cons.conf.idx: consumer confidence index - monthly indicator (numeric) Euribor3m: euribor 3 month rate - daily indicator (numeric) Nr.employed: number of employees - quarterly indicator (numeric)

Output variable:

y - has the client subscribed a term deposit? (binary: ‘yes,’ ‘no’)

Once we defined our variables in the data set we changed all of our character variables to factor variables. We then created graphic models of significant variables in our data. This allowed us to visually see what people in our data could be beneficial to market to. When making our graphic representations we had to include Y as the fill for the bar chart or histogram. For the Categorical variables we decided to use a Bar chart to best represent our data. The fill of the “Y” variable allowed us to show the relationship between the variable and if they would chose to make a bank deposit or not. We had limitations in the data as some of the variables int he data were highly correlated to Y such as the nr.employed and emp.var.rate variables. So we had to remove these variables from our main data set.

Then we ran a logistical regression and found the GVIF of each variable. The rule that we used to determine multiculinarity was if the GVIF was over 5 the variable contained multiculinarity. Once we had our data down to the correct variables we were able to see that an optimal cutoff needed to be implemented.We experimented with .5 as the cutoff first and our sensitivity was at .48 which was telling us that our model was not good at predicting the “yes” responses. We then changed our optimal cutoff to .1 which changed the sensitivty to .76 which gave us a better model. We used the full dataset in the creation of our models.

1-2 pages

V. Data Analysis

Loading in library

library(titanic)
library(caret)
library(lattice)
library(ggplot2)
library(gam)
library(car)
library(ROCR)
library(ggmosaic)
library(gmodels)
library(prettydoc)
library(tinytex)
library(corrplot)
library(MASS)
library(tidyverse)
b1 = read.csv("/Users/thomasfarrell/Downloads/bank-additional.csv", sep = ";")
source('/Users/thomasfarrell/Downloads/optim_threshold.R')

(Peng, Lee, and Ingersoll 2002)(Anderson, Jin, and Grunkemeier 2003)

Reading in data

head(b1)

##   age         job marital         education default housing    loan   contact
## 1  30 blue-collar married          basic.9y      no     yes      no  cellular
## 2  39    services  single       high.school      no      no      no telephone
## 3  25    services married       high.school      no     yes      no telephone
## 4  38    services married          basic.9y      no unknown unknown telephone
## 5  47      admin. married university.degree      no     yes      no  cellular
## 6  32    services  single university.degree      no      no      no  cellular
##   month day_of_week duration campaign pdays previous    poutcome emp.var.rate
## 1   may         fri      487        2   999        0 nonexistent         -1.8
## 2   may         fri      346        4   999        0 nonexistent          1.1
## 3   jun         wed      227        1   999        0 nonexistent          1.4
## 4   jun         fri       17        3   999        0 nonexistent          1.4
## 5   nov         mon       58        1   999        0 nonexistent         -0.1
## 6   sep         thu      128        3   999        2     failure         -1.1
##   cons.price.idx cons.conf.idx euribor3m nr.employed  y
## 1         92.893         -46.2     1.313      5099.1 no
## 2         93.994         -36.4     4.855      5191.0 no
## 3         94.465         -41.8     4.962      5228.1 no
## 4         94.465         -41.8     4.959      5228.1 no
## 5         93.200         -42.0     4.191      5195.8 no
## 6         94.199         -37.5     0.884      4963.6 no

Changing character variables to factors

b1$job = as.factor(b1$job)
b1$marital = as.factor(b1$marital)
b1$education = as.factor(b1$education)
b1$default = as.factor(b1$default)
b1$housing = as.factor(b1$housing)
b1$loan = as.factor(b1$loan)
b1$contact = as.factor(b1$contact)
b1$month = as.factor(b1$month)
b1$day_of_week = as.factor(b1$day_of_week)
b1$poutcome = as.factor(b1$poutcome)
b1$y = as.factor(b1$y)

str(b1)

## 'data.frame':    4119 obs. of  21 variables:
##  $ age           : int  30 39 25 38 47 32 32 41 31 35 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 2 1 2 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 3 1 3 2 3 1 3 3 1 1 ...
##  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 1 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
##  $ duration      : int  487 346 227 17 58 128 290 44 68 170 ...
##  $ campaign      : int  2 4 1 3 1 3 4 2 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 2 0 0 1 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ emp.var.rate  : num  -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
##  $ cons.price.idx: num  92.9 94 94.5 94.5 93.2 ...
##  $ cons.conf.idx : num  -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
##  $ euribor3m     : num  1.31 4.86 4.96 4.96 4.19 ...
##  $ nr.employed   : num  5099 5191 5228 5228 5196 ...
##  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

removed nr.employed

Creating a corrplot of all the vraiables to see the correlation. Then we removed nr.employed due to it having a correlation of .9 and it had multiculinarity. We also removed the variable emp.var.rate becuase there was a threat of multiculinarity there as well.

b1_num = dplyr::select_if(b1, is.numeric)
M = cor(b1_num)
corrplot(M, method = "number")

b1 = dplyr::select(b1, - nr.employed)
b1 = dplyr::select(b1, - emp.var.rate)

Checking summary of target variable

summary(b1$y)

##   no  yes 
## 3668  451

Removing any N/A variables from data

b2 = b1
b2 = subset(b2, !is.na(b2$age))
b2 = subset(b2, !is.na(b2$job))
b2 = subset(b2, !is.na(b2$marital))
b2 = subset(b2, !is.na(b2$education))
b2 = subset(b2, !is.na(b2$default))
b2 = subset(b2, !is.na(b2$housing))
b2 = subset(b2, !is.na(b2$loan))
b2 = subset(b2, !is.na(b2$contact))
b2 = subset(b2, !is.na(b2$month))
b2 = subset(b2, !is.na(b2$day_of_week))
b2 = subset(b2, !is.na(b2$duration))
b2 = subset(b2, !is.na(b2$campaign))
b2 = subset(b2, !is.na(b2$pdays))
b2 = subset(b2, !is.na(b2$previous))
b2 = subset(b2, !is.na(b2$poutcome))
b2 = subset(b2, !is.na(b2$emp.var.rate))
b2 = subset(b2, !is.na(b2$cons.price.idx))
b2 = subset(b2, !is.na(b2$cons.conf.idx))
b2 = subset(b2, !is.na(b2$euribor3m))
b2 = subset(b2, !is.na(b2$nr.employed))
b2 = subset(b2, !is.na(b2$y))

colSums(is.na(b2))

##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
## cons.price.idx  cons.conf.idx      euribor3m              y 
##              0              0              0              0

Histogram of yes or no counts for age

ggplot(data = b1, mapping = aes(x = age, fill = y)) +
  geom_histogram(binwidth = 10, position = "dodge2")+
  facet_wrap(~ y , nrow = 1)

Here we compare Histogram plots of the age variable ‘yes’ and ‘no’ counts. We see that from the histogram that the younger population mainly responded to the marketing campaign. We can also see that the data is right skewed.

Job variable yes and no count

ggplot(data = b1, mapping = aes(x = job, fill = y)) +
  geom_bar(position = "dodge2")+
  scale_fill_discrete(name = "Yes or no counts")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The Job variable in the data set shows us that many of the individuals who would make a bank deposit were in jobs with financial stability. These individuals would hold the biggest amount of power as they have the largest finical freedom.

Contact variable yes and no counts

ggplot(data = b1, mapping = aes(x = contact, fill = y)) +
  geom_bar(position = "dodge2")+
  scale_fill_discrete(name = "Yes or no counts")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

For the contact variable these is a high significance. This variable plays a large role in determining whether or not customers make a term deposit. From the we can see that when customers were contacted from cellular they have a higher chance of denying the opportunity to make a deposit. We can see that out of cellular respondents 14.1% responded yes and from telephone respondents only 5.2% responded yes.

Month variable yes and no count

ggplot(data = b1, mapping = aes(x = month, fill = y)) +
  geom_bar(position = "dodge2")+
  scale_fill_discrete(name = "Yes or no counts")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The month of May has the most observations in the data set. While the month of march has the largest percentage of “yes” observations.

Duration status yes and no count

ggplot(data = b1, mapping = aes(x = duration, fill = y)) +
  geom_histogram()+
  scale_fill_discrete(name = "Yes or no counts")+
  facet_wrap(~ y , nrow = 2)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Campaign Variable yes and no count

ggplot(data = b1, mapping = aes(x = campaign, fill = y)) +
  geom_histogram()+
  scale_fill_discrete(name = "Yes or no counts")+
  facet_wrap(~ y , nrow = 2)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In this data we can see that there is a right skew to the data. This showed us that the earlier campaigns were more effective at marketing to consumers.

Previous yes and no counts

ggplot(data = b1, mapping = aes(x = previous, fill = y)) +
  geom_bar(position = "dodge2")+
  scale_fill_discrete(name = "Yes or no counts")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This data shows the number of clients they had from the previous marketing campaign. The data is right-skewed showing that the previous campaign was not effective as they held very little customers

Poutcome variable yes and no counts

ggplot(data = b1, mapping = aes(x = poutcome, fill = y)) +
  geom_bar(position = "dodge2")+
  scale_fill_discrete(name = "Yes or no counts")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Looking as this chart you can see that from the previous marketing campaign there were a lot of clients that did not see the previous marketing campaign and would still respond to a bank term deposit.

VI. Findings and results

Our data analysis showed that the logistical regression model had the highest accuracy of 92.35%

training testing split

set.seed(1)
tr_ind = sample(nrow(b1),.8*nrow(b1), replace = F)
b1train = b1[tr_ind,]
b1test = b1[-tr_ind,]

Logistical regression model

a1 = glm(formula = y ~ age + job + marital + education + default + housing + contact + month + day_of_week + duration + campaign + pdays + previous + poutcome , data = b1 , family = binomial)


vif(a1)

##                  GVIF Df GVIF^(1/(2*Df))
## age          2.059400  1        1.435061
## job          5.856877 11        1.083662
## marital      1.562187  3        1.077181
## education    3.382069  7        1.090935
## default      1.146105  2        1.034680
## housing      1.083590  2        1.020273
## contact      1.636177  1        1.279131
## month        2.585692  9        1.054195
## day_of_week  1.120279  4        1.014298
## duration     1.203719  1        1.097141
## campaign     1.067845  1        1.033366
## pdays       10.552027  1        3.248388
## previous     4.142006  1        2.035192
## poutcome    23.674554  2        2.205822

summary(a1)

## 
## Call:
## glm(formula = y ~ age + job + marital + education + default + 
##     housing + contact + month + day_of_week + duration + campaign + 
##     pdays + previous + poutcome, family = binomial, data = b1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8779  -0.3383  -0.2235  -0.1297   2.9635  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -4.536e+00  9.658e-01  -4.696 2.65e-06 ***
## age                           1.838e-02  7.873e-03   2.334 0.019577 *  
## jobblue-collar               -2.524e-01  2.599e-01  -0.971 0.331618    
## jobentrepreneur              -8.253e-01  4.824e-01  -1.711 0.087120 .  
## jobhousemaid                  4.417e-01  4.079e-01   1.083 0.278829    
## jobmanagement                -3.533e-01  2.711e-01  -1.303 0.192456    
## jobretired                   -1.601e-03  3.343e-01  -0.005 0.996179    
## jobself-employed             -7.316e-01  4.029e-01  -1.816 0.069420 .  
## jobservices                   1.965e-01  2.697e-01   0.729 0.466216    
## jobstudent                    6.019e-01  3.872e-01   1.554 0.120084    
## jobtechnician                 1.336e-01  2.116e-01   0.631 0.527723    
## jobunemployed                 5.119e-01  3.659e-01   1.399 0.161864    
## jobunknown                   -5.531e-01  7.628e-01  -0.725 0.468451    
## maritalmarried                2.230e-01  2.306e-01   0.967 0.333557    
## maritalsingle                 3.957e-01  2.607e-01   1.518 0.129076    
## maritalunknown               -2.612e-02  1.119e+00  -0.023 0.981379    
## educationbasic.6y             2.570e-01  3.891e-01   0.660 0.508962    
## educationbasic.9y             1.118e-01  3.128e-01   0.358 0.720656    
## educationhigh.school          1.718e-01  2.946e-01   0.583 0.559852    
## educationilliterate          -9.849e+00  5.354e+02  -0.018 0.985323    
## educationprofessional.course  2.337e-01  3.215e-01   0.727 0.467144    
## educationuniversity.degree    4.588e-01  2.956e-01   1.552 0.120662    
## educationunknown              5.733e-01  3.700e-01   1.550 0.121210    
## defaultunknown               -1.931e-01  2.002e-01  -0.965 0.334618    
## defaultyes                   -1.011e+01  5.354e+02  -0.019 0.984937    
## housingunknown               -5.621e-01  5.106e-01  -1.101 0.270937    
## housingyes                   -7.053e-02  1.315e-01  -0.536 0.591781    
## contacttelephone             -1.462e+00  2.040e-01  -7.166 7.70e-13 ***
## monthaug                     -4.488e-01  2.823e-01  -1.590 0.111884    
## monthdec                      1.654e+00  5.987e-01   2.763 0.005723 ** 
## monthjul                     -9.034e-01  2.928e-01  -3.086 0.002032 ** 
## monthjun                      9.185e-01  3.051e-01   3.010 0.002610 ** 
## monthmar                      2.220e+00  4.192e-01   5.295 1.19e-07 ***
## monthmay                     -5.251e-01  2.694e-01  -1.949 0.051263 .  
## monthnov                     -9.076e-01  3.087e-01  -2.940 0.003279 ** 
## monthoct                      1.280e+00  3.900e-01   3.282 0.001029 ** 
## monthsep                      8.870e-01  4.159e-01   2.133 0.032931 *  
## day_of_weekmon                1.083e-01  2.048e-01   0.529 0.596867    
## day_of_weekthu                3.693e-02  2.049e-01   0.180 0.856945    
## day_of_weektue                2.546e-02  2.103e-01   0.121 0.903629    
## day_of_weekwed                2.274e-01  2.120e-01   1.073 0.283456    
## duration                      4.824e-03  2.391e-04  20.180  < 2e-16 ***
## campaign                     -1.359e-01  4.384e-02  -3.100 0.001938 ** 
## pdays                        -1.330e-04  6.747e-04  -0.197 0.843718    
## previous                      3.530e-01  1.743e-01   2.026 0.042806 *  
## poutcomenonexistent           2.420e-01  2.925e-01   0.827 0.407991    
## poutcomesuccess               2.291e+00  6.645e-01   3.447 0.000566 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2845.8  on 4118  degrees of freedom
## Residual deviance: 1749.7  on 4072  degrees of freedom
## AIC: 1843.7
## 
## Number of Fisher Scoring iterations: 12

predprob = predict.glm(a1, newdata = b1test, type = "response")
predclass_log = ifelse(predprob >= .1, "yes", "no")
caret::confusionMatrix(as.factor(predclass_log), as.factor(b1test$y), positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  639  19
##        yes 105  61
##                                           
##                Accuracy : 0.8495          
##                  95% CI : (0.8232, 0.8732)
##     No Information Rate : 0.9029          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4199          
##                                           
##  Mcnemar's Test P-Value : 2.29e-14        
##                                           
##             Sensitivity : 0.76250         
##             Specificity : 0.85887         
##          Pos Pred Value : 0.36747         
##          Neg Pred Value : 0.97112         
##              Prevalence : 0.09709         
##          Detection Rate : 0.07403         
##    Detection Prevalence : 0.20146         
##       Balanced Accuracy : 0.81069         
##                                           
##        'Positive' Class : yes             
##

Shows the accuracy that our logistic regression model has in predicting Yes and no Responses. The sensitivity of our model shows the accuracy we have for predicting “yes” responses and the specificity of our model shows the accuracy we have in predicting “no” responses. This model has an accuracy of 84.95%. We used .1 as our optimal cutoff point which gave us the best sensitivity and specificity which shows us that our model was accurate at predicting both yes and no responses.

Running the Optim Threshold Function

optim_threshold(a1,b1, b1$y)

This confirmed that the optimal threshold should be at .1 to recieve the best results in our logistic regression.

Making our LDA Model

m1.lda = lda(formula = y ~ age + job + marital + education + default + housing + contact + month + day_of_week + duration + campaign + pdays + previous + poutcome, data = b1train)
predclass_lda = predict(m1.lda, newdata = b1test)
caret::confusionMatrix(as.factor(predclass_lda$class),as.factor(b1test$y), positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  720  48
##        yes  24  32
##                                          
##                Accuracy : 0.9126         
##                  95% CI : (0.8912, 0.931)
##     No Information Rate : 0.9029         
##     P-Value [Acc > NIR] : 0.189664       
##                                          
##                   Kappa : 0.4246         
##                                          
##  Mcnemar's Test P-Value : 0.006717       
##                                          
##             Sensitivity : 0.40000        
##             Specificity : 0.96774        
##          Pos Pred Value : 0.57143        
##          Neg Pred Value : 0.93750        
##              Prevalence : 0.09709        
##          Detection Rate : 0.03883        
##    Detection Prevalence : 0.06796        
##       Balanced Accuracy : 0.68387        
##                                          
##        'Positive' Class : yes            
##

Shows the accuracy that our LDA model has in predicting Yes and no Responses. This model is 92.23% accurate. The LDA model shows us a higher accuracy but due to the fact that we were not able to implent an optimal cutoff for the LDA model. This accuracy coukld be seen as false because the sensitivity and specificity are still lower that the logistic regression model.

Using the step function

With the VIF function we were able to see that the model that would be correlated with the variables of age, job, contact, month, duration, campaign, previous and poutcome. This step function would take out any variables that could have a GVIF over 5 which would present multiculinarity in the dataset.Then we used the Step function do go backwards threough our model and take out variables that were not significant.

m2.log = step(a1, direction = "backward")

## Start:  AIC=1843.67
## y ~ age + job + marital + education + default + housing + contact + 
##     month + day_of_week + duration + campaign + pdays + previous + 
##     poutcome
## 
##               Df Deviance    AIC
## - education    7   1754.9 1834.9
## - day_of_week  4   1751.2 1837.2
## - marital      3   1752.2 1840.2
## - job         11   1768.7 1840.7
## - default      2   1750.7 1840.7
## - housing      2   1751.1 1841.1
## - pdays        1   1749.7 1841.7
## <none>             1749.7 1843.7
## - previous     1   1754.1 1846.1
## - age          1   1755.1 1847.1
## - poutcome     2   1761.9 1851.9
## - campaign     1   1761.1 1853.1
## - contact      1   1805.9 1897.9
## - month        9   1896.7 1972.7
## - duration     1   2322.9 2414.9
## 
## Step:  AIC=1834.9
## y ~ age + job + marital + default + housing + contact + month + 
##     day_of_week + duration + campaign + pdays + previous + poutcome
## 
##               Df Deviance    AIC
## - day_of_week  4   1756.5 1828.5
## - marital      3   1757.9 1831.9
## - default      2   1756.1 1832.1
## - housing      2   1756.4 1832.4
## - job         11   1774.6 1832.6
## - pdays        1   1754.9 1832.9
## <none>             1754.9 1834.9
## - age          1   1760.0 1838.0
## - previous     1   1760.1 1838.1
## - poutcome     2   1767.9 1843.9
## - campaign     1   1766.1 1844.1
## - contact      1   1811.9 1889.9
## - month        9   1904.8 1966.8
## - duration     1   2326.2 2404.2
## 
## Step:  AIC=1828.45
## y ~ age + job + marital + default + housing + contact + month + 
##     duration + campaign + pdays + previous + poutcome
## 
##            Df Deviance    AIC
## - marital   3   1759.5 1825.5
## - default   2   1757.7 1825.7
## - job      11   1775.9 1825.9
## - housing   2   1758.0 1826.0
## - pdays     1   1756.5 1826.5
## <none>          1756.5 1828.5
## - age       1   1761.3 1831.3
## - previous  1   1761.7 1831.7
## - poutcome  2   1769.1 1837.1
## - campaign  1   1767.9 1837.9
## - contact   1   1813.0 1883.0
## - month     9   1905.8 1959.8
## - duration  1   2327.1 2397.1
## 
## Step:  AIC=1825.48
## y ~ age + job + default + housing + contact + month + duration + 
##     campaign + pdays + previous + poutcome
## 
##            Df Deviance    AIC
## - default   2   1760.7 1822.7
## - housing   2   1760.9 1822.9
## - pdays     1   1759.5 1823.5
## - job      11   1780.8 1824.8
## <none>          1759.5 1825.5
## - age       1   1762.4 1826.4
## - previous  1   1765.2 1829.2
## - poutcome  2   1772.5 1834.5
## - campaign  1   1770.9 1834.9
## - contact   1   1816.3 1880.3
## - month     9   1912.0 1960.0
## - duration  1   2330.6 2394.6
## 
## Step:  AIC=1822.71
## y ~ age + job + housing + contact + month + duration + campaign + 
##     pdays + previous + poutcome
## 
##            Df Deviance    AIC
## - housing   2   1762.2 1820.2
## - pdays     1   1760.7 1820.7
## - job      11   1782.6 1822.6
## <none>          1760.7 1822.7
## - age       1   1763.2 1823.2
## - previous  1   1766.5 1826.5
## - poutcome  2   1774.1 1832.1
## - campaign  1   1772.1 1832.1
## - contact   1   1819.5 1879.5
## - month     9   1916.5 1960.5
## - duration  1   2333.3 2393.3
## 
## Step:  AIC=1820.2
## y ~ age + job + contact + month + duration + campaign + pdays + 
##     previous + poutcome
## 
##            Df Deviance    AIC
## - pdays     1   1762.2 1818.2
## - job      11   1784.1 1820.1
## <none>          1762.2 1820.2
## - age       1   1764.8 1820.8
## - previous  1   1767.7 1823.7
## - campaign  1   1773.6 1829.6
## - poutcome  2   1775.8 1829.8
## - contact   1   1820.9 1876.9
## - month     9   1917.6 1957.6
## - duration  1   2334.8 2390.8
## 
## Step:  AIC=1818.21
## y ~ age + job + contact + month + duration + campaign + previous + 
##     poutcome
## 
##            Df Deviance    AIC
## <none>          1762.2 1818.2
## - job      11   1784.2 1818.2
## - age       1   1764.8 1818.8
## - previous  1   1768.9 1822.9
## - campaign  1   1773.6 1827.6
## - contact   1   1821.0 1875.0
## - poutcome  2   1849.6 1901.6
## - month     9   1918.5 1956.5
## - duration  1   2334.9 2388.9

summary(m2.log)

## 
## Call:
## glm(formula = y ~ age + job + contact + month + duration + campaign + 
##     previous + poutcome, family = binomial, data = b1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.9721  -0.3365  -0.2286  -0.1340   3.0660  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.7935609  0.4514940  -8.402  < 2e-16 ***
## age                  0.0110124  0.0068641   1.604 0.108634    
## jobblue-collar      -0.4931022  0.2094759  -2.354 0.018574 *  
## jobentrepreneur     -0.9113934  0.4772859  -1.910 0.056193 .  
## jobhousemaid         0.2608928  0.3912566   0.667 0.504895    
## jobmanagement       -0.3441070  0.2666340  -1.291 0.196856    
## jobretired          -0.1043960  0.3233497  -0.323 0.746803    
## jobself-employed    -0.7334403  0.3985768  -1.840 0.065747 .  
## jobservices         -0.0155009  0.2500631  -0.062 0.950573    
## jobstudent           0.5948306  0.3616674   1.645 0.100034    
## jobtechnician        0.0479376  0.1924817   0.249 0.803322    
## jobunemployed        0.3575541  0.3589769   0.996 0.319232    
## jobunknown          -0.6145917  0.7537521  -0.815 0.414857    
## contacttelephone    -1.4770638  0.2015376  -7.329 2.32e-13 ***
## monthaug            -0.4214099  0.2776536  -1.518 0.129076    
## monthdec             1.7657440  0.5929586   2.978 0.002903 ** 
## monthjul            -0.9235200  0.2882876  -3.203 0.001358 ** 
## monthjun             0.9328203  0.3016902   3.092 0.001988 ** 
## monthmar             2.2707780  0.4172530   5.442 5.26e-08 ***
## monthmay            -0.5397248  0.2653768  -2.034 0.041971 *  
## monthnov            -0.8894528  0.3055540  -2.911 0.003603 ** 
## monthoct             1.3512528  0.3869363   3.492 0.000479 ***
## monthsep             0.9172236  0.4112379   2.230 0.025721 *  
## duration             0.0047984  0.0002373  20.224  < 2e-16 ***
## campaign            -0.1334898  0.0432051  -3.090 0.002004 ** 
## previous             0.3940194  0.1568724   2.512 0.012014 *  
## poutcomenonexistent  0.2978275  0.2865183   1.039 0.298586    
## poutcomesuccess      2.4434982  0.2718644   8.988  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2845.8  on 4118  degrees of freedom
## Residual deviance: 1762.2  on 4091  degrees of freedom
## AIC: 1818.2
## 
## Number of Fisher Scoring iterations: 6

vif(m2.log)

##              GVIF Df GVIF^(1/(2*Df))
## age      1.559178  1        1.248671
## job      1.991426 11        1.031807
## contact  1.612700  1        1.269921
## month    2.261133  9        1.046369
## duration 1.189741  1        1.090752
## campaign 1.056995  1        1.028102
## previous 3.395907  1        1.842799
## poutcome 3.750104  2        1.391589

References

Anderson, Richard P, Ruyun Jin, and Gary L Grunkemeier. 2003. “Understanding Logistic Regression Analysis in Clinical Reports: An Introduction.” The Annals of Thoracic Surgery 75 (3): 753–57.

Peng, Chao-Ying Joanne, Kuk Lida Lee, and Gary M Ingersoll. 2002. “An Introduction to Logistic Regression Analysis and Reporting.” The Journal of Educational Research 96 (1): 3–14.

Bank Marketing Case

Thomas Farrell

2/7/2022

I. Executive Summary

II. The Problem

IV. Methodology

V. Data Analysis

Loading in library

Reading in data

Changing character variables to factors

removed nr.employed

Checking summary of target variable

Removing any N/A variables from data

Histogram of yes or no counts for age

Job variable yes and no count

Contact variable yes and no counts

Month variable yes and no count

Duration status yes and no count

Campaign Variable yes and no count

Previous yes and no counts

Poutcome variable yes and no counts

VI. Findings and results

training testing split

Logistical regression model

Running the Optim Threshold Function

Making our LDA Model

Using the step function

References

Bank Marketing Case

Thomas Farrell

2/7/2022

I. Executive Summary

II. The Problem

III. Related literature

IV. Methodology

V. Data Analysis

Loading in library

Reading in data

Changing character variables to factors

removed nr.employed

Checking summary of target variable

Removing any N/A variables from data

Histogram of yes or no counts for age

Job variable yes and no count

Contact variable yes and no counts

Month variable yes and no count

Duration status yes and no count

Campaign Variable yes and no count

Previous yes and no counts

Poutcome variable yes and no counts

VI. Findings and results

training testing split

Logistical regression model

Running the Optim Threshold Function

Making our LDA Model

Using the step function

References