I. Executive Summary
This data set is correlated directly with marketing campaigns that were conducted by the Portuguese Banking institution. These model below show th correlation between the different variable tracked in the campaign. In this case study we look to improve the efficiency of the marketing campaign by defining the main factors that may affect the success of the campaign.
II. The Problem
The task we were assigned for this case was to run a linear regression model to decide if clients would subscribe to a bank deposit. Based on the results of the various input variables. The process of accurately predicting customers decisions on whether make a deposit will allow us to find the most prominent predictors and the data to give the company a better understanding of who to market to. This task could be solved by a logistical regression we will determine is this is an accurate predictor or not by checking F-score. In this case there was
The main question to be answered is what is a term deposit? A term deposit can be classified as a cash investment is held by a financial institution. These investments carry a short-term maturity. Which the investor must recognize before buying as they will only have access to the fund once the term has expired.
IV. Methodology
What are the variables that we are dealing with in the case. What types of variables are they continuous, numerical, categorical. What type of sampling techniques we used. Was this full full data set or just a sample of the data. Talk about the assumptions and limitations to the model that we chose to use.
Variables related to bank client data:
Age: Client’s age.(numerical) Job: Client’s type of job. (categorical) Marital: Client’s marital status, divorced means divorced or widowed.(categorical) Education: Client’s education.(categorical) Default: Client has previosly defaulted.(categorical) Housing: Client has a housing loan.(categorical) Loan: Client has a personal loan.(categorical)
Variables related to last contact of the current marketing campaign:
Contact: Contact communication type (telephone or cellular). Month: Last contact month of year. day_of_week: Last contact day of week. duration: Last contact duration in seconds. If duration is 0s, then we never contacted a client to sign up for a term deposit account. Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) Previous: number of contacts performed before this campaign and for this client (numeric) Poutcome: outcome of the previous marketing campaign (categorical: ‘failure,’‘nonexistent,’‘success’)
Social and economic context attributes:
Emp.var.rate: employment variation rate - quarterly indicator (numeric) Cons.price.idx: consumer price index - monthly indicator (numeric) Cons.conf.idx: consumer confidence index - monthly indicator (numeric) Euribor3m: euribor 3 month rate - daily indicator (numeric) Nr.employed: number of employees - quarterly indicator (numeric)
Output variable:
y - has the client subscribed a term deposit? (binary: ‘yes,’ ‘no’)
Once we defined our variables in the data set we changed all of our character variables to factor variables. We then created graphic models of significant variables in our data. This allowed us to visually see what people in our data could be beneficial to market to. When making our graphic representations we had to include Y as the fill for the bar chart or histogram. For the Categorical variables we decided to use a Bar chart to best represent our data. The fill of the “Y” variable allowed us to show the relationship between the variable and if they would chose to make a bank deposit or not. We had limitations in the data as some of the variables int he data were highly correlated to Y such as the nr.employed and emp.var.rate variables. So we had to remove these variables from our main data set.
Then we ran a logistical regression and found the GVIF of each variable. The rule that we used to determine multiculinarity was if the GVIF was over 5 the variable contained multiculinarity. Once we had our data down to the correct variables we were able to see that an optimal cutoff needed to be implemented.We experimented with .5 as the cutoff first and our sensitivity was at .48 which was telling us that our model was not good at predicting the “yes” responses. We then changed our optimal cutoff to .1 which changed the sensitivty to .76 which gave us a better model. We used the full dataset in the creation of our models.
1-2 pages
V. Data Analysis
Loading in library
library(titanic)
library(caret)
library(lattice)
library(ggplot2)
library(gam)
library(car)
library(ROCR)
library(ggmosaic)
library(gmodels)
library(prettydoc)
library(tinytex)
library(corrplot)
library(MASS)
library(tidyverse)
= read.csv("/Users/thomasfarrell/Downloads/bank-additional.csv", sep = ";")
b1 source('/Users/thomasfarrell/Downloads/optim_threshold.R')
(Peng, Lee, and Ingersoll 2002)(Anderson, Jin, and Grunkemeier 2003)
Reading in data
head(b1)
## age job marital education default housing loan contact
## 1 30 blue-collar married basic.9y no yes no cellular
## 2 39 services single high.school no no no telephone
## 3 25 services married high.school no yes no telephone
## 4 38 services married basic.9y no unknown unknown telephone
## 5 47 admin. married university.degree no yes no cellular
## 6 32 services single university.degree no no no cellular
## month day_of_week duration campaign pdays previous poutcome emp.var.rate
## 1 may fri 487 2 999 0 nonexistent -1.8
## 2 may fri 346 4 999 0 nonexistent 1.1
## 3 jun wed 227 1 999 0 nonexistent 1.4
## 4 jun fri 17 3 999 0 nonexistent 1.4
## 5 nov mon 58 1 999 0 nonexistent -0.1
## 6 sep thu 128 3 999 2 failure -1.1
## cons.price.idx cons.conf.idx euribor3m nr.employed y
## 1 92.893 -46.2 1.313 5099.1 no
## 2 93.994 -36.4 4.855 5191.0 no
## 3 94.465 -41.8 4.962 5228.1 no
## 4 94.465 -41.8 4.959 5228.1 no
## 5 93.200 -42.0 4.191 5195.8 no
## 6 94.199 -37.5 0.884 4963.6 no
Changing character variables to factors
$job = as.factor(b1$job)
b1$marital = as.factor(b1$marital)
b1$education = as.factor(b1$education)
b1$default = as.factor(b1$default)
b1$housing = as.factor(b1$housing)
b1$loan = as.factor(b1$loan)
b1$contact = as.factor(b1$contact)
b1$month = as.factor(b1$month)
b1$day_of_week = as.factor(b1$day_of_week)
b1$poutcome = as.factor(b1$poutcome)
b1$y = as.factor(b1$y)
b1
str(b1)
## 'data.frame': 4119 obs. of 21 variables:
## $ age : int 30 39 25 38 47 32 32 41 31 35 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 2 1 2 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 3 1 3 2 3 1 3 3 1 1 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 1 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
## $ duration : int 487 346 227 17 58 128 290 44 68 170 ...
## $ campaign : int 2 4 1 3 1 3 4 2 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 2 0 0 1 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ emp.var.rate : num -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
## $ cons.price.idx: num 92.9 94 94.5 94.5 93.2 ...
## $ cons.conf.idx : num -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
## $ euribor3m : num 1.31 4.86 4.96 4.96 4.19 ...
## $ nr.employed : num 5099 5191 5228 5228 5196 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
removed nr.employed
Creating a corrplot of all the vraiables to see the correlation. Then we removed nr.employed due to it having a correlation of .9 and it had multiculinarity. We also removed the variable emp.var.rate becuase there was a threat of multiculinarity there as well.
= dplyr::select_if(b1, is.numeric)
b1_num = cor(b1_num)
M corrplot(M, method = "number")
= dplyr::select(b1, - nr.employed)
b1 = dplyr::select(b1, - emp.var.rate) b1
Checking summary of target variable
summary(b1$y)
## no yes
## 3668 451
Removing any N/A variables from data
= b1
b2 = subset(b2, !is.na(b2$age))
b2 = subset(b2, !is.na(b2$job))
b2 = subset(b2, !is.na(b2$marital))
b2 = subset(b2, !is.na(b2$education))
b2 = subset(b2, !is.na(b2$default))
b2 = subset(b2, !is.na(b2$housing))
b2 = subset(b2, !is.na(b2$loan))
b2 = subset(b2, !is.na(b2$contact))
b2 = subset(b2, !is.na(b2$month))
b2 = subset(b2, !is.na(b2$day_of_week))
b2 = subset(b2, !is.na(b2$duration))
b2 = subset(b2, !is.na(b2$campaign))
b2 = subset(b2, !is.na(b2$pdays))
b2 = subset(b2, !is.na(b2$previous))
b2 = subset(b2, !is.na(b2$poutcome))
b2 = subset(b2, !is.na(b2$emp.var.rate))
b2 = subset(b2, !is.na(b2$cons.price.idx))
b2 = subset(b2, !is.na(b2$cons.conf.idx))
b2 = subset(b2, !is.na(b2$euribor3m))
b2 = subset(b2, !is.na(b2$nr.employed))
b2 = subset(b2, !is.na(b2$y))
b2
colSums(is.na(b2))
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## cons.price.idx cons.conf.idx euribor3m y
## 0 0 0 0
Histogram of yes or no counts for age
ggplot(data = b1, mapping = aes(x = age, fill = y)) +
geom_histogram(binwidth = 10, position = "dodge2")+
facet_wrap(~ y , nrow = 1)
Here we compare Histogram plots of the age variable ‘yes’ and ‘no’ counts. We see that from the histogram that the younger population mainly responded to the marketing campaign. We can also see that the data is right skewed.
Job variable yes and no count
ggplot(data = b1, mapping = aes(x = job, fill = y)) +
geom_bar(position = "dodge2")+
scale_fill_discrete(name = "Yes or no counts")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The Job variable in the data set shows us that many of the individuals who would make a bank deposit were in jobs with financial stability. These individuals would hold the biggest amount of power as they have the largest finical freedom.
Contact variable yes and no counts
ggplot(data = b1, mapping = aes(x = contact, fill = y)) +
geom_bar(position = "dodge2")+
scale_fill_discrete(name = "Yes or no counts")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
For the contact variable these is a high significance. This variable plays a large role in determining whether or not customers make a term deposit. From the we can see that when customers were contacted from cellular they have a higher chance of denying the opportunity to make a deposit. We can see that out of cellular respondents 14.1% responded yes and from telephone respondents only 5.2% responded yes.
Month variable yes and no count
ggplot(data = b1, mapping = aes(x = month, fill = y)) +
geom_bar(position = "dodge2")+
scale_fill_discrete(name = "Yes or no counts")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The month of May has the most observations in the data set. While the month of march has the largest percentage of “yes” observations.
Duration status yes and no count
ggplot(data = b1, mapping = aes(x = duration, fill = y)) +
geom_histogram()+
scale_fill_discrete(name = "Yes or no counts")+
facet_wrap(~ y , nrow = 2)+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Campaign Variable yes and no count
ggplot(data = b1, mapping = aes(x = campaign, fill = y)) +
geom_histogram()+
scale_fill_discrete(name = "Yes or no counts")+
facet_wrap(~ y , nrow = 2)+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In this data we can see that there is a right skew to the data. This showed us that the earlier campaigns were more effective at marketing to consumers.
Previous yes and no counts
ggplot(data = b1, mapping = aes(x = previous, fill = y)) +
geom_bar(position = "dodge2")+
scale_fill_discrete(name = "Yes or no counts")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This data shows the number of clients they had from the previous marketing campaign. The data is right-skewed showing that the previous campaign was not effective as they held very little customers
Poutcome variable yes and no counts
ggplot(data = b1, mapping = aes(x = poutcome, fill = y)) +
geom_bar(position = "dodge2")+
scale_fill_discrete(name = "Yes or no counts")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Looking as this chart you can see that from the previous marketing campaign there were a lot of clients that did not see the previous marketing campaign and would still respond to a bank term deposit.
VI. Findings and results
Our data analysis showed that the logistical regression model had the highest accuracy of 92.35%
training testing split
set.seed(1)
= sample(nrow(b1),.8*nrow(b1), replace = F)
tr_ind = b1[tr_ind,]
b1train = b1[-tr_ind,] b1test
Logistical regression model
= glm(formula = y ~ age + job + marital + education + default + housing + contact + month + day_of_week + duration + campaign + pdays + previous + poutcome , data = b1 , family = binomial)
a1
vif(a1)
## GVIF Df GVIF^(1/(2*Df))
## age 2.059400 1 1.435061
## job 5.856877 11 1.083662
## marital 1.562187 3 1.077181
## education 3.382069 7 1.090935
## default 1.146105 2 1.034680
## housing 1.083590 2 1.020273
## contact 1.636177 1 1.279131
## month 2.585692 9 1.054195
## day_of_week 1.120279 4 1.014298
## duration 1.203719 1 1.097141
## campaign 1.067845 1 1.033366
## pdays 10.552027 1 3.248388
## previous 4.142006 1 2.035192
## poutcome 23.674554 2 2.205822
summary(a1)
##
## Call:
## glm(formula = y ~ age + job + marital + education + default +
## housing + contact + month + day_of_week + duration + campaign +
## pdays + previous + poutcome, family = binomial, data = b1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.8779 -0.3383 -0.2235 -0.1297 2.9635
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.536e+00 9.658e-01 -4.696 2.65e-06 ***
## age 1.838e-02 7.873e-03 2.334 0.019577 *
## jobblue-collar -2.524e-01 2.599e-01 -0.971 0.331618
## jobentrepreneur -8.253e-01 4.824e-01 -1.711 0.087120 .
## jobhousemaid 4.417e-01 4.079e-01 1.083 0.278829
## jobmanagement -3.533e-01 2.711e-01 -1.303 0.192456
## jobretired -1.601e-03 3.343e-01 -0.005 0.996179
## jobself-employed -7.316e-01 4.029e-01 -1.816 0.069420 .
## jobservices 1.965e-01 2.697e-01 0.729 0.466216
## jobstudent 6.019e-01 3.872e-01 1.554 0.120084
## jobtechnician 1.336e-01 2.116e-01 0.631 0.527723
## jobunemployed 5.119e-01 3.659e-01 1.399 0.161864
## jobunknown -5.531e-01 7.628e-01 -0.725 0.468451
## maritalmarried 2.230e-01 2.306e-01 0.967 0.333557
## maritalsingle 3.957e-01 2.607e-01 1.518 0.129076
## maritalunknown -2.612e-02 1.119e+00 -0.023 0.981379
## educationbasic.6y 2.570e-01 3.891e-01 0.660 0.508962
## educationbasic.9y 1.118e-01 3.128e-01 0.358 0.720656
## educationhigh.school 1.718e-01 2.946e-01 0.583 0.559852
## educationilliterate -9.849e+00 5.354e+02 -0.018 0.985323
## educationprofessional.course 2.337e-01 3.215e-01 0.727 0.467144
## educationuniversity.degree 4.588e-01 2.956e-01 1.552 0.120662
## educationunknown 5.733e-01 3.700e-01 1.550 0.121210
## defaultunknown -1.931e-01 2.002e-01 -0.965 0.334618
## defaultyes -1.011e+01 5.354e+02 -0.019 0.984937
## housingunknown -5.621e-01 5.106e-01 -1.101 0.270937
## housingyes -7.053e-02 1.315e-01 -0.536 0.591781
## contacttelephone -1.462e+00 2.040e-01 -7.166 7.70e-13 ***
## monthaug -4.488e-01 2.823e-01 -1.590 0.111884
## monthdec 1.654e+00 5.987e-01 2.763 0.005723 **
## monthjul -9.034e-01 2.928e-01 -3.086 0.002032 **
## monthjun 9.185e-01 3.051e-01 3.010 0.002610 **
## monthmar 2.220e+00 4.192e-01 5.295 1.19e-07 ***
## monthmay -5.251e-01 2.694e-01 -1.949 0.051263 .
## monthnov -9.076e-01 3.087e-01 -2.940 0.003279 **
## monthoct 1.280e+00 3.900e-01 3.282 0.001029 **
## monthsep 8.870e-01 4.159e-01 2.133 0.032931 *
## day_of_weekmon 1.083e-01 2.048e-01 0.529 0.596867
## day_of_weekthu 3.693e-02 2.049e-01 0.180 0.856945
## day_of_weektue 2.546e-02 2.103e-01 0.121 0.903629
## day_of_weekwed 2.274e-01 2.120e-01 1.073 0.283456
## duration 4.824e-03 2.391e-04 20.180 < 2e-16 ***
## campaign -1.359e-01 4.384e-02 -3.100 0.001938 **
## pdays -1.330e-04 6.747e-04 -0.197 0.843718
## previous 3.530e-01 1.743e-01 2.026 0.042806 *
## poutcomenonexistent 2.420e-01 2.925e-01 0.827 0.407991
## poutcomesuccess 2.291e+00 6.645e-01 3.447 0.000566 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2845.8 on 4118 degrees of freedom
## Residual deviance: 1749.7 on 4072 degrees of freedom
## AIC: 1843.7
##
## Number of Fisher Scoring iterations: 12
= predict.glm(a1, newdata = b1test, type = "response")
predprob = ifelse(predprob >= .1, "yes", "no")
predclass_log ::confusionMatrix(as.factor(predclass_log), as.factor(b1test$y), positive = "yes") caret
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 639 19
## yes 105 61
##
## Accuracy : 0.8495
## 95% CI : (0.8232, 0.8732)
## No Information Rate : 0.9029
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4199
##
## Mcnemar's Test P-Value : 2.29e-14
##
## Sensitivity : 0.76250
## Specificity : 0.85887
## Pos Pred Value : 0.36747
## Neg Pred Value : 0.97112
## Prevalence : 0.09709
## Detection Rate : 0.07403
## Detection Prevalence : 0.20146
## Balanced Accuracy : 0.81069
##
## 'Positive' Class : yes
##
Shows the accuracy that our logistic regression model has in predicting Yes and no Responses. The sensitivity of our model shows the accuracy we have for predicting “yes” responses and the specificity of our model shows the accuracy we have in predicting “no” responses. This model has an accuracy of 84.95%. We used .1 as our optimal cutoff point which gave us the best sensitivity and specificity which shows us that our model was accurate at predicting both yes and no responses.
Running the Optim Threshold Function
optim_threshold(a1,b1, b1$y)
This confirmed that the optimal threshold should be at .1 to recieve the best results in our logistic regression.
Making our LDA Model
= lda(formula = y ~ age + job + marital + education + default + housing + contact + month + day_of_week + duration + campaign + pdays + previous + poutcome, data = b1train)
m1.lda = predict(m1.lda, newdata = b1test)
predclass_lda ::confusionMatrix(as.factor(predclass_lda$class),as.factor(b1test$y), positive = "yes") caret
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 720 48
## yes 24 32
##
## Accuracy : 0.9126
## 95% CI : (0.8912, 0.931)
## No Information Rate : 0.9029
## P-Value [Acc > NIR] : 0.189664
##
## Kappa : 0.4246
##
## Mcnemar's Test P-Value : 0.006717
##
## Sensitivity : 0.40000
## Specificity : 0.96774
## Pos Pred Value : 0.57143
## Neg Pred Value : 0.93750
## Prevalence : 0.09709
## Detection Rate : 0.03883
## Detection Prevalence : 0.06796
## Balanced Accuracy : 0.68387
##
## 'Positive' Class : yes
##
Shows the accuracy that our LDA model has in predicting Yes and no Responses. This model is 92.23% accurate. The LDA model shows us a higher accuracy but due to the fact that we were not able to implent an optimal cutoff for the LDA model. This accuracy coukld be seen as false because the sensitivity and specificity are still lower that the logistic regression model.
Using the step function
With the VIF function we were able to see that the model that would be correlated with the variables of age, job, contact, month, duration, campaign, previous and poutcome. This step function would take out any variables that could have a GVIF over 5 which would present multiculinarity in the dataset.Then we used the Step function do go backwards threough our model and take out variables that were not significant.
= step(a1, direction = "backward") m2.log
## Start: AIC=1843.67
## y ~ age + job + marital + education + default + housing + contact +
## month + day_of_week + duration + campaign + pdays + previous +
## poutcome
##
## Df Deviance AIC
## - education 7 1754.9 1834.9
## - day_of_week 4 1751.2 1837.2
## - marital 3 1752.2 1840.2
## - job 11 1768.7 1840.7
## - default 2 1750.7 1840.7
## - housing 2 1751.1 1841.1
## - pdays 1 1749.7 1841.7
## <none> 1749.7 1843.7
## - previous 1 1754.1 1846.1
## - age 1 1755.1 1847.1
## - poutcome 2 1761.9 1851.9
## - campaign 1 1761.1 1853.1
## - contact 1 1805.9 1897.9
## - month 9 1896.7 1972.7
## - duration 1 2322.9 2414.9
##
## Step: AIC=1834.9
## y ~ age + job + marital + default + housing + contact + month +
## day_of_week + duration + campaign + pdays + previous + poutcome
##
## Df Deviance AIC
## - day_of_week 4 1756.5 1828.5
## - marital 3 1757.9 1831.9
## - default 2 1756.1 1832.1
## - housing 2 1756.4 1832.4
## - job 11 1774.6 1832.6
## - pdays 1 1754.9 1832.9
## <none> 1754.9 1834.9
## - age 1 1760.0 1838.0
## - previous 1 1760.1 1838.1
## - poutcome 2 1767.9 1843.9
## - campaign 1 1766.1 1844.1
## - contact 1 1811.9 1889.9
## - month 9 1904.8 1966.8
## - duration 1 2326.2 2404.2
##
## Step: AIC=1828.45
## y ~ age + job + marital + default + housing + contact + month +
## duration + campaign + pdays + previous + poutcome
##
## Df Deviance AIC
## - marital 3 1759.5 1825.5
## - default 2 1757.7 1825.7
## - job 11 1775.9 1825.9
## - housing 2 1758.0 1826.0
## - pdays 1 1756.5 1826.5
## <none> 1756.5 1828.5
## - age 1 1761.3 1831.3
## - previous 1 1761.7 1831.7
## - poutcome 2 1769.1 1837.1
## - campaign 1 1767.9 1837.9
## - contact 1 1813.0 1883.0
## - month 9 1905.8 1959.8
## - duration 1 2327.1 2397.1
##
## Step: AIC=1825.48
## y ~ age + job + default + housing + contact + month + duration +
## campaign + pdays + previous + poutcome
##
## Df Deviance AIC
## - default 2 1760.7 1822.7
## - housing 2 1760.9 1822.9
## - pdays 1 1759.5 1823.5
## - job 11 1780.8 1824.8
## <none> 1759.5 1825.5
## - age 1 1762.4 1826.4
## - previous 1 1765.2 1829.2
## - poutcome 2 1772.5 1834.5
## - campaign 1 1770.9 1834.9
## - contact 1 1816.3 1880.3
## - month 9 1912.0 1960.0
## - duration 1 2330.6 2394.6
##
## Step: AIC=1822.71
## y ~ age + job + housing + contact + month + duration + campaign +
## pdays + previous + poutcome
##
## Df Deviance AIC
## - housing 2 1762.2 1820.2
## - pdays 1 1760.7 1820.7
## - job 11 1782.6 1822.6
## <none> 1760.7 1822.7
## - age 1 1763.2 1823.2
## - previous 1 1766.5 1826.5
## - poutcome 2 1774.1 1832.1
## - campaign 1 1772.1 1832.1
## - contact 1 1819.5 1879.5
## - month 9 1916.5 1960.5
## - duration 1 2333.3 2393.3
##
## Step: AIC=1820.2
## y ~ age + job + contact + month + duration + campaign + pdays +
## previous + poutcome
##
## Df Deviance AIC
## - pdays 1 1762.2 1818.2
## - job 11 1784.1 1820.1
## <none> 1762.2 1820.2
## - age 1 1764.8 1820.8
## - previous 1 1767.7 1823.7
## - campaign 1 1773.6 1829.6
## - poutcome 2 1775.8 1829.8
## - contact 1 1820.9 1876.9
## - month 9 1917.6 1957.6
## - duration 1 2334.8 2390.8
##
## Step: AIC=1818.21
## y ~ age + job + contact + month + duration + campaign + previous +
## poutcome
##
## Df Deviance AIC
## <none> 1762.2 1818.2
## - job 11 1784.2 1818.2
## - age 1 1764.8 1818.8
## - previous 1 1768.9 1822.9
## - campaign 1 1773.6 1827.6
## - contact 1 1821.0 1875.0
## - poutcome 2 1849.6 1901.6
## - month 9 1918.5 1956.5
## - duration 1 2334.9 2388.9
summary(m2.log)
##
## Call:
## glm(formula = y ~ age + job + contact + month + duration + campaign +
## previous + poutcome, family = binomial, data = b1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9721 -0.3365 -0.2286 -0.1340 3.0660
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.7935609 0.4514940 -8.402 < 2e-16 ***
## age 0.0110124 0.0068641 1.604 0.108634
## jobblue-collar -0.4931022 0.2094759 -2.354 0.018574 *
## jobentrepreneur -0.9113934 0.4772859 -1.910 0.056193 .
## jobhousemaid 0.2608928 0.3912566 0.667 0.504895
## jobmanagement -0.3441070 0.2666340 -1.291 0.196856
## jobretired -0.1043960 0.3233497 -0.323 0.746803
## jobself-employed -0.7334403 0.3985768 -1.840 0.065747 .
## jobservices -0.0155009 0.2500631 -0.062 0.950573
## jobstudent 0.5948306 0.3616674 1.645 0.100034
## jobtechnician 0.0479376 0.1924817 0.249 0.803322
## jobunemployed 0.3575541 0.3589769 0.996 0.319232
## jobunknown -0.6145917 0.7537521 -0.815 0.414857
## contacttelephone -1.4770638 0.2015376 -7.329 2.32e-13 ***
## monthaug -0.4214099 0.2776536 -1.518 0.129076
## monthdec 1.7657440 0.5929586 2.978 0.002903 **
## monthjul -0.9235200 0.2882876 -3.203 0.001358 **
## monthjun 0.9328203 0.3016902 3.092 0.001988 **
## monthmar 2.2707780 0.4172530 5.442 5.26e-08 ***
## monthmay -0.5397248 0.2653768 -2.034 0.041971 *
## monthnov -0.8894528 0.3055540 -2.911 0.003603 **
## monthoct 1.3512528 0.3869363 3.492 0.000479 ***
## monthsep 0.9172236 0.4112379 2.230 0.025721 *
## duration 0.0047984 0.0002373 20.224 < 2e-16 ***
## campaign -0.1334898 0.0432051 -3.090 0.002004 **
## previous 0.3940194 0.1568724 2.512 0.012014 *
## poutcomenonexistent 0.2978275 0.2865183 1.039 0.298586
## poutcomesuccess 2.4434982 0.2718644 8.988 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2845.8 on 4118 degrees of freedom
## Residual deviance: 1762.2 on 4091 degrees of freedom
## AIC: 1818.2
##
## Number of Fisher Scoring iterations: 6
vif(m2.log)
## GVIF Df GVIF^(1/(2*Df))
## age 1.559178 1 1.248671
## job 1.991426 11 1.031807
## contact 1.612700 1 1.269921
## month 2.261133 9 1.046369
## duration 1.189741 1 1.090752
## campaign 1.056995 1 1.028102
## previous 3.395907 1 1.842799
## poutcome 3.750104 2 1.391589