Statistical Modeling Assignment 2

Group Members : Utsav Raj, Rohit Menon, Satarupa Saha, Mounish Sunkara

Description:

Customer conversion in a direct marketing campaign is one of the most important metrics of success when evaluating the campaign itself. In a world of limited resources, it’s often difficult to make the best use of a marketer’s time, we can improve on this by predicting whether a prospective customer will respond to a marketing campaign. In order to do this, we will:

• Understand which of the observed variables are most associated with the chance of subscribing to a term deposit.

• How the important variables relate to the predicted probability that a client will subscribe to a term deposit

• Build a model that could be useful in determining which future clients are likely to respond a term deposit, if contacted.

Code
bank_trn_data <- read.csv("bank_trn.csv",stringsAsFactors = TRUE)
bank_new_data <- read.csv("bank_new.csv",stringsAsFactors = TRUE)
bank_trn_data$y <- ifelse(bank_trn_data$subscribed == "no", 0, 1)
bank_new_data$y <- ifelse(bank_new_data$subscribed == "no", 0, 1)

1. Exploratory data analysis (EDA)

Code
summ <- summary(bank_trn_data)
#knitr::kable(summary(bank_trn_data), "simple")

knitr::kable(summ, align = "lccrr", caption = "Bank Data Summary")
Bank Data Summary
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed subscribed y
Min. :17.00 admin. :9334 divorced: 4145 university.degree :10931 no :29292 no :16774 no :30559 cellular :23458 may :12451 fri:7033 Min. : 0.0 Min. : 1.000 Min. : 0.0 Min. :0.0000 failure : 3840 Min. :-3.40000 Min. :92.20 Min. :-50.80 Min. :0.634 Min. :4964 no :32886 Min. :0.0000
1st Qu.:32.00 blue-collar:8365 married :22444 high.school : 8526 unknown: 7766 unknown: 896 unknown: 896 telephone:13603 jul : 6418 mon:7660 1st Qu.: 102.0 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.0000 nonexistent:31979 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.70 1st Qu.:1.344 1st Qu.:5099 yes: 4175 1st Qu.:0.0000
Median :38.00 technician :6047 single :10396 basic.9y : 5456 yes : 3 yes :19391 yes : 5606 NA aug : 5542 thu:7747 Median : 180.0 Median : 2.000 Median :999.0 Median :0.0000 success : 1242 Median : 1.10000 Median :93.75 Median :-41.80 Median :4.857 Median :5191 NA Median :0.0000
Mean :40.04 services :3589 unknown : 76 professional.course: 4702 NA NA NA NA jun : 4766 tue:7296 Mean : 258.5 Mean : 2.565 Mean :962.2 Mean :0.1738 NA Mean : 0.08039 Mean :93.58 Mean :-40.49 Mean :3.620 Mean :5167 NA Mean :0.1127
3rd Qu.:47.00 management :2632 NA basic.4y : 3814 NA NA NA NA nov : 3683 wed:7325 3rd Qu.: 320.0 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.0000 NA 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.40 3rd Qu.:4.961 3rd Qu.:5228 NA 3rd Qu.:0.0000
Max. :98.00 retired :1541 NA basic.6y : 2073 NA NA NA NA apr : 2380 NA Max. :4918.0 Max. :56.000 Max. :999.0 Max. :6.0000 NA Max. : 1.40000 Max. :94.77 Max. :-26.90 Max. :5.045 Max. :5228 NA Max. :1.0000
NA (Other) :5553 NA (Other) : 1559 NA NA NA NA (Other): 1821 NA NA NA NA NA NA NA NA NA NA NA NA NA

Data Cleaning

Checking for blank data and duplictaes

Code
colSums(is.na(bank_trn_data))
           age            job        marital      education        default 
             0              0              0              0              0 
       housing           loan        contact          month    day_of_week 
             0              0              0              0              0 
      duration       campaign          pdays       previous       poutcome 
             0              0              0              0              0 
  emp_var_rate cons_price_idx  cons_conf_idx      euribor3m    nr_employed 
             0              0              0              0              0 
    subscribed              y 
             0              0 
Code
bank_trn_data_fnl <-unique(bank_trn_data)

We have 9 duplicate records. Deleted all the duplicates.

Plotting the Graphs

Code
par(mfrow = c(3,2))
hist(bank_trn_data_fnl$age, main = "Distribution of Age", xlab = "Age")
hist(bank_trn_data_fnl$duration, main = "Distribution of Duration", xlab = "Duration")
barplot(table(bank_trn_data_fnl$default), main = "Frequency of Default Status")
barplot(table(bank_trn_data_fnl$marital), main = "Marital Status")
barplot(table(bank_trn_data_fnl$loan), main = "Frequency by Loan Status")
barplot(table(bank_trn_data_fnl$contact), main = "Distribution by Contact Status")

  1. Call duration variable seems to be right skewed which suggests most of the calls where less than 10 secs.
  2. Around 79% people have not defaulted on loan payment.
  3. 50% of the people are married and don’t have any loan.
Code
par(mfrow = c(2,2))

barplot(table(bank_trn_data_fnl$pdays), main = "Distribution by PDays")
barplot(table(bank_trn_data_fnl$previous), main = "Distribution by Previous")


#bank_trn_data_fnl %>% group_by(pdays) %>% count()

hist(bank_trn_data_fnl$nr_employed, main = "Distribution of Number Employed", xlab = "Number of Employees")
barplot(table(bank_trn_data_fnl$subscribed), main = "Frequency by Subscribed Status")

  1. Majority of the records have PDays as 999 which means they were contacted a long time back.
  2. Pdays and previous are correlated.
  3. No of employees does not have a clear distribution
  4. 88% of the data set have subscribed

Checking correlation on all numeric variables

Code
# Look at correlations between numeric features
num <- sapply(bank_trn_data_fnl, FUN = is.numeric) 
corx <- cor(bank_trn_data_fnl[, num], use="pairwise", method="spearman")  

# Visualize correlations

corrplot(corx, method = "color", order = "FPC", type = "lower",addCoef.col = "black",number.cex = 0.65, diag = TRUE)

High Correlation between emp_var_rate with nr_emolpyed and euribor3m and b/w euribor3m and nr_emolpyed.

Chi-Sqr test for categorical variables

Code
bank_fct<- bank_trn_data_fnl%>%select_if(is.factor)
bank_num<- bank_trn_data_fnl%>%select_if(is.numeric)

chis <- lapply(1:ncol(bank_fct), function(x) chisq.test(bank_trn_data_fnl$subscribed, bank_fct[,x])$p.value)

pvalues<-unlist(chis)

table<-data.frame(colnames(bank_fct), pvalues)
table<- table%>%arrange(pvalues)
table
   colnames.bank_fct.       pvalues
1               month  0.000000e+00
2            poutcome  0.000000e+00
3          subscribed  0.000000e+00
4                 job 1.936701e-181
5             contact 5.907889e-174
6             default  1.251693e-80
7           education  2.704408e-34
8             marital  1.260445e-23
9         day_of_week  9.154301e-05
10            housing  3.527408e-02
11               loan  2.002998e-01

From Chi square test of categorical values we can see that month, poutcome, job are significant top 3 predictors.

Based on the analysis we have done on numerical and categorical variables, 3 predictors we have chosen are as below :

  1. Previous
  2. Duration
  3. poutcome

2.Initial Modeling

Full Model

Code
model_null<- glm(subscribed ~ 1 , family=binomial, data = bank_trn_data_fnl)

model_var<- glm(subscribed ~ duration+previous+poutcome, family=binomial, data = bank_trn_data_fnl)

AIC(model_null)
[1] 26094.76
Code
AIC(model_var)
[1] 19237.43

Null model has AIC of 26101.04 where as our initial model has AIC of 19243.69 which is comparatively better than the null model.

3.Step Function

stepAIC() and specify direction = “both”

Code
model_full<-glm(subscribed ~ .-y, family=binomial, data = bank_trn_data_fnl)

step<-MASS::stepAIC(model_var, direction ="both", scope= formula(model_full), trace=FALSE)
summary(step)

Call:
glm(formula = subscribed ~ duration + poutcome + month + emp_var_rate + 
    job + cons_price_idx + contact + euribor3m + default + pdays + 
    day_of_week + campaign + cons_conf_idx, family = binomial, 
    data = bank_trn_data_fnl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.9968  -0.2989  -0.1856  -0.1347   3.2903  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.660e+02  1.093e+01 -15.187  < 2e-16 ***
duration             4.690e-03  7.847e-05  59.766  < 2e-16 ***
poutcomenonexistent  4.966e-01  6.735e-02   7.373 1.67e-13 ***
poutcomesuccess      1.030e+00  2.130e-01   4.836 1.32e-06 ***
monthaug             7.528e-01  1.133e-01   6.644 3.05e-11 ***
monthdec             2.206e-01  2.083e-01   1.059  0.28975    
monthjul             1.450e-01  1.010e-01   1.435  0.15119    
monthjun            -3.342e-01  1.121e-01  -2.981  0.00287 ** 
monthmar             1.854e+00  1.231e-01  15.065  < 2e-16 ***
monthmay            -5.118e-01  8.009e-02  -6.391 1.65e-10 ***
monthnov            -4.940e-01  1.128e-01  -4.378 1.20e-05 ***
monthoct             3.480e-02  1.308e-01   0.266  0.79014    
monthsep             1.617e-01  1.379e-01   1.172  0.24107    
emp_var_rate        -1.551e+00  1.200e-01 -12.917  < 2e-16 ***
jobblue-collar      -3.135e-01  6.949e-02  -4.512 6.43e-06 ***
jobentrepreneur     -1.598e-01  1.309e-01  -1.221  0.22223    
jobhousemaid        -2.757e-02  1.439e-01  -0.192  0.84807    
jobmanagement        2.079e-02  8.730e-02   0.238  0.81178    
jobretired           2.820e-01  8.779e-02   3.213  0.00132 ** 
jobself-employed    -8.868e-02  1.226e-01  -0.723  0.46955    
jobservices         -1.775e-01  8.635e-02  -2.055  0.03986 *  
jobstudent           1.897e-01  1.088e-01   1.743  0.08130 .  
jobtechnician        8.411e-03  6.707e-02   0.125  0.90020    
jobunemployed       -4.733e-02  1.345e-01  -0.352  0.72484    
jobunknown           3.126e-02  2.361e-01   0.132  0.89466    
cons_price_idx       1.726e+00  1.133e-01  15.229  < 2e-16 ***
contacttelephone    -6.062e-01  7.717e-02  -7.855 3.99e-15 ***
euribor3m            4.576e-01  9.089e-02   5.035 4.78e-07 ***
defaultunknown      -3.149e-01  6.983e-02  -4.510 6.50e-06 ***
defaultyes          -7.366e+00  1.134e+02  -0.065  0.94820    
pdays               -8.578e-04  2.126e-04  -4.035 5.45e-05 ***
day_of_weekmon      -1.133e-01  6.968e-02  -1.626  0.10396    
day_of_weekthu       5.689e-02  6.758e-02   0.842  0.39989    
day_of_weektue       1.256e-01  6.896e-02   1.822  0.06850 .  
day_of_weekwed       1.531e-01  6.939e-02   2.206  0.02737 *  
campaign            -3.676e-02  1.204e-02  -3.054  0.00226 ** 
cons_conf_idx        1.142e-02  5.773e-03   1.978  0.04788 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26093  on 37060  degrees of freedom
Residual deviance: 15401  on 37024  degrees of freedom
AIC: 15475

Number of Fisher Scoring iterations: 10

Step: AIC=15475 subscribed ~ duration + month + poutcome + emp_var_rate + job + cons_price_idx + contact + euribor3m + default + pdays + day_of_week + campaign + cons_conf_idx

The predicted model is different from the initial model that we chose using eye test. Many new predictors have been added to the initial model.

stepAIC() and specify direction = “forward”

Code
step_for<-MASS::stepAIC(model_null, direction ="forward", scope= formula(model_full), trace=FALSE)

summary(step_for)

Call:
glm(formula = subscribed ~ duration + nr_employed + month + poutcome + 
    emp_var_rate + job + cons_price_idx + contact + euribor3m + 
    default + pdays + day_of_week + campaign + cons_conf_idx, 
    family = binomial, data = bank_trn_data_fnl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.9914  -0.2992  -0.1856  -0.1345   3.3061  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.936e+02  4.030e+01  -4.804 1.56e-06 ***
duration             4.689e-03  7.847e-05  59.757  < 2e-16 ***
nr_employed          2.326e-03  3.271e-03   0.711 0.476999    
monthaug             7.937e-01  1.271e-01   6.247 4.19e-10 ***
monthdec             2.753e-01  2.222e-01   1.239 0.215274    
monthjul             1.466e-01  1.011e-01   1.450 0.147188    
monthjun            -3.846e-01  1.327e-01  -2.899 0.003742 ** 
monthmar             1.917e+00  1.516e-01  12.644  < 2e-16 ***
monthmay            -4.881e-01  8.677e-02  -5.625 1.86e-08 ***
monthnov            -4.521e-01  1.273e-01  -3.550 0.000385 ***
monthoct             1.034e-01  1.624e-01   0.637 0.524390    
monthsep             2.536e-01  1.890e-01   1.342 0.179618    
poutcomenonexistent  4.985e-01  6.741e-02   7.394 1.42e-13 ***
poutcomesuccess      1.031e+00  2.130e-01   4.840 1.30e-06 ***
emp_var_rate        -1.615e+00  1.503e-01 -10.742  < 2e-16 ***
jobblue-collar      -3.151e-01  6.952e-02  -4.533 5.81e-06 ***
jobentrepreneur     -1.620e-01  1.309e-01  -1.238 0.215886    
jobhousemaid        -2.740e-02  1.440e-01  -0.190 0.849041    
jobmanagement        2.061e-02  8.730e-02   0.236 0.813358    
jobretired           2.806e-01  8.784e-02   3.194 0.001403 ** 
jobself-employed    -8.785e-02  1.226e-01  -0.716 0.473686    
jobservices         -1.778e-01  8.634e-02  -2.059 0.039452 *  
jobstudent           1.892e-01  1.088e-01   1.738 0.082151 .  
jobtechnician        9.741e-03  6.710e-02   0.145 0.884568    
jobunemployed       -4.720e-02  1.345e-01  -0.351 0.725592    
jobunknown           3.158e-02  2.360e-01   0.134 0.893525    
cons_price_idx       1.896e+00  2.656e-01   7.140 9.35e-13 ***
contacttelephone    -6.211e-01  8.017e-02  -7.747 9.37e-15 ***
euribor3m            3.851e-01  1.366e-01   2.819 0.004812 ** 
defaultunknown      -3.155e-01  6.983e-02  -4.519 6.23e-06 ***
defaultyes          -7.359e+00  1.134e+02  -0.065 0.948253    
pdays               -8.582e-04  2.126e-04  -4.037 5.41e-05 ***
day_of_weekmon      -1.118e-01  6.971e-02  -1.604 0.108744    
day_of_weekthu       5.634e-02  6.759e-02   0.834 0.404507    
day_of_weektue       1.283e-01  6.906e-02   1.858 0.063221 .  
day_of_weekwed       1.549e-01  6.943e-02   2.231 0.025652 *  
campaign            -3.683e-02  1.204e-02  -3.060 0.002216 ** 
cons_conf_idx        1.548e-02  8.122e-03   1.906 0.056627 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26093  on 37060  degrees of freedom
Residual deviance: 15400  on 37023  degrees of freedom
AIC: 15476

Number of Fisher Scoring iterations: 10

Step: AIC=15476 subscribed ~ duration + nr_employed + month + poutcome + emp_var_rate + job + cons_price_idx + contact + euribor3m + default + pdays + day_of_week + campaign + cons_conf_idx

The forward model differes from STEP AIC both ways model and our initial model, we have one new predictor as “nr_employed”.

stepAIC() and specify direction = “backward”

Code
step_back<-MASS::stepAIC(model_full, direction ="backward", trace=FALSE)
summary(step_back)

Call:
glm(formula = subscribed ~ job + default + contact + month + 
    day_of_week + duration + campaign + pdays + poutcome + emp_var_rate + 
    cons_price_idx + cons_conf_idx + euribor3m, family = binomial, 
    data = bank_trn_data_fnl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.9968  -0.2989  -0.1856  -0.1347   3.2903  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.660e+02  1.093e+01 -15.187  < 2e-16 ***
jobblue-collar      -3.135e-01  6.949e-02  -4.512 6.43e-06 ***
jobentrepreneur     -1.598e-01  1.309e-01  -1.221  0.22223    
jobhousemaid        -2.757e-02  1.439e-01  -0.192  0.84807    
jobmanagement        2.079e-02  8.730e-02   0.238  0.81178    
jobretired           2.820e-01  8.779e-02   3.213  0.00132 ** 
jobself-employed    -8.868e-02  1.226e-01  -0.723  0.46955    
jobservices         -1.775e-01  8.635e-02  -2.055  0.03986 *  
jobstudent           1.897e-01  1.088e-01   1.743  0.08130 .  
jobtechnician        8.411e-03  6.707e-02   0.125  0.90020    
jobunemployed       -4.733e-02  1.345e-01  -0.352  0.72484    
jobunknown           3.126e-02  2.361e-01   0.132  0.89466    
defaultunknown      -3.149e-01  6.983e-02  -4.510 6.50e-06 ***
defaultyes          -7.366e+00  1.134e+02  -0.065  0.94820    
contacttelephone    -6.062e-01  7.717e-02  -7.855 3.99e-15 ***
monthaug             7.528e-01  1.133e-01   6.644 3.05e-11 ***
monthdec             2.206e-01  2.083e-01   1.059  0.28975    
monthjul             1.450e-01  1.010e-01   1.435  0.15119    
monthjun            -3.342e-01  1.121e-01  -2.981  0.00287 ** 
monthmar             1.854e+00  1.231e-01  15.065  < 2e-16 ***
monthmay            -5.118e-01  8.009e-02  -6.391 1.65e-10 ***
monthnov            -4.940e-01  1.128e-01  -4.378 1.20e-05 ***
monthoct             3.480e-02  1.308e-01   0.266  0.79014    
monthsep             1.617e-01  1.379e-01   1.172  0.24107    
day_of_weekmon      -1.133e-01  6.968e-02  -1.626  0.10396    
day_of_weekthu       5.689e-02  6.758e-02   0.842  0.39989    
day_of_weektue       1.256e-01  6.896e-02   1.822  0.06850 .  
day_of_weekwed       1.531e-01  6.939e-02   2.206  0.02737 *  
duration             4.690e-03  7.847e-05  59.766  < 2e-16 ***
campaign            -3.676e-02  1.204e-02  -3.054  0.00226 ** 
pdays               -8.578e-04  2.126e-04  -4.035 5.45e-05 ***
poutcomenonexistent  4.966e-01  6.735e-02   7.373 1.67e-13 ***
poutcomesuccess      1.030e+00  2.130e-01   4.836 1.32e-06 ***
emp_var_rate        -1.551e+00  1.200e-01 -12.917  < 2e-16 ***
cons_price_idx       1.726e+00  1.133e-01  15.229  < 2e-16 ***
cons_conf_idx        1.142e-02  5.773e-03   1.978  0.04788 *  
euribor3m            4.576e-01  9.089e-02   5.035 4.78e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26093  on 37060  degrees of freedom
Residual deviance: 15401  on 37024  degrees of freedom
AIC: 15475

Number of Fisher Scoring iterations: 10

Step: AIC=15475 subscribed ~ job + default + contact + month + day_of_week + duration + campaign + pdays + poutcome + emp_var_rate + cons_price_idx + cons_conf_idx + euribor3m

Backward model is same as the model we got with both ways process and it differs from our initial model.

2 Way Interaction

As discussed we are not going ahead with the 2 way interaction.

Model Accuracy

Code
bank_copy<- bank_trn_data_fnl

bank_copy$pred_null<-as.factor(ifelse(predict(model_null, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))

bank_copy$pred_init<-as.factor(ifelse(predict(model_var, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))

bank_copy$pred_step_AIC_both <- as.factor(ifelse(predict(step, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))

bank_copy$pred_step_AIC_fwd <- as.factor(ifelse(predict(step_for, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))

bank_copy$pred_step_AIC_bck <- as.factor(ifelse(predict(step_back, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))

df <- data.frame(matrix(ncol = 4, nrow = 1))
colnames(df) <- c('pred_init', 'pred_step_AIC_both', 'pred_step_AIC_fwd','pred_step_AIC_bck')

#Accuracy of initial model 
df$pred_init<- sum(bank_copy$subscribed==bank_copy$pred_init)/nrow(bank_copy)

#Accuracy of AIC both way model 
df$pred_step_AIC_both<- sum(bank_copy$subscribed==bank_copy$pred_step_AIC_both)/nrow(bank_copy)

#Accuracy of AIC forward model 
df$pred_step_AIC_fwd<- sum(bank_copy$subscribed==bank_copy$pred_step_AIC_fwd)/nrow(bank_copy)

#Accuracy of AIC Backward model 
df$pred_step_AIC_bck<- sum(bank_copy$subscribed==bank_copy$pred_step_AIC_bck)/nrow(bank_copy)

knitr::kable(df, align = "lccrr", caption = "Accuracy of all the models")
Accuracy of all the models
pred_init pred_step_AIC_both pred_step_AIC_fwd pred_step_AIC_bck
0.9042929 0.9112544 0.9114973 0.9112544

Based on accuracy percentage all AIC models are performing better than the initial model.

4.Model Selction

Based on the accuracy and AIC value the backward and both ways model are best but we need to do some more analysis to lessen the number of predictors. We will use LOCO method in the next step.

5.Model Predictor Selection

LOCO of Both and backward direction model

Code
xnames <- c("duration", "month","poutcome", "emp_var_rate", "cons_price_idx", "contact", "euribor3m", "job", "default", "pdays","day_of_week","campaign", "cons_conf_idx") # predictors only
vi.scores <- numeric(length(xnames))
names(vi.scores) <- xnames
(baseline <- deviance(step)) # smaller is better; could also use AIC, Brier score, etc.
[1] 15400.73
Code
for (xname in xnames) {
# col_list <- c(attr(step$terms, "term.labels"), "su
data.copy <- bank_trn_data_fnl %>% dplyr::select(c(attr(step$terms, "term.labels"), "subscribed"))
data.copy[[xname]] <- NULL
fit.new <- glm(subscribed ~ ., data = data.copy, family = binomial(link = "logit"))
vi.scores[xname] <- deviance(fit.new) - baseline # measure drop in performance
}
sort(vi.scores, decreasing = TRUE)
      duration          month cons_price_idx   emp_var_rate       poutcome 
   5039.053927     495.370818     231.100827     164.120290      79.483019 
       contact            job      euribor3m        default    day_of_week 
     65.196905      49.462358      25.150698      21.196811      19.483327 
         pdays       campaign  cons_conf_idx 
     16.061368       9.919232       3.919442 

Plotting all the predictors performance score.

Code
barplot(sort(vi.scores, decreasing = TRUE), cex.names = 0.8)

LOCO of forward direction model

Code
xnames <- c("duration", "month","poutcome", "emp_var_rate", "cons_price_idx", "contact", "euribor3m", "job", "default", "pdays","day_of_week","campaign", "cons_conf_idx", "nr_employed") # predictors only
vi.scores <- numeric(length(xnames))
names(vi.scores) <- xnames
(baseline <- deviance(step_for)) # smaller is better; could also use AIC, Brier score, etc.
[1] 15400.22
Code
for (xname in xnames) {
data.copy <- bank_trn_data_fnl %>% dplyr::select(c(attr(step_for$terms, "term.labels"), "subscribed"))
data.copy[[xname]] <- NULL
fit.new <- glm(subscribed ~ ., data = data.copy, family = binomial(link = "logit"))
vi.scores[xname] <- deviance(fit.new) - baseline # measure drop in performance
}
sort(vi.scores, decreasing = TRUE)
      duration          month   emp_var_rate       poutcome        contact 
  5037.3753768    495.4813743    114.6749096     79.8428769     63.9066294 
cons_price_idx            job        default    day_of_week          pdays 
    50.6922459     49.6478272     21.2814235     19.6211677     16.0776258 
      campaign      euribor3m  cons_conf_idx    nr_employed 
     9.9599812      7.9615473      3.6336372      0.5051312 

Plotting all the predictors performance score.

Code
barplot(sort(vi.scores, decreasing = TRUE), cex.names = 0.8)

Based on the LOCO test we can take below 3 predictors.

  1. Duration
  2. Month
  3. Cons_price_idx

6.Leakage.

Duration may be leakage variable, since if we know the duration of the last call, it is highly likely that we know the outcome whether the customer subscribed or not.

7.Deployment

Taking the output of both direction step AIC but removing duration (As it is a potential leakage variable) from predictor as the final model.

Code
final_model<-glm(formula = subscribed ~ job + default + contact + month + 
    day_of_week + campaign + pdays + poutcome + emp_var_rate + 
    cons_price_idx + cons_conf_idx + euribor3m, family = binomial, 
    data = bank_trn_data_fnl)

summary(final_model)

Call:
glm(formula = subscribed ~ job + default + contact + month + 
    day_of_week + campaign + pdays + poutcome + emp_var_rate + 
    cons_price_idx + cons_conf_idx + euribor3m, family = binomial, 
    data = bank_trn_data_fnl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0732  -0.3909  -0.3183  -0.2628   2.9652  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.436e+02  9.685e+00 -14.825  < 2e-16 ***
jobblue-collar      -1.883e-01  5.963e-02  -3.158 0.001589 ** 
jobentrepreneur     -2.904e-02  1.111e-01  -0.261 0.793848    
jobhousemaid        -9.183e-02  1.263e-01  -0.727 0.467037    
jobmanagement       -1.031e-03  7.721e-02  -0.013 0.989349    
jobretired           2.579e-01  7.912e-02   3.260 0.001114 ** 
jobself-employed    -2.591e-02  1.063e-01  -0.244 0.807374    
jobservices         -1.508e-01  7.560e-02  -1.995 0.046008 *  
jobstudent           1.967e-01  9.802e-02   2.006 0.044803 *  
jobtechnician       -9.000e-03  5.845e-02  -0.154 0.877622    
jobunemployed       -7.908e-02  1.171e-01  -0.675 0.499411    
jobunknown          -1.340e-01  2.146e-01  -0.624 0.532507    
defaultunknown      -2.495e-01  5.966e-02  -4.182 2.90e-05 ***
defaultyes          -8.677e+00  1.133e+02  -0.077 0.938965    
contacttelephone    -7.004e-01  6.737e-02 -10.397  < 2e-16 ***
monthaug             3.336e-01  1.019e-01   3.273 0.001064 ** 
monthdec             3.312e-01  1.890e-01   1.752 0.079746 .  
monthjul             7.405e-02  8.747e-02   0.847 0.397193    
monthjun            -4.100e-01  9.770e-02  -4.197 2.70e-05 ***
monthmar             1.296e+00  1.134e-01  11.429  < 2e-16 ***
monthmay            -4.588e-01  7.013e-02  -6.543 6.04e-11 ***
monthnov            -5.488e-01  9.836e-02  -5.580 2.41e-08 ***
monthoct            -1.614e-01  1.171e-01  -1.378 0.168117    
monthsep            -7.764e-04  1.244e-01  -0.006 0.995021    
day_of_weekmon      -2.114e-01  6.108e-02  -3.462 0.000537 ***
day_of_weekthu       8.425e-02  5.884e-02   1.432 0.152176    
day_of_weektue       7.854e-02  6.026e-02   1.303 0.192487    
day_of_weekwed       1.388e-01  6.028e-02   2.302 0.021326 *  
campaign            -4.205e-02  9.638e-03  -4.363 1.28e-05 ***
pdays               -1.019e-03  1.952e-04  -5.219 1.80e-07 ***
poutcomenonexistent  5.139e-01  5.989e-02   8.581  < 2e-16 ***
poutcomesuccess      8.517e-01  1.964e-01   4.336 1.45e-05 ***
emp_var_rate        -1.226e+00  1.062e-01 -11.541  < 2e-16 ***
cons_price_idx       1.515e+00  1.004e-01  15.093  < 2e-16 ***
cons_conf_idx        1.801e-02  5.242e-03   3.435 0.000592 ***
euribor3m            3.606e-01  8.034e-02   4.488 7.18e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26093  on 37060  degrees of freedom
Residual deviance: 20440  on 37025  degrees of freedom
AIC: 20512

Number of Fisher Scoring iterations: 10

Calibration Graph

Code
bank_new_copy<-bank_new_data
prob <- predict(final_model,type = "response", newdata = bank_new_data)

bank_new_copy$prob<- prob

y <- na.omit(bank_new_data)$subscribed  

# Function to compute lift and cumulative gain charts
lift <- function(prob, y, pos.class = NULL, cumulative = TRUE) {
  if (!all(sort(unique(y)) == c(0, 1))) {
    if (is.null(pos.class)) {
      stop("A value for `pos.class` is required whenever `y` is not a 0/1 ",
           "outcome.", call. = FALSE)
    }
  y <- ifelse(y == pos.class, 1, 0)
  }
  ord <- order(prob, decreasing = TRUE)
  prob <- prob[ord]
  y <- y[ord]
  prop <- seq_along(y) / length(y)
  lift <- if (isTRUE(cumulative)) {
    cumsum(y)
  } else {
    (cumsum(y) / seq_along(y)) / mean(y)
  }
  structure(list("lift" = lift, "prop" = prop, "cumulative" = cumulative,
                 "y" = y), class = "lift")
}

# Cumulative gain chart; what does this plot tell us?
l <- lift(prob, y = y, pos.class = "yes")
plot(l[["prop"]], l[["lift"]], type = "l", 
     xlab = "Proportion of sample", ylab = "Cumulative lift", 
     las = 1, lwd = 2, col = 2)
abline(0, sum(y == "yes"), lty = 2)

There are 464 clients who subscribed in the bank_new data. The calibration chart looks well calibrated, as it shows at first top 20% of the sample, its almost captured the subscribed clients.

8. Target Audience

Code
bank_new_copy<-bank_new_copy%>%
  arrange(desc(prob))%>%
  mutate(prob_rnk = row_number(), final_class = if_else(prob_rnk<= 500, "yes", "no"))

temp<-bank_new_copy[bank_new_copy$final_class=="yes",]

#temp #Top 500 Customers to target

xtabs(~subscribed + final_class,temp)
          final_class
subscribed yes
       no  267
       yes 233

Out of 500 targeted 233 people actually subscribed which is around 47% of target audience.

Data Dictionry

Client attributes

• age (numeric) • job : type of job • marital : marital status • education • default: has credit in default? • loan: has personal loan?

Social and economic context attributes

• emp.var.rate: employment variation rate - quarterly indicator (numeric) • cons.price.idx: consumer price index - monthly indicator (numeric) • cons.conf.idx: consumer confidence index - monthly indicator (numeric) • euribor3m: euribor 3 month rate - daily indicator (numeric) • nr.employed: number of employees - quarterly indicator (numeric)

Other attributes

• campaign: number of contacts performed during this campaign and for this client • pdays: number of days that passed by after the client was last contacted from a previous campaign • previous: number of contacts performed before this campaign and for this client (numeric) • poutcome: outcome of the previous marketing campaign