Statistical Modeling Assignment 2

Group Members : Utsav Raj, Rohit Menon, Satarupa Saha, Mounish Sunkara

Description:

Customer conversion in a direct marketing campaign is one of the most important metrics of success when evaluating the campaign itself. In a world of limited resources, it’s often difficult to make the best use of a marketer’s time, we can improve on this by predicting whether a prospective customer will respond to a marketing campaign. In order to do this, we will:

• Understand which of the observed variables are most associated with the chance of subscribing to a term deposit.

• How the important variables relate to the predicted probability that a client will subscribe to a term deposit

• Build a model that could be useful in determining which future clients are likely to respond a term deposit, if contacted.

Code

bank_trn_data <- read.csv("bank_trn.csv",stringsAsFactors = TRUE)
bank_new_data <- read.csv("bank_new.csv",stringsAsFactors = TRUE)
bank_trn_data$y <- ifelse(bank_trn_data$subscribed == "no", 0, 1)
bank_new_data$y <- ifelse(bank_new_data$subscribed == "no", 0, 1)

1. Exploratory data analysis (EDA)

Code

summ <- summary(bank_trn_data)
#knitr::kable(summary(bank_trn_data), "simple")

knitr::kable(summ, align = "lccrr", caption = "Bank Data Summary")

Bank Data Summary
age	job	marital	education	default	housing	loan	contact	month	day_of_week	duration	campaign	pdays	previous	poutcome	emp_var_rate	cons_price_idx	cons_conf_idx	euribor3m	nr_employed	subscribed	y
Min. :17.00	admin. :9334	divorced: 4145	university.degree :10931	no :29292	no :16774	no :30559	cellular :23458	may :12451	fri:7033	Min. : 0.0	Min. : 1.000	Min. : 0.0	Min. :0.0000	failure : 3840	Min. :-3.40000	Min. :92.20	Min. :-50.80	Min. :0.634	Min. :4964	no :32886	Min. :0.0000
1st Qu.:32.00	blue-collar:8365	married :22444	high.school : 8526	unknown: 7766	unknown: 896	unknown: 896	telephone:13603	jul : 6418	mon:7660	1st Qu.: 102.0	1st Qu.: 1.000	1st Qu.:999.0	1st Qu.:0.0000	nonexistent:31979	1st Qu.:-1.80000	1st Qu.:93.08	1st Qu.:-42.70	1st Qu.:1.344	1st Qu.:5099	yes: 4175	1st Qu.:0.0000
Median :38.00	technician :6047	single :10396	basic.9y : 5456	yes : 3	yes :19391	yes : 5606	NA	aug : 5542	thu:7747	Median : 180.0	Median : 2.000	Median :999.0	Median :0.0000	success : 1242	Median : 1.10000	Median :93.75	Median :-41.80	Median :4.857	Median :5191	NA	Median :0.0000
Mean :40.04	services :3589	unknown : 76	professional.course: 4702	NA	NA	NA	NA	jun : 4766	tue:7296	Mean : 258.5	Mean : 2.565	Mean :962.2	Mean :0.1738	NA	Mean : 0.08039	Mean :93.58	Mean :-40.49	Mean :3.620	Mean :5167	NA	Mean :0.1127
3rd Qu.:47.00	management :2632	NA	basic.4y : 3814	NA	NA	NA	NA	nov : 3683	wed:7325	3rd Qu.: 320.0	3rd Qu.: 3.000	3rd Qu.:999.0	3rd Qu.:0.0000	NA	3rd Qu.: 1.40000	3rd Qu.:93.99	3rd Qu.:-36.40	3rd Qu.:4.961	3rd Qu.:5228	NA	3rd Qu.:0.0000
Max. :98.00	retired :1541	NA	basic.6y : 2073	NA	NA	NA	NA	apr : 2380	NA	Max. :4918.0	Max. :56.000	Max. :999.0	Max. :6.0000	NA	Max. : 1.40000	Max. :94.77	Max. :-26.90	Max. :5.045	Max. :5228	NA	Max. :1.0000
NA	(Other) :5553	NA	(Other) : 1559	NA	NA	NA	NA	(Other): 1821	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Data Cleaning

Checking for blank data and duplictaes

Code

colSums(is.na(bank_trn_data))

           age            job        marital      education        default 
             0              0              0              0              0 
       housing           loan        contact          month    day_of_week 
             0              0              0              0              0 
      duration       campaign          pdays       previous       poutcome 
             0              0              0              0              0 
  emp_var_rate cons_price_idx  cons_conf_idx      euribor3m    nr_employed 
             0              0              0              0              0 
    subscribed              y 
             0              0

Code

bank_trn_data_fnl <-unique(bank_trn_data)

We have 9 duplicate records. Deleted all the duplicates.

Plotting the Graphs

Code

par(mfrow = c(3,2))
hist(bank_trn_data_fnl$age, main = "Distribution of Age", xlab = "Age")
hist(bank_trn_data_fnl$duration, main = "Distribution of Duration", xlab = "Duration")
barplot(table(bank_trn_data_fnl$default), main = "Frequency of Default Status")
barplot(table(bank_trn_data_fnl$marital), main = "Marital Status")
barplot(table(bank_trn_data_fnl$loan), main = "Frequency by Loan Status")
barplot(table(bank_trn_data_fnl$contact), main = "Distribution by Contact Status")

Call duration variable seems to be right skewed which suggests most of the calls where less than 10 secs.
Around 79% people have not defaulted on loan payment.
50% of the people are married and don’t have any loan.

Code

par(mfrow = c(2,2))

barplot(table(bank_trn_data_fnl$pdays), main = "Distribution by PDays")
barplot(table(bank_trn_data_fnl$previous), main = "Distribution by Previous")


#bank_trn_data_fnl %>% group_by(pdays) %>% count()

hist(bank_trn_data_fnl$nr_employed, main = "Distribution of Number Employed", xlab = "Number of Employees")
barplot(table(bank_trn_data_fnl$subscribed), main = "Frequency by Subscribed Status")

Majority of the records have PDays as 999 which means they were contacted a long time back.
Pdays and previous are correlated.
No of employees does not have a clear distribution
88% of the data set have subscribed

Checking correlation on all numeric variables

Code

# Look at correlations between numeric features
num <- sapply(bank_trn_data_fnl, FUN = is.numeric) 
corx <- cor(bank_trn_data_fnl[, num], use="pairwise", method="spearman")  

# Visualize correlations

corrplot(corx, method = "color", order = "FPC", type = "lower",addCoef.col = "black",number.cex = 0.65, diag = TRUE)

High Correlation between emp_var_rate with nr_emolpyed and euribor3m and b/w euribor3m and nr_emolpyed.

Chi-Sqr test for categorical variables

Code

bank_fct<- bank_trn_data_fnl%>%select_if(is.factor)
bank_num<- bank_trn_data_fnl%>%select_if(is.numeric)

chis <- lapply(1:ncol(bank_fct), function(x) chisq.test(bank_trn_data_fnl$subscribed, bank_fct[,x])$p.value)

pvalues<-unlist(chis)

table<-data.frame(colnames(bank_fct), pvalues)
table<- table%>%arrange(pvalues)
table

   colnames.bank_fct.       pvalues
1               month  0.000000e+00
2            poutcome  0.000000e+00
3          subscribed  0.000000e+00
4                 job 1.936701e-181
5             contact 5.907889e-174
6             default  1.251693e-80
7           education  2.704408e-34
8             marital  1.260445e-23
9         day_of_week  9.154301e-05
10            housing  3.527408e-02
11               loan  2.002998e-01

From Chi square test of categorical values we can see that month, poutcome, job are significant top 3 predictors.

Based on the analysis we have done on numerical and categorical variables, 3 predictors we have chosen are as below :

Previous
Duration
poutcome

2.Initial Modeling

Full Model

Code

model_null<- glm(subscribed ~ 1 , family=binomial, data = bank_trn_data_fnl)

model_var<- glm(subscribed ~ duration+previous+poutcome, family=binomial, data = bank_trn_data_fnl)

AIC(model_null)

[1] 26094.76

Code

AIC(model_var)

[1] 19237.43

Null model has AIC of 26101.04 where as our initial model has AIC of 19243.69 which is comparatively better than the null model.

3.Step Function

stepAIC() and specify direction = “both”

Code

model_full<-glm(subscribed ~ .-y, family=binomial, data = bank_trn_data_fnl)

step<-MASS::stepAIC(model_var, direction ="both", scope= formula(model_full), trace=FALSE)
summary(step)


Call:
glm(formula = subscribed ~ duration + poutcome + month + emp_var_rate + 
    job + cons_price_idx + contact + euribor3m + default + pdays + 
    day_of_week + campaign + cons_conf_idx, family = binomial, 
    data = bank_trn_data_fnl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.9968  -0.2989  -0.1856  -0.1347   3.2903  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.660e+02  1.093e+01 -15.187  < 2e-16 ***
duration             4.690e-03  7.847e-05  59.766  < 2e-16 ***
poutcomenonexistent  4.966e-01  6.735e-02   7.373 1.67e-13 ***
poutcomesuccess      1.030e+00  2.130e-01   4.836 1.32e-06 ***
monthaug             7.528e-01  1.133e-01   6.644 3.05e-11 ***
monthdec             2.206e-01  2.083e-01   1.059  0.28975    
monthjul             1.450e-01  1.010e-01   1.435  0.15119    
monthjun            -3.342e-01  1.121e-01  -2.981  0.00287 ** 
monthmar             1.854e+00  1.231e-01  15.065  < 2e-16 ***
monthmay            -5.118e-01  8.009e-02  -6.391 1.65e-10 ***
monthnov            -4.940e-01  1.128e-01  -4.378 1.20e-05 ***
monthoct             3.480e-02  1.308e-01   0.266  0.79014    
monthsep             1.617e-01  1.379e-01   1.172  0.24107    
emp_var_rate        -1.551e+00  1.200e-01 -12.917  < 2e-16 ***
jobblue-collar      -3.135e-01  6.949e-02  -4.512 6.43e-06 ***
jobentrepreneur     -1.598e-01  1.309e-01  -1.221  0.22223    
jobhousemaid        -2.757e-02  1.439e-01  -0.192  0.84807    
jobmanagement        2.079e-02  8.730e-02   0.238  0.81178    
jobretired           2.820e-01  8.779e-02   3.213  0.00132 ** 
jobself-employed    -8.868e-02  1.226e-01  -0.723  0.46955    
jobservices         -1.775e-01  8.635e-02  -2.055  0.03986 *  
jobstudent           1.897e-01  1.088e-01   1.743  0.08130 .  
jobtechnician        8.411e-03  6.707e-02   0.125  0.90020    
jobunemployed       -4.733e-02  1.345e-01  -0.352  0.72484    
jobunknown           3.126e-02  2.361e-01   0.132  0.89466    
cons_price_idx       1.726e+00  1.133e-01  15.229  < 2e-16 ***
contacttelephone    -6.062e-01  7.717e-02  -7.855 3.99e-15 ***
euribor3m            4.576e-01  9.089e-02   5.035 4.78e-07 ***
defaultunknown      -3.149e-01  6.983e-02  -4.510 6.50e-06 ***
defaultyes          -7.366e+00  1.134e+02  -0.065  0.94820    
pdays               -8.578e-04  2.126e-04  -4.035 5.45e-05 ***
day_of_weekmon      -1.133e-01  6.968e-02  -1.626  0.10396    
day_of_weekthu       5.689e-02  6.758e-02   0.842  0.39989    
day_of_weektue       1.256e-01  6.896e-02   1.822  0.06850 .  
day_of_weekwed       1.531e-01  6.939e-02   2.206  0.02737 *  
campaign            -3.676e-02  1.204e-02  -3.054  0.00226 ** 
cons_conf_idx        1.142e-02  5.773e-03   1.978  0.04788 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26093  on 37060  degrees of freedom
Residual deviance: 15401  on 37024  degrees of freedom
AIC: 15475

Number of Fisher Scoring iterations: 10

Step: AIC=15475 subscribed ~ duration + month + poutcome + emp_var_rate + job + cons_price_idx + contact + euribor3m + default + pdays + day_of_week + campaign + cons_conf_idx

The predicted model is different from the initial model that we chose using eye test. Many new predictors have been added to the initial model.

stepAIC() and specify direction = “forward”

Code

step_for<-MASS::stepAIC(model_null, direction ="forward", scope= formula(model_full), trace=FALSE)

summary(step_for)


Call:
glm(formula = subscribed ~ duration + nr_employed + month + poutcome + 
    emp_var_rate + job + cons_price_idx + contact + euribor3m + 
    default + pdays + day_of_week + campaign + cons_conf_idx, 
    family = binomial, data = bank_trn_data_fnl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.9914  -0.2992  -0.1856  -0.1345   3.3061  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.936e+02  4.030e+01  -4.804 1.56e-06 ***
duration             4.689e-03  7.847e-05  59.757  < 2e-16 ***
nr_employed          2.326e-03  3.271e-03   0.711 0.476999    
monthaug             7.937e-01  1.271e-01   6.247 4.19e-10 ***
monthdec             2.753e-01  2.222e-01   1.239 0.215274    
monthjul             1.466e-01  1.011e-01   1.450 0.147188    
monthjun            -3.846e-01  1.327e-01  -2.899 0.003742 ** 
monthmar             1.917e+00  1.516e-01  12.644  < 2e-16 ***
monthmay            -4.881e-01  8.677e-02  -5.625 1.86e-08 ***
monthnov            -4.521e-01  1.273e-01  -3.550 0.000385 ***
monthoct             1.034e-01  1.624e-01   0.637 0.524390    
monthsep             2.536e-01  1.890e-01   1.342 0.179618    
poutcomenonexistent  4.985e-01  6.741e-02   7.394 1.42e-13 ***
poutcomesuccess      1.031e+00  2.130e-01   4.840 1.30e-06 ***
emp_var_rate        -1.615e+00  1.503e-01 -10.742  < 2e-16 ***
jobblue-collar      -3.151e-01  6.952e-02  -4.533 5.81e-06 ***
jobentrepreneur     -1.620e-01  1.309e-01  -1.238 0.215886    
jobhousemaid        -2.740e-02  1.440e-01  -0.190 0.849041    
jobmanagement        2.061e-02  8.730e-02   0.236 0.813358    
jobretired           2.806e-01  8.784e-02   3.194 0.001403 ** 
jobself-employed    -8.785e-02  1.226e-01  -0.716 0.473686    
jobservices         -1.778e-01  8.634e-02  -2.059 0.039452 *  
jobstudent           1.892e-01  1.088e-01   1.738 0.082151 .  
jobtechnician        9.741e-03  6.710e-02   0.145 0.884568    
jobunemployed       -4.720e-02  1.345e-01  -0.351 0.725592    
jobunknown           3.158e-02  2.360e-01   0.134 0.893525    
cons_price_idx       1.896e+00  2.656e-01   7.140 9.35e-13 ***
contacttelephone    -6.211e-01  8.017e-02  -7.747 9.37e-15 ***
euribor3m            3.851e-01  1.366e-01   2.819 0.004812 ** 
defaultunknown      -3.155e-01  6.983e-02  -4.519 6.23e-06 ***
defaultyes          -7.359e+00  1.134e+02  -0.065 0.948253    
pdays               -8.582e-04  2.126e-04  -4.037 5.41e-05 ***
day_of_weekmon      -1.118e-01  6.971e-02  -1.604 0.108744    
day_of_weekthu       5.634e-02  6.759e-02   0.834 0.404507    
day_of_weektue       1.283e-01  6.906e-02   1.858 0.063221 .  
day_of_weekwed       1.549e-01  6.943e-02   2.231 0.025652 *  
campaign            -3.683e-02  1.204e-02  -3.060 0.002216 ** 
cons_conf_idx        1.548e-02  8.122e-03   1.906 0.056627 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26093  on 37060  degrees of freedom
Residual deviance: 15400  on 37023  degrees of freedom
AIC: 15476

Number of Fisher Scoring iterations: 10

Step: AIC=15476 subscribed ~ duration + nr_employed + month + poutcome + emp_var_rate + job + cons_price_idx + contact + euribor3m + default + pdays + day_of_week + campaign + cons_conf_idx

The forward model differes from STEP AIC both ways model and our initial model, we have one new predictor as “nr_employed”.

stepAIC() and specify direction = “backward”

Code

step_back<-MASS::stepAIC(model_full, direction ="backward", trace=FALSE)
summary(step_back)


Call:
glm(formula = subscribed ~ job + default + contact + month + 
    day_of_week + duration + campaign + pdays + poutcome + emp_var_rate + 
    cons_price_idx + cons_conf_idx + euribor3m, family = binomial, 
    data = bank_trn_data_fnl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.9968  -0.2989  -0.1856  -0.1347   3.2903  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.660e+02  1.093e+01 -15.187  < 2e-16 ***
jobblue-collar      -3.135e-01  6.949e-02  -4.512 6.43e-06 ***
jobentrepreneur     -1.598e-01  1.309e-01  -1.221  0.22223    
jobhousemaid        -2.757e-02  1.439e-01  -0.192  0.84807    
jobmanagement        2.079e-02  8.730e-02   0.238  0.81178    
jobretired           2.820e-01  8.779e-02   3.213  0.00132 ** 
jobself-employed    -8.868e-02  1.226e-01  -0.723  0.46955    
jobservices         -1.775e-01  8.635e-02  -2.055  0.03986 *  
jobstudent           1.897e-01  1.088e-01   1.743  0.08130 .  
jobtechnician        8.411e-03  6.707e-02   0.125  0.90020    
jobunemployed       -4.733e-02  1.345e-01  -0.352  0.72484    
jobunknown           3.126e-02  2.361e-01   0.132  0.89466    
defaultunknown      -3.149e-01  6.983e-02  -4.510 6.50e-06 ***
defaultyes          -7.366e+00  1.134e+02  -0.065  0.94820    
contacttelephone    -6.062e-01  7.717e-02  -7.855 3.99e-15 ***
monthaug             7.528e-01  1.133e-01   6.644 3.05e-11 ***
monthdec             2.206e-01  2.083e-01   1.059  0.28975    
monthjul             1.450e-01  1.010e-01   1.435  0.15119    
monthjun            -3.342e-01  1.121e-01  -2.981  0.00287 ** 
monthmar             1.854e+00  1.231e-01  15.065  < 2e-16 ***
monthmay            -5.118e-01  8.009e-02  -6.391 1.65e-10 ***
monthnov            -4.940e-01  1.128e-01  -4.378 1.20e-05 ***
monthoct             3.480e-02  1.308e-01   0.266  0.79014    
monthsep             1.617e-01  1.379e-01   1.172  0.24107    
day_of_weekmon      -1.133e-01  6.968e-02  -1.626  0.10396    
day_of_weekthu       5.689e-02  6.758e-02   0.842  0.39989    
day_of_weektue       1.256e-01  6.896e-02   1.822  0.06850 .  
day_of_weekwed       1.531e-01  6.939e-02   2.206  0.02737 *  
duration             4.690e-03  7.847e-05  59.766  < 2e-16 ***
campaign            -3.676e-02  1.204e-02  -3.054  0.00226 ** 
pdays               -8.578e-04  2.126e-04  -4.035 5.45e-05 ***
poutcomenonexistent  4.966e-01  6.735e-02   7.373 1.67e-13 ***
poutcomesuccess      1.030e+00  2.130e-01   4.836 1.32e-06 ***
emp_var_rate        -1.551e+00  1.200e-01 -12.917  < 2e-16 ***
cons_price_idx       1.726e+00  1.133e-01  15.229  < 2e-16 ***
cons_conf_idx        1.142e-02  5.773e-03   1.978  0.04788 *  
euribor3m            4.576e-01  9.089e-02   5.035 4.78e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26093  on 37060  degrees of freedom
Residual deviance: 15401  on 37024  degrees of freedom
AIC: 15475

Number of Fisher Scoring iterations: 10

Step: AIC=15475 subscribed ~ job + default + contact + month + day_of_week + duration + campaign + pdays + poutcome + emp_var_rate + cons_price_idx + cons_conf_idx + euribor3m

Backward model is same as the model we got with both ways process and it differs from our initial model.

2 Way Interaction

As discussed we are not going ahead with the 2 way interaction.

Model Accuracy

Code

bank_copy<- bank_trn_data_fnl

bank_copy$pred_null<-as.factor(ifelse(predict(model_null, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))

bank_copy$pred_init<-as.factor(ifelse(predict(model_var, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))

bank_copy$pred_step_AIC_both <- as.factor(ifelse(predict(step, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))

bank_copy$pred_step_AIC_fwd <- as.factor(ifelse(predict(step_for, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))

bank_copy$pred_step_AIC_bck <- as.factor(ifelse(predict(step_back, bank_trn_data_fnl[1:20], type="response")>0.5, "yes", "no"))

df <- data.frame(matrix(ncol = 4, nrow = 1))
colnames(df) <- c('pred_init', 'pred_step_AIC_both', 'pred_step_AIC_fwd','pred_step_AIC_bck')

#Accuracy of initial model 
df$pred_init<- sum(bank_copy$subscribed==bank_copy$pred_init)/nrow(bank_copy)

#Accuracy of AIC both way model 
df$pred_step_AIC_both<- sum(bank_copy$subscribed==bank_copy$pred_step_AIC_both)/nrow(bank_copy)

#Accuracy of AIC forward model 
df$pred_step_AIC_fwd<- sum(bank_copy$subscribed==bank_copy$pred_step_AIC_fwd)/nrow(bank_copy)

#Accuracy of AIC Backward model 
df$pred_step_AIC_bck<- sum(bank_copy$subscribed==bank_copy$pred_step_AIC_bck)/nrow(bank_copy)

knitr::kable(df, align = "lccrr", caption = "Accuracy of all the models")

Accuracy of all the models
pred_init	pred_step_AIC_both	pred_step_AIC_fwd	pred_step_AIC_bck
0.9042929	0.9112544	0.9114973	0.9112544

Based on accuracy percentage all AIC models are performing better than the initial model.

4.Model Selction

Based on the accuracy and AIC value the backward and both ways model are best but we need to do some more analysis to lessen the number of predictors. We will use LOCO method in the next step.

5.Model Predictor Selection

LOCO of Both and backward direction model

Code

xnames <- c("duration", "month","poutcome", "emp_var_rate", "cons_price_idx", "contact", "euribor3m", "job", "default", "pdays","day_of_week","campaign", "cons_conf_idx") # predictors only
vi.scores <- numeric(length(xnames))
names(vi.scores) <- xnames
(baseline <- deviance(step)) # smaller is better; could also use AIC, Brier score, etc.

[1] 15400.73

Code

for (xname in xnames) {
# col_list <- c(attr(step$terms, "term.labels"), "su
data.copy <- bank_trn_data_fnl %>% dplyr::select(c(attr(step$terms, "term.labels"), "subscribed"))
data.copy[[xname]] <- NULL
fit.new <- glm(subscribed ~ ., data = data.copy, family = binomial(link = "logit"))
vi.scores[xname] <- deviance(fit.new) - baseline # measure drop in performance
}
sort(vi.scores, decreasing = TRUE)

      duration          month cons_price_idx   emp_var_rate       poutcome 
   5039.053927     495.370818     231.100827     164.120290      79.483019 
       contact            job      euribor3m        default    day_of_week 
     65.196905      49.462358      25.150698      21.196811      19.483327 
         pdays       campaign  cons_conf_idx 
     16.061368       9.919232       3.919442

Plotting all the predictors performance score.

Code

barplot(sort(vi.scores, decreasing = TRUE), cex.names = 0.8)

LOCO of forward direction model

Code

xnames <- c("duration", "month","poutcome", "emp_var_rate", "cons_price_idx", "contact", "euribor3m", "job", "default", "pdays","day_of_week","campaign", "cons_conf_idx", "nr_employed") # predictors only
vi.scores <- numeric(length(xnames))
names(vi.scores) <- xnames
(baseline <- deviance(step_for)) # smaller is better; could also use AIC, Brier score, etc.

[1] 15400.22

Code

for (xname in xnames) {
data.copy <- bank_trn_data_fnl %>% dplyr::select(c(attr(step_for$terms, "term.labels"), "subscribed"))
data.copy[[xname]] <- NULL
fit.new <- glm(subscribed ~ ., data = data.copy, family = binomial(link = "logit"))
vi.scores[xname] <- deviance(fit.new) - baseline # measure drop in performance
}
sort(vi.scores, decreasing = TRUE)

      duration          month   emp_var_rate       poutcome        contact 
  5037.3753768    495.4813743    114.6749096     79.8428769     63.9066294 
cons_price_idx            job        default    day_of_week          pdays 
    50.6922459     49.6478272     21.2814235     19.6211677     16.0776258 
      campaign      euribor3m  cons_conf_idx    nr_employed 
     9.9599812      7.9615473      3.6336372      0.5051312

Plotting all the predictors performance score.

Code

barplot(sort(vi.scores, decreasing = TRUE), cex.names = 0.8)

Based on the LOCO test we can take below 3 predictors.

Duration
Month
Cons_price_idx

6.Leakage.

Duration may be leakage variable, since if we know the duration of the last call, it is highly likely that we know the outcome whether the customer subscribed or not.

7.Deployment

Taking the output of both direction step AIC but removing duration (As it is a potential leakage variable) from predictor as the final model.

Code

final_model<-glm(formula = subscribed ~ job + default + contact + month + 
    day_of_week + campaign + pdays + poutcome + emp_var_rate + 
    cons_price_idx + cons_conf_idx + euribor3m, family = binomial, 
    data = bank_trn_data_fnl)

summary(final_model)


Call:
glm(formula = subscribed ~ job + default + contact + month + 
    day_of_week + campaign + pdays + poutcome + emp_var_rate + 
    cons_price_idx + cons_conf_idx + euribor3m, family = binomial, 
    data = bank_trn_data_fnl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0732  -0.3909  -0.3183  -0.2628   2.9652  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.436e+02  9.685e+00 -14.825  < 2e-16 ***
jobblue-collar      -1.883e-01  5.963e-02  -3.158 0.001589 ** 
jobentrepreneur     -2.904e-02  1.111e-01  -0.261 0.793848    
jobhousemaid        -9.183e-02  1.263e-01  -0.727 0.467037    
jobmanagement       -1.031e-03  7.721e-02  -0.013 0.989349    
jobretired           2.579e-01  7.912e-02   3.260 0.001114 ** 
jobself-employed    -2.591e-02  1.063e-01  -0.244 0.807374    
jobservices         -1.508e-01  7.560e-02  -1.995 0.046008 *  
jobstudent           1.967e-01  9.802e-02   2.006 0.044803 *  
jobtechnician       -9.000e-03  5.845e-02  -0.154 0.877622    
jobunemployed       -7.908e-02  1.171e-01  -0.675 0.499411    
jobunknown          -1.340e-01  2.146e-01  -0.624 0.532507    
defaultunknown      -2.495e-01  5.966e-02  -4.182 2.90e-05 ***
defaultyes          -8.677e+00  1.133e+02  -0.077 0.938965    
contacttelephone    -7.004e-01  6.737e-02 -10.397  < 2e-16 ***
monthaug             3.336e-01  1.019e-01   3.273 0.001064 ** 
monthdec             3.312e-01  1.890e-01   1.752 0.079746 .  
monthjul             7.405e-02  8.747e-02   0.847 0.397193    
monthjun            -4.100e-01  9.770e-02  -4.197 2.70e-05 ***
monthmar             1.296e+00  1.134e-01  11.429  < 2e-16 ***
monthmay            -4.588e-01  7.013e-02  -6.543 6.04e-11 ***
monthnov            -5.488e-01  9.836e-02  -5.580 2.41e-08 ***
monthoct            -1.614e-01  1.171e-01  -1.378 0.168117    
monthsep            -7.764e-04  1.244e-01  -0.006 0.995021    
day_of_weekmon      -2.114e-01  6.108e-02  -3.462 0.000537 ***
day_of_weekthu       8.425e-02  5.884e-02   1.432 0.152176    
day_of_weektue       7.854e-02  6.026e-02   1.303 0.192487    
day_of_weekwed       1.388e-01  6.028e-02   2.302 0.021326 *  
campaign            -4.205e-02  9.638e-03  -4.363 1.28e-05 ***
pdays               -1.019e-03  1.952e-04  -5.219 1.80e-07 ***
poutcomenonexistent  5.139e-01  5.989e-02   8.581  < 2e-16 ***
poutcomesuccess      8.517e-01  1.964e-01   4.336 1.45e-05 ***
emp_var_rate        -1.226e+00  1.062e-01 -11.541  < 2e-16 ***
cons_price_idx       1.515e+00  1.004e-01  15.093  < 2e-16 ***
cons_conf_idx        1.801e-02  5.242e-03   3.435 0.000592 ***
euribor3m            3.606e-01  8.034e-02   4.488 7.18e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26093  on 37060  degrees of freedom
Residual deviance: 20440  on 37025  degrees of freedom
AIC: 20512

Number of Fisher Scoring iterations: 10

Calibration Graph

Code

bank_new_copy<-bank_new_data
prob <- predict(final_model,type = "response", newdata = bank_new_data)

bank_new_copy$prob<- prob

y <- na.omit(bank_new_data)$subscribed  

# Function to compute lift and cumulative gain charts
lift <- function(prob, y, pos.class = NULL, cumulative = TRUE) {
  if (!all(sort(unique(y)) == c(0, 1))) {
    if (is.null(pos.class)) {
      stop("A value for `pos.class` is required whenever `y` is not a 0/1 ",
           "outcome.", call. = FALSE)
    }
  y <- ifelse(y == pos.class, 1, 0)
  }
  ord <- order(prob, decreasing = TRUE)
  prob <- prob[ord]
  y <- y[ord]
  prop <- seq_along(y) / length(y)
  lift <- if (isTRUE(cumulative)) {
    cumsum(y)
  } else {
    (cumsum(y) / seq_along(y)) / mean(y)
  }
  structure(list("lift" = lift, "prop" = prop, "cumulative" = cumulative,
                 "y" = y), class = "lift")
}

# Cumulative gain chart; what does this plot tell us?
l <- lift(prob, y = y, pos.class = "yes")
plot(l[["prop"]], l[["lift"]], type = "l", 
     xlab = "Proportion of sample", ylab = "Cumulative lift", 
     las = 1, lwd = 2, col = 2)
abline(0, sum(y == "yes"), lty = 2)

There are 464 clients who subscribed in the bank_new data. The calibration chart looks well calibrated, as it shows at first top 20% of the sample, its almost captured the subscribed clients.

8. Target Audience

Code

bank_new_copy<-bank_new_copy%>%
  arrange(desc(prob))%>%
  mutate(prob_rnk = row_number(), final_class = if_else(prob_rnk<= 500, "yes", "no"))

temp<-bank_new_copy[bank_new_copy$final_class=="yes",]

#temp #Top 500 Customers to target

xtabs(~subscribed + final_class,temp)

          final_class
subscribed yes
       no  267
       yes 233

Out of 500 targeted 233 people actually subscribed which is around 47% of target audience.

Data Dictionry

Client attributes

• age (numeric) • job : type of job • marital : marital status • education • default: has credit in default? • loan: has personal loan?

Other attributes

• campaign: number of contacts performed during this campaign and for this client • pdays: number of days that passed by after the client was last contacted from a previous campaign • previous: number of contacts performed before this campaign and for this client (numeric) • poutcome: outcome of the previous marketing campaign

Description:

1. Exploratory data analysis (EDA)

Data Cleaning

Plotting the Graphs

Checking correlation on all numeric variables

Chi-Sqr test for categorical variables

2.Initial Modeling

Full Model

3.Step Function

stepAIC() and specify direction = “both”

stepAIC() and specify direction = “forward”

stepAIC() and specify direction = “backward”

2 Way Interaction

Model Accuracy

4.Model Selction

5.Model Predictor Selection

LOCO of Both and backward direction model

Plotting all the predictors performance score.

LOCO of forward direction model

Plotting all the predictors performance score.

6.Leakage.

7.Deployment

Calibration Graph

8. Target Audience

Data Dictionry

Client attributes

Attributes related with the last contact of the current campaign

Social and economic context attributes

Other attributes