Load data

age	job	marital	education	default	housing	loan	contact	month	day_of_week	duration	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
56	housemaid	married	basic.4y	no	no	no	telephone	may	mon	261	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
57	services	married	high.school	unknown	no	no	telephone	may	mon	149	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
37	services	married	high.school	no	yes	no	telephone	may	mon	226	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
40	admin.	married	basic.6y	no	no	no	telephone	may	mon	151	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
56	services	married	high.school	no	no	yes	telephone	may	mon	307	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
45	services	married	basic.9y	unknown	no	no	telephone	may	mon	198	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no

Diving into the risk factors that may affect in the modeling for term deposits in bank marketing data,depending on our further analysis,in addition we have very important variables for our risk modeling!

1. emp.var.rate:refers to the employment variation rate, which is crucial economic indicator.

2. cons.price.idx (CPI):This is the consumer price index,which measures inflation.

3. cons.conf.idx (CCI):Consumer confidence index,reflects consumer sentiment.

4. euribor3m:The 3-month Euro Interbank Offered Rate, a benchmark interest rate.

Data manipulation

data <- data %>% 
  mutate(across(c(job,education,marital,default,housing,loan,contact,poutcome,y)
                ,as.factor))

data %>% ggplot(aes(x=loan,y=default))+
  geom_bar(position=position_dodge(width=0.5),stat="identity",
           color="black",size=0.25)+
  scale_fill_brewer(palette="Set1")+
  labs(title="Grouped Bar Chart",x="Loan",
       y="default")+
  theme_minimal()

We are observing from the chart here, there exists default cases without taking a loan..so may be a credit default or housing one.

data %>% ggplot(aes(x=age,y=loan))+
  geom_jitter(alpha=0.3,color="blue")

Studying the distribution of loans during this data study across age levels,giving that most of taking a loan are under 60 years old with a huge density.

Studying distribution of housing loan across marital state–

data %>% ggplot(aes(x=housing))+
  geom_bar()+
  facet_wrap(~marital)

As we notice, the higher housing rates are for married people.

Correlations

poutcome/y

tab <- table(data$poutcome,data$y)
chisq.test(tab,correct=T)

## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 4230.5, df = 2, p-value < 2.2e-16

cramersV(tab)

## [1] 0.320488

Not very strong correlation between the previous campaign outcome and the present outcome.

Comparing hypothesis:job,education & age effects on the groups of y

attach(data)
Y <- cbind(job,education,age)
Manova <- manova(Y ~ y)
summary(Manova)

##              Df  Pillai approx F num Df den Df    Pr(>F)    
## y             1 0.00503   69.401      3  41184 < 2.2e-16 ***
## Residuals 41186                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Very Significant p-value,means there is a statistically significant difference in the three combined variables(age,education,job) between the groups defined by the outcome y of the present term deposit campaign.**

Correlation Heatmap(Studying corr within numerical variables)

numerical_cols <- sapply(data,is.numeric)
data_numeric <- data[,numerical_cols]
correlation_matrix <- cor(data_numeric,use="complete.obs")
corrplot(correlation_matrix, method="circle")

As visualized in the heatmap, euribor3m and nr.employed have a strong correlation with emp.var.rate. Also Euribor3m has less strong correlation with the cons.price.idx and this is a very logical real correlation.

T test hypotheses of some essential economic indicators with y

attach(data)
variables <- c("cons.price.idx","emp.var.rate","cons.conf.idx","euribor3m","nr.employed")
results <-lapply(variables,function(var){t.test(data[[var]]~data$y)})
names(results)<- variables
for(var in variables){
  print(paste("Results for",var))
  print(results[[var]])
}

## [1] "Results for cons.price.idx"
## 
##  Welch Two Sample t-test
## 
## data:  data[[var]] by data$y
## t = 24.082, df = 5472.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  0.2290714 0.2696708
## sample estimates:
##  mean in group no mean in group yes 
##          93.60376          93.35439 
## 
## [1] "Results for emp.var.rate"
## 
##  Welch Two Sample t-test
## 
## data:  data[[var]] by data$y
## t = 59.137, df = 5665.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  1.433185 1.531463
## sample estimates:
##  mean in group no mean in group yes 
##         0.2488755        -1.2334483 
## 
## [1] "Results for cons.conf.idx"
## 
##  Welch Two Sample t-test
## 
## data:  data[[var]] by data$y
## t = -8.6365, df = 5258.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -0.9856585 -0.6209660
## sample estimates:
##  mean in group no mean in group yes 
##         -40.59310         -39.78978 
## 
## [1] "Results for euribor3m"
## 
##  Welch Two Sample t-test
## 
## data:  data[[var]] by data$y
## t = 62.58, df = 5729.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  1.635467 1.741246
## sample estimates:
##  mean in group no mean in group yes 
##          3.811491          2.123135 
## 
## [1] "Results for nr.employed"
## 
##  Welch Two Sample t-test
## 
## data:  data[[var]] by data$y
## t = 60.975, df = 5298.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  78.44475 83.65647
## sample estimates:
##  mean in group no mean in group yes 
##          5176.167          5095.116

We applied A/B comparison t.test of all the essential numerical risk factors with the response binary y in addition to the balance of the client, and we observed all significant unless client balance which gives non significant p-value.

Studying most risk factors(Risk Tree) affecting y

rpart(y ~ ., data=data)

## n= 41188 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 41188 4640 no (0.8873458 0.1126542)  
##    2) nr.employed>=5087.65 36224 2431 no (0.9328898 0.0671102)  
##      4) duration< 606.5 33232 1044 no (0.9685845 0.0314155) *
##      5) duration>=606.5 2992 1387 no (0.5364305 0.4635695)  
##       10) duration< 835.5 1601  585 no (0.6346034 0.3653966) *
##       11) duration>=835.5 1391  589 yes (0.4234364 0.5765636) *
##    3) nr.employed< 5087.65 4964 2209 no (0.5549960 0.4450040)  
##      6) duration< 172.5 1891  328 no (0.8265468 0.1734532) *
##      7) duration>=172.5 3073 1192 yes (0.3878946 0.6121054)  
##       14) pdays>=16.5 2173 1027 yes (0.4726185 0.5273815)  
##         28) duration< 250.5 700  276 no (0.6057143 0.3942857) *
##         29) duration>=250.5 1473  603 yes (0.4093686 0.5906314) *
##       15) pdays< 16.5 900  165 yes (0.1833333 0.8166667) *

model <- rpart(y~., data=data)
rpart.plot(model)

Based on the results we got nr.employed,last contact duration with the client and pdays are the most risk factors on result y.

So now we are going to build our logistic model based on studying these factors.

#Building Models to predict–

set.seed(2)
split <- initial_split(data,
                       prop=.80,
                       strata=y)
p_train <- training(split) 
p_test <- testing(split)
Model <- glm(y ~ nr.employed + duration + pdays +
               age +job+ education,
               data=p_train,
               family="binomial")
summary(Model)

## 
## Call:
## glm(formula = y ~ nr.employed + duration + pdays + age + job + 
##     education, family = "binomial", data = p_train)
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   6.743e+01  1.542e+00  43.738  < 2e-16 ***
## nr.employed                  -1.356e-02  3.049e-04 -44.464  < 2e-16 ***
## duration                      4.485e-03  7.937e-05  56.504  < 2e-16 ***
## pdays                        -1.526e-03  7.694e-05 -19.826  < 2e-16 ***
## age                          -1.301e-03  2.414e-03  -0.539 0.590028    
## jobblue-collar               -4.329e-01  8.594e-02  -5.037 4.73e-07 ***
## jobentrepreneur              -3.050e-01  1.361e-01  -2.241 0.025002 *  
## jobhousemaid                 -5.117e-02  1.589e-01  -0.322 0.747424    
## jobmanagement                -1.835e-01  9.326e-02  -1.967 0.049165 *  
## jobretired                    4.476e-01  1.146e-01   3.906 9.37e-05 ***
## jobself-employed             -2.723e-01  1.291e-01  -2.109 0.034904 *  
## jobservices                  -2.994e-01  9.301e-02  -3.219 0.001287 ** 
## jobstudent                    3.476e-01  1.207e-01   2.879 0.003994 ** 
## jobtechnician                -5.483e-02  7.696e-02  -0.712 0.476167    
## jobunemployed                 1.194e-01  1.357e-01   0.880 0.379105    
## jobunknown                   -6.822e-02  2.646e-01  -0.258 0.796549    
## educationbasic.6y             8.510e-02  1.309e-01   0.650 0.515484    
## educationbasic.9y            -3.769e-02  1.031e-01  -0.366 0.714659    
## educationhigh.school          3.943e-02  9.996e-02   0.394 0.693248    
## educationilliterate           1.741e+00  7.211e-01   2.415 0.015738 *  
## educationprofessional.course  1.885e-01  1.102e-01   1.710 0.087196 .  
## educationuniversity.degree    3.425e-01  9.963e-02   3.437 0.000588 ***
## educationunknown              2.183e-01  1.309e-01   1.668 0.095402 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23199  on 32949  degrees of freedom
## Residual deviance: 14506  on 32927  degrees of freedom
## AIC: 14552
## 
## Number of Fisher Scoring iterations: 6

We have from this analysis, age is not significant in our model,so check by anova to study if there is overfittings in the Model!

anova<-aov(age~y)
summary(anova)

##                Df  Sum Sq Mean Sq F value  Pr(>F)    
## y               1    4133    4133   38.09 6.8e-10 ***
## Residuals   41186 4468876     109                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Age is very significant predictor or risk factor for y outcome,so now we will combine all non significant job categories and also all non significant education categories to resolve the issue!

data <- data %>% mutate(job=ifelse(job%in% c("housemaid,technician,unemployed,unknown"),
                                        "Other",as.factor(job)))

data <- data %>% mutate(education=ifelse(education%in% c("basic.6y,basic.9y,high.school,professional.course,unknown"),
                                        "Other2",as.factor(education)))

set.seed(2)
split <- initial_split(data,
                       prop=.80,
                       strata=y)
p_train <- training(split) 
p_test <- testing(split)
Model1 <- glm(y ~ nr.employed + duration + pdays +
                age +job + education,
                data=p_train,
                family="binomial")
summary(Model1)

## 
## Call:
## glm(formula = y ~ nr.employed + duration + pdays + age + job + 
##     education, family = "binomial", data = p_train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  6.960e+01  1.516e+00  45.905  < 2e-16 ***
## nr.employed -1.410e-02  3.000e-04 -47.006  < 2e-16 ***
## duration     4.453e-03  7.879e-05  56.514  < 2e-16 ***
## pdays       -1.555e-03  7.643e-05 -20.347  < 2e-16 ***
## age          4.554e-03  1.830e-03   2.489   0.0128 *  
## job          1.033e-02  6.106e-03   1.693   0.0905 .  
## education    8.624e-02  1.060e-02   8.138 4.01e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23199  on 32949  degrees of freedom
## Residual deviance: 14621  on 32943  degrees of freedom
## AIC: 14635
## 
## Number of Fisher Scoring iterations: 6

Job as an overall non significant predictor for the outcome of the present campaign in our modified Model

tab2 <- table(data$job,data$y)
chisq.test(tab2,correct=T)

## 
##  Pearson's Chi-squared test
## 
## data:  tab2
## X-squared = 961.24, df = 11, p-value < 2.2e-16

cramersV(tab2)

## [1] 0.1527676

Very low correlation between client job and y result,but based on its importance in our real data research we will try to keep it in the model!

In this model, we will try to add gradually the economic risk indicators and check if they are fit!

Model2 <- glm(
  y ~ emp.var.rate+cons.conf.idx+euribor3m+cons.price.idx+
      nr.employed+duration + pdays+education+age+job+
      cons.conf.idx:cons.price.idx+
        emp.var.rate:cons.price.idx,
             data=p_train,
             family="binomial")
summary(Model2)

## 
## Call:
## glm(formula = y ~ emp.var.rate + cons.conf.idx + euribor3m + 
##     cons.price.idx + nr.employed + duration + pdays + education + 
##     age + job + cons.conf.idx:cons.price.idx + emp.var.rate:cons.price.idx, 
##     family = "binomial", data = p_train)
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   1.017e+02  2.954e+01   3.445 0.000572 ***
## emp.var.rate                  3.709e+01  4.228e+00   8.774  < 2e-16 ***
## cons.conf.idx                 5.259e+00  7.409e-01   7.098 1.26e-12 ***
## euribor3m                    -2.949e-01  1.030e-01  -2.864 0.004187 ** 
## cons.price.idx               -1.405e+00  3.222e-01  -4.361 1.29e-05 ***
## nr.employed                   5.874e-03  2.088e-03   2.813 0.004910 ** 
## duration                      4.556e-03  8.056e-05  56.553  < 2e-16 ***
## pdays                        -1.487e-03  7.696e-05 -19.325  < 2e-16 ***
## education                     6.809e-02  1.080e-02   6.303 2.92e-10 ***
## age                           1.987e-03  1.854e-03   1.072 0.283835    
## job                           9.699e-03  6.148e-03   1.578 0.114640    
## cons.conf.idx:cons.price.idx -5.558e-02  7.914e-03  -7.022 2.18e-12 ***
## emp.var.rate:cons.price.idx  -4.066e-01  4.556e-02  -8.923  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23199  on 32949  degrees of freedom
## Residual deviance: 14362  on 32937  degrees of freedom
## AIC: 14388
## 
## Number of Fisher Scoring iterations: 6

Lower AIC:After additive selection regression analysis from the most to the least important predictors and comparing their AIC,P-values and fixing their multicollinearity interactions terms,we concluded this Model2 as the best fitting.

But still in this model having problems related to age & job predictors

Checking VIF

library(car)
vif1 <- vif(Model1)
vif2 <- vif(Model2)
print(vif1)

## nr.employed    duration       pdays         age         job   education 
##    1.279584    1.160995    1.129549    1.029727    1.009281    1.039995

print(vif2)

##                 emp.var.rate                cons.conf.idx 
##                 98925.316606                 39721.017484 
##                    euribor3m               cons.price.idx 
##                    69.153350                    90.731074 
##                  nr.employed                     duration 
##                    62.171288                     1.195561 
##                        pdays                    education 
##                     1.163667                     1.067460 
##                          age                          job 
##                     1.064445                     1.012766 
## cons.conf.idx:cons.price.idx  emp.var.rate:cons.price.idx 
##                 40060.767756                 99609.347103

About age and after studying its vif in the first and second model, it is clear it have no multicollinearity with any.

y-binary tidied regression coefficients

library(broom)
tidy(Model2)

## # A tibble: 13 x 5
##    term                          estimate  std.error statistic  p.value
##    <chr>                            <dbl>      <dbl>     <dbl>    <dbl>
##  1 (Intercept)                  102.      29.5            3.44 5.72e- 4
##  2 emp.var.rate                  37.1      4.23           8.77 1.72e-18
##  3 cons.conf.idx                  5.26     0.741          7.10 1.26e-12
##  4 euribor3m                     -0.295    0.103         -2.86 4.19e- 3
##  5 cons.price.idx                -1.41     0.322         -4.36 1.29e- 5
##  6 nr.employed                    0.00587  0.00209        2.81 4.91e- 3
##  7 duration                       0.00456  0.0000806     56.6  0       
##  8 pdays                         -0.00149  0.0000770    -19.3  3.29e-83
##  9 education                      0.0681   0.0108         6.30 2.92e-10
## 10 age                            0.00199  0.00185        1.07 2.84e- 1
## 11 job                            0.00970  0.00615        1.58 1.15e- 1
## 12 cons.conf.idx:cons.price.idx  -0.0556   0.00791       -7.02 2.18e-12
## 13 emp.var.rate:cons.price.idx   -0.407    0.0456        -8.92 4.54e-19

Printing Final Logistic Regression Equation

coefficients <- coef(Model2)
equation <- paste("y=",
                  round(coefficients[1],3))

for (i in 2:length(coefficients)){
  if(coefficients[i]>=0){
    sign<-"+"
  }else{
    sign<-"-"
  }
  
  equation <- paste(equation,sign,round(abs(coefficients[i]),3),"*",
                    names(coefficients)[i])
}

Classification Model of binary y{yes-no}

print(equation)

## [1] "y= 101.749 + 37.094 * emp.var.rate + 5.259 * cons.conf.idx - 0.295 * euribor3m - 1.405 * cons.price.idx + 0.006 * nr.employed + 0.005 * duration - 0.001 * pdays + 0.068 * education + 0.002 * age + 0.01 * job - 0.056 * cons.conf.idx:cons.price.idx - 0.407 * emp.var.rate:cons.price.idx"

Note: These variables in this model must be standarized before predicting; z = (x-mean)/sd where x is the original value of variable.

Ridge Regularization Regression Model

Trying to fix non significance of age and job

library(glmnet)
X <- as.matrix(p_train[,c("emp.var.rate","cons.conf.idx","euribor3m",
"cons.price.idx","nr.employed","age","job","duration", "pdays","education")])

test <- p_test[,c("emp.var.rate","cons.conf.idx","euribor3m",
"cons.price.idx","nr.employed","age","job","duration", "pdays","education")]
z <- p_train$y
X_scaled <- scale(X)
cv_ridge <- cv.glmnet(X_scaled,z,alpha=0,family="binomial")
best_lambda <- cv_ridge$lambda.min
ridge_model <- glmnet(X_scaled,z,alpha=0,lambda=best_lambda,
                      family="binomial")

print(coef(ridge_model))

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                         s0
## (Intercept)    -2.82329543
## emp.var.rate   -0.32815152
## cons.conf.idx   0.15176639
## euribor3m      -0.27703707
## cons.price.idx  0.11311307
## nr.employed    -0.44015118
## age             0.02987493
## job             0.03333369
## duration        0.98678253
## pdays          -0.28004171
## education       0.13908973

According to those ridge model coefficients, age and job coefficient are very near to zero so model is suggesting again these variables don’t contribute much to explaining although the model is allowed to use them.

Ridge Equation with initial variables to predict,but with some less accuracy..

“ridge equation: y= -2.82 -0.33 emp.var.rate + 0.15 cons.conf.idx -0.28 euribor3m + 0.11 cons.price.idx - 0.44 nr.employed + 0.99 duration - 0.28 pdays +0.14 education + 0.03 age + 0.04 job “

Accuracy of the Logistic Model2

p_test <- p_test %>% 
  mutate(y_prob = predict(Model2,
                               p_test,
                               type= "response"),
         y_pred = ifelse(y_prob > .5, 1, 0))
t <- table(p_test$y,
           p_test$y_pred)

AccuracyModel2<-sum(diag(t))/sum(t) 
print(AccuracyModel2)

## [1] 0.9089585

91% high accuracy and overall correctness.So Model2 is performing very well.

ROC Curve of Model2

library(ROCR)
pred <- predict(Model2, data, type='response')
pred <- prediction(pred, data$y)

roc <- performance(pred, "tpr","fpr")
plot(roc,
     colorize=T,
     main="ROC Curve",
     ylab="Sensitivity",
     xlab="1-Specificity")
abline(a=0, b=1)

auc <- performance(pred, "auc")
auc <- unlist(slot(auc, "y.values"))
auc <- round(auc,4)
print(auc)

## [1] 0.9246

92% AUC is excellent representing the model’s ability to discriminate between classes.

Bank Marketing Analysis

Batoul Kalot