| age | job | marital | education | default | housing | loan | contact | month | day_of_week | duration | campaign | pdays | previous | poutcome | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employed | y |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | housemaid | married | basic.4y | no | no | no | telephone | may | mon | 261 | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191 | no |
| 57 | services | married | high.school | unknown | no | no | telephone | may | mon | 149 | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191 | no |
| 37 | services | married | high.school | no | yes | no | telephone | may | mon | 226 | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191 | no |
| 40 | admin. | married | basic.6y | no | no | no | telephone | may | mon | 151 | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191 | no |
| 56 | services | married | high.school | no | no | yes | telephone | may | mon | 307 | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191 | no |
| 45 | services | married | basic.9y | unknown | no | no | telephone | may | mon | 198 | 1 | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191 | no |
Diving into the risk factors that may affect in the modeling
for term deposits in bank marketing
data,depending on our further analysis,in addition we
have very important variables for our risk
modeling!
1. emp.var.rate:refers to the employment variation
rate, which is crucial economic indicator.
2. cons.price.idx (CPI):This is the consumer price
index,which measures inflation.
3. cons.conf.idx (CCI):Consumer confidence
index,reflects consumer sentiment.
4. euribor3m:The 3-month Euro Interbank Offered
Rate, a benchmark interest rate.
data <- data %>%
mutate(across(c(job,education,marital,default,housing,loan,contact,poutcome,y)
,as.factor))
data %>% ggplot(aes(x=loan,y=default))+
geom_bar(position=position_dodge(width=0.5),stat="identity",
color="black",size=0.25)+
scale_fill_brewer(palette="Set1")+
labs(title="Grouped Bar Chart",x="Loan",
y="default")+
theme_minimal()
We are observing from the chart here, there exists default cases without taking a loan..so may be a credit default or housing one.
data %>% ggplot(aes(x=age,y=loan))+
geom_jitter(alpha=0.3,color="blue")
Studying the distribution of loans during this data study across age levels,giving that most of taking a loan are under 60 years old with a huge density.
data %>% ggplot(aes(x=housing))+
geom_bar()+
facet_wrap(~marital)
As we notice, the higher housing rates are for married people.
tab <- table(data$poutcome,data$y)
chisq.test(tab,correct=T)
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 4230.5, df = 2, p-value < 2.2e-16
cramersV(tab)
## [1] 0.320488
Not very strong correlation between the previous campaign outcome and the present outcome.
attach(data)
Y <- cbind(job,education,age)
Manova <- manova(Y ~ y)
summary(Manova)
## Df Pillai approx F num Df den Df Pr(>F)
## y 1 0.00503 69.401 3 41184 < 2.2e-16 ***
## Residuals 41186
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Very Significant p-value,means there is a statistically significant difference in the three combined variables(age,education,job) between the groups defined by the outcome y of the present term deposit campaign.**
numerical_cols <- sapply(data,is.numeric)
data_numeric <- data[,numerical_cols]
correlation_matrix <- cor(data_numeric,use="complete.obs")
corrplot(correlation_matrix, method="circle")
As visualized in the heatmap, euribor3m and nr.employed have a strong correlation with emp.var.rate. Also Euribor3m has less strong correlation with the cons.price.idx and this is a very logical real correlation.
attach(data)
variables <- c("cons.price.idx","emp.var.rate","cons.conf.idx","euribor3m","nr.employed")
results <-lapply(variables,function(var){t.test(data[[var]]~data$y)})
names(results)<- variables
for(var in variables){
print(paste("Results for",var))
print(results[[var]])
}
## [1] "Results for cons.price.idx"
##
## Welch Two Sample t-test
##
## data: data[[var]] by data$y
## t = 24.082, df = 5472.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## 0.2290714 0.2696708
## sample estimates:
## mean in group no mean in group yes
## 93.60376 93.35439
##
## [1] "Results for emp.var.rate"
##
## Welch Two Sample t-test
##
## data: data[[var]] by data$y
## t = 59.137, df = 5665.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## 1.433185 1.531463
## sample estimates:
## mean in group no mean in group yes
## 0.2488755 -1.2334483
##
## [1] "Results for cons.conf.idx"
##
## Welch Two Sample t-test
##
## data: data[[var]] by data$y
## t = -8.6365, df = 5258.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -0.9856585 -0.6209660
## sample estimates:
## mean in group no mean in group yes
## -40.59310 -39.78978
##
## [1] "Results for euribor3m"
##
## Welch Two Sample t-test
##
## data: data[[var]] by data$y
## t = 62.58, df = 5729.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## 1.635467 1.741246
## sample estimates:
## mean in group no mean in group yes
## 3.811491 2.123135
##
## [1] "Results for nr.employed"
##
## Welch Two Sample t-test
##
## data: data[[var]] by data$y
## t = 60.975, df = 5298.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## 78.44475 83.65647
## sample estimates:
## mean in group no mean in group yes
## 5176.167 5095.116
We applied A/B comparison t.test of all the essential numerical risk factors with the response binary y in addition to the balance of the client, and we observed all significant unless client balance which gives non significant p-value.
rpart(y ~ ., data=data)
## n= 41188
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 41188 4640 no (0.8873458 0.1126542)
## 2) nr.employed>=5087.65 36224 2431 no (0.9328898 0.0671102)
## 4) duration< 606.5 33232 1044 no (0.9685845 0.0314155) *
## 5) duration>=606.5 2992 1387 no (0.5364305 0.4635695)
## 10) duration< 835.5 1601 585 no (0.6346034 0.3653966) *
## 11) duration>=835.5 1391 589 yes (0.4234364 0.5765636) *
## 3) nr.employed< 5087.65 4964 2209 no (0.5549960 0.4450040)
## 6) duration< 172.5 1891 328 no (0.8265468 0.1734532) *
## 7) duration>=172.5 3073 1192 yes (0.3878946 0.6121054)
## 14) pdays>=16.5 2173 1027 yes (0.4726185 0.5273815)
## 28) duration< 250.5 700 276 no (0.6057143 0.3942857) *
## 29) duration>=250.5 1473 603 yes (0.4093686 0.5906314) *
## 15) pdays< 16.5 900 165 yes (0.1833333 0.8166667) *
model <- rpart(y~., data=data)
rpart.plot(model)
Based on the results we got nr.employed,last contact duration with the client and pdays are the most risk factors on result y.
So now we are going to build our logistic model based on studying these factors.
#Building Models to predict–
set.seed(2)
split <- initial_split(data,
prop=.80,
strata=y)
p_train <- training(split)
p_test <- testing(split)
Model <- glm(y ~ nr.employed + duration + pdays +
age +job+ education,
data=p_train,
family="binomial")
summary(Model)
##
## Call:
## glm(formula = y ~ nr.employed + duration + pdays + age + job +
## education, family = "binomial", data = p_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.743e+01 1.542e+00 43.738 < 2e-16 ***
## nr.employed -1.356e-02 3.049e-04 -44.464 < 2e-16 ***
## duration 4.485e-03 7.937e-05 56.504 < 2e-16 ***
## pdays -1.526e-03 7.694e-05 -19.826 < 2e-16 ***
## age -1.301e-03 2.414e-03 -0.539 0.590028
## jobblue-collar -4.329e-01 8.594e-02 -5.037 4.73e-07 ***
## jobentrepreneur -3.050e-01 1.361e-01 -2.241 0.025002 *
## jobhousemaid -5.117e-02 1.589e-01 -0.322 0.747424
## jobmanagement -1.835e-01 9.326e-02 -1.967 0.049165 *
## jobretired 4.476e-01 1.146e-01 3.906 9.37e-05 ***
## jobself-employed -2.723e-01 1.291e-01 -2.109 0.034904 *
## jobservices -2.994e-01 9.301e-02 -3.219 0.001287 **
## jobstudent 3.476e-01 1.207e-01 2.879 0.003994 **
## jobtechnician -5.483e-02 7.696e-02 -0.712 0.476167
## jobunemployed 1.194e-01 1.357e-01 0.880 0.379105
## jobunknown -6.822e-02 2.646e-01 -0.258 0.796549
## educationbasic.6y 8.510e-02 1.309e-01 0.650 0.515484
## educationbasic.9y -3.769e-02 1.031e-01 -0.366 0.714659
## educationhigh.school 3.943e-02 9.996e-02 0.394 0.693248
## educationilliterate 1.741e+00 7.211e-01 2.415 0.015738 *
## educationprofessional.course 1.885e-01 1.102e-01 1.710 0.087196 .
## educationuniversity.degree 3.425e-01 9.963e-02 3.437 0.000588 ***
## educationunknown 2.183e-01 1.309e-01 1.668 0.095402 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23199 on 32949 degrees of freedom
## Residual deviance: 14506 on 32927 degrees of freedom
## AIC: 14552
##
## Number of Fisher Scoring iterations: 6
We have from this analysis, age is not significant in our model,so check by anova to study if there is overfittings in the Model!
anova<-aov(age~y)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## y 1 4133 4133 38.09 6.8e-10 ***
## Residuals 41186 4468876 109
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Age is very significant predictor or risk factor for y outcome,so now we will combine all non significant job categories and also all non significant education categories to resolve the issue!
data <- data %>% mutate(job=ifelse(job%in% c("housemaid,technician,unemployed,unknown"),
"Other",as.factor(job)))
data <- data %>% mutate(education=ifelse(education%in% c("basic.6y,basic.9y,high.school,professional.course,unknown"),
"Other2",as.factor(education)))
set.seed(2)
split <- initial_split(data,
prop=.80,
strata=y)
p_train <- training(split)
p_test <- testing(split)
Model1 <- glm(y ~ nr.employed + duration + pdays +
age +job + education,
data=p_train,
family="binomial")
summary(Model1)
##
## Call:
## glm(formula = y ~ nr.employed + duration + pdays + age + job +
## education, family = "binomial", data = p_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.960e+01 1.516e+00 45.905 < 2e-16 ***
## nr.employed -1.410e-02 3.000e-04 -47.006 < 2e-16 ***
## duration 4.453e-03 7.879e-05 56.514 < 2e-16 ***
## pdays -1.555e-03 7.643e-05 -20.347 < 2e-16 ***
## age 4.554e-03 1.830e-03 2.489 0.0128 *
## job 1.033e-02 6.106e-03 1.693 0.0905 .
## education 8.624e-02 1.060e-02 8.138 4.01e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23199 on 32949 degrees of freedom
## Residual deviance: 14621 on 32943 degrees of freedom
## AIC: 14635
##
## Number of Fisher Scoring iterations: 6
Job as an overall non significant predictor for the outcome of the present campaign in our modified Model
tab2 <- table(data$job,data$y)
chisq.test(tab2,correct=T)
##
## Pearson's Chi-squared test
##
## data: tab2
## X-squared = 961.24, df = 11, p-value < 2.2e-16
cramersV(tab2)
## [1] 0.1527676
Very low correlation between client job and y result,but based on its importance in our real data research we will try to keep it in the model!
Model2 <- glm(
y ~ emp.var.rate+cons.conf.idx+euribor3m+cons.price.idx+
nr.employed+duration + pdays+education+age+job+
cons.conf.idx:cons.price.idx+
emp.var.rate:cons.price.idx,
data=p_train,
family="binomial")
summary(Model2)
##
## Call:
## glm(formula = y ~ emp.var.rate + cons.conf.idx + euribor3m +
## cons.price.idx + nr.employed + duration + pdays + education +
## age + job + cons.conf.idx:cons.price.idx + emp.var.rate:cons.price.idx,
## family = "binomial", data = p_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.017e+02 2.954e+01 3.445 0.000572 ***
## emp.var.rate 3.709e+01 4.228e+00 8.774 < 2e-16 ***
## cons.conf.idx 5.259e+00 7.409e-01 7.098 1.26e-12 ***
## euribor3m -2.949e-01 1.030e-01 -2.864 0.004187 **
## cons.price.idx -1.405e+00 3.222e-01 -4.361 1.29e-05 ***
## nr.employed 5.874e-03 2.088e-03 2.813 0.004910 **
## duration 4.556e-03 8.056e-05 56.553 < 2e-16 ***
## pdays -1.487e-03 7.696e-05 -19.325 < 2e-16 ***
## education 6.809e-02 1.080e-02 6.303 2.92e-10 ***
## age 1.987e-03 1.854e-03 1.072 0.283835
## job 9.699e-03 6.148e-03 1.578 0.114640
## cons.conf.idx:cons.price.idx -5.558e-02 7.914e-03 -7.022 2.18e-12 ***
## emp.var.rate:cons.price.idx -4.066e-01 4.556e-02 -8.923 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23199 on 32949 degrees of freedom
## Residual deviance: 14362 on 32937 degrees of freedom
## AIC: 14388
##
## Number of Fisher Scoring iterations: 6
Lower AIC:After additive selection regression analysis from the most to the least important predictors and comparing their AIC,P-values and fixing their multicollinearity interactions terms,we concluded this Model2 as the best fitting.
But still in this model having problems related to age & job predictors
library(car)
vif1 <- vif(Model1)
vif2 <- vif(Model2)
print(vif1)
## nr.employed duration pdays age job education
## 1.279584 1.160995 1.129549 1.029727 1.009281 1.039995
print(vif2)
## emp.var.rate cons.conf.idx
## 98925.316606 39721.017484
## euribor3m cons.price.idx
## 69.153350 90.731074
## nr.employed duration
## 62.171288 1.195561
## pdays education
## 1.163667 1.067460
## age job
## 1.064445 1.012766
## cons.conf.idx:cons.price.idx emp.var.rate:cons.price.idx
## 40060.767756 99609.347103
About age and after studying its vif in the first and second model, it is clear it have no multicollinearity with any.
y-binary tidied regression coefficients
library(broom)
tidy(Model2)
## # A tibble: 13 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 102. 29.5 3.44 5.72e- 4
## 2 emp.var.rate 37.1 4.23 8.77 1.72e-18
## 3 cons.conf.idx 5.26 0.741 7.10 1.26e-12
## 4 euribor3m -0.295 0.103 -2.86 4.19e- 3
## 5 cons.price.idx -1.41 0.322 -4.36 1.29e- 5
## 6 nr.employed 0.00587 0.00209 2.81 4.91e- 3
## 7 duration 0.00456 0.0000806 56.6 0
## 8 pdays -0.00149 0.0000770 -19.3 3.29e-83
## 9 education 0.0681 0.0108 6.30 2.92e-10
## 10 age 0.00199 0.00185 1.07 2.84e- 1
## 11 job 0.00970 0.00615 1.58 1.15e- 1
## 12 cons.conf.idx:cons.price.idx -0.0556 0.00791 -7.02 2.18e-12
## 13 emp.var.rate:cons.price.idx -0.407 0.0456 -8.92 4.54e-19
coefficients <- coef(Model2)
equation <- paste("y=",
round(coefficients[1],3))
for (i in 2:length(coefficients)){
if(coefficients[i]>=0){
sign<-"+"
}else{
sign<-"-"
}
equation <- paste(equation,sign,round(abs(coefficients[i]),3),"*",
names(coefficients)[i])
}
print(equation)
## [1] "y= 101.749 + 37.094 * emp.var.rate + 5.259 * cons.conf.idx - 0.295 * euribor3m - 1.405 * cons.price.idx + 0.006 * nr.employed + 0.005 * duration - 0.001 * pdays + 0.068 * education + 0.002 * age + 0.01 * job - 0.056 * cons.conf.idx:cons.price.idx - 0.407 * emp.var.rate:cons.price.idx"
Note: These variables in this model must be standarized
before predicting;
z = (x-mean)/sd where x is the original value of
variable.
library(glmnet)
X <- as.matrix(p_train[,c("emp.var.rate","cons.conf.idx","euribor3m",
"cons.price.idx","nr.employed","age","job","duration", "pdays","education")])
test <- p_test[,c("emp.var.rate","cons.conf.idx","euribor3m",
"cons.price.idx","nr.employed","age","job","duration", "pdays","education")]
z <- p_train$y
X_scaled <- scale(X)
cv_ridge <- cv.glmnet(X_scaled,z,alpha=0,family="binomial")
best_lambda <- cv_ridge$lambda.min
ridge_model <- glmnet(X_scaled,z,alpha=0,lambda=best_lambda,
family="binomial")
print(coef(ridge_model))
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -2.82329543
## emp.var.rate -0.32815152
## cons.conf.idx 0.15176639
## euribor3m -0.27703707
## cons.price.idx 0.11311307
## nr.employed -0.44015118
## age 0.02987493
## job 0.03333369
## duration 0.98678253
## pdays -0.28004171
## education 0.13908973
According to those ridge model coefficients, age and job coefficient are very near to zero so model is suggesting again these variables don’t contribute much to explaining although the model is allowed to use them.
Ridge Equation with initial variables to predict,but with some less accuracy..
“ridge equation: y= -2.82 -0.33 emp.var.rate + 0.15 cons.conf.idx -0.28 euribor3m + 0.11 cons.price.idx - 0.44 nr.employed + 0.99 duration - 0.28 pdays +0.14 education + 0.03 age + 0.04 job “
p_test <- p_test %>%
mutate(y_prob = predict(Model2,
p_test,
type= "response"),
y_pred = ifelse(y_prob > .5, 1, 0))
t <- table(p_test$y,
p_test$y_pred)
AccuracyModel2<-sum(diag(t))/sum(t)
print(AccuracyModel2)
## [1] 0.9089585
91% high accuracy and overall correctness.So Model2 is performing very well.
library(ROCR)
pred <- predict(Model2, data, type='response')
pred <- prediction(pred, data$y)
roc <- performance(pred, "tpr","fpr")
plot(roc,
colorize=T,
main="ROC Curve",
ylab="Sensitivity",
xlab="1-Specificity")
abline(a=0, b=1)
auc <- performance(pred, "auc")
auc <- unlist(slot(auc, "y.values"))
auc <- round(auc,4)
print(auc)
## [1] 0.9246
92% AUC is excellent representing the model’s ability to discriminate between classes.