Logistic Regression Application on Bank Telemarketing Data

A Team Based Project
Author

Harrison Claudius, Louis Aubert, Sefira Achmadi, Keisha Aimee P., Alya Fitri S., Yehezkiel Wisnu A.K.

Published

March 6, 2023

1 INTRODUCTION

1.1 Case Background

Bank is a financial institution that deals in money and substitutes and provides other money-related services such as savings deposits, checking, loans, etc. Bank maintains customer database that provides insights and information that used for maintain relation between customer and the bank itself.

As a financial institution that entirely depends on the customer. Hence marketing plays an important part in the bank’s business process. One of the technique that is carried by the bank is telemarketing.

Telemarketing uses telephone, as stated by its definition, to convey information, especially about products, to customers. A well-organized telemarketing not only successful in offering products, but also can be used for market research or simply to collect accurate information about customers. Telemarketing is quite effective than traditional marketing techniques even though by e-mails. Telemarketing guarantee an active communication between company’s marketer and its customers.

Bank Telemarketing dataset consists of 41,184 observations with bunch of variables. In this case, we would only use 6 predictor variables and a binary response variable, in which 2 of the predictor variables are dummy variable. The dataset is retrieved from UCI-Machine Learning website with acknowledgement from Moro and Cortez.

The used variables are:

  • Customer’s Age

  • Customer’s Marital Status (A dummy variable with 3 categories: married, divorced, single).

  • Customer’s Loan Status (A dummy variable with 3 categories: Yes, No, Unknown).

  • Contact Duration

  • Consumer Price Index

  • Euribor 3 Month Rate

  • Customer’s Decision on Product Subscription

1.2 Binary Logistic Regression

1.2.1 Logistic Model

Logistic regression is one of statistical analysis method that describes relationship between categorical response variable with one or more numerical predictor variable(s). In outline, if the response variable consisted only with 2 categories then binary logistic regression will be used. Binary logistic regression follows Bernoulli Distribution:

f(y_i)=\pi_i^{y_i} (1-\pi_i)^{1-y_i}

where:

\pi_i: i-th Event Probability

y_i: i-th Random Variables that consists of 0 and 1

The model for logistic regression is written as:

\pi(x)=\frac{\exp(\beta_0+\beta_1x_1+\cdots+\beta_ix_i)}{1+ \exp(\beta_0+\beta_1x_1+\cdots+\beta_ix_i) }

The previous model is usually extremely hard to be interpreted. Hence, the model is transformed and rewrote to logit form of logistic regression as follows:

g(x)=\ln\bigg[\frac{\pi(x)}{1-\pi(x)}\bigg] = \beta_0+\beta_1x_1+\cdots+\beta_ix_i

1.2.2 Parameter Significance Test

Hypothesis for simultaneous testing:

H_0: \beta_1 = \beta_2 = \cdots = \beta_i = 0 vs.

H_1: At least there is one \beta_i \neq 0.

Simultaneous testing could be done by Likelihood Ratio Test with G Statistics Ratio:

G = -2 \ln \bigg[\frac{\big(\frac{n_1}{n}\big)^{n_1}\big(\frac{n_0}{n}\big)^{n_0}}{\prod^n_{i=1}\hat{\pi}_i^{y_i}(1-\hat{\pi_i})^{1-y_i}}\bigg] = -2\ln \bigg[\frac{L_0}{L_1}\bigg]

where:

n_1: The number of observation on category 1

n_0: The number of observation on category 0

L_0: Likelihood without any predictor variables model

L_1: Likelihood with predictor variables model

As partial testing could be done by Wald Test with W Test Statistics:

W = \frac{\hat{\beta_j}}{\sqrt{\sigma^2\hat{\beta_j}}}

with hypothesis:

H_0: \beta_i = 0 vs.

H_1: \beta_i \neq 0.

2 MODEL BUILDING PROCESS

2.1 Building Logistic Model Part 1

2.1.1 Variables Description

Now we will build a logistic regression model for our data, to see whether can we predict a customer subscribe to the offered service or not.

As stated in the previous part, we’ve got 2 dummy variables. In the 1st dummy variable, we need to create 2 dummy variables (because we’ve got 3 categories on the variable) same for the 2nd dummy variable with its own 2 dummy variables. In summary, the variables present in the model are:

  • X_1: Customer’s Age

  • D_1: Married Customer

  • D_2: Divorced Customer

  • D_3: Customer without Loans

  • D_4: Customer with Loans

  • X_2: Contact Duration

  • X_3: Customer Price Index

  • X_4: Euribor 3 Month Rate

When D_1 and D_2 are 0, the customer’s marital status is single.

When D_3 and D_4 are 0, the customer’s loan status is unknown.

2.1.2 Data Preprocessing & Exploration

The first 10 observations are shown below:

library(readr)
bank <- read_delim("D:/Tugas/SEM 5/Stat Sos/Presentasi 3/bank.csv", 
                   delim = ";", escape_double = FALSE, trim_ws = TRUE)
bank <- as.data.frame(bank)
rmarkdown::paged_table(head(bank,10))

Note: The preprocessing for the dummy variables are done in Excel. One might also do the preprocessing directly in R with the help of caret package.

Then, we transform the response variable into a binary value:

bank$y <- ifelse(bank$y =="yes", 1, 0)
rmarkdown::paged_table(head(bank,10))

The statistical summaries for each of the variables are shown below:

summary(bank)
      age         mar.married        mar.div         loan.no      
 Min.   :17.00   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
 1st Qu.:32.00   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.0000  
 Median :38.00   Median :1.0000   Median :0.000   Median :1.0000  
 Mean   :40.02   Mean   :0.6052   Mean   :0.112   Mean   :0.8243  
 3rd Qu.:47.00   3rd Qu.:1.0000   3rd Qu.:0.000   3rd Qu.:1.0000  
 Max.   :98.00   Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
    loan.yes         duration           cpi          euribor3m    
 Min.   :0.0000   Min.   :   1.0   Min.   :92.20   Min.   :0.634  
 1st Qu.:0.0000   1st Qu.: 102.0   1st Qu.:93.08   1st Qu.:1.344  
 Median :0.0000   Median : 180.0   Median :93.75   Median :4.857  
 Mean   :0.1517   Mean   : 258.3   Mean   :93.58   Mean   :3.621  
 3rd Qu.:0.0000   3rd Qu.: 319.0   3rd Qu.:93.99   3rd Qu.:4.961  
 Max.   :1.0000   Max.   :4918.0   Max.   :94.77   Max.   :5.045  
       y         
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.1127  
 3rd Qu.:0.0000  
 Max.   :1.0000  

The correlation between the variables are visualized as follows:

corrplot::corrplot(cor(bank), method = "number")

From the correlation matrix, we could see a strong correlation between the 2 dummy variables for the loan status and a somewhat strong one for CPI and Euribor 3 Month Rate. Confirmation of multicollinearity will be done after the model building.

2.1.3 Model

We could build the logistic model with help of R with basic package and rms package that yields the same result but with different outline:

logit <- glm(y~., data=bank, family="binomial"(link="logit"))
summary(logit)

Call:
glm(formula = y ~ ., family = binomial(link = "logit"), data = bank)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.8278  -0.4242  -0.1922  -0.1444   3.1817  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.918e+01  3.290e+00 -14.947  < 2e-16 ***
age          1.473e-02  1.742e-03   8.458  < 2e-16 ***
mar.married -3.261e-01  4.555e-02  -7.159 8.15e-13 ***
mar.div     -3.456e-01  7.133e-02  -4.845 1.27e-06 ***
loan.no      7.649e-02  1.284e-01   0.596    0.551    
loan.yes     1.124e-02  1.359e-01   0.083    0.934    
duration     4.421e-03  6.962e-05  63.500  < 2e-16 ***
cpi          5.072e-01  3.539e-02  14.333  < 2e-16 ***
euribor3m   -7.747e-01  1.399e-02 -55.371  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 28998  on 41183  degrees of freedom
Residual deviance: 19473  on 41175  degrees of freedom
AIC: 19491

Number of Fisher Scoring iterations: 6
library(rms)
logit.res <- lrm(y ~ ., data=bank, y = TRUE, x = TRUE)
print(logit.res)
Logistic Regression Model
 
 lrm(formula = y ~ ., data = bank, x = TRUE, y = TRUE)
 
                        Model Likelihood        Discrimination    Rank Discrim.    
                              Ratio Test               Indexes          Indexes    
 Obs         41184    LR chi2    9525.22        R2       0.409    C       0.904    
  0          36544    d.f.             8      R2(8,41184)0.206    Dxy     0.809    
  1           4640    Pr(> chi2) <0.0001    R2(8,12351.7)0.537    gamma   0.809    
 max |deriv| 3e-08                              Brier    0.071    tau-a   0.162    
 
             Coef     S.E.   Wald Z Pr(>|Z|)
 Intercept   -49.1799 3.2903 -14.95 <0.0001 
 age           0.0147 0.0017   8.46 <0.0001 
 mar.married  -0.3261 0.0456  -7.16 <0.0001 
 mar.div      -0.3456 0.0713  -4.85 <0.0001 
 loan.no       0.0765 0.1284   0.60 0.5514  
 loan.yes      0.0112 0.1359   0.08 0.9341  
 duration      0.0044 0.0001  63.50 <0.0001 
 cpi           0.5072 0.0354  14.33 <0.0001 
 euribor3m    -0.7747 0.0140 -55.37 <0.0001 
 

The logistic model could be written as:

\hat{\pi}(x)=\frac{\exp(-49.18+.015X_1-.326D_1-.346D_2+.076D_3+.011D_4+.004X_2+.507X_3-.775X_4)}{1+ \exp(-49.18+.015X_1-.326D_1-.346D_2+.076D_3+.011D_4+.004X_2+.507X_3-.775X_4)} The above model formula could be rewritten in logit form as:

\ln\bigg(\frac{\hat{\pi}}{1-\hat{\pi}}\bigg) = -49.18+.015X_1-.326D_1-.346D_2+.076D_3+.011D_4+.004X_2+.507X_3-.775X_4 From the model output above, we could also see that customer’s loan status doesn’t influence customer’s decision on subscription significantly. We could simply drop the variable from our model.

2.1.3.1 Model Assumption Checking

Apart from multicollinearity between predictor variables, logistic regression have other simple assumption that need to be noted. The residual vs fitted probability values need to be in a straight line with zero intercept. This assumption may be violated because of the nature of the categorical predictor variable(s) or extreme outlier that present in the model.

Now, let’s check our first model assumption

car::vif(logit)
        age mar.married     mar.div     loan.no    loan.yes    duration 
   1.294038    1.459160    1.357024    6.575063    6.575264    1.180864 
        cpi   euribor3m 
   1.436980    1.638136 

For multicollinearity checking, all of variables’ VIF are less than 5, the multicollinearity between the variables seem to be absent.

plot(logit, 1)

In the residuals vs fitted plot, it can be seen clearly that the intercept of the red line isn’t zero. There are few points that might be an outlier.

Because of the model violated one of the assumptions, we can’t interpret this model right now. We need to detect whether those influential points are really outlier or not.

2.1.4 Outlier Checking

One of the tools that we can use to checking either a point or observation is an outlier or not is by plotting the Cook’s Distance. Where Cook’s Distance is defined as:

D_i = \frac{(y_i-\hat{y_i})^2}{(k+1)\times MSE} \bigg[\frac{h_{ii}}{(1-h_{ii})^2}\bigg]

Ideally, the Cook’s Distance of each observations must be more or less 0 with around 0.005 deviation.

plot(logit, which = 4, id.n=5)

We could see clearly that there are around 5 outliers and one of them is quite extreme. By the number of all observations available, it is reasonable for us to remove the 5 outliers observation from a pool data on the overal 40,000 more observations.

One of the other method that is proposed by Gregory, 2018 for handling outliers in logistic regression is by taking the logarithm value of each continuous variables. In the next 2nd model, we would combine the methods to find the best possible result.

2.2 Building Logistic Model Part 2

2.2.1 Model

bank2 <- bank2[-c(2314,19632,24090,36040,40534), ]
str(bank2)
tibble [41,179 × 9] (S3: tbl_df/tbl/data.frame)
 $ logage     : num [1:41179] 1.75 1.76 1.57 1.6 1.75 ...
 $ mar.married: num [1:41179] 1 1 1 1 1 1 1 1 0 0 ...
 $ mar.div    : num [1:41179] 0 0 0 0 0 0 0 0 0 0 ...
 $ loan.no    : num [1:41179] 1 1 1 1 0 1 1 1 1 1 ...
 $ loan.yes   : num [1:41179] 0 0 0 0 1 0 0 0 0 0 ...
 $ logdur     : num [1:41179] 2.42 2.17 2.35 2.18 2.49 ...
 $ logcpi     : num [1:41179] 1.97 1.97 1.97 1.97 1.97 ...
 $ logeuribor : num [1:41179] 0.686 0.686 0.686 0.686 0.686 ...
 $ y          : num [1:41179] 0 0 0 0 0 0 0 0 0 0 ...

We’ve clearly deleted the 5 outliers earlier, now let’s move on building new model for our data.

logit2 <- glm(y~logage+mar.married+mar.div+logdur
              +logcpi+logeuribor, data=bank2, family="binomial"(link="logit"))
summary(logit2)

Call:
glm(formula = y ~ logage + mar.married + mar.div + logdur + logcpi + 
    logeuribor, family = binomial(link = "logit"), data = bank2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.8100  -0.3620  -0.1718  -0.0761   3.8934  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -190.92758   15.08435 -12.657  < 2e-16 ***
logage         0.56310    0.18089   3.113  0.00185 ** 
mar.married   -0.24019    0.04787  -5.017 5.24e-07 ***
mar.div       -0.22447    0.07363  -3.048  0.00230 ** 
logdur         4.62843    0.07228  64.031  < 2e-16 ***
logcpi        90.49778    7.66093  11.813  < 2e-16 ***
logeuribor    -4.31441    0.07564 -57.035  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 28997  on 41178  degrees of freedom
Residual deviance: 18103  on 41172  degrees of freedom
AIC: 18117

Number of Fisher Scoring iterations: 7
library(rms)
logit.res2 <- lrm(y ~ logage+mar.married+mar.div+logdur
              +logcpi+logeuribor, data=bank2, y = TRUE, x = TRUE)
print(logit.res2)
Logistic Regression Model
 
 lrm(formula = y ~ logage + mar.married + mar.div + logdur + logcpi + 
     logeuribor, data = bank2, x = TRUE, y = TRUE)
 
                         Model Likelihood        Discrimination    Rank Discrim.    
                               Ratio Test               Indexes          Indexes    
 Obs         41179    LR chi2    10893.91        R2       0.460    C       0.913    
  0          36539    d.f.              6      R2(6,41179)0.232    Dxy     0.826    
  1           4640    Pr(> chi2)  <0.0001    R2(6,12351.5)0.586    gamma   0.826    
 max |deriv| 1e-10                               Brier    0.068    tau-a   0.165    
 
             Coef      S.E.    Wald Z Pr(>|Z|)
 Intercept   -190.9276 15.0843 -12.66 <0.0001 
 logage         0.5631  0.1809   3.11 0.0019  
 mar.married   -0.2402  0.0479  -5.02 <0.0001 
 mar.div       -0.2245  0.0736  -3.05 0.0023  
 logdur         4.6284  0.0723  64.03 <0.0001 
 logcpi        90.4978  7.6609  11.81 <0.0001 
 logeuribor    -4.3144  0.0756 -57.04 <0.0001 
 

The logistic model after transformed could be written as:

\hat{\pi}(x)=\frac{\exp(-190.72+.569\log X_1-.240D_1-.223D_2+4.61\log X_2+90.4\log X_3-4.31\log X_4)}{1+ \exp(-190.72+.569X_1-.240D_1-.223D_2+4.61X_2+90.4X_3-4.31X_4)} The above model formula could be rewritten in logit form as:

\ln\bigg(\frac{\hat{\pi}}{1-\hat{\pi}}\bigg) = \exp(-190.72+.569\log X_1-.240D_1-.223D_2+4.61\log X_2+90.4\log X_3-4.31\log X_4)

2.2.2 Assumption & Outlier Checking

plot(logit2, 1)

plot(logit2, which = 4, id.n=5)

We could see that even though there is a potential outlier, the intercept of the residuals vs fitted still nearly in the intercept = 0. The residual for observation number 24,013 is kinda far from other residuals but isn’t considered an outlier as it isn’t far enough.

2.3 Model Interpretation

Interpretation for the logistic model usually assisted with Odd Ratio (OR) from the model. The OR model could be retrieved from the model with

beta <- coef(logit2)
OR_beta <- exp(beta)
conf <- exp(confint(logit2))
knitr::kable(cbind(beta,OR_beta,conf))
beta OR_beta 2.5 % 97.5 %
(Intercept) -190.9275762 0.000000e+00 0.000000e+00 0.000000e+00
logage 0.5630979 1.756104e+00 1.231879e+00 2.503443e+00
mar.married -0.2401897 7.864787e-01 7.160927e-01 8.639204e-01
mar.div -0.2244652 7.989434e-01 6.911455e-01 9.224412e-01
logdur 4.6284294 1.023532e+02 8.891799e+01 1.180467e+02
logcpi 90.4977845 2.007652e+39 6.073295e+32 6.704942e+45
logeuribor -4.3144130 1.337440e-02 1.152390e-02 1.550200e-02

The logistic model could be interpreted with the help of Odd Ratio. The interpretation of Odd Ratio (OR) from each variables are as follow:

  • The increase in 1 year of age in the customer, the probability that customer subscribe to the product increased \log(1.766) = 0.25 times. In other word, the increase in 1 year decreased the probability to subscribe more than 75%.

  • The customer that is married is 21.3% less likely to subscribe to the marketed product than the customer that is single.

  • The customer that is divorced is 20% less likely to subscribe to the marketed product than the customer that is single.

  • The increase in 1 second in duration, increase the probability that a customer subscribe the marketed product increased \log(100) = 2 times.

  • The increase in 1 of CPI, increase the probability that a customer subscribe to the marketed product around \log(1.85 \times 10^{39}) times.

  • The increase in 1 percent of Euribor Month rate, decrease the probability that a customer subscribe in almost 2 folds.

3 MODEL EVALUATION

3.1 Pseudo R2

pscl::pR2(logit2)
fitting null model for pseudo-r2
          llh       llhNull            G2      McFadden          r2ML 
-9.051330e+03 -1.449829e+04  1.089391e+04  3.756966e-01  2.324489e-01 
         r2CU 
 4.598616e-01 

The number of R2-McFadden is quite small. The model isn’t really doing a good job in prediction power.

3.2 ROC Curve

library(pROC)
yp_hat <- fitted.values(logit2)
bank2$yp_hat <- yp_hat
roc(bank2$y~yp_hat,plot=T,print.auc=T)


Call:
roc.formula(formula = bank2$y ~ yp_hat, plot = T, print.auc = T)

Data: yp_hat in 36539 controls (bank2$y 0) < 4640 cases (bank2$y 1).
Area under the curve: 0.9129

Clearly the model is quite good in fitting the data with AUC 0.913. The curve also closely follows the perfect model curve. The AUC number is very close to the perfect condition, i.e equal to 1. This means the model has a good job in discriminating between the two categories on our response variable.

3.3 Variable Importance

caret::varImp(logit2)
              Overall
logage       3.112887
mar.married  5.017211
mar.div      3.048433
logdur      64.030640
logcpi      11.812896
logeuribor  57.035158

We could see the most important variable that affects whether a customer subscribe or not to the newest marketed product is the duration of the contact followed by the Interest Rate in Euribor’s 3 Month Rate. The variables that come from personal data doesn’t take much importance in our model.

4 CONCLUSION

The logistic regression model that is built for the Telemarketing Data with the statistical learning approach has been proved quite good in discriminating the two categories provided and have a somewhat good power in prediction that is shown by the AUC in the ROC Curve.

However, keep in mind that in machine learning approach, there might be a chance that the model overfits the data that make the model inflexible with the new observations. The test for overfitting or underfitting is needed to complete the model.

There might be a good idea also to explore another logistic related model that might be proven a better fit for the data.