library(readr)
bank <- read_delim("D:/Tugas/SEM 5/Stat Sos/Presentasi 3/bank.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
bank <- as.data.frame(bank)
rmarkdown::paged_table(head(bank,10))1 INTRODUCTION
1.1 Case Background
Bank is a financial institution that deals in money and substitutes and provides other money-related services such as savings deposits, checking, loans, etc. Bank maintains customer database that provides insights and information that used for maintain relation between customer and the bank itself.
As a financial institution that entirely depends on the customer. Hence marketing plays an important part in the bank’s business process. One of the technique that is carried by the bank is telemarketing.
Telemarketing uses telephone, as stated by its definition, to convey information, especially about products, to customers. A well-organized telemarketing not only successful in offering products, but also can be used for market research or simply to collect accurate information about customers. Telemarketing is quite effective than traditional marketing techniques even though by e-mails. Telemarketing guarantee an active communication between company’s marketer and its customers.
Bank Telemarketing dataset consists of 41,184 observations with bunch of variables. In this case, we would only use 6 predictor variables and a binary response variable, in which 2 of the predictor variables are dummy variable. The dataset is retrieved from UCI-Machine Learning website with acknowledgement from Moro and Cortez.
The used variables are:
Customer’s Age
Customer’s Marital Status (A dummy variable with 3 categories: married, divorced, single).
Customer’s Loan Status (A dummy variable with 3 categories: Yes, No, Unknown).
Contact Duration
Consumer Price Index
Euribor 3 Month Rate
Customer’s Decision on Product Subscription
1.2 Binary Logistic Regression
1.2.1 Logistic Model
Logistic regression is one of statistical analysis method that describes relationship between categorical response variable with one or more numerical predictor variable(s). In outline, if the response variable consisted only with 2 categories then binary logistic regression will be used. Binary logistic regression follows Bernoulli Distribution:
f(y_i)=\pi_i^{y_i} (1-\pi_i)^{1-y_i}
where:
\pi_i: i-th Event Probability
y_i: i-th Random Variables that consists of 0 and 1
The model for logistic regression is written as:
\pi(x)=\frac{\exp(\beta_0+\beta_1x_1+\cdots+\beta_ix_i)}{1+ \exp(\beta_0+\beta_1x_1+\cdots+\beta_ix_i) }
The previous model is usually extremely hard to be interpreted. Hence, the model is transformed and rewrote to logit form of logistic regression as follows:
g(x)=\ln\bigg[\frac{\pi(x)}{1-\pi(x)}\bigg] = \beta_0+\beta_1x_1+\cdots+\beta_ix_i
1.2.2 Parameter Significance Test
Hypothesis for simultaneous testing:
H_0: \beta_1 = \beta_2 = \cdots = \beta_i = 0 vs.
H_1: At least there is one \beta_i \neq 0.
Simultaneous testing could be done by Likelihood Ratio Test with G Statistics Ratio:
G = -2 \ln \bigg[\frac{\big(\frac{n_1}{n}\big)^{n_1}\big(\frac{n_0}{n}\big)^{n_0}}{\prod^n_{i=1}\hat{\pi}_i^{y_i}(1-\hat{\pi_i})^{1-y_i}}\bigg] = -2\ln \bigg[\frac{L_0}{L_1}\bigg]
where:
n_1: The number of observation on category 1
n_0: The number of observation on category 0
L_0: Likelihood without any predictor variables model
L_1: Likelihood with predictor variables model
As partial testing could be done by Wald Test with W Test Statistics:
W = \frac{\hat{\beta_j}}{\sqrt{\sigma^2\hat{\beta_j}}}
with hypothesis:
H_0: \beta_i = 0 vs.
H_1: \beta_i \neq 0.
2 MODEL BUILDING PROCESS
2.1 Building Logistic Model Part 1
2.1.1 Variables Description
Now we will build a logistic regression model for our data, to see whether can we predict a customer subscribe to the offered service or not.
As stated in the previous part, we’ve got 2 dummy variables. In the 1st dummy variable, we need to create 2 dummy variables (because we’ve got 3 categories on the variable) same for the 2nd dummy variable with its own 2 dummy variables. In summary, the variables present in the model are:
X_1: Customer’s Age
D_1: Married Customer
D_2: Divorced Customer
D_3: Customer without Loans
D_4: Customer with Loans
X_2: Contact Duration
X_3: Customer Price Index
X_4: Euribor 3 Month Rate
When D_1 and D_2 are 0, the customer’s marital status is single.
When D_3 and D_4 are 0, the customer’s loan status is unknown.
2.1.2 Data Preprocessing & Exploration
The first 10 observations are shown below:
Note: The preprocessing for the dummy variables are done in Excel. One might also do the preprocessing directly in R with the help of caret package.
Then, we transform the response variable into a binary value:
bank$y <- ifelse(bank$y =="yes", 1, 0)
rmarkdown::paged_table(head(bank,10))The statistical summaries for each of the variables are shown below:
summary(bank) age mar.married mar.div loan.no
Min. :17.00 Min. :0.0000 Min. :0.000 Min. :0.0000
1st Qu.:32.00 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.0000
Median :38.00 Median :1.0000 Median :0.000 Median :1.0000
Mean :40.02 Mean :0.6052 Mean :0.112 Mean :0.8243
3rd Qu.:47.00 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.:1.0000
Max. :98.00 Max. :1.0000 Max. :1.000 Max. :1.0000
loan.yes duration cpi euribor3m
Min. :0.0000 Min. : 1.0 Min. :92.20 Min. :0.634
1st Qu.:0.0000 1st Qu.: 102.0 1st Qu.:93.08 1st Qu.:1.344
Median :0.0000 Median : 180.0 Median :93.75 Median :4.857
Mean :0.1517 Mean : 258.3 Mean :93.58 Mean :3.621
3rd Qu.:0.0000 3rd Qu.: 319.0 3rd Qu.:93.99 3rd Qu.:4.961
Max. :1.0000 Max. :4918.0 Max. :94.77 Max. :5.045
y
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.1127
3rd Qu.:0.0000
Max. :1.0000
The correlation between the variables are visualized as follows:
corrplot::corrplot(cor(bank), method = "number")From the correlation matrix, we could see a strong correlation between the 2 dummy variables for the loan status and a somewhat strong one for CPI and Euribor 3 Month Rate. Confirmation of multicollinearity will be done after the model building.
2.1.3 Model
We could build the logistic model with help of R with basic package and rms package that yields the same result but with different outline:
logit <- glm(y~., data=bank, family="binomial"(link="logit"))
summary(logit)
Call:
glm(formula = y ~ ., family = binomial(link = "logit"), data = bank)
Deviance Residuals:
Min 1Q Median 3Q Max
-5.8278 -0.4242 -0.1922 -0.1444 3.1817
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.918e+01 3.290e+00 -14.947 < 2e-16 ***
age 1.473e-02 1.742e-03 8.458 < 2e-16 ***
mar.married -3.261e-01 4.555e-02 -7.159 8.15e-13 ***
mar.div -3.456e-01 7.133e-02 -4.845 1.27e-06 ***
loan.no 7.649e-02 1.284e-01 0.596 0.551
loan.yes 1.124e-02 1.359e-01 0.083 0.934
duration 4.421e-03 6.962e-05 63.500 < 2e-16 ***
cpi 5.072e-01 3.539e-02 14.333 < 2e-16 ***
euribor3m -7.747e-01 1.399e-02 -55.371 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 28998 on 41183 degrees of freedom
Residual deviance: 19473 on 41175 degrees of freedom
AIC: 19491
Number of Fisher Scoring iterations: 6
library(rms)
logit.res <- lrm(y ~ ., data=bank, y = TRUE, x = TRUE)
print(logit.res)Logistic Regression Model
lrm(formula = y ~ ., data = bank, x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 41184 LR chi2 9525.22 R2 0.409 C 0.904
0 36544 d.f. 8 R2(8,41184)0.206 Dxy 0.809
1 4640 Pr(> chi2) <0.0001 R2(8,12351.7)0.537 gamma 0.809
max |deriv| 3e-08 Brier 0.071 tau-a 0.162
Coef S.E. Wald Z Pr(>|Z|)
Intercept -49.1799 3.2903 -14.95 <0.0001
age 0.0147 0.0017 8.46 <0.0001
mar.married -0.3261 0.0456 -7.16 <0.0001
mar.div -0.3456 0.0713 -4.85 <0.0001
loan.no 0.0765 0.1284 0.60 0.5514
loan.yes 0.0112 0.1359 0.08 0.9341
duration 0.0044 0.0001 63.50 <0.0001
cpi 0.5072 0.0354 14.33 <0.0001
euribor3m -0.7747 0.0140 -55.37 <0.0001
The logistic model could be written as:
\hat{\pi}(x)=\frac{\exp(-49.18+.015X_1-.326D_1-.346D_2+.076D_3+.011D_4+.004X_2+.507X_3-.775X_4)}{1+ \exp(-49.18+.015X_1-.326D_1-.346D_2+.076D_3+.011D_4+.004X_2+.507X_3-.775X_4)} The above model formula could be rewritten in logit form as:
\ln\bigg(\frac{\hat{\pi}}{1-\hat{\pi}}\bigg) = -49.18+.015X_1-.326D_1-.346D_2+.076D_3+.011D_4+.004X_2+.507X_3-.775X_4 From the model output above, we could also see that customer’s loan status doesn’t influence customer’s decision on subscription significantly. We could simply drop the variable from our model.
2.1.3.1 Model Assumption Checking
Apart from multicollinearity between predictor variables, logistic regression have other simple assumption that need to be noted. The residual vs fitted probability values need to be in a straight line with zero intercept. This assumption may be violated because of the nature of the categorical predictor variable(s) or extreme outlier that present in the model.
Now, let’s check our first model assumption
car::vif(logit) age mar.married mar.div loan.no loan.yes duration
1.294038 1.459160 1.357024 6.575063 6.575264 1.180864
cpi euribor3m
1.436980 1.638136
For multicollinearity checking, all of variables’ VIF are less than 5, the multicollinearity between the variables seem to be absent.
plot(logit, 1)In the residuals vs fitted plot, it can be seen clearly that the intercept of the red line isn’t zero. There are few points that might be an outlier.
Because of the model violated one of the assumptions, we can’t interpret this model right now. We need to detect whether those influential points are really outlier or not.
2.1.4 Outlier Checking
One of the tools that we can use to checking either a point or observation is an outlier or not is by plotting the Cook’s Distance. Where Cook’s Distance is defined as:
D_i = \frac{(y_i-\hat{y_i})^2}{(k+1)\times MSE} \bigg[\frac{h_{ii}}{(1-h_{ii})^2}\bigg]
Ideally, the Cook’s Distance of each observations must be more or less 0 with around 0.005 deviation.
plot(logit, which = 4, id.n=5)We could see clearly that there are around 5 outliers and one of them is quite extreme. By the number of all observations available, it is reasonable for us to remove the 5 outliers observation from a pool data on the overal 40,000 more observations.
One of the other method that is proposed by Gregory, 2018 for handling outliers in logistic regression is by taking the logarithm value of each continuous variables. In the next 2nd model, we would combine the methods to find the best possible result.
2.2 Building Logistic Model Part 2
2.2.1 Model
bank2 <- bank2[-c(2314,19632,24090,36040,40534), ]
str(bank2)tibble [41,179 × 9] (S3: tbl_df/tbl/data.frame)
$ logage : num [1:41179] 1.75 1.76 1.57 1.6 1.75 ...
$ mar.married: num [1:41179] 1 1 1 1 1 1 1 1 0 0 ...
$ mar.div : num [1:41179] 0 0 0 0 0 0 0 0 0 0 ...
$ loan.no : num [1:41179] 1 1 1 1 0 1 1 1 1 1 ...
$ loan.yes : num [1:41179] 0 0 0 0 1 0 0 0 0 0 ...
$ logdur : num [1:41179] 2.42 2.17 2.35 2.18 2.49 ...
$ logcpi : num [1:41179] 1.97 1.97 1.97 1.97 1.97 ...
$ logeuribor : num [1:41179] 0.686 0.686 0.686 0.686 0.686 ...
$ y : num [1:41179] 0 0 0 0 0 0 0 0 0 0 ...
We’ve clearly deleted the 5 outliers earlier, now let’s move on building new model for our data.
logit2 <- glm(y~logage+mar.married+mar.div+logdur
+logcpi+logeuribor, data=bank2, family="binomial"(link="logit"))
summary(logit2)
Call:
glm(formula = y ~ logage + mar.married + mar.div + logdur + logcpi +
logeuribor, family = binomial(link = "logit"), data = bank2)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8100 -0.3620 -0.1718 -0.0761 3.8934
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -190.92758 15.08435 -12.657 < 2e-16 ***
logage 0.56310 0.18089 3.113 0.00185 **
mar.married -0.24019 0.04787 -5.017 5.24e-07 ***
mar.div -0.22447 0.07363 -3.048 0.00230 **
logdur 4.62843 0.07228 64.031 < 2e-16 ***
logcpi 90.49778 7.66093 11.813 < 2e-16 ***
logeuribor -4.31441 0.07564 -57.035 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 28997 on 41178 degrees of freedom
Residual deviance: 18103 on 41172 degrees of freedom
AIC: 18117
Number of Fisher Scoring iterations: 7
library(rms)
logit.res2 <- lrm(y ~ logage+mar.married+mar.div+logdur
+logcpi+logeuribor, data=bank2, y = TRUE, x = TRUE)
print(logit.res2)Logistic Regression Model
lrm(formula = y ~ logage + mar.married + mar.div + logdur + logcpi +
logeuribor, data = bank2, x = TRUE, y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 41179 LR chi2 10893.91 R2 0.460 C 0.913
0 36539 d.f. 6 R2(6,41179)0.232 Dxy 0.826
1 4640 Pr(> chi2) <0.0001 R2(6,12351.5)0.586 gamma 0.826
max |deriv| 1e-10 Brier 0.068 tau-a 0.165
Coef S.E. Wald Z Pr(>|Z|)
Intercept -190.9276 15.0843 -12.66 <0.0001
logage 0.5631 0.1809 3.11 0.0019
mar.married -0.2402 0.0479 -5.02 <0.0001
mar.div -0.2245 0.0736 -3.05 0.0023
logdur 4.6284 0.0723 64.03 <0.0001
logcpi 90.4978 7.6609 11.81 <0.0001
logeuribor -4.3144 0.0756 -57.04 <0.0001
The logistic model after transformed could be written as:
\hat{\pi}(x)=\frac{\exp(-190.72+.569\log X_1-.240D_1-.223D_2+4.61\log X_2+90.4\log X_3-4.31\log X_4)}{1+ \exp(-190.72+.569X_1-.240D_1-.223D_2+4.61X_2+90.4X_3-4.31X_4)} The above model formula could be rewritten in logit form as:
\ln\bigg(\frac{\hat{\pi}}{1-\hat{\pi}}\bigg) = \exp(-190.72+.569\log X_1-.240D_1-.223D_2+4.61\log X_2+90.4\log X_3-4.31\log X_4)
2.2.2 Assumption & Outlier Checking
plot(logit2, 1)plot(logit2, which = 4, id.n=5)We could see that even though there is a potential outlier, the intercept of the residuals vs fitted still nearly in the intercept = 0. The residual for observation number 24,013 is kinda far from other residuals but isn’t considered an outlier as it isn’t far enough.
2.3 Model Interpretation
Interpretation for the logistic model usually assisted with Odd Ratio (OR) from the model. The OR model could be retrieved from the model with
beta <- coef(logit2)
OR_beta <- exp(beta)
conf <- exp(confint(logit2))
knitr::kable(cbind(beta,OR_beta,conf))| beta | OR_beta | 2.5 % | 97.5 % | |
|---|---|---|---|---|
| (Intercept) | -190.9275762 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| logage | 0.5630979 | 1.756104e+00 | 1.231879e+00 | 2.503443e+00 |
| mar.married | -0.2401897 | 7.864787e-01 | 7.160927e-01 | 8.639204e-01 |
| mar.div | -0.2244652 | 7.989434e-01 | 6.911455e-01 | 9.224412e-01 |
| logdur | 4.6284294 | 1.023532e+02 | 8.891799e+01 | 1.180467e+02 |
| logcpi | 90.4977845 | 2.007652e+39 | 6.073295e+32 | 6.704942e+45 |
| logeuribor | -4.3144130 | 1.337440e-02 | 1.152390e-02 | 1.550200e-02 |
The logistic model could be interpreted with the help of Odd Ratio. The interpretation of Odd Ratio (OR) from each variables are as follow:
The increase in 1 year of age in the customer, the probability that customer subscribe to the product increased \log(1.766) = 0.25 times. In other word, the increase in 1 year decreased the probability to subscribe more than 75%.
The customer that is married is 21.3% less likely to subscribe to the marketed product than the customer that is single.
The customer that is divorced is 20% less likely to subscribe to the marketed product than the customer that is single.
The increase in 1 second in duration, increase the probability that a customer subscribe the marketed product increased \log(100) = 2 times.
The increase in 1 of CPI, increase the probability that a customer subscribe to the marketed product around \log(1.85 \times 10^{39}) times.
The increase in 1 percent of Euribor Month rate, decrease the probability that a customer subscribe in almost 2 folds.
3 MODEL EVALUATION
3.1 Pseudo R2
pscl::pR2(logit2)fitting null model for pseudo-r2
llh llhNull G2 McFadden r2ML
-9.051330e+03 -1.449829e+04 1.089391e+04 3.756966e-01 2.324489e-01
r2CU
4.598616e-01
The number of R2-McFadden is quite small. The model isn’t really doing a good job in prediction power.
3.2 ROC Curve
library(pROC)
yp_hat <- fitted.values(logit2)
bank2$yp_hat <- yp_hat
roc(bank2$y~yp_hat,plot=T,print.auc=T)
Call:
roc.formula(formula = bank2$y ~ yp_hat, plot = T, print.auc = T)
Data: yp_hat in 36539 controls (bank2$y 0) < 4640 cases (bank2$y 1).
Area under the curve: 0.9129
Clearly the model is quite good in fitting the data with AUC 0.913. The curve also closely follows the perfect model curve. The AUC number is very close to the perfect condition, i.e equal to 1. This means the model has a good job in discriminating between the two categories on our response variable.
3.3 Variable Importance
caret::varImp(logit2) Overall
logage 3.112887
mar.married 5.017211
mar.div 3.048433
logdur 64.030640
logcpi 11.812896
logeuribor 57.035158
We could see the most important variable that affects whether a customer subscribe or not to the newest marketed product is the duration of the contact followed by the Interest Rate in Euribor’s 3 Month Rate. The variables that come from personal data doesn’t take much importance in our model.
4 CONCLUSION
The logistic regression model that is built for the Telemarketing Data with the statistical learning approach has been proved quite good in discriminating the two categories provided and have a somewhat good power in prediction that is shown by the AUC in the ROC Curve.
However, keep in mind that in machine learning approach, there might be a chance that the model overfits the data that make the model inflexible with the new observations. The test for overfitting or underfitting is needed to complete the model.
There might be a good idea also to explore another logistic related model that might be proven a better fit for the data.