Marketing campaigns can be expensive. Knowing which customers to target is essential in order to conduct an efficient campaign. Given information about a client, can we predict whether that client will respond positively to the product being marketed? In the case presented, the target variable (the client’s response) is either yes or no. When trying to predict binary outcomes instead of continuous values, we must turn to an alternative regression techniques, namely logistic regression. The logistic regression calculates the probability of an observation belonging to the target class, which in this case is whether a bank customer will suscribe to a term deposit. The data is available from the UCI Machine Learning Repository.
The dataset is composed of 4521 observations, with 7 numeric variables and 10 character variables. The latter will have to be converted to factors instead of characters for use in regression analysis. Some of the independent variables are bank client data such as age, job, education or balance while others are related to the last contact with the current marketing campaign. There are no missing values and we proceed to take a look at the data distributions.
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
| 33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no |
| 35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
| 30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no |
| 59 | blue-collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no |
| 35 | management | single | tertiary | no | 747 | no | no | cellular | 23 | feb | 141 | 2 | 176 | 3 | failure | no |
| 36 | self-employed | married | tertiary | no | 307 | yes | no | cellular | 14 | may | 341 | 1 | 330 | 2 | other | no |
| 39 | technician | married | secondary | no | 147 | yes | no | cellular | 6 | may | 151 | 2 | -1 | 0 | unknown | no |
| 41 | entrepreneur | married | tertiary | no | 221 | yes | no | unknown | 14 | may | 57 | 2 | -1 | 0 | unknown | no |
| 43 | services | married | primary | no | -88 | yes | yes | cellular | 17 | apr | 313 | 1 | 147 | 2 | failure | no |
| 39 | services | married | secondary | no | 9374 | yes | no | unknown | 20 | may | 273 | 1 | -1 | 0 | unknown | no |
| 43 | admin. | married | secondary | no | 264 | yes | no | cellular | 17 | apr | 113 | 2 | -1 | 0 | unknown | no |
| 36 | technician | married | tertiary | no | 1109 | no | no | cellular | 13 | aug | 328 | 2 | -1 | 0 | unknown | no |
| 20 | student | single | secondary | no | 502 | no | no | cellular | 30 | apr | 261 | 1 | -1 | 0 | unknown | yes |
| 31 | blue-collar | married | secondary | no | 360 | yes | yes | cellular | 29 | jan | 89 | 1 | 241 | 1 | failure | no |
| 40 | management | married | tertiary | no | 194 | no | yes | cellular | 29 | aug | 189 | 2 | -1 | 0 | unknown | no |
| 56 | technician | married | secondary | no | 4073 | no | no | cellular | 27 | aug | 239 | 5 | -1 | 0 | unknown | no |
| 37 | admin. | single | tertiary | no | 2317 | yes | no | cellular | 20 | apr | 114 | 1 | 152 | 2 | failure | no |
| 25 | blue-collar | single | primary | no | -221 | yes | no | unknown | 23 | may | 250 | 1 | -1 | 0 | unknown | no |
| 31 | services | married | secondary | no | 132 | no | no | cellular | 7 | jul | 148 | 1 | 152 | 1 | other | no |
| Name | data |
| Number of rows | 4521 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| job | 0 | 1 | 6 | 13 | 0 | 12 | 0 |
| marital | 0 | 1 | 6 | 8 | 0 | 3 | 0 |
| education | 0 | 1 | 7 | 9 | 0 | 4 | 0 |
| default | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| housing | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| loan | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| contact | 0 | 1 | 7 | 9 | 0 | 3 | 0 |
| month | 0 | 1 | 3 | 3 | 0 | 12 | 0 |
| poutcome | 0 | 1 | 5 | 7 | 0 | 4 | 0 |
| y | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 41.17 | 10.58 | 19 | 33 | 39 | 49 | 87 | ▅▇▅▁▁ |
| balance | 0 | 1 | 1422.66 | 3009.64 | -3313 | 69 | 444 | 1480 | 71188 | ▇▁▁▁▁ |
| day | 0 | 1 | 15.92 | 8.25 | 1 | 9 | 16 | 21 | 31 | ▆▆▇▅▆ |
| duration | 0 | 1 | 263.96 | 259.86 | 4 | 104 | 185 | 329 | 3025 | ▇▁▁▁▁ |
| campaign | 0 | 1 | 2.79 | 3.11 | 1 | 1 | 2 | 3 | 50 | ▇▁▁▁▁ |
| pdays | 0 | 1 | 39.77 | 100.12 | -1 | -1 | -1 | -1 | 871 | ▇▁▁▁▁ |
| previous | 0 | 1 | 0.54 | 1.69 | 0 | 0 | 0 | 0 | 25 | ▇▁▁▁▁ |
As mentioned earlier, the categorical data needs to be manipulated into factors in order to be valid inputs for regression. We also mutate the response variable y from ‘yes’ and ‘no’ to 1 or 0. The data is fed into the glm function for generalized linear models which is appropriate for distributions belonging to the exponential family. We distinguish the family by and speficy the logit link function, which transforms the inputs into a sigmoid which is continuous and range bounded between 0 and 1.
From the summary of the trained model, we observe a number of significant predictors with positive and negative coefficients:
##
## Call:
## glm(formula = y ~ ., family = binomial(link = "logit"), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.0169 -0.3814 -0.2567 -0.1579 3.0346
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.462e+00 6.038e-01 -4.077 4.55e-05 ***
## age -4.232e-03 7.125e-03 -0.594 0.552537
## jobblue-collar -3.924e-01 2.420e-01 -1.621 0.104937
## jobentrepreneur -2.498e-01 3.811e-01 -0.655 0.512199
## jobhousemaid -3.530e-01 4.176e-01 -0.845 0.398000
## jobmanagement -7.302e-02 2.407e-01 -0.303 0.761602
## jobretired 6.315e-01 3.112e-01 2.029 0.042454 *
## jobself-employed -1.812e-01 3.533e-01 -0.513 0.608167
## jobservices -1.457e-01 2.729e-01 -0.534 0.593542
## jobstudent 3.784e-01 3.750e-01 1.009 0.312958
## jobtechnician -1.926e-01 2.301e-01 -0.837 0.402496
## jobunemployed -6.395e-01 4.214e-01 -1.518 0.129138
## jobunknown 5.207e-01 5.853e-01 0.890 0.373669
## maritalmarried -4.696e-01 1.743e-01 -2.694 0.007058 **
## maritalsingle -3.051e-01 2.038e-01 -1.497 0.134354
## educationsecondary 8.011e-02 2.022e-01 0.396 0.691924
## educationtertiary 3.208e-01 2.337e-01 1.373 0.169897
## educationunknown -4.210e-01 3.572e-01 -1.179 0.238561
## defaultyes 5.446e-01 4.315e-01 1.262 0.206824
## balance -3.911e-06 1.749e-05 -0.224 0.823014
## housingyes -2.600e-01 1.381e-01 -1.883 0.059676 .
## loanyes -6.296e-01 2.000e-01 -3.149 0.001640 **
## contacttelephone -7.020e-02 2.327e-01 -0.302 0.762900
## contactunknown -1.416e+00 2.277e-01 -6.219 4.99e-10 ***
## day 1.641e-02 8.161e-03 2.011 0.044362 *
## monthaug -3.081e-01 2.494e-01 -1.235 0.216655
## monthdec 1.144e-01 6.573e-01 0.174 0.861784
## monthfeb 2.022e-01 2.937e-01 0.688 0.491290
## monthjan -1.123e+00 3.816e-01 -2.944 0.003245 **
## monthjul -7.515e-01 2.498e-01 -3.008 0.002630 **
## monthjun 5.542e-01 3.003e-01 1.845 0.065009 .
## monthmar 1.498e+00 3.901e-01 3.842 0.000122 ***
## monthmay -4.900e-01 2.340e-01 -2.094 0.036246 *
## monthnov -8.430e-01 2.737e-01 -3.080 0.002072 **
## monthoct 1.361e+00 3.300e-01 4.124 3.72e-05 ***
## monthsep 6.572e-01 4.115e-01 1.597 0.110265
## duration 4.225e-03 2.020e-04 20.912 < 2e-16 ***
## campaign -7.042e-02 2.821e-02 -2.496 0.012549 *
## pdays -9.791e-05 9.959e-04 -0.098 0.921684
## previous -5.511e-03 3.818e-02 -0.144 0.885249
## poutcomeother 4.912e-01 2.692e-01 1.825 0.068019 .
## poutcomesuccess 2.445e+00 2.773e-01 8.818 < 2e-16 ***
## poutcomeunknown -1.216e-01 3.199e-01 -0.380 0.703822
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3231.0 on 4520 degrees of freedom
## Residual deviance: 2173.7 on 4478 degrees of freedom
## AIC: 2259.7
##
## Number of Fisher Scoring iterations: 6
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3914 343
## 1 86 178
##
## Accuracy : 0.9051
## 95% CI : (0.8962, 0.9135)
## No Information Rate : 0.8848
## P-Value [Acc > NIR] : 6.12e-06
##
## Kappa : 0.4076
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.34165
## Specificity : 0.97850
## Pos Pred Value : 0.67424
## Neg Pred Value : 0.91943
## Prevalence : 0.11524
## Detection Rate : 0.03937
## Detection Prevalence : 0.05839
## Balanced Accuracy : 0.66008
##
## 'Positive' Class : 1
##
The model yields a high accuracy metric on the training data. However, it is worth comparing these results with a smaller model that does not use the variables that we outlined above as suspect. What we find is the the reduced model actually worse with an increase in AIC from 2260 to 2879.
glm.mod2 <- glm(y ~ . - day - duration - contact - campaign, data=data, family = binomial(link="logit"))
summary(glm.mod2)##
## Call:
## glm(formula = y ~ . - day - duration - contact - campaign, family = binomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2680 -0.4806 -0.3849 -0.3029 2.7504
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.294e+00 5.104e-01 -2.535 0.011233 *
## age 1.165e-03 6.140e-03 0.190 0.849566
## jobblue-collar -1.262e-01 2.123e-01 -0.594 0.552219
## jobentrepreneur 1.147e-01 3.266e-01 0.351 0.725398
## jobhousemaid 9.059e-02 3.602e-01 0.251 0.801449
## jobmanagement 1.405e-01 2.121e-01 0.663 0.507579
## jobretired 7.743e-01 2.704e-01 2.863 0.004197 **
## jobself-employed 9.603e-02 3.047e-01 0.315 0.752643
## jobservices -3.075e-02 2.377e-01 -0.129 0.897074
## jobstudent 4.528e-01 3.427e-01 1.321 0.186502
## jobtechnician -9.065e-02 2.022e-01 -0.448 0.653881
## jobunemployed -1.808e-01 3.531e-01 -0.512 0.608535
## jobunknown 4.362e-01 5.017e-01 0.870 0.384543
## maritalmarried -5.548e-01 1.505e-01 -3.686 0.000227 ***
## maritalsingle -1.904e-01 1.746e-01 -1.090 0.275539
## educationsecondary 1.678e-01 1.758e-01 0.955 0.339735
## educationtertiary 2.806e-01 2.042e-01 1.374 0.169422
## educationunknown -3.350e-01 3.153e-01 -1.062 0.288092
## defaultyes 3.933e-01 3.678e-01 1.069 0.284907
## balance -1.468e-05 1.613e-05 -0.910 0.362722
## housingyes -2.303e-01 1.190e-01 -1.936 0.052908 .
## loanyes -5.680e-01 1.743e-01 -3.260 0.001116 **
## monthaug -4.857e-01 2.175e-01 -2.233 0.025559 *
## monthdec 2.984e-01 5.513e-01 0.541 0.588281
## monthfeb -1.546e-01 2.502e-01 -0.618 0.536734
## monthjan -9.351e-01 3.304e-01 -2.830 0.004654 **
## monthjul -6.601e-01 2.202e-01 -2.998 0.002715 **
## monthjun -4.977e-01 2.241e-01 -2.221 0.026363 *
## monthmar 7.657e-01 3.610e-01 2.121 0.033926 *
## monthmay -9.781e-01 1.969e-01 -4.967 6.8e-07 ***
## monthnov -6.839e-01 2.420e-01 -2.825 0.004721 **
## monthoct 1.044e+00 2.994e-01 3.487 0.000488 ***
## monthsep 4.464e-02 3.802e-01 0.117 0.906518
## pdays 3.637e-04 8.629e-04 0.422 0.673370
## previous 8.537e-03 3.371e-02 0.253 0.800066
## poutcomeother 5.413e-01 2.378e-01 2.276 0.022848 *
## poutcomesuccess 2.390e+00 2.528e-01 9.453 < 2e-16 ***
## poutcomeunknown -1.572e-01 2.815e-01 -0.558 0.576627
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3231.0 on 4520 degrees of freedom
## Residual deviance: 2802.9 on 4483 degrees of freedom
## AIC: 2878.9
##
## Number of Fisher Scoring iterations: 5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 3956 443
## 1 44 78
##
## Accuracy : 0.8923
## 95% CI : (0.8829, 0.9012)
## No Information Rate : 0.8848
## P-Value [Acc > NIR] : 0.0583
##
## Kappa : 0.208
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.14971
## Specificity : 0.98900
## Pos Pred Value : 0.63934
## Neg Pred Value : 0.89930
## Prevalence : 0.11524
## Detection Rate : 0.01725
## Detection Prevalence : 0.02699
## Balanced Accuracy : 0.56936
##
## 'Positive' Class : 1
##
From the classification metrics, we see a minor difference in model accuracy between the full model (0.9051) and the reduced model (0.8923). Where the large difference between the models is the sensitivity metric (0.34165 vs 0.14971) which is also called the True Positive Rate, and represents the number of clients who were correctly predicted to have subscribed. This is ultimately an important scoring measure for this exercise. Finally, we see as illustrated on the ROC plot above that the full model with Area Under Curve 0.903 is an overall better classifier than the reduced model with AUC 0.733.
This was a simple example of binary logistic regression used both for explanation and as a classifier. Note should be giving as we have pointed out earlier that some variables might not exist at the time of a new marketing campaign and therefore should not be considered in a robust model. A more robust model would also identify outliers and deal with any class imbalances likely present in the data