Binary Logistic Regression

Marketing campaigns can be expensive. Knowing which customers to target is essential in order to conduct an efficient campaign. Given information about a client, can we predict whether that client will respond positively to the product being marketed? In the case presented, the target variable (the client’s response) is either yes or no. When trying to predict binary outcomes instead of continuous values, we must turn to an alternative regression techniques, namely logistic regression. The logistic regression calculates the probability of an observation belonging to the target class, which in this case is whether a bank customer will suscribe to a term deposit. The data is available from the UCI Machine Learning Repository.

Data

The dataset is composed of 4521 observations, with 7 numeric variables and 10 character variables. The latter will have to be converted to factors instead of characters for use in regression analysis. Some of the independent variables are bank client data such as age, job, education or balance while others are related to the last contact with the current marketing campaign. There are no missing values and we proceed to take a look at the data distributions.

age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown no
35 management single tertiary no 747 no no cellular 23 feb 141 2 176 3 failure no
36 self-employed married tertiary no 307 yes no cellular 14 may 341 1 330 2 other no
39 technician married secondary no 147 yes no cellular 6 may 151 2 -1 0 unknown no
41 entrepreneur married tertiary no 221 yes no unknown 14 may 57 2 -1 0 unknown no
43 services married primary no -88 yes yes cellular 17 apr 313 1 147 2 failure no
39 services married secondary no 9374 yes no unknown 20 may 273 1 -1 0 unknown no
43 admin. married secondary no 264 yes no cellular 17 apr 113 2 -1 0 unknown no
36 technician married tertiary no 1109 no no cellular 13 aug 328 2 -1 0 unknown no
20 student single secondary no 502 no no cellular 30 apr 261 1 -1 0 unknown yes
31 blue-collar married secondary no 360 yes yes cellular 29 jan 89 1 241 1 failure no
40 management married tertiary no 194 no yes cellular 29 aug 189 2 -1 0 unknown no
56 technician married secondary no 4073 no no cellular 27 aug 239 5 -1 0 unknown no
37 admin. single tertiary no 2317 yes no cellular 20 apr 114 1 152 2 failure no
25 blue-collar single primary no -221 yes no unknown 23 may 250 1 -1 0 unknown no
31 services married secondary no 132 no no cellular 7 jul 148 1 152 1 other no
Data summary
Name data
Number of rows 4521
Number of columns 17
_______________________
Column type frequency:
character 10
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
job 0 1 6 13 0 12 0
marital 0 1 6 8 0 3 0
education 0 1 7 9 0 4 0
default 0 1 2 3 0 2 0
housing 0 1 2 3 0 2 0
loan 0 1 2 3 0 2 0
contact 0 1 7 9 0 3 0
month 0 1 3 3 0 12 0
poutcome 0 1 5 7 0 4 0
y 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 41.17 10.58 19 33 39 49 87 ▅▇▅▁▁
balance 0 1 1422.66 3009.64 -3313 69 444 1480 71188 ▇▁▁▁▁
day 0 1 15.92 8.25 1 9 16 21 31 ▆▆▇▅▆
duration 0 1 263.96 259.86 4 104 185 329 3025 ▇▁▁▁▁
campaign 0 1 2.79 3.11 1 1 2 3 50 ▇▁▁▁▁
pdays 0 1 39.77 100.12 -1 -1 -1 -1 871 ▇▁▁▁▁
previous 0 1 0.54 1.69 0 0 0 0 25 ▇▁▁▁▁

Modeling

As mentioned earlier, the categorical data needs to be manipulated into factors in order to be valid inputs for regression. We also mutate the response variable y from ‘yes’ and ‘no’ to 1 or 0. The data is fed into the glm function for generalized linear models which is appropriate for distributions belonging to the exponential family. We distinguish the family by and speficy the logit link function, which transforms the inputs into a sigmoid which is continuous and range bounded between 0 and 1.

Full model

From the summary of the trained model, we observe a number of significant predictors with positive and negative coefficients:

  • Positive:
    • poutcomesuccess: the outcome of previous succesfull campaigns is a sensible predictor for future success
    • duration: this is actually not a valid variable because when launching a new campaign, the duration spent on the phone is not data that is available yet.
    • october, march: these variables are the last contact months which are difficult to interpret
    • day: this variable is the day of the week of the last contact and also diffucult to interpret, and mistakenly treated as a continuous variable
    • jobretired: this is a sensible predictor since retired people are older and are more responsive to add campaigns
  • Negative:
    • campaign: this is the number of contacts performed during the campaign and could imply that some customers get annoyed
    • november, may, july, january:
    • contactunknown: this is the contact commucation type and it is surprising that unknown contacts are significant
    • loanyes: it is understandable that people with outstanding loans would be less likely to sign up for additional products
    • maritalmarried: married clients tend to sign up less
## 
## Call:
## glm(formula = y ~ ., family = binomial(link = "logit"), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0169  -0.3814  -0.2567  -0.1579   3.0346  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.462e+00  6.038e-01  -4.077 4.55e-05 ***
## age                -4.232e-03  7.125e-03  -0.594 0.552537    
## jobblue-collar     -3.924e-01  2.420e-01  -1.621 0.104937    
## jobentrepreneur    -2.498e-01  3.811e-01  -0.655 0.512199    
## jobhousemaid       -3.530e-01  4.176e-01  -0.845 0.398000    
## jobmanagement      -7.302e-02  2.407e-01  -0.303 0.761602    
## jobretired          6.315e-01  3.112e-01   2.029 0.042454 *  
## jobself-employed   -1.812e-01  3.533e-01  -0.513 0.608167    
## jobservices        -1.457e-01  2.729e-01  -0.534 0.593542    
## jobstudent          3.784e-01  3.750e-01   1.009 0.312958    
## jobtechnician      -1.926e-01  2.301e-01  -0.837 0.402496    
## jobunemployed      -6.395e-01  4.214e-01  -1.518 0.129138    
## jobunknown          5.207e-01  5.853e-01   0.890 0.373669    
## maritalmarried     -4.696e-01  1.743e-01  -2.694 0.007058 ** 
## maritalsingle      -3.051e-01  2.038e-01  -1.497 0.134354    
## educationsecondary  8.011e-02  2.022e-01   0.396 0.691924    
## educationtertiary   3.208e-01  2.337e-01   1.373 0.169897    
## educationunknown   -4.210e-01  3.572e-01  -1.179 0.238561    
## defaultyes          5.446e-01  4.315e-01   1.262 0.206824    
## balance            -3.911e-06  1.749e-05  -0.224 0.823014    
## housingyes         -2.600e-01  1.381e-01  -1.883 0.059676 .  
## loanyes            -6.296e-01  2.000e-01  -3.149 0.001640 ** 
## contacttelephone   -7.020e-02  2.327e-01  -0.302 0.762900    
## contactunknown     -1.416e+00  2.277e-01  -6.219 4.99e-10 ***
## day                 1.641e-02  8.161e-03   2.011 0.044362 *  
## monthaug           -3.081e-01  2.494e-01  -1.235 0.216655    
## monthdec            1.144e-01  6.573e-01   0.174 0.861784    
## monthfeb            2.022e-01  2.937e-01   0.688 0.491290    
## monthjan           -1.123e+00  3.816e-01  -2.944 0.003245 ** 
## monthjul           -7.515e-01  2.498e-01  -3.008 0.002630 ** 
## monthjun            5.542e-01  3.003e-01   1.845 0.065009 .  
## monthmar            1.498e+00  3.901e-01   3.842 0.000122 ***
## monthmay           -4.900e-01  2.340e-01  -2.094 0.036246 *  
## monthnov           -8.430e-01  2.737e-01  -3.080 0.002072 ** 
## monthoct            1.361e+00  3.300e-01   4.124 3.72e-05 ***
## monthsep            6.572e-01  4.115e-01   1.597 0.110265    
## duration            4.225e-03  2.020e-04  20.912  < 2e-16 ***
## campaign           -7.042e-02  2.821e-02  -2.496 0.012549 *  
## pdays              -9.791e-05  9.959e-04  -0.098 0.921684    
## previous           -5.511e-03  3.818e-02  -0.144 0.885249    
## poutcomeother       4.912e-01  2.692e-01   1.825 0.068019 .  
## poutcomesuccess     2.445e+00  2.773e-01   8.818  < 2e-16 ***
## poutcomeunknown    -1.216e-01  3.199e-01  -0.380 0.703822    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3231.0  on 4520  degrees of freedom
## Residual deviance: 2173.7  on 4478  degrees of freedom
## AIC: 2259.7
## 
## Number of Fisher Scoring iterations: 6
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3914  343
##          1   86  178
##                                           
##                Accuracy : 0.9051          
##                  95% CI : (0.8962, 0.9135)
##     No Information Rate : 0.8848          
##     P-Value [Acc > NIR] : 6.12e-06        
##                                           
##                   Kappa : 0.4076          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.34165         
##             Specificity : 0.97850         
##          Pos Pred Value : 0.67424         
##          Neg Pred Value : 0.91943         
##              Prevalence : 0.11524         
##          Detection Rate : 0.03937         
##    Detection Prevalence : 0.05839         
##       Balanced Accuracy : 0.66008         
##                                           
##        'Positive' Class : 1               
## 

The model yields a high accuracy metric on the training data. However, it is worth comparing these results with a smaller model that does not use the variables that we outlined above as suspect. What we find is the the reduced model actually worse with an increase in AIC from 2260 to 2879.

Reduced model

## 
## Call:
## glm(formula = y ~ . - day - duration - contact - campaign, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2680  -0.4806  -0.3849  -0.3029   2.7504  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.294e+00  5.104e-01  -2.535 0.011233 *  
## age                 1.165e-03  6.140e-03   0.190 0.849566    
## jobblue-collar     -1.262e-01  2.123e-01  -0.594 0.552219    
## jobentrepreneur     1.147e-01  3.266e-01   0.351 0.725398    
## jobhousemaid        9.059e-02  3.602e-01   0.251 0.801449    
## jobmanagement       1.405e-01  2.121e-01   0.663 0.507579    
## jobretired          7.743e-01  2.704e-01   2.863 0.004197 ** 
## jobself-employed    9.603e-02  3.047e-01   0.315 0.752643    
## jobservices        -3.075e-02  2.377e-01  -0.129 0.897074    
## jobstudent          4.528e-01  3.427e-01   1.321 0.186502    
## jobtechnician      -9.065e-02  2.022e-01  -0.448 0.653881    
## jobunemployed      -1.808e-01  3.531e-01  -0.512 0.608535    
## jobunknown          4.362e-01  5.017e-01   0.870 0.384543    
## maritalmarried     -5.548e-01  1.505e-01  -3.686 0.000227 ***
## maritalsingle      -1.904e-01  1.746e-01  -1.090 0.275539    
## educationsecondary  1.678e-01  1.758e-01   0.955 0.339735    
## educationtertiary   2.806e-01  2.042e-01   1.374 0.169422    
## educationunknown   -3.350e-01  3.153e-01  -1.062 0.288092    
## defaultyes          3.933e-01  3.678e-01   1.069 0.284907    
## balance            -1.468e-05  1.613e-05  -0.910 0.362722    
## housingyes         -2.303e-01  1.190e-01  -1.936 0.052908 .  
## loanyes            -5.680e-01  1.743e-01  -3.260 0.001116 ** 
## monthaug           -4.857e-01  2.175e-01  -2.233 0.025559 *  
## monthdec            2.984e-01  5.513e-01   0.541 0.588281    
## monthfeb           -1.546e-01  2.502e-01  -0.618 0.536734    
## monthjan           -9.351e-01  3.304e-01  -2.830 0.004654 ** 
## monthjul           -6.601e-01  2.202e-01  -2.998 0.002715 ** 
## monthjun           -4.977e-01  2.241e-01  -2.221 0.026363 *  
## monthmar            7.657e-01  3.610e-01   2.121 0.033926 *  
## monthmay           -9.781e-01  1.969e-01  -4.967  6.8e-07 ***
## monthnov           -6.839e-01  2.420e-01  -2.825 0.004721 ** 
## monthoct            1.044e+00  2.994e-01   3.487 0.000488 ***
## monthsep            4.464e-02  3.802e-01   0.117 0.906518    
## pdays               3.637e-04  8.629e-04   0.422 0.673370    
## previous            8.537e-03  3.371e-02   0.253 0.800066    
## poutcomeother       5.413e-01  2.378e-01   2.276 0.022848 *  
## poutcomesuccess     2.390e+00  2.528e-01   9.453  < 2e-16 ***
## poutcomeunknown    -1.572e-01  2.815e-01  -0.558 0.576627    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3231.0  on 4520  degrees of freedom
## Residual deviance: 2802.9  on 4483  degrees of freedom
## AIC: 2878.9
## 
## Number of Fisher Scoring iterations: 5
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3956  443
##          1   44   78
##                                           
##                Accuracy : 0.8923          
##                  95% CI : (0.8829, 0.9012)
##     No Information Rate : 0.8848          
##     P-Value [Acc > NIR] : 0.0583          
##                                           
##                   Kappa : 0.208           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.14971         
##             Specificity : 0.98900         
##          Pos Pred Value : 0.63934         
##          Neg Pred Value : 0.89930         
##              Prevalence : 0.11524         
##          Detection Rate : 0.01725         
##    Detection Prevalence : 0.02699         
##       Balanced Accuracy : 0.56936         
##                                           
##        'Positive' Class : 1               
## 

ROC

From the classification metrics, we see a minor difference in model accuracy between the full model (0.9051) and the reduced model (0.8923). Where the large difference between the models is the sensitivity metric (0.34165 vs 0.14971) which is also called the True Positive Rate, and represents the number of clients who were correctly predicted to have subscribed. This is ultimately an important scoring measure for this exercise. Finally, we see as illustrated on the ROC plot above that the full model with Area Under Curve 0.903 is an overall better classifier than the reduced model with AUC 0.733.

This was a simple example of binary logistic regression used both for explanation and as a classifier. Note should be giving as we have pointed out earlier that some variables might not exist at the time of a new marketing campaign and therefore should not be considered in a robust model. A more robust model would also identify outliers and deal with any class imbalances likely present in the data