Binary Logistic Regression

Marketing campaigns can be expensive. Knowing which customers to target is essential in order to conduct an efficient campaign. Given information about a client, can we predict whether that client will respond positively to the product being marketed? In the case presented, the target variable (the client’s response) is either yes or no. When trying to predict binary outcomes instead of continuous values, we must turn to an alternative regression techniques, namely logistic regression. The logistic regression calculates the probability of an observation belonging to the target class, which in this case is whether a bank customer will suscribe to a term deposit. The data is available from the UCI Machine Learning Repository.

Data

The dataset is composed of 4521 observations, with 7 numeric variables and 10 character variables. The latter will have to be converted to factors instead of characters for use in regression analysis. Some of the independent variables are bank client data such as age, job, education or balance while others are related to the last contact with the current marketing campaign. There are no missing values and we proceed to take a look at the data distributions.


age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
30	unemployed	married	primary	no	1787	no	no	cellular	19	oct	79	1	-1	0	unknown	no
33	services	married	secondary	no	4789	yes	yes	cellular	11	may	220	1	339	4	failure	no
35	management	single	tertiary	no	1350	yes	no	cellular	16	apr	185	1	330	1	failure	no
30	management	married	tertiary	no	1476	yes	yes	unknown	3	jun	199	4	-1	0	unknown	no
59	blue-collar	married	secondary	no	0	yes	no	unknown	5	may	226	1	-1	0	unknown	no
35	management	single	tertiary	no	747	no	no	cellular	23	feb	141	2	176	3	failure	no
36	self-employed	married	tertiary	no	307	yes	no	cellular	14	may	341	1	330	2	other	no
39	technician	married	secondary	no	147	yes	no	cellular	6	may	151	2	-1	0	unknown	no
41	entrepreneur	married	tertiary	no	221	yes	no	unknown	14	may	57	2	-1	0	unknown	no
43	services	married	primary	no	-88	yes	yes	cellular	17	apr	313	1	147	2	failure	no
39	services	married	secondary	no	9374	yes	no	unknown	20	may	273	1	-1	0	unknown	no
43	admin.	married	secondary	no	264	yes	no	cellular	17	apr	113	2	-1	0	unknown	no
36	technician	married	tertiary	no	1109	no	no	cellular	13	aug	328	2	-1	0	unknown	no
20	student	single	secondary	no	502	no	no	cellular	30	apr	261	1	-1	0	unknown	yes
31	blue-collar	married	secondary	no	360	yes	yes	cellular	29	jan	89	1	241	1	failure	no
40	management	married	tertiary	no	194	no	yes	cellular	29	aug	189	2	-1	0	unknown	no
56	technician	married	secondary	no	4073	no	no	cellular	27	aug	239	5	-1	0	unknown	no
37	admin.	single	tertiary	no	2317	yes	no	cellular	20	apr	114	1	152	2	failure	no
25	blue-collar	single	primary	no	-221	yes	no	unknown	23	may	250	1	-1	0	unknown	no
31	services	married	secondary	no	132	no	no	cellular	7	jul	148	1	152	1	other	no

Data summary

Name	data
Number of rows	4521
Number of columns	17
_______________________
Column type frequency:
character	10
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
job	1	6	13	12
marital	1	6	8	3
education	1	7	9	4
default	1	2	3	2
housing	1	2	3	2
loan	1	2	3	2
contact	1	7	9	3
month	1	3	3	12
poutcome	1	5	7	4
y	1	2	3	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	41.17	10.58	19	33	39	49	87	▅▇▅▁▁
balance	1	1422.66	3009.64	-3313	69	444	1480	71188	▇▁▁▁▁
day	1	15.92	8.25	1	9	16	21	31	▆▆▇▅▆
duration	1	263.96	259.86	4	104	185	329	3025	▇▁▁▁▁
campaign	1	2.79	3.11	1	1	2	3	50	▇▁▁▁▁
pdays	1	39.77	100.12	-1	-1	-1	-1	871	▇▁▁▁▁
previous	1	0.54	1.69	0	0	0	0	25	▇▁▁▁▁

Modeling

As mentioned earlier, the categorical data needs to be manipulated into factors in order to be valid inputs for regression. We also mutate the response variable y from ‘yes’ and ‘no’ to 1 or 0. The data is fed into the glm function for generalized linear models which is appropriate for distributions belonging to the exponential family. We distinguish the family by and speficy the logit link function, which transforms the inputs into a sigmoid which is continuous and range bounded between 0 and 1.

Full model

From the summary of the trained model, we observe a number of significant predictors with positive and negative coefficients:

Positive:
- poutcomesuccess: the outcome of previous succesfull campaigns is a sensible predictor for future success
- duration: this is actually not a valid variable because when launching a new campaign, the duration spent on the phone is not data that is available yet.
- october, march: these variables are the last contact months which are difficult to interpret
- day: this variable is the day of the week of the last contact and also diffucult to interpret, and mistakenly treated as a continuous variable
- jobretired: this is a sensible predictor since retired people are older and are more responsive to add campaigns
Negative:
- campaign: this is the number of contacts performed during the campaign and could imply that some customers get annoyed
- november, may, july, january:
- contactunknown: this is the contact commucation type and it is surprising that unknown contacts are significant
- loanyes: it is understandable that people with outstanding loans would be less likely to sign up for additional products
- maritalmarried: married clients tend to sign up less

glm.mod <- glm(y ~ ., data=data, family = binomial(link="logit")) 
summary(glm.mod)

## 
## Call:
## glm(formula = y ~ ., family = binomial(link = "logit"), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0169  -0.3814  -0.2567  -0.1579   3.0346  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.462e+00  6.038e-01  -4.077 4.55e-05 ***
## age                -4.232e-03  7.125e-03  -0.594 0.552537    
## jobblue-collar     -3.924e-01  2.420e-01  -1.621 0.104937    
## jobentrepreneur    -2.498e-01  3.811e-01  -0.655 0.512199    
## jobhousemaid       -3.530e-01  4.176e-01  -0.845 0.398000    
## jobmanagement      -7.302e-02  2.407e-01  -0.303 0.761602    
## jobretired          6.315e-01  3.112e-01   2.029 0.042454 *  
## jobself-employed   -1.812e-01  3.533e-01  -0.513 0.608167    
## jobservices        -1.457e-01  2.729e-01  -0.534 0.593542    
## jobstudent          3.784e-01  3.750e-01   1.009 0.312958    
## jobtechnician      -1.926e-01  2.301e-01  -0.837 0.402496    
## jobunemployed      -6.395e-01  4.214e-01  -1.518 0.129138    
## jobunknown          5.207e-01  5.853e-01   0.890 0.373669    
## maritalmarried     -4.696e-01  1.743e-01  -2.694 0.007058 ** 
## maritalsingle      -3.051e-01  2.038e-01  -1.497 0.134354    
## educationsecondary  8.011e-02  2.022e-01   0.396 0.691924    
## educationtertiary   3.208e-01  2.337e-01   1.373 0.169897    
## educationunknown   -4.210e-01  3.572e-01  -1.179 0.238561    
## defaultyes          5.446e-01  4.315e-01   1.262 0.206824    
## balance            -3.911e-06  1.749e-05  -0.224 0.823014    
## housingyes         -2.600e-01  1.381e-01  -1.883 0.059676 .  
## loanyes            -6.296e-01  2.000e-01  -3.149 0.001640 ** 
## contacttelephone   -7.020e-02  2.327e-01  -0.302 0.762900    
## contactunknown     -1.416e+00  2.277e-01  -6.219 4.99e-10 ***
## day                 1.641e-02  8.161e-03   2.011 0.044362 *  
## monthaug           -3.081e-01  2.494e-01  -1.235 0.216655    
## monthdec            1.144e-01  6.573e-01   0.174 0.861784    
## monthfeb            2.022e-01  2.937e-01   0.688 0.491290    
## monthjan           -1.123e+00  3.816e-01  -2.944 0.003245 ** 
## monthjul           -7.515e-01  2.498e-01  -3.008 0.002630 ** 
## monthjun            5.542e-01  3.003e-01   1.845 0.065009 .  
## monthmar            1.498e+00  3.901e-01   3.842 0.000122 ***
## monthmay           -4.900e-01  2.340e-01  -2.094 0.036246 *  
## monthnov           -8.430e-01  2.737e-01  -3.080 0.002072 ** 
## monthoct            1.361e+00  3.300e-01   4.124 3.72e-05 ***
## monthsep            6.572e-01  4.115e-01   1.597 0.110265    
## duration            4.225e-03  2.020e-04  20.912  < 2e-16 ***
## campaign           -7.042e-02  2.821e-02  -2.496 0.012549 *  
## pdays              -9.791e-05  9.959e-04  -0.098 0.921684    
## previous           -5.511e-03  3.818e-02  -0.144 0.885249    
## poutcomeother       4.912e-01  2.692e-01   1.825 0.068019 .  
## poutcomesuccess     2.445e+00  2.773e-01   8.818  < 2e-16 ***
## poutcomeunknown    -1.216e-01  3.199e-01  -0.380 0.703822    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3231.0  on 4520  degrees of freedom
## Residual deviance: 2173.7  on 4478  degrees of freedom
## AIC: 2259.7
## 
## Number of Fisher Scoring iterations: 6

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3914  343
##          1   86  178
##                                           
##                Accuracy : 0.9051          
##                  95% CI : (0.8962, 0.9135)
##     No Information Rate : 0.8848          
##     P-Value [Acc > NIR] : 6.12e-06        
##                                           
##                   Kappa : 0.4076          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.34165         
##             Specificity : 0.97850         
##          Pos Pred Value : 0.67424         
##          Neg Pred Value : 0.91943         
##              Prevalence : 0.11524         
##          Detection Rate : 0.03937         
##    Detection Prevalence : 0.05839         
##       Balanced Accuracy : 0.66008         
##                                           
##        'Positive' Class : 1               
##

The model yields a high accuracy metric on the training data. However, it is worth comparing these results with a smaller model that does not use the variables that we outlined above as suspect. What we find is the the reduced model actually worse with an increase in AIC from 2260 to 2879.

Reduced model

glm.mod2 <- glm(y ~ . - day - duration - contact - campaign, data=data, family = binomial(link="logit")) 
summary(glm.mod2)

## 
## Call:
## glm(formula = y ~ . - day - duration - contact - campaign, family = binomial(link = "logit"), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2680  -0.4806  -0.3849  -0.3029   2.7504  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.294e+00  5.104e-01  -2.535 0.011233 *  
## age                 1.165e-03  6.140e-03   0.190 0.849566    
## jobblue-collar     -1.262e-01  2.123e-01  -0.594 0.552219    
## jobentrepreneur     1.147e-01  3.266e-01   0.351 0.725398    
## jobhousemaid        9.059e-02  3.602e-01   0.251 0.801449    
## jobmanagement       1.405e-01  2.121e-01   0.663 0.507579    
## jobretired          7.743e-01  2.704e-01   2.863 0.004197 ** 
## jobself-employed    9.603e-02  3.047e-01   0.315 0.752643    
## jobservices        -3.075e-02  2.377e-01  -0.129 0.897074    
## jobstudent          4.528e-01  3.427e-01   1.321 0.186502    
## jobtechnician      -9.065e-02  2.022e-01  -0.448 0.653881    
## jobunemployed      -1.808e-01  3.531e-01  -0.512 0.608535    
## jobunknown          4.362e-01  5.017e-01   0.870 0.384543    
## maritalmarried     -5.548e-01  1.505e-01  -3.686 0.000227 ***
## maritalsingle      -1.904e-01  1.746e-01  -1.090 0.275539    
## educationsecondary  1.678e-01  1.758e-01   0.955 0.339735    
## educationtertiary   2.806e-01  2.042e-01   1.374 0.169422    
## educationunknown   -3.350e-01  3.153e-01  -1.062 0.288092    
## defaultyes          3.933e-01  3.678e-01   1.069 0.284907    
## balance            -1.468e-05  1.613e-05  -0.910 0.362722    
## housingyes         -2.303e-01  1.190e-01  -1.936 0.052908 .  
## loanyes            -5.680e-01  1.743e-01  -3.260 0.001116 ** 
## monthaug           -4.857e-01  2.175e-01  -2.233 0.025559 *  
## monthdec            2.984e-01  5.513e-01   0.541 0.588281    
## monthfeb           -1.546e-01  2.502e-01  -0.618 0.536734    
## monthjan           -9.351e-01  3.304e-01  -2.830 0.004654 ** 
## monthjul           -6.601e-01  2.202e-01  -2.998 0.002715 ** 
## monthjun           -4.977e-01  2.241e-01  -2.221 0.026363 *  
## monthmar            7.657e-01  3.610e-01   2.121 0.033926 *  
## monthmay           -9.781e-01  1.969e-01  -4.967  6.8e-07 ***
## monthnov           -6.839e-01  2.420e-01  -2.825 0.004721 ** 
## monthoct            1.044e+00  2.994e-01   3.487 0.000488 ***
## monthsep            4.464e-02  3.802e-01   0.117 0.906518    
## pdays               3.637e-04  8.629e-04   0.422 0.673370    
## previous            8.537e-03  3.371e-02   0.253 0.800066    
## poutcomeother       5.413e-01  2.378e-01   2.276 0.022848 *  
## poutcomesuccess     2.390e+00  2.528e-01   9.453  < 2e-16 ***
## poutcomeunknown    -1.572e-01  2.815e-01  -0.558 0.576627    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3231.0  on 4520  degrees of freedom
## Residual deviance: 2802.9  on 4483  degrees of freedom
## AIC: 2878.9
## 
## Number of Fisher Scoring iterations: 5

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3956  443
##          1   44   78
##                                           
##                Accuracy : 0.8923          
##                  95% CI : (0.8829, 0.9012)
##     No Information Rate : 0.8848          
##     P-Value [Acc > NIR] : 0.0583          
##                                           
##                   Kappa : 0.208           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.14971         
##             Specificity : 0.98900         
##          Pos Pred Value : 0.63934         
##          Neg Pred Value : 0.89930         
##              Prevalence : 0.11524         
##          Detection Rate : 0.01725         
##    Detection Prevalence : 0.02699         
##       Balanced Accuracy : 0.56936         
##                                           
##        'Positive' Class : 1               
##

ROC

From the classification metrics, we see a minor difference in model accuracy between the full model (0.9051) and the reduced model (0.8923). Where the large difference between the models is the sensitivity metric (0.34165 vs 0.14971) which is also called the True Positive Rate, and represents the number of clients who were correctly predicted to have subscribed. This is ultimately an important scoring measure for this exercise. Finally, we see as illustrated on the ROC plot above that the full model with Area Under Curve 0.903 is an overall better classifier than the reduced model with AUC 0.733.

This was a simple example of binary logistic regression used both for explanation and as a classifier. Note should be giving as we have pointed out earlier that some variables might not exist at the time of a new marketing campaign and therefore should not be considered in a robust model. A more robust model would also identify outliers and deal with any class imbalances likely present in the data

DATA621 Blog 4

Mael Illien

12/15/2020

Binary Logistic Regression

Data

Modeling

Full model

Reduced model

ROC