Description:

Customer conversion in a direct marketing campaign is one of the most important metrics of success when evaluating the campaign itself. In a world of limited resources, it’s often difficult to make the best use of a marketer’s time, we can improve on this by predicting whether a prospective customer will respond to a marketing campaign. In order to do this, I will:

• understand which of the observed variables are most associated with the chance of subscribing to a term deposit (feature importance);

• how the important variables relate to the predicted probability that a client will subscribe to a term deposit (feature effects);

• build a model that could be useful in determining which future clients are likely to respond a term deposit, if contacted (prediction/deployment).

1) Exploratory data analysis (EDA)

Explore the bank_trn.csv data using any means necessary.

Based on your exploration and basic intuition, which variables (if any) seem to be associated with the response? If you had to pick the top three variables in terms of their degree of association to the response, which would you choose (be sure to include your reasoning for choosing these variables)? Please include any useful visualizations and/or statistics (i.e., frequency tables) as evidence to your hypotheses.

Summary Table for Sampled Bank Training Data
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed subscribed
Min. :17.00 Length:37070 Length:37070 Length:37070 Length:37070 Length:37070 Length:37070 Length:37070 Length:37070 Length:37070 Min. : 0.0 Min. : 1.000 Min. : 0.0 Min. :0.0000 Length:37070 Min. :-3.40000 Min. :92.20 Min. :-50.80 Min. :0.634 Min. :4964 0:32894
1st Qu.:32.00 Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 102.0 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.0000 Class :character 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.70 1st Qu.:1.344 1st Qu.:5099 1: 4176
Median :38.00 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 180.0 Median : 2.000 Median :999.0 Median :0.0000 Mode :character Median : 1.10000 Median :93.75 Median :-41.80 Median :4.857 Median :5191 NA
Mean :40.04 NA NA NA NA NA NA NA NA NA Mean : 258.5 Mean : 2.565 Mean :962.2 Mean :0.1737 NA Mean : 0.08046 Mean :93.58 Mean :-40.49 Mean :3.620 Mean :5167 NA
3rd Qu.:47.00 NA NA NA NA NA NA NA NA NA 3rd Qu.: 320.0 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.0000 NA 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.40 3rd Qu.:4.961 3rd Qu.:5228 NA
Max. :98.00 NA NA NA NA NA NA NA NA NA Max. :4918.0 Max. :56.000 Max. :999.0 Max. :6.0000 NA Max. : 1.40000 Max. :94.77 Max. :-26.90 Max. :5.045 Max. :5228 NA

The dataset that we are working with includes 21 variables, a full data dictionary is available at the bottom of this report. We can see that some of these variables will likely contain much of the same information, housing & loan will likely contain some of the same information as people with personal loans will likely have overlap with those that have housing loans. In order to identify any redundant variables we’ll implement a redundancy analysis utilziing the Hmisc::redun() function.

## 
## Redundancy Analysis
## 
## Hmisc::redun(formula = ~., data = df_samp, nk = 0)
## 
## n: 3000  p: 21   nk: 0 
## 
## Number of NAs:    0 
## 
## Transformation of target variables forced to be linear
## 
## R-squared cutoff: 0.9    Type: ordinary 
## 
## R^2 with which each variable can be predicted from all other variables:
## 
##            age            job        marital      education        default 
##          0.434          0.448          0.238          0.460          0.139 
##        housing           loan        contact          month    day_of_week 
##          1.000          1.000          0.772          0.933          0.034 
##       duration       campaign          pdays       previous       poutcome 
##          0.252          0.067          0.891          0.851          0.864 
##   emp_var_rate cons_price_idx  cons_conf_idx      euribor3m    nr_employed 
##          0.996          0.988          0.870          0.995          0.995 
##     subscribed 
##          0.395 
## 
## Rendundant variables:
## 
## housing emp_var_rate euribor3m
## 
## Predicted from variables:
## 
## age job marital education default loan contact month day_of_week duration campaign pdays previous poutcome cons_price_idx cons_conf_idx nr_employed subscribed 
## 
##   Variable Deleted   R^2 R^2 after later deletions
## 1          housing 1.000                       1 1
## 2     emp_var_rate 0.996                     0.996
## 3        euribor3m 0.995
## [1] "housing"      "emp_var_rate" "euribor3m"

The redundancy analysis yielded some interesting results, indicating that the housing, emp_var_rate and the euribor3m variables include redundant information that are predicted by the remaining variables of the dataset. In order to make our analysis more effective, we’ll remove these variables from the analysis before proceeding.

## 
## Redundancy Analysis
## 
## Hmisc::redun(formula = ~., data = df_samp, nk = 0)
## 
## n: 3000  p: 18   nk: 0 
## 
## Number of NAs:    0 
## 
## Transformation of target variables forced to be linear
## 
## R-squared cutoff: 0.9    Type: ordinary 
## 
## R^2 with which each variable can be predicted from all other variables:
## 
##            age            job        marital      education        default 
##          0.433          0.447          0.238          0.460          0.138 
##           loan        contact          month    day_of_week       duration 
##          0.025          0.752          0.712          0.034          0.250 
##       campaign          pdays       previous       poutcome cons_price_idx 
##          0.065          0.891          0.850          0.864          0.698 
##  cons_conf_idx    nr_employed     subscribed 
##          0.608          0.712          0.389 
## 
## No redundant variables

After dropping the named variables int he redundancy analysis, we reran the function again and can see that there are no longer any redundant outputs.

Summary Table for Sampled Bank Training Data
age job marital education default loan contact month day_of_week duration campaign pdays previous poutcome cons_price_idx cons_conf_idx nr_employed subscribed
Min. :17.00 Length:3000 Length:3000 Length:3000 Length:3000 Length:3000 Length:3000 Length:3000 Length:3000 Min. : 1.0 Min. : 1.000 Min. : 0.0 Min. :0.0000 Length:3000 Min. :92.20 Min. :-50.80 Min. :4964 0:2659
1st Qu.:32.00 Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 102.0 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.0000 Class :character 1st Qu.:93.08 1st Qu.:-42.70 1st Qu.:5099 1: 341
Median :38.50 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 187.0 Median : 2.000 Median :999.0 Median :0.0000 Mode :character Median :93.75 Median :-41.80 Median :5191 NA
Mean :40.21 NA NA NA NA NA NA NA NA Mean : 264.5 Mean : 2.703 Mean :960.9 Mean :0.1763 NA Mean :93.58 Mean :-40.52 Mean :5168 NA
3rd Qu.:47.00 NA NA NA NA NA NA NA NA 3rd Qu.: 321.0 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.0000 NA 3rd Qu.:93.99 3rd Qu.:-36.40 3rd Qu.:5228 NA
Max. :89.00 NA NA NA NA NA NA NA NA Max. :3183.0 Max. :40.000 Max. :999.0 Max. :5.0000 NA Max. :94.77 Max. :-26.90 Max. :5228 NA

Above is a summary table of the new dataset after having dropped the redundant variables. This table is based on a randomly sampled subset of the original dataset including 3000 rows, which we will use to make the anlaysis less computationally demanding before fitting the final model.

The variables we have decided to include in the analysis thus far are “age”, “job”, “marital”, “education”, “default”, “housing”, “loan”, “contact”, “month”, “day of week”, “duration”, “campaign”, “pdays”, “previous”, “poutcome”, “emp_var_rate”, “cons_price_idx”, “cons_conf_idx”, “euribor3m”, “nr_employed”, and “subscribed”. We’ll continue by creating visualizations for the remaining variables.

Visualizations

The Duration variable appears to be heavily right-skewed, which is sensible as many sales-based variables have similar qualities to them.

> The “default” Variable includes many records that have “no” as a response, with only “3” records including a “yes” response, there are 7767 records with an “unknown” status of default. A majority of records include married persons. Most respondents have a loan status of “no”.

The lack of frequency of default “yes” variables may create a false sense of certainty within a model due to the low variance of the data, we may have to reconsider including this variable in the final model.

The “PDays” varaible in the dataset includes many records with the value “999”, which means that the customer has not recently been contacted. The majority response for the variable “Previous” is 0, which aligns with the notion of the data from the “PDays” variable, as a “Previous” value of 0 indicates that no contacts were made with the customer prior to the campaign.

There don’t appear to be any clear patterns to the Consumer Price & Confidence Index variables, what effect they ultimately have on the outcome variable is yet to be seen.

There are a wide variety of “nr_employed” values, with what appears to approach a bimodal distribution. The vast majority of the records in the dataset have a subscribed status of 0, indicating most consumers are not subscribed.

Tree EDA

CART (Classification and Regression Trees) Models are highly flexible, non-parametric models that are great low-cost methods for investigating the dataset. I’ll use this method to identify what are likely to be the most important variables to build a sparse model.

We don’t have any variables in our dataset that give us perfect information, however, we can see from the above clasification tree that the “duration” variable plays a big role in classifying whether or not somebody subscribes.

The duration does appear to be the most important variable for classifying a prospective customer as subscribed vs. not. We can also see that the nr_employed variable is also important to the tree based classification model.

Correlation Plot

The above correlation plot indicates that we have some decently strong correllations between the variables “previous” and “pdays”, as well as the “cons_price_idx” and “nr_employed”. While it doesn’t appear that we have to worry about multicollinearity, let’s fit a full linear regression model to the data and the study the variance inflation factors to be more sure.

2) Initial modeling. Based on your results from 1) fit an initial logistic regression model to the data.

Does the model fit appear reasonable? Please explain using relevant graphics and/or statistics (e.g., how does it compare to the null model or a model that always predicts “no”?).

We’ll start by fitting a full logistic regression model and assessing the strength of that model vs. a model built on the identified most important variables.

Full Model

## 
## Call:
## glm(formula = subscribed ~ ., family = binomial(link = "logit"), 
##     data = df_samp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6617  -0.2911  -0.1633  -0.1013   3.1627  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   1.071e+02  1.730e+01   6.192 5.96e-10 ***
## age                           2.667e-02  9.461e-03   2.819 0.004821 ** 
## jobblue-collar                4.558e-01  3.091e-01   1.474 0.140382    
## jobentrepreneur               2.622e-01  4.516e-01   0.581 0.561439    
## jobhousemaid                  6.287e-01  5.168e-01   1.217 0.223737    
## jobmanagement                -5.224e-01  3.685e-01  -1.418 0.156312    
## jobretired                   -4.372e-01  4.484e-01  -0.975 0.329523    
## jobself-employed              9.449e-01  3.788e-01   2.494 0.012619 *  
## jobservices                   3.839e-01  3.460e-01   1.109 0.267219    
## jobstudent                    8.972e-01  4.215e-01   2.129 0.033272 *  
## jobtechnician                 1.179e-01  2.800e-01   0.421 0.673673    
## jobunemployed                 2.341e-01  5.029e-01   0.465 0.641646    
## jobunknown                    1.371e-01  8.702e-01   0.158 0.874850    
## maritalmarried                7.060e-01  2.763e-01   2.555 0.010604 *  
## maritalsingle                 7.282e-01  3.112e-01   2.340 0.019292 *  
## maritalunknown                1.120e+00  1.306e+00   0.858 0.390925    
## educationbasic.6y            -4.753e-01  4.501e-01  -1.056 0.290939    
## educationbasic.9y            -1.026e+00  3.667e-01  -2.798 0.005139 ** 
## educationhigh.school          1.329e-01  3.301e-01   0.403 0.687227    
## educationilliterate          -9.705e+00  3.714e+02  -0.026 0.979152    
## educationprofessional.course  4.586e-01  3.533e-01   1.298 0.194285    
## educationuniversity.degree    3.073e-01  3.317e-01   0.926 0.354270    
## educationunknown             -3.411e-01  4.791e-01  -0.712 0.476508    
## defaultunknown               -6.401e-01  2.562e-01  -2.499 0.012457 *  
## loanunknown                  -4.212e-02  5.034e-01  -0.084 0.933321    
## loanyes                       3.538e-03  2.201e-01   0.016 0.987174    
## contacttelephone             -4.338e-01  2.970e-01  -1.460 0.144157    
## monthaug                      6.537e-01  4.342e-01   1.506 0.132155    
## monthdec                      1.032e+00  7.840e-01   1.316 0.188277    
## monthjul                      9.202e-01  3.984e-01   2.310 0.020900 *  
## monthjun                      1.294e+00  3.822e-01   3.387 0.000708 ***
## monthmar                      1.532e+00  5.099e-01   3.004 0.002666 ** 
## monthmay                     -2.503e-01  3.054e-01  -0.820 0.412486    
## monthnov                      1.447e-01  3.944e-01   0.367 0.713627    
## monthoct                      9.423e-01  5.693e-01   1.655 0.097864 .  
## monthsep                     -6.716e-01  6.080e-01  -1.105 0.269330    
## day_of_weekmon               -7.353e-03  2.560e-01  -0.029 0.977087    
## day_of_weekthu                2.267e-01  2.564e-01   0.884 0.376681    
## day_of_weektue                1.501e-01  2.656e-01   0.565 0.572100    
## day_of_weekwed                1.591e-01  2.647e-01   0.601 0.547840    
## duration                      5.511e-03  3.117e-04  17.678  < 2e-16 ***
## campaign                     -1.935e-02  3.942e-02  -0.491 0.623558    
## pdays                        -7.039e-04  7.438e-04  -0.946 0.343974    
## previous                     -1.272e-01  2.508e-01  -0.507 0.611947    
## poutcomenonexistent           6.978e-01  3.823e-01   1.825 0.067941 .  
## poutcomesuccess               2.108e+00  7.184e-01   2.935 0.003336 ** 
## cons_price_idx               -4.091e-01  1.795e-01  -2.279 0.022654 *  
## cons_conf_idx                 1.020e-02  2.247e-02   0.454 0.650016    
## nr_employed                  -1.456e-02  1.452e-03 -10.023  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2124.7  on 2999  degrees of freedom
## Residual deviance: 1150.5  on 2951  degrees of freedom
## AIC: 1248.5
## 
## Number of Fisher Scoring iterations: 12

Our model considers the month of June to be highly significant to the data, as well as duration, nr_employed, and the model intercept. We can investigate whether our model is dealing with any multicollinearity problems by calculating the VIF (Variance Inflation Factors) for the model, any values greater than 10 indicate high multicollinearity.

##                     GVIF Df GVIF^(1/(2*Df))
## age             2.417949  1        1.554975
## job             8.218908 11        1.100481
## marital         1.743055  3        1.097030
## education       3.714700  7        1.098269
## default         1.158304  1        1.076245
## loan            1.113545  2        1.027252
## contact         2.225823  1        1.491919
## month           8.704645  9        1.127738
## day_of_week     1.248831  4        1.028165
## duration        1.431379  1        1.196402
## campaign        1.111302  1        1.054183
## pdays           8.615066  1        2.935143
## previous        5.185798  1        2.277235
## poutcome       18.996515  2        2.087702
## cons_price_idx  2.143149  1        1.463950
## cons_conf_idx   2.702760  1        1.644007
## nr_employed     2.442967  1        1.562999

pdays and poutcomesuccess have relatively high VIF values, however, they aren’t above 10 so we don’t necessarily need to worry about multicollinearity next. We’ll remove the “poutcome” variable from the model before fitting it again to reassess.

All but Poutcome

## 
## Call:
## glm(formula = subscribed ~ . - poutcome, family = binomial(link = "logit"), 
##     data = df_samp)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6303  -0.2958  -0.1677  -0.1050   3.0732  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   1.042e+02  1.707e+01   6.103 1.04e-09 ***
## age                           2.529e-02  9.407e-03   2.688 0.007181 ** 
## jobblue-collar                4.404e-01  3.075e-01   1.432 0.152108    
## jobentrepreneur               2.812e-01  4.473e-01   0.629 0.529593    
## jobhousemaid                  6.777e-01  5.078e-01   1.335 0.182030    
## jobmanagement                -4.694e-01  3.627e-01  -1.294 0.195598    
## jobretired                   -4.189e-01  4.483e-01  -0.934 0.350087    
## jobself-employed              9.230e-01  3.783e-01   2.440 0.014683 *  
## jobservices                   3.810e-01  3.439e-01   1.108 0.267886    
## jobstudent                    8.163e-01  4.278e-01   1.908 0.056380 .  
## jobtechnician                 1.012e-01  2.779e-01   0.364 0.715643    
## jobunemployed                 2.652e-01  4.957e-01   0.535 0.592578    
## jobunknown                    1.786e-01  8.669e-01   0.206 0.836790    
## maritalmarried                6.354e-01  2.715e-01   2.340 0.019280 *  
## maritalsingle                 6.736e-01  3.063e-01   2.199 0.027899 *  
## maritalunknown                1.094e+00  1.293e+00   0.846 0.397563    
## educationbasic.6y            -4.239e-01  4.458e-01  -0.951 0.341646    
## educationbasic.9y            -1.021e+00  3.650e-01  -2.797 0.005152 ** 
## educationhigh.school          1.230e-01  3.298e-01   0.373 0.709104    
## educationilliterate          -9.693e+00  3.717e+02  -0.026 0.979195    
## educationprofessional.course  4.642e-01  3.520e-01   1.319 0.187214    
## educationuniversity.degree    3.052e-01  3.308e-01   0.923 0.356218    
## educationunknown             -3.083e-01  4.753e-01  -0.649 0.516597    
## defaultunknown               -6.488e-01  2.556e-01  -2.538 0.011142 *  
## loanunknown                  -1.211e-01  5.210e-01  -0.233 0.816127    
## loanyes                      -1.685e-02  2.186e-01  -0.077 0.938551    
## contacttelephone             -4.620e-01  2.981e-01  -1.550 0.121183    
## monthaug                      5.860e-01  4.308e-01   1.360 0.173703    
## monthdec                      1.025e+00  7.792e-01   1.316 0.188175    
## monthjul                      8.615e-01  3.959e-01   2.176 0.029548 *  
## monthjun                      1.257e+00  3.797e-01   3.310 0.000932 ***
## monthmar                      1.532e+00  5.113e-01   2.995 0.002742 ** 
## monthmay                     -2.501e-01  3.036e-01  -0.824 0.410005    
## monthnov                      1.186e-01  3.918e-01   0.303 0.762128    
## monthoct                      8.440e-01  5.728e-01   1.474 0.140600    
## monthsep                     -6.490e-01  6.120e-01  -1.060 0.288959    
## day_of_weekmon               -3.126e-02  2.558e-01  -0.122 0.902728    
## day_of_weekthu                2.377e-01  2.551e-01   0.932 0.351419    
## day_of_weektue                1.534e-01  2.649e-01   0.579 0.562695    
## day_of_weekwed                1.494e-01  2.633e-01   0.568 0.570290    
## duration                      5.438e-03  3.083e-04  17.635  < 2e-16 ***
## campaign                     -1.741e-02  3.908e-02  -0.446 0.655853    
## pdays                        -2.605e-03  3.726e-04  -6.992 2.71e-12 ***
## previous                     -5.644e-01  1.718e-01  -3.284 0.001023 ** 
## cons_price_idx               -3.601e-01  1.759e-01  -2.047 0.040661 *  
## cons_conf_idx                 1.522e-02  2.237e-02   0.680 0.496460    
## nr_employed                  -1.429e-02  1.444e-03  -9.902  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2124.7  on 2999  degrees of freedom
## Residual deviance: 1160.4  on 2953  degrees of freedom
## AIC: 1254.4
## 
## Number of Fisher Scoring iterations: 12
##                    GVIF Df GVIF^(1/(2*Df))
## age            2.389666  1        1.545854
## job            7.891819 11        1.098451
## marital        1.718696  3        1.094460
## education      3.661976  7        1.097148
## default        1.156534  1        1.075423
## loan           1.104804  2        1.025230
## contact        2.251491  1        1.500497
## month          8.279219  9        1.124603
## day_of_week    1.236578  4        1.026899
## duration       1.415599  1        1.189789
## campaign       1.109412  1        1.053286
## pdays          2.157115  1        1.468712
## previous       2.328030  1        1.525788
## cons_price_idx 2.073407  1        1.439933
## cons_conf_idx  2.693734  1        1.641260
## nr_employed    2.424991  1        1.557238

This seems to have fixed the VIF’s, whether or not we have to remove one of these variables to fit a final model is yet to be seen.

Confusion Matrix

##       predicted
## actual   no  yes
##      0 2590   69
##      1  180  161
## [1] 0.9740504
## [1] 0.4721408

The most important variables in the tree EDA were “duration”, “nr_employed”, “cons_conf_idx”, “cons_price_idx”, "“month”, “poutcome” “pdays” and “job”. I will build a model with these variables, and then fit a both direction stepwise model based on that initial variable selection.

Important Variables Model

## 
## Call:  glm(formula = subscribed ~ duration + nr_employed + cons_price_idx + 
##     month + poutcome, family = binomial(link = "logit"), data = df_samp)
## 
## Coefficients:
##         (Intercept)             duration          nr_employed  
##          122.820352             0.005255            -0.015244  
##      cons_price_idx             monthaug             monthdec  
##           -0.531128             0.991967             1.031605  
##            monthjul             monthjun             monthmar  
##            1.065344             1.260295             1.451497  
##            monthmay             monthnov             monthoct  
##           -0.343813             0.330123             1.052313  
##            monthsep  poutcomenonexistent      poutcomesuccess  
##           -0.565476             0.674584             2.566166  
## 
## Degrees of Freedom: 2999 Total (i.e. Null);  2985 Residual
## Null Deviance:       2125 
## Residual Deviance: 1206  AIC: 1236

The model performing both directions stepwise variable selection has an AIC of 1236 vs. the AIC of 1254 of the full model, which indicates slightly better but overall comparable performance between the two models. The model ultimately selected all the months except for January, February, and April. The model also deselected the cons_conf_idx.

Confusion Matrix

##       predicted
## actual   no  yes
##      0 2591   68
##      1  186  155
## [1] 0.9744265
## [1] 0.4545455

The stepwise selection model above has a sensitivity of 0.97, and a specificity of 0.454, which means that it is better at predictcing positives than negatives.

Brier Scores

## [1] 0.01292011
## [1] 0.05857815
## [1] 0.0567255
## [1] 0.05619917

The Brier score indicates that our best model is the full model, however, the initial model based on the important variables from the classification tree as well as the stepwise selection both have comparable performance.

Null Model

## 
## Call:  glm(formula = subscribed ~ 1, family = binomial(link = "logit"), 
##     data = na.omit(df_samp))
## 
## Coefficients:
## (Intercept)  
##      -2.054  
## 
## Degrees of Freedom: 2999 Total (i.e. Null);  2999 Residual
## Null Deviance:       2125 
## Residual Deviance: 2125  AIC: 2127

The forward selection model for this sample retained only the intercept and didn’t select any additional variables.

Backward Selection Model

## 
## Call:  glm(formula = subscribed ~ education + default + month + duration + 
##     pdays + previous + cons_price_idx + nr_employed, family = binomial(link = "logit"), 
##     data = df_samp)
## 
## Coefficients:
##                  (Intercept)             educationbasic.6y  
##                   121.971191                     -0.523206  
##            educationbasic.9y          educationhigh.school  
##                    -1.099349                     -0.177028  
##          educationilliterate  educationprofessional.course  
##                    -9.414733                      0.152205  
##   educationuniversity.degree              educationunknown  
##                    -0.132823                     -0.357035  
##               defaultunknown                      monthaug  
##                    -0.545339                      0.838824  
##                     monthdec                      monthjul  
##                     1.058423                      1.013807  
##                     monthjun                      monthmar  
##                     1.204347                      1.404514  
##                     monthmay                      monthnov  
##                    -0.280470                      0.227730  
##                     monthoct                      monthsep  
##                     0.898014                     -0.446311  
##                     duration                         pdays  
##                     0.005385                     -0.002525  
##                     previous                cons_price_idx  
##                    -0.513262                     -0.525413  
##                  nr_employed  
##                    -0.014507  
## 
## Degrees of Freedom: 2999 Total (i.e. Null);  2977 Residual
## Null Deviance:       2125 
## Residual Deviance: 1189  AIC: 1235
##                  (Intercept)            educationbasic.6y 
##                121.971190745                 -0.523206007 
##            educationbasic.9y         educationhigh.school 
##                 -1.099348945                 -0.177027712 
##          educationilliterate educationprofessional.course 
##                 -9.414733268                  0.152204675 
##   educationuniversity.degree             educationunknown 
##                 -0.132823298                 -0.357035344 
##               defaultunknown                     monthaug 
##                 -0.545338777                  0.838823569 
##                     monthdec                     monthjul 
##                  1.058423157                  1.013806938 
##                     monthjun                     monthmar 
##                  1.204346934                  1.404513778 
##                     monthmay                     monthnov 
##                 -0.280469676                  0.227729907 
##                     monthoct                     monthsep 
##                  0.898014003                 -0.446311360 
##                     duration                        pdays 
##                  0.005384822                 -0.002524750 
##                     previous               cons_price_idx 
##                 -0.513261646                 -0.525412663 
##                  nr_employed 
##                 -0.014506517

The backward selection model starting with the full model (ignoring problematic variables) included education, month, cons_price_idx, pdays, previous, and nr_employed.

Removed Default

Due to the high bias towards default records with a value of “no”, we’ll remove the “default” variable and create the model again.

## 
## Call:  glm(formula = subscribed ~ education + month + duration + poutcome + 
##     cons_price_idx + nr_employed, family = binomial(link = "logit"), 
##     data = df_samp)
## 
## Coefficients:
##                  (Intercept)             educationbasic.6y  
##                   126.705782                     -0.541767  
##            educationbasic.9y          educationhigh.school  
##                    -1.067692                     -0.115977  
##          educationilliterate  educationprofessional.course  
##                    -9.512626                      0.202520  
##   educationuniversity.degree              educationunknown  
##                    -0.077990                     -0.407718  
##                     monthaug                      monthdec  
##                     0.931334                      1.001819  
##                     monthjul                      monthjun  
##                     1.091719                      1.247449  
##                     monthmar                      monthmay  
##                     1.380907                     -0.290425  
##                     monthnov                      monthoct  
##                     0.248568                      0.980236  
##                     monthsep                      duration  
##                    -0.531962                      0.005399  
##          poutcomenonexistent               poutcomesuccess  
##                     0.699466                      2.590449  
##               cons_price_idx                   nr_employed  
##                    -0.571809                     -0.015238  
## 
## Degrees of Freedom: 2999 Total (i.e. Null);  2978 Residual
## Null Deviance:       2125 
## Residual Deviance: 1187  AIC: 1231

The model has selected most of the same variables as have been selected by prior models, and has a comparable AIC to the other models we have created.

##                     GVIF Df GVIF^(1/(2*Df))
## age             2.372570  1        1.540315
## job             7.950828 11        1.098823
## marital         1.736930  3        1.096387
## education       3.651245  7        1.096918
## loan            1.109466  2        1.026310
## contact         2.211694  1        1.487177
## month           8.675513  9        1.127528
## day_of_week     1.244106  4        1.027678
## duration        1.405547  1        1.185558
## campaign        1.108716  1        1.052956
## pdays           8.578793  1        2.928958
## previous        5.193618  1        2.278951
## poutcome       18.909607  2        2.085310
## cons_price_idx  2.141863  1        1.463511
## cons_conf_idx   2.706542  1        1.645157
## nr_employed     2.395187  1        1.547639

Identify Problem Variables

When attempting to create interaction based models from the variables above, we’ve encountered errors indicating that our model is perfectly predicting the probability of a customer being a subscriber or not, indicating some issues in our variable selection. We’ll run an alternative Logistic Regression package using the same variables to see if an issue is flagged.

LRM Modeling

## singular information matrix in lrm.fit (rank= 46 ).  Offending variable(s):
## cons_price_idx
## singular information matrix in lrm.fit (rank= 45 ).  Offending variable(s):
## nr_employed

This may be the source of error. Let’s eliminate these con_price_idx and the nr_employed variables!

## Logistic Regression Model
##  
##  lrm(formula = subscribed ~ ., data = df_model, x = TRUE, y = TRUE)
##  
##                        Model Likelihood    Discrimination    Rank Discrim.    
##                              Ratio Test           Indexes          Indexes    
##  Obs         3000    LR chi2     867.40    R2       0.495    C       0.926    
##   0          2659    d.f.            46    g        2.038    Dxy     0.852    
##   1           341    Pr(> chi2) <0.0001    gr       7.677    gamma   0.852    
##  max |deriv|  0.2                          gp       0.161    tau-a   0.172    
##                                            Brier    0.063                     
##  
##                                Coef    S.E.    Wald Z Pr(>|Z|)
##  Intercept                     -0.7317  1.5660 -0.47  0.6403  
##  age                            0.0308  0.0092  3.34  0.0008  
##  job=blue-collar                0.4443  0.2958  1.50  0.1330  
##  job=entrepreneur               0.1997  0.4214  0.47  0.6355  
##  job=housemaid                  0.9065  0.4723  1.92  0.0549  
##  job=management                -0.5458  0.3635 -1.50  0.1333  
##  job=retired                   -0.2040  0.4319 -0.47  0.6367  
##  job=self-employed              0.7052  0.3674  1.92  0.0550  
##  job=services                   0.2756  0.3382  0.81  0.4151  
##  job=student                    1.3052  0.4110  3.18  0.0015  
##  job=technician                -0.0314  0.2679 -0.12  0.9067  
##  job=unemployed                 0.4635  0.4753  0.98  0.3295  
##  job=unknown                    0.0580  0.8750  0.07  0.9471  
##  marital=married                0.5690  0.2624  2.17  0.0301  
##  marital=single                 0.7159  0.2965  2.41  0.0158  
##  marital=unknown                0.9166  1.2592  0.73  0.4667  
##  education=basic.6y            -0.4472  0.4279 -1.05  0.2960  
##  education=basic.9y            -0.9102  0.3497 -2.60  0.0093  
##  education=high.school          0.1370  0.3194  0.43  0.6679  
##  education=illiterate          -5.9228 69.3918 -0.09  0.9320  
##  education=professional.course  0.5054  0.3385  1.49  0.1354  
##  education=university.degree    0.3089  0.3189  0.97  0.3327  
##  education=unknown             -0.1100  0.4509 -0.24  0.8073  
##  default=unknown               -0.9571  0.2445 -3.91  <0.0001 
##  loan=unknown                  -0.1269  0.4927 -0.26  0.7968  
##  loan=yes                      -0.0593  0.2132 -0.28  0.7809  
##  contact=telephone             -1.9149  0.2685 -7.13  <0.0001 
##  month=aug                     -1.1636  0.4073 -2.86  0.0043  
##  month=dec                      1.2440  0.8212  1.51  0.1298  
##  month=jul                     -0.9443  0.3463 -2.73  0.0064  
##  month=jun                      0.7912  0.3654  2.17  0.0303  
##  month=mar                      1.7236  0.5148  3.35  0.0008  
##  month=may                     -0.4452  0.3119 -1.43  0.1534  
##  month=nov                     -1.0444  0.3770 -2.77  0.0056  
##  month=oct                      1.0012  0.5762  1.74  0.0823  
##  month=sep                     -0.3826  0.6228 -0.61  0.5390  
##  day_of_week=mon                0.0960  0.2468  0.39  0.6975  
##  day_of_week=thu                0.1370  0.2471  0.55  0.5794  
##  day_of_week=tue                0.1361  0.2535  0.54  0.5915  
##  day_of_week=wed                0.0589  0.2552  0.23  0.8174  
##  duration                       0.0051  0.0003 17.54  <0.0001 
##  campaign                      -0.0553  0.0396 -1.40  0.1628  
##  pdays                         -0.0015  0.0007 -2.02  0.0432  
##  previous                       0.1032  0.2381  0.43  0.6648  
##  poutcome=nonexistent           0.4289  0.3767  1.14  0.2548  
##  poutcome=success               1.6221  0.7239  2.24  0.0250  
##  cons_conf_idx                  0.0799  0.0223  3.58  0.0003  
## 
##                           age               job=blue-collar 
##                      2.242453                      2.337542 
##              job=entrepreneur                 job=housemaid 
##                      1.189413                      1.257750 
##                job=management                   job=retired 
##                      1.231424                      1.861339 
##             job=self-employed                  job=services 
##                      1.187605                      1.420354 
##                   job=student                job=technician 
##                      1.415295                      1.742777 
##                job=unemployed                   job=unknown 
##                      1.154653                      1.097757 
##               marital=married                marital=single 
##                      3.031142                      3.485868 
##               marital=unknown            education=basic.6y 
##                      1.081056                      1.422209 
##            education=basic.9y         education=high.school 
##                      1.814134                      3.143285 
##          education=illiterate education=professional.course 
##                      1.000023                      2.610486 
##   education=university.degree             education=unknown 
##                      4.018167                      1.519228 
##               default=unknown                  loan=unknown 
##                      1.125831                      1.041478 
##                      loan=yes             contact=telephone 
##                      1.049369                      2.050293 
##                     month=aug                     month=dec 
##                      4.236623                      1.337778 
##                     month=jul                     month=jun 
##                      2.776786                      2.693257 
##                     month=mar                     month=may 
##                      1.464515                      3.244309 
##                     month=nov                     month=oct 
##                      2.266156                      1.998685 
##                     month=sep               day_of_week=mon 
##                      1.550030                      1.832020 
##               day_of_week=thu               day_of_week=tue 
##                      1.858854                      1.856185 
##               day_of_week=wed                      duration 
##                      1.809375                      1.282285 
##                      campaign                         pdays 
##                      1.088236                      8.868305 
##                      previous          poutcome=nonexistent 
##                      4.764565                      4.251087 
##              poutcome=success                 cons_conf_idx 
##                      7.242318                      2.478904

After having removed the problem variables, LRM() allowed us to create logistic regression models, which has likely found the source of the error.

Refit with Clean Data

## 
## Call:
## glm(formula = subscribed ~ ., family = binomial(link = "logit"), 
##     data = df_model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4765  -0.3276  -0.2094  -0.1255   3.3265  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -7.317e-01  1.566e+00  -0.467 0.640327    
## age                           3.080e-02  9.208e-03   3.345 0.000823 ***
## jobblue-collar                4.443e-01  2.958e-01   1.502 0.133023    
## jobentrepreneur               1.997e-01  4.214e-01   0.474 0.635549    
## jobhousemaid                  9.065e-01  4.723e-01   1.919 0.054930 .  
## jobmanagement                -5.458e-01  3.635e-01  -1.501 0.133280    
## jobretired                   -2.040e-01  4.319e-01  -0.472 0.636725    
## jobself-employed              7.052e-01  3.674e-01   1.919 0.054964 .  
## jobservices                   2.756e-01  3.382e-01   0.815 0.415107    
## jobstudent                    1.305e+00  4.110e-01   3.176 0.001494 ** 
## jobtechnician                -3.139e-02  2.679e-01  -0.117 0.906718    
## jobunemployed                 4.635e-01  4.753e-01   0.975 0.329461    
## jobunknown                    5.802e-02  8.750e-01   0.066 0.947133    
## maritalmarried                5.690e-01  2.624e-01   2.168 0.030127 *  
## maritalsingle                 7.159e-01  2.965e-01   2.414 0.015773 *  
## maritalunknown                9.166e-01  1.259e+00   0.728 0.466679    
## educationbasic.6y            -4.472e-01  4.279e-01  -1.045 0.295995    
## educationbasic.9y            -9.102e-01  3.497e-01  -2.603 0.009251 ** 
## educationhigh.school          1.370e-01  3.194e-01   0.429 0.667882    
## educationilliterate          -1.026e+01  3.673e+02  -0.028 0.977722    
## educationprofessional.course  5.054e-01  3.385e-01   1.493 0.135446    
## educationuniversity.degree    3.089e-01  3.189e-01   0.969 0.332710    
## educationunknown             -1.100e-01  4.509e-01  -0.244 0.807324    
## defaultunknown               -9.571e-01  2.445e-01  -3.915 9.06e-05 ***
## loanunknown                  -1.269e-01  4.927e-01  -0.257 0.796814    
## loanyes                      -5.931e-02  2.132e-01  -0.278 0.780856    
## contacttelephone             -1.915e+00  2.685e-01  -7.132 9.92e-13 ***
## monthaug                     -1.164e+00  4.073e-01  -2.857 0.004280 ** 
## monthdec                      1.244e+00  8.212e-01   1.515 0.129826    
## monthjul                     -9.443e-01  3.463e-01  -2.727 0.006396 ** 
## monthjun                      7.912e-01  3.654e-01   2.166 0.030346 *  
## monthmar                      1.724e+00  5.148e-01   3.348 0.000814 ***
## monthmay                     -4.452e-01  3.119e-01  -1.428 0.153431    
## monthnov                     -1.044e+00  3.770e-01  -2.770 0.005599 ** 
## monthoct                      1.001e+00  5.762e-01   1.738 0.082272 .  
## monthsep                     -3.826e-01  6.228e-01  -0.614 0.539018    
## day_of_weekmon                9.596e-02  2.468e-01   0.389 0.697471    
## day_of_weekthu                1.370e-01  2.471e-01   0.554 0.579433    
## day_of_weektue                1.361e-01  2.535e-01   0.537 0.591513    
## day_of_weekwed                5.892e-02  2.552e-01   0.231 0.817394    
## duration                      5.064e-03  2.886e-04  17.545  < 2e-16 ***
## campaign                     -5.528e-02  3.960e-02  -1.396 0.162751    
## pdays                        -1.516e-03  7.499e-04  -2.022 0.043198 *  
## previous                      1.032e-01  2.381e-01   0.433 0.664826    
## poutcomenonexistent           4.289e-01  3.767e-01   1.139 0.254841    
## poutcomesuccess               1.622e+00  7.239e-01   2.241 0.025043 *  
## cons_conf_idx                 7.994e-02  2.234e-02   3.579 0.000345 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2124.7  on 2999  degrees of freedom
## Residual deviance: 1257.3  on 2953  degrees of freedom
## AIC: 1351.3
## 
## Number of Fisher Scoring iterations: 12

After removing the problematic variables identified with the LRM() function, we fit a LRM() model based on the cleaned dataset and identify several important variables:

  1. Cons_conf_idx
  2. pdays
  3. duration
  4. monthmar
  5. contacttelephone
  6. defaultunknown

We’ll now refit a backward stepwise regression model on the data using stepAIC to identify what variables remain.

3) Using the stepAIC() function from package MASS, fit several additional models for comparison (see

?MASS::stepAIC for additional details):

## 
## Call:  glm(formula = subscribed ~ age + marital + education + default + 
##     contact + month + duration + campaign + pdays + poutcome + 
##     cons_conf_idx, family = binomial(link = "logit"), data = df_model)
## 
## Coefficients:
##                  (Intercept)                           age  
##                     0.542432                      0.021631  
##               maritalmarried                 maritalsingle  
##                     0.527857                      0.743832  
##               maritalunknown             educationbasic.6y  
##                     0.941990                     -0.464775  
##            educationbasic.9y          educationhigh.school  
##                    -0.933969                     -0.065225  
##          educationilliterate  educationprofessional.course  
##                   -10.138866                      0.209167  
##   educationuniversity.degree              educationunknown  
##                    -0.049793                     -0.173080  
##               defaultunknown              contacttelephone  
##                    -0.905514                     -1.882243  
##                     monthaug                      monthdec  
##                    -1.119470                      1.254650  
##                     monthjul                      monthjun  
##                    -0.867890                      0.846824  
##                     monthmar                      monthmay  
##                     1.621747                     -0.418215  
##                     monthnov                      monthoct  
##                    -1.036661                      0.872815  
##                     monthsep                      duration  
##                    -0.185974                      0.004985  
##                     campaign                         pdays  
##                    -0.063613                     -0.001729  
##          poutcomenonexistent               poutcomesuccess  
##                     0.272594                      1.354891  
##                cons_conf_idx  
##                     0.080867  
## 
## Degrees of Freedom: 2999 Total (i.e. Null);  2971 Residual
## Null Deviance:       2125 
## Residual Deviance: 1280  AIC: 1338
## 
## Call:  glm(formula = subscribed ~ education + month + duration + poutcome + 
##     cons_price_idx + nr_employed, family = binomial(link = "logit"), 
##     data = df_samp)
## 
## Coefficients:
##                  (Intercept)             educationbasic.6y  
##                   126.705782                     -0.541767  
##            educationbasic.9y          educationhigh.school  
##                    -1.067692                     -0.115977  
##          educationilliterate  educationprofessional.course  
##                    -9.512626                      0.202520  
##   educationuniversity.degree              educationunknown  
##                    -0.077990                     -0.407718  
##                     monthaug                      monthdec  
##                     0.931334                      1.001819  
##                     monthjul                      monthjun  
##                     1.091719                      1.247449  
##                     monthmar                      monthmay  
##                     1.380907                     -0.290425  
##                     monthnov                      monthoct  
##                     0.248568                      0.980236  
##                     monthsep                      duration  
##                    -0.531962                      0.005399  
##          poutcomenonexistent               poutcomesuccess  
##                     0.699466                      2.590449  
##               cons_price_idx                   nr_employed  
##                    -0.571809                     -0.015238  
## 
## Degrees of Freedom: 2999 Total (i.e. Null);  2978 Residual
## Null Deviance:       2125 
## Residual Deviance: 1187  AIC: 1231
## 
## Call:  glm(formula = subscribed ~ 1, family = binomial(link = "logit"), 
##     data = na.omit(df_samp))
## 
## Coefficients:
## (Intercept)  
##      -2.054  
## 
## Degrees of Freedom: 2999 Total (i.e. Null);  2999 Residual
## Null Deviance:       2125 
## Residual Deviance: 2125  AIC: 2127

The models created by the stepwise functions relative to our investigative models have emphasized month variables, education, the consumer price and conference indices, nr_employed, poutcome, and duration, much in line with our initial EDA.

Tree Check

We’ll create another classification tree on the data that we’ve selected based on the above modeling process in order to verify that we don’t have any perfectly separable variables included in the data.

We still have essentially the same tree as before, so we don’t have to worry about perfect information in our dataset when moving forward.

Explore models that include 2-way interaction effects

There are several ways to do this. For example, in forward selection, you can allow for 2-way interaction effects in stepAIC() by using the scope argument (for details, read the documentation in ?MASS::stepAIC or post a question to the Teams channel—I’d be happy to provide example code). Does it seem necessary or beneficial to include any 2-way interaction effects in this example? If so, which interaction effects seem relevant and predictive?

We have decided to employe a 2nd degree MARS model for regression to identify 2nd degree interactions.

## Call: earth(formula=subscribed~., data=na.omit(df_model),
##             glm=list(family=binomial), degree=2)
## 
## GLM coefficients
##                                                      1
## (Intercept)                                  7.9840207
## poutcomesuccess                              1.5799340
## h(1490-duration)                            -0.0038909
## h(-36.4-cons_conf_idx)                      -0.5448745
## jobservices * poutcomesuccess               -5.4751453
## educationbasic.9y * pdays                   -0.0007706
## educationbasic.9y * poutcomesuccess        -15.5059097
## educationunknown * poutcomesuccess          -2.0515171
## monthaug * pdays                             0.0023956
## monthjun * pdays                             0.0090869
## monthjun * poutcomesuccess                   5.3792121
## monthoct * pdays                             0.0030279
## previous * poutcomesuccess                   0.8196176
## h(40-age) * pdays                            0.0000766
## h(age-40) * pdays                            0.0000530
## jobtechnician * h(duration-1490)            -0.0480059
## defaultunknown * h(-36.4-cons_conf_idx)     -0.0942493
## contacttelephone * h(cons_conf_idx- -36.4)  -0.5270565
## monthaug * h(1490-duration)                 -0.0022602
## monthaug * h(-36.4-cons_conf_idx)            1.3467500
## monthjun * h(-36.4-cons_conf_idx)           -1.6041637
## monthmar * h(-36.4-cons_conf_idx)            0.1876522
## day_of_weekmon * h(cons_conf_idx- -36.4)    -0.2362459
## h(305-duration) * pdays                     -0.0000045
## pdays * h(1-previous)                        0.0006118
## pdays * h(cons_conf_idx- -47.1)             -0.0007432
## pdays * h(-47.1-cons_conf_idx)               0.0007122
## pdays * h(cons_conf_idx- -36.4)              0.0029888
## pdays * h(cons_conf_idx- -33.6)             -0.0021683
## poutcomesuccess * h(cons_conf_idx- -46.2)   -1.0258409
## poutcomesuccess * h(-46.2-cons_conf_idx)     1.5758089
## poutcomesuccess * h(cons_conf_idx- -42)      1.5395490
## h(32-age) * h(-36.4-cons_conf_idx)          -0.0200820
## h(age-60) * h(cons_conf_idx- -36.4)         -0.0172104
## h(duration-104) * h(cons_conf_idx- -36.4)   -0.0082063
## h(duration-181) * h(cons_conf_idx- -36.4)    0.0894204
## h(duration-193) * h(cons_conf_idx- -36.4)   -0.1039647
## h(214-duration) * h(cons_conf_idx- -36.4)   -0.0044517
## h(duration-225) * h(cons_conf_idx- -36.4)    0.0225000
## h(567-duration) * h(-36.4-cons_conf_idx)    -0.0002915
## h(2-campaign) * h(-36.4-cons_conf_idx)       0.0574912
## 
## GLM (family binomial, link logit):
##  nulldev   df       dev   df   devratio     AIC iters converged
##  2124.68 2999   1073.38 2959      0.495    1155    14         1
## 
## Earth selected 41 of 56 terms, and 18 of 46 predictors
## Termination condition: RSq changed by less than 0.001 at 56 terms
## Importance: duration, pdays, cons_conf_idx, poutcomesuccess, monthoct, ...
## Number of terms at each degree of interaction: 1 3 37
## Earth GCV 0.05864098    RSS 164.2769    GRSq 0.4183236    RSq 0.4564683

It looks like the most important interactions in the model exist between age, duration of call, cons_conf_idx, job, month, and day_of_week, which covers most of the dataset that remains after the investigative process we underwent. The interactions of the cons_conf_idx with the variables of campaign and age seem intriguing, as the timing and economic conditions associated with the time that a campaign was orchestrated can be influential to how we schedule marketing campaigns going forward.

##       predicted
## actual   no  yes
##      0 2600   59
##      1  161  180
## [1] 0.9778112
## [1] 0.5278592

## [1] 0.9778112
## [1] 0.9740504

The MARS model has a higher sensitivity than the stepwise fitted model, therefore I’ll use that model to make predictions for the entire dataset. The most important variables to the MARS Models were “Pdays”, “Duration”, “cons_conf_idx”, “monthaug”, “contacttelephone”, “age”, “campaign”, “monthoct”, “jobtechnician, and”previous".

6) Leakage

One form of data leakage occurs when your model uses (predictor) information that would not be available at prediction time (e.g., when you apply the model to future cases). Carefully look over the description of each predictor in this data set again? Do any of the models you created above contain leakage? If so, how could you remedy the situation?

Our datasets inclusion of the “Duration” variable is problematic as we won’t know the duration of a call before contacting somebody, therefore we’ll fit the model again without the “Duration” variable and then make predictions on the new dataset.

# Fit a degree-2 MARS model (i.e., allow up to 2-way interaction effects)
lr.mars.leakage <- earth(subscribed ~ . -duration, data = na.omit(df_model), 
                 degree = 2, glm = list(family = binomial))
## Call: earth(formula=subscribed~.-duration, data=na.omit(df_full),
##             glm=list(family=binomial), degree=2)
## 
## GLM coefficients
##                                                    1
## (Intercept)                              -0.41619053
## contacttelephone                         -0.60634122
## pdays                                    -0.00575490
## h(47-age)                                 0.01335400
## h(age-47)                                 0.02115947
## h(-46.2-cons_conf_idx)                    0.28208192
## h(cons_conf_idx- -46.2)                   0.15439871
## h(cons_conf_idx- -34.6)                  -0.28362340
## defaultunknown * h(-46.2-cons_conf_idx)  -0.66971320
## monthaug * h(cons_conf_idx- -46.2)       -0.04127816
## day_of_weekmon * h(-46.2-cons_conf_idx)  -0.27817771
## day_of_weekmon * h(cons_conf_idx- -34.6) -0.12900504
## pdays * h(previous-2)                    -0.00036609
## pdays * h(2-previous)                     0.00040599
## pdays * h(cons_conf_idx- -36.4)           0.00356994
## pdays * h(-36.4-cons_conf_idx)            0.00031420
## pdays * h(cons_conf_idx- -37.5)          -0.00208744
## pdays * h(cons_conf_idx- -42)             0.00201707
## pdays * h(cons_conf_idx- -34.6)          -0.00139170
## pdays * h(cons_conf_idx- -40.4)          -0.00204563
## 
## GLM (family binomial, link logit):
##  nulldev    df       dev    df   devratio     AIC iters converged
##    26099 37069   20434.5 37050      0.217   20470     6         1
## 
## Earth selected 20 of 20 terms, and 8 of 46 predictors
## Termination condition: RSq changed by less than 0.001 at 20 terms
## Importance: pdays, cons_conf_idx, contacttelephone, monthaug, previous, ...
## Number of terms at each degree of interaction: 1 7 12
## Earth GCV 0.07841244    RSS 2899.148    GRSq 0.2156145    RSq 0.2176234
##       predicted
## actual    no   yes
##      0 32264   630
##      1  3102  1074
## [1] 0.9808476
## [1] 0.2571839

The model we have fitted it better at predicting positive results than negative, as it has a Sensitivity of 0.98 and a Specificity of 0.25.

7) Deployment

Based on all of the results you’ve uncovered so far, fit a final model to the bank_trn.csv data that does not contain any leakage variables (it may include interaction effects if you found them to be important earlier). Use this model to score the bank_new.csv data (i.e., for each row in bank_new.csv, provide the estimated conditional probability of subscribing to a term deposit). How accurate is your model on bank_new.csv? Do the probabilities appear to be well-calibrated? Provide a calibration curve for the predictions (in practice you would not be able to do this since the value of subscribed would be unknown at this time).

We’ll bring in our new dataset and drop the variables dropped during the prior analysis.

##       predicted
## actual   no  yes
##      0 3588   66
##      1  357  107
## [1] 0.9819376
## [1] 0.2306034

##           Dxy       C (ROC)            R2             D      D:Chi-sq 
##  5.770477e-01  7.885239e-01  2.669560e-01  1.447116e-01  5.969225e+02 
##           D:p             U      U:Chi-sq           U:p             Q 
##            NA -4.776162e-04  3.317640e-02  9.835486e-01  1.451892e-01 
##         Brier     Intercept         Slope          Emax           E90 
##  7.955021e-02  2.177396e-04  9.953881e-01  5.945939e-02  1.845727e-02 
##          Eavg           S:z           S:p 
##  8.505070e-03  3.092930e-01  7.570986e-01

The calibration curve for our model indicates a generally good fit for our data.

##       predicted
## actual   no  yes
##      0 3588   66
##      1  357  107
## [1] 0.9819376
## [1] 0.2306034

Data Dictionary

Variable Class description
age numeric Customer age
job categorical type of job
marital categorical marital status
education categorical Level of Education: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”, “professional.course”, “university.degree”, “unknown”
default categorical has credit in default
housing categorical has housing loan
loan categorical has personal loan
contact categorical contact communication type
month categorical last contact month of year
day_of_week categorical last contact day of the week
emp.var.rate numeric employment variation rate
cons.price.idx numeric consumer price index - monthly indicator
cons.conf.idx numeric consumer confidence index - monthly indicator
euribor3m numeric euribor 3 month rate - daily indicator
nr.employed numeric number of employees - quarterly indicator
campaign numeric number of contacts performed during this campaign and for this client
pdays numeric number of days that passed by after the client was last contacted from a previous campaign
previous numeric number of contacts performed before this campaign and for this client
poutcome categorical outcome of the previous marketing campaign
subscribed binary has the client subscribed a term deposit?

Full Code

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(echo = TRUE)
pkgs <- c(
  "kableExtra",  # for creating nice tables
  "magrittr",
  "rpart",
  "rpart.plot",
  "vip",
  "car"
)

# Install required (CRAN) packages
for (pkg in pkgs) {
  if (!(pkg %in% installed.packages()[, "Package"])) {
    install.packages(pkg)
  }
}

# Helper packages
library(kableExtra)
library(magrittr)
library(rpart)
library(rpart.plot)
library(vip)
library(car)
library(earth)


setwd("C:/Users/John Trygier/Documents/Schoolwork/Graduate/Spring 2022/Statistical Modeling/Project/HW2")

df_trn <- read.csv("bank_trn.csv")

df_trn$subscribed <- ifelse(df_trn$subscribed == "yes", 1,0)
df_trn$subscribed <- as.factor(df_trn$subscribed)

set.seed(1951)  # for reproducibility
samp <- sample.int(nrow(df_trn), size = 3000, replace = FALSE)
df_samp <- df_trn[samp, ]

(redun <- Hmisc::redun(~., data=df_samp, nk = 0))
# housing, emp_var_rate, and euribor3m are problematic


redun$Out

df_samp <- df_samp[, -which(names(df_samp) %in% redun$Out)]
(redun <- Hmisc::redun(~., data=df_samp, nk = 0))

summary(df_samp) %>% 
  kbl(caption = "Summary Table for Sampled Bank Training Data") %>%
  kable_styling(full_width = F) %>% 
  kable_classic(full_width = F, html_font = "Cambria") %>% 
  scroll_box(width = "1000px", height = "400px")

par(mfrow = c(1,2))
hist(df_samp$age, main = "Distribution of Age", xlab = "Age")
hist(df_samp$duration, main = "Distribution of Duration", xlab = "Duration")

par(mfrow = c(2,2))
barplot(table(df_samp$default), main = "Frequency of Default Status")
barplot(table(df_samp$marital), main = "Marital Status")
barplot(table(df_samp$loan), main = "Frequency by Loan Status")
barplot(table(df_samp$contact), main = "Distribution by Contact Status")

par(mfrow = c(1,2))

barplot(table(df_samp$pdays), main = "Distribution by PDays")
barplot(table(df_samp$previous), main = "Distribution by Previous")

par(mfrow = c(1,2))
hist(df_samp$nr_employed, main = "Distribution of Number Employed")
barplot(table(df_samp$subscribed), main = "Frequency by Subscribed Status")

subsc_dt <- rpart(
  formula = subscribed ~ .,
  data    = df_samp,
  method  = "class"
)
rpart.plot(subsc_dt)
plotcp(subsc_dt)


subsc_dt2 <- rpart(
  formula = subscribed ~ .,
  data    = df_samp,
  method  = "class", 
  control = list(cp = 0, xval = 10)
)

plotcp(subsc_dt2)
abline(v = 11, lty = "dashed")

vip(subsc_dt, num_features = 10, bar = FALSE)

# Look at correlations between numeric features
num <- sapply(df_samp, FUN = is.numeric)  # identify numeric columns
corx <- cor(df_samp[, num], use = "pairwise.complete.obs")  # simple correlation matrix

# Visualize correlations; can be useful if you have a lot of features
corrplot::corrplot(corx, method = "square", order = "FPC", type = "lower", diag = TRUE)

lr.fit.full <- glm(subscribed ~ .,family = binomial(link = "logit"), data = df_samp)

# Print a summary of the fitted model
summary(lr.fit.full)
vif(lr.fit.full)

lr.fit.2 <- glm(subscribed ~ . -poutcome,family = binomial(link = "logit"), data = df_samp)

# Print a summary of the fitted model
summary(lr.fit.2)
vif(lr.fit.2)

df_samp <- df_samp[, -which(names(df_samp) %in% c("poutcome"))]

library(caret)
y <- na.omit(df_samp)$subscribed  # observed classes
prob <- predict(lr.fit.2, type = "response")  # predicted probabilities
classes <- ifelse(prob > 0.5, "yes", "no")  # classification based on 0.5 threshold
(cm <- table("actual" = y, "predicted" = classes))  # confusion matrix
classes <- ifelse(prob > 0.5, 1, 0)
classes <- as.factor(classes)
(sens_lr_fit2 <- sensitivity(classes, y))
(spec_lr_fit2 <- specificity(classes, y))

lr.fit.init <- glm(subscribed ~ duration + nr_employed + cons_conf_idx + cons_price_idx + month + pdays + job,family = binomial(link = "logit"), data = df_samp)
(lr.fit.both <- MASS::stepAIC(lr.fit.init, direction = "both", trace = 0))


y <- na.omit(df_samp)$subscribed  # observed classes
prob <- predict(lr.fit.both, type = "response")  # predicted probabilities
classes <- ifelse(prob > 0.5, "yes", "no")  # classification based on 0.5 threshold
(cm <- table("actual" = y, "predicted" = classes))  # confusion matrix
classes <- ifelse(prob > 0.5, 1, 0)
classes <- as.factor(classes)
(sens_mars <- sensitivity(classes, y))
(spec_mars <- specificity(classes, y))

bs <- function(y, prob) {  # this is a relative measure, like log loss/binomial deviance, SSE/RMSE, AIC/BIC, etc.
  mean((prob - y) ^ 2)  # y should be in {0, 1}
}
# Compute Brier score for previous models
y01 <- ifelse(y == "yes", 1, 0)
fit0 <- glm(subscribed ~ 1, data = na.omit(df_samp), family = binomial(link = "logit"))
bs(y01, prob = predict(fit0, type = "response"))
bs(y01, prob = predict(lr.fit.full, type = "response"))
bs(y01, prob = predict(lr.fit.init, type = "response"))
bs(y01, prob = predict(lr.fit.both, type = "response"))

lr.null <- glm(subscribed ~ 1, data = na.omit(df_samp), family = binomial(link = "logit"))
(lr.fit.both <- MASS::stepAIC(lr.null, direction = "forward", trace = 0))

(lr.fit.back <- MASS::stepAIC(lr.fit.2, direction = "backward", trace = 0))
lr.fit.back$coefficients

lr.fit.2 <- glm(subscribed ~ . -default,family = binomial(link = "logit"), data = df_samp)

(lr.fit.back <- MASS::stepAIC(lr.fit.2, direction = "backward", trace = 0))

vif(lr.fit.2)

library(rms)
fit <- lrm(subscribed ~ age + job + marital + education + default + loan + contact + month + day_of_week + duration + campaign + pdays + previous + cons_price_idx +cons_conf_idx + nr_employed,
           data = df_samp,
           x=TRUE, y=TRUE)

fit <- lrm(subscribed ~ age + job + marital + education + default + loan + contact + month + day_of_week + duration + campaign + pdays + previous + cons_conf_idx + nr_employed,
           data = df_samp,
           x=TRUE, y=TRUE)

df_model <- df_samp[, -which(names(df_samp) %in% c("cons_price_idx", "nr_employed"))]
fit <- lrm(subscribed ~ .,
           data = df_model,
           x=TRUE, y=TRUE)

(fit)

vif(fit)

lr.fit.clean <- glm(subscribed ~ .,family = binomial(link = "logit"), data = df_model)

# Print a summary of the fitted model
summary(lr.fit.clean)

(lr.fit.back.clean <- MASS::stepAIC(lr.fit.clean, direction = "backward", trace = 0))
subsc_dt <- rpart(
  formula = subscribed ~ .,
  data    = df_model,
  method  = "class"
)
rpart.plot(subsc_dt)
plotcp(subsc_dt)


subsc_dt2 <- rpart(
  formula = subscribed ~ .,
  data    = df_model,
  method  = "class", 
  control = list(cp = 0, xval = 10)
)

plotcp(subsc_dt2)
abline(v = 11, lty = "dashed")
# Fit a degree-2 MARS model (i.e., allow up to 2-way interaction effects)
lr.mars <- earth(subscribed ~ ., data = na.omit(df_model), 
                 degree = 2, glm = list(family = binomial))

# Print summary of model fit
summary(lr.mars)

y <- na.omit(df_samp)$subscribed  # observed classes
prob <- predict(lr.mars, type = "response")  # predicted probabilities
classes <- ifelse(prob > 0.5, "yes", "no")  # classification based on 0.5 threshold
(cm <- table("actual" = y, "predicted" = classes))  # confusion matrix
library(caret)
classes <- ifelse(prob > 0.5, 1, 0)
classes <- as.factor(classes)
(sens_mars <- sensitivity(classes, y))
(spec_mars <- specificity(classes, y))
vip(lr.mars)

sens_mars
sens_lr_fit2
df_full <- df_trn[, c(names(df_trn) %in% names(df_model))]
# Fit a degree-2 MARS model (i.e., allow up to 2-way interaction effects)
lr.mars.full <- earth(subscribed ~ ., data = na.omit(df_full), 
                      degree = 2, glm = list(family = binomial))

# Print summary of model fit
summary(lr.mars.full)

y <- na.omit(df_full)$subscribed  # observed classes
prob <- predict(lr.mars.full, type = "response")  # predicted probabilities
classes <- ifelse(prob > 0.5, "yes", "no")  # classification based on 0.5 threshold
(cm <- table("actual" = y, "predicted" = classes))  # confusion matrix
library(caret)
classes <- ifelse(prob > 0.5, 1, 0)
classes <- as.factor(classes)
(sens_mars <- sensitivity(classes, y))
(spec_mars <- specificity(classes, y))

prob <- predict(lr.mars.leakage, newdata = df_new,  type = "response")  # predicted probabilities
y01 <- ifelse(df_new$subscribed == "1", 1, 0)

rms::val.prob(prob,y01)

df_new <- read.csv("bank_new.csv")
names <- names(df_model)
df_new$subscribed <- ifelse(df_new$subscribed == "yes", 1,0)
df_new$subscribed <- as.factor(df_new$subscribed)
(names.drop <- names(df_new)[!(names(df_new) %in% names)])
df_new <- df_new[ , -which(names(df_new) %in% names.drop)]

y <- na.omit(df_new)$subscribed  # observed classes
prob <- predict(lr.mars.full, newdata = df_new,  type = "response")  # predicted probabilities
classes <- ifelse(prob > 0.5, "yes", "no")  # classification based on 0.5 threshold

library(dplyr)
df_new$prob <- prob
df_new <- df_new[order(-prob),]
ideal_households <- head(df_new, 500)