2 Introduction

With this data set, the goal is to be able to predict which customers will subscribe to a term deposit based upon specific qualities and characteristics of the customer and methods used by the campaign. The data set is made up of 45,211 observations and contains 17 variables. Among all of the variables, there were no missing values or null values found. The variables and their descriptions are listed below:

1 - age of customer(numeric)

2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”, “self-employed”, “retired”, “technician”, “services”)

3 - marital : marital status (categorical: “married”, “divorced”, “single”; note: “divorced” means divorced or widowed)

4 - education (categorical: “unknown”, “secondary”, “primary”, “tertiary”)

5 - default: has credit in default? (binary: “yes”, “no”)

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: “yes”, “no”)

8 - loan: has personal loan? (binary: “yes”, “no”) # related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: “unknown”, “telephone”, “cellular”)

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)

12 - duration: last contact duration, in seconds (numeric) # other attributes:

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”, “other”, “failure”, “success”)

Output variable:

17 – y - has the client subscribed a term deposit? (binary: “yes”, “no”)


3 Exploratory Data Analysis and Feature Engineering for Numeric Variables

After finding no missing values or null values, the variables distributions were individually looked at to find outliers. After creating boxplots for the numerical variables, outliers were found in age, balance, duration, campaign, pdays, and previous.
Boxplots for EDA

Boxplots for EDA

With each variable, the normality was checked by looking at a histogram and the p-values for normality tests. For the variable age, the histogram was skewed right due to the outliers, so a log transformation was computed to create a more normal distribution and minimize the outliers. The variable balance also had a right skew to its histogram, but as some of the values were negative or zero, the log transformation could not be used. Therefore, the cube root was taken of each value to get a more normal histogram and minimize outliers. After the cube root transformation, the distribution was still skewed right, but was improved from before. The duration variable was found to be right skewed as well, but after the log transformation was performed, the distribution came to look more normal and minimized its outliers as well.
Histograms for EDA

Histograms for EDA

## 
##  Fall  Spri  Summ  Wint 
##  5287 17175 18483  4266
## 
## B.-Col.-Serv.          Bus.   Unk./Unemp. 
##         22723         17695          4793
## 
##     1     2     3  4-63 
## 17544 12505  5521  9641
Histograms for Transformed Features

Histograms for Transformed Features

The campaign variable was more difficult because none of the transformations like the log, square root, and cube root transformations normalized the distribution. So, instead the variable was grouped into categories to get rid of the sparse values. Both variables pdays and previous were 80% made up of customers that were not previously contacted, so I grouped both variables with two groups, not contacted or contacted. This was done to get rid of the sparse groups as well.

## 
##     0    1+ 
## 36954  8257
## 
##    -1     0 
## 36954  8257
## 
##     1     2     3  4-63 
## 17544 12505  5521  9641

Next, pairwise comparison was performed with a pairwise scatterplot for each numeric variable, excluding the newly grouped variables campaign, previous, and pdays. The scatterplots all showed a similar trend with the red curve differing from the blue curve, showing only a small correlation between the compared numerical variables. A low correlation value for each comparison means they should all be included in further subsequential models and algorithms.

## Warning: Removed 33 rows containing non-finite values (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 33 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 33 rows containing missing values

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 33 rows containing missing values
## Warning: Removed 33 rows containing missing values (`geom_point()`).
## Removed 33 rows containing missing values (`geom_point()`).
## Removed 33 rows containing missing values (`geom_point()`).
Pairwise Plots

Pairwise Plots


4 Exploratory Data Analysis and Feature Engineering for Categorical Variables

None of the categorical variables included any missing values or null values, so next, variables with many different categories were grouped to minimize the number of different levels. Both job and month had over 10 different categories, so month was grouped by season rather then month, and job was split by less specific category titles. One of the new category titles was called unknown/unemployed which included students, retired, unemployed, and unknown. Another was called Blue-collar/Services which included services, housemaid, and technician. The last category was called Business, which included admin., entreprenuer, management, and self-employed.

## 
##  Fall  Spri  Summ  Wint 
##  5287 17175 18483  4266
## 
## B.-Col.-Serv.          Bus.   Unk./Unemp. 
##         22723         17695          4793
pairwise comparison was performed. Each variable was compared to the output variable y in a mosaic plot. The new grouped variables previous and pdays were included in this comparison due to their newly grouped values. Most of the plots showed that the outcome of whether the customer subscribed to the deposit was dependent on the different variables because the plots individual categories were not equal. The only variable that showed itself to be independent was the variable default, so this means this variable should not be involved in any subsequential models and algorithms.
Pairwise Comparison

Pairwise Comparison

Pairwise Comparison

Pairwise Comparison


5 Logistic Predictive Model

When building a model for the newly engineered data set, we decided to use a logistic regression approach due to the binary response variable. Starting off with the full model, this model is built up with all of the possible variables, despite their significance in improving it. After that, we move onto the reduced model, which is only made up of significant variables that improve the model. The significant variables are determined by a p-value less then 0.05, meaning that any variable that exceeded this p-value was not included in the reduced model. Both of these models together were utilized as bounds for the final model.

Significance tests of logistic regression full model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.0250432 1.0850496 -5.5527816 0.0000000
mar married -0.1985343 0.0598173 -3.3190110 0.0009034
mar single 0.0643846 0.0687361 0.9366924 0.3489168
grp_jobBus. 0.1519907 0.0449950 3.3779429 0.0007303
grp_jobUnk./Unemp. 0.5323866 0.0570103 9.3384219 0.0000000
edu secondary 0.2250788 0.0615388 3.6575087 0.0002547
edu tertiary 0.4154247 0.0687663 6.0411049 0.0000000
edu unknown 0.2909910 0.1006219 2.8919258 0.0038289
def yes -0.2418033 0.2354876 -1.0268193 0.3045055
hous. yes -0.9128193 0.0430473 -21.2050323 0.0000000
loan yes -0.5008711 0.0618657 -8.0961013 0.0000000
cont telephone -0.0136247 0.0743607 -0.1832243 0.8546220
cont unknown -1.1597910 0.0592143 -19.5863366 0.0000000
grp_monSpri 0.0800941 0.0581875 1.3764838 0.1686719
grp_monSumm -0.2744039 0.0569068 -4.8219879 0.0000014
grp_monWint -0.2552634 0.0691594 -3.6909409 0.0002234
pout other 0.2874105 0.0903786 3.1800730 0.0014724
pout success 2.3486974 0.0810564 28.9760971 0.0000000
pout unknown 1.2522823 1.0254055 1.2212557 0.2219892
grp_pre1+ 1.4891726 1.0244034 1.4536973 0.1460302
trans_age -0.1465316 0.0838891 -1.7467291 0.0806843
day -0.0057074 0.0022398 -2.5481238 0.0108304
dur_min 2.1338470 0.0323923 65.8752105 0.0000000
new_bal 0.0225097 0.0033625 6.6942925 0.0000000
grp_cmpn2 -0.3786628 0.0443444 -8.5391391 0.0000000
grp_cmpn3 -0.3080132 0.0598627 -5.1453307 0.0000003
grp_cmpn4-63 -0.5046849 0.0567881 -8.8871523 0.0000000
Significance Tests of logistic regression reduced model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.4278622 0.1130642 -48.006885 0.0000000
mar married -0.1824294 0.0585650 -3.114989 0.0018395
mar single 0.1764423 0.0616811 2.860555 0.0042290
grp_jobBus. 0.1805590 0.0442184 4.083343 0.0000444
grp_jobUnk./Unemp. 0.5943619 0.0547011 10.865630 0.0000000
edu secondary 0.2800343 0.0598823 4.676414 0.0000029
edu tertiary 0.4876643 0.0668516 7.294724 0.0000000
edu unknown 0.2744635 0.0989194 2.774617 0.0055267
hous. yes -0.9228536 0.0384500 -24.001414 0.0000000
loan yes -0.5037262 0.0608733 -8.274994 0.0000000
pout other 0.2260754 0.0898139 2.517154 0.0118307
pout success 2.2696256 0.0802112 28.295626 0.0000000
pout unknown -0.6519499 0.0549714 -11.859808 0.0000000
dur_min 2.0871488 0.0314338 66.398144 0.0000000
new_bal 0.0242147 0.0032708 7.403219 0.0000000

Using the full and reduced model to build the final model, the final model ended up using 13 of the 16 predicting variables available. The final model did not use the default variable, which indicates whether the customer has credit in a default, the previous variable which indicates how many times the customer was contacted before this campaign, and pdays which indicates how many days since the customer was last contacted before this campaign. This means that these variables were not significantly improving the model when predicting whether the customer subscribed to the term deposit.

Summary Table of Significant Tests
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.5392208 0.3516724 -12.9075249 0.0000000
mar married -0.1962908 0.0597952 -3.2827176 0.0010281
mar single 0.0660461 0.0687212 0.9610730 0.3365155
grp_jobBus. 0.1520453 0.0449911 3.3794556 0.0007263
grp_jobUnk./Unemp. 0.5341695 0.0569901 9.3730185 0.0000000
edu secondary 0.2256504 0.0615350 3.6670244 0.0002454
edu tertiary 0.4165539 0.0687592 6.0581558 0.0000000
edu unknown 0.2916336 0.1006156 2.8984939 0.0037496
hous. yes -0.9117506 0.0430296 -21.1889126 0.0000000
loan yes -0.5027728 0.0617813 -8.1379507 0.0000000
cont telephone -0.0134104 0.0743609 -0.1803422 0.8568839
cont unknown -1.1602671 0.0592103 -19.5956967 0.0000000
grp_monSpri 0.0797174 0.0581543 1.3707895 0.1704406
grp_monSumm -0.2754371 0.0568831 -4.8421587 0.0000013
grp_monWint -0.2559840 0.0691383 -3.7024937 0.0002135
pout other 0.2868420 0.0903877 3.1734613 0.0015063
pout success 2.3493013 0.0810550 28.9840362 0.0000000
pout unknown -0.2369468 0.0576687 -4.1087592 0.0000398
trans_age -0.1471512 0.0838811 -1.7542844 0.0793818
day -0.0057369 0.0022398 -2.5613819 0.0104257
dur_min 2.1337476 0.0323876 65.8815797 0.0000000
new_bal 0.0227796 0.0033506 6.7985578 0.0000000
grp_cmpn2 -0.3786729 0.0443409 -8.5400410 0.0000000
grp_cmpn3 -0.3072668 0.0598581 -5.1332530 0.0000003
grp_cmpn4-63 -0.5041658 0.0567725 -8.8804611 0.0000000

The variables with a negative coefficients have a lesser chance of getting the customer to subscribe to the term deposit when the variable category outcome is true. For example, the marital variable is negative when the customer is married, indicating that when a customer is married, that customer is less likely to subscribe to the term deposit. This goes for customers with housing loans, personal loans, those who were contacted by the telephone or by unknown means, those who were last contacted in the winter and summer, those who are older in age, those contacted multiple times during this campaign, those who were contacted later in the month, and those who had an unknown outcome for the previous campaign.

The variables with a positive coefficient have a greater chance at getting the customer to subscribe to the term deposit when the variable category outcome is true. This includes customers who are single, those with jobs that qualify as business, unemployed, or unknown, those with secondary, tertiary, or unknown education, those contacted in the spring, those with a previous subscription outcome of success or other, those with a higher yearly balance of money, an those who had longer duration of time spent in conversations when contacted. For, example customers who were contacted in this campaign during the spring were more likely to subscribe to the term deposit then those contacted in the default season which was fall.

Predicted Value of Response Variable with the given Cut-Cff Probability
trans_age mar grp_job edu dur_min new_bal loan Pred.Response
3 married Bus. primary 2 100 no 0
4 single Unk./Unemp. tertiary 4 40 yes 1

We can now use this model to predict the outcome of customers with specific qualities. In this case, the model predicted that a customer who was married, with a job in business, in their twenties, primary education, an average yearly balance of 100 (most likely thousand) euros, no personal loan, and had a conversation for between 5 and 10 minutes to not subscribe to the term deposit with this campaign. In the second case, the model predicted that a customer who was single, with an unknown job or unemployed, in their fifties, tertiary education, an average yearly balance of 40 (most likely thousand) euros, currently with a personal loan, and had a conversation for about 50 minutes long to subscribe to the term deposit with this campaign.