1 Bank Direct Marketing Data Set Description

The data used for this study comes from direct marketing campaigns of a Portuguese banking institution. These marketing campaigns were based on phone calls, and often, more than one contact was required to the same client to access if the term deposit was subscribed. The data is ordered by date, from May 2008 to November 2010. The data was found at the UC Irvine Machine Learning Repository.

The overall goal of this study is to predict if a client will subscribe to a term deposit after direct marketing campaigns of a Portuguese banking institution.

There is a total number of 45,211 client records in this data set. The data set consists of 17 variables, including the response variable with the name ‘y’. A detailed description of the predictor and outcome variables are given below:

1 - age (numeric)

2 - job : Job type (categorical): “admin.”, “unknown”, “unemployed”, “management”, “housemaid”, “entrepreneur”, “student”, “blue-collar”, “self-employed”, “retired”, “technician”, “services”

3 - marital : Marital status (categorical): “married”, “divorced”, “single” note: “divorced” means divorced or widowed

4 - education (categorical): “unknown”,“secondary”,“primary”,“tertiary”

5 - default: Does the client have credit in default? (binary: “yes”,“no”)

6 - balance: Average yearly balance (numeric, in euros)

7 - housing: Does the client have a housing loan? (binary: “yes”,“no”)

8 - loan: Does the client have a personal loan? (binary: “yes”,“no”)

9 - contact: Contact communication type (categorical): “unknown”,“telephone”,“cellular”

10 - day: Last contact day of the month (numeric, discrete)

11 - month: Last contact month of year (categorical): “jan”, “feb”, “mar”, “apr”, “may”, “jun”, “jul”, “aug”, “sep”, “oct”, “nov”, “dec”

12 - duration: Last contact duration (numeric, in seconds)

13 - campaign: The number of contacts performed during this campaign and for this client (numeric, discrete)

14 - pdays: The number of days after the client was last contacted from a previous campaign (numeric, discrete) note: -1 means client was not previously contacted

15 - previous: The number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: The outcome of the previous marketing campaign (categorical): “unknown”, “other”, “failure”, “success”

17 - y (outcome response variable): Has the client subscribed a term deposit? (binary: “yes”,“no”)

A copy of this publicly available data is stored at: https://archive.ics.uci.edu/dataset/222/bank+marketing

# Loading in the data set
BankMarketing = read.csv("https://pengdsci.github.io/datasets/BankMarketing/BankMarketingCSV.csv")[, -1]

2 Exploratory Data Analysis for Feature Engineering

Exploratory data analysis (EDA) for Feature Engineering will be done to look at the distribution of variables and observe patterns. Changes will be made to the variables based off the results, and these fixed variables will be used for future modeling.

First, the entire data set will be scanned to determine the EDA tools to use for feature engineering. Then, if there is missing values, the data will be imputted. Afterwards, if numeric or categorical variables are skewed, they will be discretized, where there values are split into new groups or categories. These new variables will be used in future modeling instead of the original variables. A final data set will then be created using these transformed variables.

Finally, with this fixed data set, linear association and correlation between numeric variables, as well as dependency on the response variable for categorical variables, will be investigated.

Let’s begin by looking at a few descriptive statistics for every variable in the data set.

#Summarized descriptive statistics for all variables in the data set
summary(BankMarketing)
##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

It can be observed from the above summary table that the distribution of some numeric variables is skewed and contains outliers.

2.1 Missing Values of the Data Set

There appears to be no missing values in this data set. Therefore, there is no need to use any methods regarding the imputation or deletion of missing values.

2.2 Assessing Distributions of the Variables

Now, we will look at possibly discretizing the numeric variables, both continuous and discrete, and existing categorical variables of the study.

2.2.1 Discretizing Continuous Variables

To deal with the outliers and skewness of certain numerical variables, such as duration of the last contact, shown in the histogram below, discretization will be used to divide the different values into groups. This variable should be discretized due to the great number of high outliers, which in turn leads to great skewness. In looking at this variable’s distribution, the three groups that were created (0-180, 181-319, and 320+) seem similar enough in the frequency of client observations. This variable will be used for future models.

# histogram showing the distribution of the duration variable
hist(BankMarketing$duration, xlab = "Duration", ylab = "count", main = "Durations of Last Contact")

# New grouping variable for duration
BankMarketing$grp.duration <- ifelse(BankMarketing$duration <= 180, '0-180',
               ifelse(BankMarketing$duration >= 320, '320+', '[181, 319]'))

2.2.2 Grouping Categories for Discrete Numeric Variables

Now, let’s look at bar plots for and discretize three discrete numerical variables: campaign, pdays, and previous.

# barplot showing the distribution of the campaign variable
marketcampaigns = table(BankMarketing$campaign)
barplot(marketcampaigns, main = "Distribution of Contacts Performed During Campaign", xlab = "Number of Contacts")

# barplot showing the distribution of the pdays variable
dayspassed = table(BankMarketing$pdays)
barplot(dayspassed, main = "Distribution of Days Passed After Client Last Contacted From Previous Campaign", xlab = "Number of Days")

# barplot showing the distribution of the previous variable
prev = table(BankMarketing$previous)
barplot(prev, main = "Distribution of Number of Contacts Performed Before This Campaign and for This Client", xlab = "Number of Contacts")

Overall, the bar plots are greatly skewed and/or weighted for certain values, so category groups should be made for each variable.

For campaign, the value of 1 contact should be its own group since it has the highest frequency of observations. The Values of 2 and 3 contacts combined have a similar frequency, so this should be a second group. The rest of the observations from 4 contacts and up together act as a third group since they more or less add up to a similar frequency as the first two groups. As for pdays, the value of -1 for this variable acts as an indicator that a client was not previously contacted. Due to this fact, and the fact that it makes up most of the observations as well, this will be its own group. The rest of the observations were split into groups of 1-200 days and 200 days or more. The value of 200 seemed like a decent splitting point due to how the distribution looked on the bar plot. The previous variable was also split into 3 groups. The value of 0 contacts is one group since it has the most observations. The values of 1 to 3 contacts is another category since they both make a fair amount of the observations. Same goes for observations with 4 or more contacts.

These grouped variables will be used in subsequent modeling. The categories for each variable are as follows:

campaign: 1, 2-3, 4+ pdays: -1, 1-199, 200+ previous: 0, 1-3, 4+

# New grouping variable for campaign
BankMarketing$grp.campaign <- ifelse(BankMarketing$campaign <= 1, '1',
               ifelse(BankMarketing$campaign >= 4, '4+', '[2, 3]'))

# New grouping variable for pdays
BankMarketing$grp.pdays <- ifelse(BankMarketing$pdays <= -1, 'Client Not Previously Contacted', ifelse(BankMarketing$pdays >= 200, '200+', '[1, 199]'))

# New grouping variable for previous
BankMarketing$grp.previous <- ifelse(BankMarketing$previous <= 0, '0',
               ifelse(BankMarketing$previous > 4, '4+', '[1,3]'))

2.2.3 Regrouping Catagorical Variables

The bar plot for the month variable also shows that the distribution of this variable is skewed in favor of warmer seasons like spring (specifically may) and summer (jun, jul, aug). As such, this categorical variable should be re-categorized by season (winter, spring, summer, fall).

The job variable also has sparse categories that may affect the results of subsequent modeling. Therefore, it may be beneficial to group them in a more meaningful way to make a more powerful feature variable. They are now split between four new categories, depending on the type of job:

not working (or does not currently have a job) = unemployed, unknown, retired, student

Workers (standard jobs/blue-collar workers) = blue-collar, housemaids

bosses (running own company) = entrepreneur, management, self-employed

white-collar (white-collar type jobs) = services, admin., technician

Both of these discretized variables will be used for modeling.

# barplot showing the distribution of the month variable
durationmonth = table(BankMarketing$month)
barplot(durationmonth, main = "Distribution of Month of Last Contact", xlab = "Number of Contacts by Month")

# barplot showing the distribution of the job variable
jobcategory = table(BankMarketing$job)
barplot(jobcategory, main = "Distribution of Job Type", xlab = "Number of Clients in Each Job")

# New grouping variable for month
BankMarketing$grp.month = ifelse(BankMarketing$month == " jan", "winter", ifelse(BankMarketing$month == " feb", "winter", ifelse(BankMarketing$month == " mar", "spring", ifelse(BankMarketing$month == " apr", "spring", ifelse(BankMarketing$month == " may", "spring", ifelse(BankMarketing$month == " jun", "summer", ifelse(BankMarketing$month == " jul", "summer", ifelse(BankMarketing$month == " aug", "summer", ifelse(BankMarketing$month == " sep", "fall", ifelse(BankMarketing$month == " oct", "fall", ifelse(BankMarketing$month == " nov", "fall", "winter")))))))))))

# New grouping variable for job
BankMarketing$grp.job = ifelse(BankMarketing$job == " unknown", "not working", ifelse(BankMarketing$job == " unemployed", "not working", ifelse(BankMarketing$job == " retired", "not working", ifelse(BankMarketing$job == " blue-collar", "workers", ifelse(BankMarketing$job == " entrepreneur", "bosses", ifelse(BankMarketing$job == " housemaid", "workers", ifelse(BankMarketing$job == " management", "bosses", ifelse(BankMarketing$job == " self-employed", "bosses", ifelse(BankMarketing$job == " services", "white-collar", ifelse(BankMarketing$job == " technician", "white-collar", ifelse(BankMarketing$job == " student", "not working", "white-collar")))))))))))

2.3 Assembling the New Data Set

Now that the variables have been discretized, those newly discretized variables will be kept for use in subsequent modeling instead of the original versions.

# Assembling the discretized variables and other variables to make the modeling data set
var.names = c("age", "balance", "day", "grp.job", "marital", "education", "default", "housing", "loan", "contact", "grp.month", "grp.duration", "grp.campaign", "grp.pdays", "grp.previous", "poutcome", "y") 
BankMarketingCampaign = BankMarketing[, var.names]

2.4 Pairwise Associations

It is time to look at association between numeric variables and dependency of categorical variables to the response.

2.4.1 Correlation of Numerical Variables

A pair-wise scatter plot is used for assessing pairwise linear association between two numeric variables at a time.

# Pair-wise scatter plot for numeric variables
ggpairs(BankMarketingCampaign,  # Data frame
        columns = 1:3,  # Columns
        aes(color = y,  # Color by group (cat. variable)
            alpha = 0.5))

The off-diagonal plots and numbers indicate the correlation between was weak and not what was expected, None of the numerical variables appear to be significantly correlated to each other.

The stacked density curve for balance shows that distributions of balance in the yes and no response categories are essentially identical. This would imply that balance might not associated with the response variable. Therefore, it should probably be removed from the modeling data set. As for the other variables, the curves are mostly but not completely overlapped, this means there is correlation between each of these numeric variables and the response variable (y), but it’s not a lot.

There is almost no correlation between day and the other variables, but there is a somewhat better correlation between age and balance, even though it is still very weak.

2.4.2 Dependency of Categorical Variables

These mosaic plots help show whether clients subscribing a term deposit is independent of the categorical variables. Variables that are independent should be excluded in future models.

# Mosaic plots to show categorical variable dependency to the response.
par(mfrow = c(2,2))
mosaicplot(grp.job ~ y, data=BankMarketingCampaign,col=c("Blue","Red"), main="job vs term deposit subscription")
mosaicplot(marital ~ y, data=BankMarketingCampaign,col=c("Blue","Red"), main="marital vs term deposit subscription")
mosaicplot(education ~ y, data=BankMarketingCampaign,col=c("Blue","Red"), main="education vs term deposit subscription")
mosaicplot(poutcome ~ y, data=BankMarketingCampaign,col=c("Blue","Red"), main="poutcome vs term deposit subscription")

# Mosaic plots to show categorical variable dependency to the response.
par(mfrow = c(2,2))
mosaicplot(housing ~ y, data=BankMarketingCampaign,col=c("Blue","Red"), main="housing vs term deposit subscription")
mosaicplot(loan ~ y, data=BankMarketingCampaign,col=c("Blue","Red"), main="loan vs term deposit subscription")
mosaicplot(contact ~ y, data=BankMarketingCampaign,col=c("Blue","Red"), main="contact vs term deposit subscription")
mosaicplot(grp.month ~ y, data=BankMarketingCampaign,col=c("Blue","Red"), main="month vs term deposit subscription")

# Mosaic plots to show categorical variable dependency to the response.
par(mfrow = c(2,2))
mosaicplot(grp.duration ~ y, data=BankMarketingCampaign,col=c("Blue","Red"), main="duration vs term deposit subscription")
mosaicplot(grp.campaign ~ y, data=BankMarketingCampaign,col=c("Blue","Red"), main="campaign vs term deposit subscription")
mosaicplot(grp.pdays ~ y, data=BankMarketingCampaign,col=c("Blue","Red"), main="pdays vs term deposit subscription")
mosaicplot(grp.previous ~ y, data=BankMarketingCampaign,col=c("Blue","Red"), main="previous vs term deposit subscription")

The mosaic plots for contact, pdays, and education show negative association between contact communication type, client education, and number of days passed after the client was last contacted from a previous campaign. Most of these mosaic plots show that whether the client subscribed a term deposit is not independent of times of these variables because the proportion of subscription cases in individual categories is not identical.

# Mosaic plot to show default dependency to the response.
mosaicplot(default ~ y, data=BankMarketingCampaign,col=c("Blue","Red"), main="default vs term deposit subscription")

table(BankMarketing$default)
## 
##    no   yes 
## 44396   815

It should be said that the mosaic plot for the default variable showed the yes category was extremely small compared to the no category. Only 815 (1.8%) clients of the 45211 total had credit in default, according to this data. This category of the default variable having only a few subscribers could cause instability with estimating model parameters. Therefore, it might be better to not include this variable in subsequent modeling.

3 Predictive Modeling with Logistic Regression

In this section, the revised data set created through the EDA done above will be used to run different logistic regression models. An optimal final model will be found from these models, which will be used to calculate probabilities for predicting whether or not a client has subscribed a term deposit after direct marketing campaigns. The variable, y, which tells whether a client has subscribed a term deposit, acts as the binary reponse variable of all the models. The rest of the variables, including the new discretized variables, of the revised data set act as the predictor variables that will possibly affect the response.

3.1 Methodology for Modeling

In order to perform proper modeling, some categorical and binary variables had to be changed, including the response (y) and the ones changed in the EDA, to have numerical labels, thereby making them easier to use for modeling.

The first logistic regression model that will be built is an initial full model that contains all predictors variables of the data set. Automatic variable selection will then be used to find a final model. In looking at the p-values of the variables in the initial model, those that are insignificant at the 0.05 level will be dropped. The variables remaining that are either statistically significant or important for the model will be used to create a sort of reduced model. A third and final model, that is between the full and reduced models, will then be found. Performance of predictive power will be analyzed for all predictor variables as well as their association to the response.

Finally, this final model will be used to calculate predictive probability for values of the response variable. When values of predictor values are entered, the predicted value of whether or not a client has subscribed a term deposit (either Yes or No) is given.

3.2 Turning Text Categorical and Binary Variables into Disrete Numerical Variables for Modeling

The values of the response variable, y, (yes/no), along with certain binary and categorical variables, must be changed to have numerical labels here. This is the only way the models can be created properly. The labels are as follows:

y (response): 0=no, 1=yes

grp.job: 0=not working, 1=workers, 2=bosses, 3=white-collar

marital: 0=divorced, 1=single, 2=married

education: 0=unknown, 1=primary, 2=secondary, 3=tertiary

housing: 0=no, 1=yes

loan: 0=no, 1=yes

contact: 0=unknown, 1=telephone, 2=cellular

grp.month: 0=winter, 1=spring, 2=summer, 3=fall

grp.duration: 0=(0-180), 1=(181-319), 2=320+

grp.campaign: 0=1, 1=(2-3), 2=4+

grp.pdays: 0=Client Not Previously Contacted, 1=(1-199), 2=200+

grp.previous: 0=0, 1=(1-3), 2=4+

poutcome: 0=unknown, 1=success, 2=failure, 3=other

# Create numerical value labels for categorical variables
BankMarketingCampaign$y <- factor(BankMarketingCampaign$y, levels = c(" no", " yes"), labels = c("0", "1"))

BankMarketingCampaign$grp.job <- factor(BankMarketingCampaign$grp.job, levels = c("not working", "workers", "bosses", "white-collar"), labels = c("0", "1", "2", "3"))

BankMarketingCampaign$marital <- factor(BankMarketingCampaign$marital, levels = c(" divorced", " single", " married"), labels = c("0", "1", "2"))

BankMarketingCampaign$education <- factor(BankMarketingCampaign$education, levels = c(" unknown", " primary", " secondary", " tertiary"), labels = c("0", "1", "2", "3"))
  
BankMarketingCampaign$housing <- factor(BankMarketingCampaign$housing, levels = c(" no", " yes"), labels = c("0", "1"))
  
BankMarketingCampaign$loan <- factor(BankMarketingCampaign$loan, levels = c(" no", " yes"), labels = c("0", "1"))

BankMarketingCampaign$contact <- factor(BankMarketingCampaign$contact, levels = c(" unknown", " telephone", " cellular"), labels = c("0", "1", "2"))

BankMarketingCampaign$grp.month <- factor(BankMarketingCampaign$grp.month, levels = c("winter", "spring", "summer", "fall"), labels = c("0", "1", "2", "3"))

BankMarketingCampaign$grp.duration <- factor(BankMarketingCampaign$grp.duration, levels = c("0-180", "[181, 319]", "320+"), labels = c("0", "1", "2"))
  
BankMarketingCampaign$grp.campaign <- factor(BankMarketingCampaign$grp.campaign, levels = c("1", "[2, 3]", "4+"), labels = c("0", "1", "2"))
  
BankMarketingCampaign$grp.pdays <- factor(BankMarketingCampaign$grp.pdays, levels = c("Client Not Previously Contacted", "[1, 199]", "200+"), labels = c("0", "1", "2"))
  
BankMarketingCampaign$grp.previous <- factor(BankMarketingCampaign$grp.previous, levels = c("0", "[1,3]", "4+"), labels = c("0", "1", "2"))
  
BankMarketingCampaign$poutcome <- factor(BankMarketingCampaign$poutcome, levels = c(" unknown", " success", " failure", "  other"), labels = c("0", "1", "2", "3"))

3.3 Model Building

3.3.1 Building the Initial Full Model

The full model containing all predictor variables of the data set will be made first, with the variable y (whether or not a client has subscribed a term deposit) as the response. The variables balance and default are not included since the EDA showed that removing them from the model might help the results.

# Create the initial full model
initial.model = glm(y ~ age + day + grp.job + marital + education + housing + loan + contact + grp.month + grp.duration + grp.campaign + grp.pdays + grp.previous + poutcome, family = binomial, data = BankMarketingCampaign)
coefficient.table = summary(initial.model)$coef
kable(coefficient.table, caption = "Significance tests of logistic regression model")
Significance tests of logistic regression model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.6932401 0.1685643 -21.9099805 0.0000000
age 0.0030780 0.0018848 1.6330474 0.1024590
day -0.0083344 0.0021761 -3.8299334 0.0001282
grp.job1 -0.5019278 0.0660231 -7.6023072 0.0000000
grp.job2 -0.4214571 0.0639024 -6.5953217 0.0000000
grp.job3 -0.3396858 0.0583846 -5.8180669 0.0000000
marital1 0.1430139 0.0645777 2.2146005 0.0267875
marital2 -0.1955606 0.0568222 -3.4416237 0.0005782
education1 -0.2127588 0.0992433 -2.1438106 0.0320481
education2 -0.0846006 0.0879245 -0.9621955 0.3359514
education3 0.1931150 0.0918617 2.1022355 0.0355327
housing1 -0.7395607 0.0415171 -17.8134037 0.0000000
loan1 -0.4417726 0.0559484 -7.8960718 0.0000000
contact1 1.0848603 0.0865852 12.5293956 0.0000000
contact2 1.1034597 0.0544136 20.2791034 0.0000000
grp.month1 0.3174489 0.0637672 4.9782486 0.0000006
grp.month2 -0.0643480 0.0604185 -1.0650376 0.2868589
grp.month3 0.2619248 0.0690091 3.7955101 0.0001473
grp.duration1 1.2500054 0.0542134 23.0571388 0.0000000
grp.duration2 2.7423695 0.0488318 56.1594983 0.0000000
grp.campaign1 -0.2914368 0.0388654 -7.4986213 0.0000000
grp.campaign2 -0.4295598 0.0541171 -7.9375980 0.0000000
grp.pdays1 1.7918868 1.0029024 1.7867011 0.0739858
grp.pdays2 1.3994078 1.0044816 1.3931642 0.1635701
grp.previous1 -0.2227561 0.0986341 -2.2584090 0.0239202
poutcome1 1.0496216 1.0032678 1.0462028 0.2954674
poutcome2 -1.2746816 1.0026094 -1.2713641 0.2035991

It appears that some p-values in the above significance test table are bigger than 0.5 for some levels of predictor variables, but not all.

3.3.2 Building the Reduced and Final Models with Automatic Variable Selection

Some of the insignificant predictor variables will now be dropped, using automatic variable selection, in finding the reduced and final models. The final best model will be a model that is between the full and reduced models.

# Creating the reduced and final models
full.model = initial.model  # the *biggest model* that includes all predictor variables
reduced.model = glm(y ~ day + grp.job + marital + housing + loan + contact + grp.duration + grp.campaign + grp.previous, family = binomial, data = BankMarketingCampaign)
final.model =  step(full.model, 
                    scope=list(lower=formula(reduced.model),upper=formula(full.model)),
                    data = BankMarketingCampaign, 
                    direction = "backward",
                    trace = 0)   # trace = 0: suppress the detailed selection process
final.model.coef = summary(final.model)$coef
kable(final.model.coef, caption = "Summary table of significant tests")
Summary table of significant tests
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.6932401 0.1685643 -21.9099805 0.0000000
age 0.0030780 0.0018848 1.6330474 0.1024590
day -0.0083344 0.0021761 -3.8299334 0.0001282
grp.job1 -0.5019278 0.0660231 -7.6023072 0.0000000
grp.job2 -0.4214571 0.0639024 -6.5953217 0.0000000
grp.job3 -0.3396858 0.0583846 -5.8180669 0.0000000
marital1 0.1430139 0.0645777 2.2146005 0.0267875
marital2 -0.1955606 0.0568222 -3.4416237 0.0005782
education1 -0.2127588 0.0992433 -2.1438106 0.0320481
education2 -0.0846006 0.0879245 -0.9621955 0.3359514
education3 0.1931150 0.0918617 2.1022355 0.0355327
housing1 -0.7395607 0.0415171 -17.8134037 0.0000000
loan1 -0.4417726 0.0559484 -7.8960718 0.0000000
contact1 1.0848603 0.0865852 12.5293956 0.0000000
contact2 1.1034597 0.0544136 20.2791034 0.0000000
grp.month1 0.3174489 0.0637672 4.9782486 0.0000006
grp.month2 -0.0643480 0.0604185 -1.0650376 0.2868589
grp.month3 0.2619248 0.0690091 3.7955101 0.0001473
grp.duration1 1.2500054 0.0542134 23.0571388 0.0000000
grp.duration2 2.7423695 0.0488318 56.1594983 0.0000000
grp.campaign1 -0.2914368 0.0388654 -7.4986213 0.0000000
grp.campaign2 -0.4295598 0.0541171 -7.9375980 0.0000000
grp.pdays1 1.7918868 1.0029024 1.7867011 0.0739858
grp.pdays2 1.3994078 1.0044816 1.3931642 0.1635701
grp.previous1 -0.2227561 0.0986341 -2.2584090 0.0239202
poutcome1 1.0496216 1.0032678 1.0462028 0.2954674
poutcome2 -1.2746816 1.0026094 -1.2713641 0.2035991

3.4 Predictive Probability Analysis for Clients Subscribing Term Deposits

Now that a final model has been created, it will be used to predict whether or not a client has subscribed a term deposit based on given values of the predictor variables in the final model associated with two clients. A threshold probability of 0.5 is used to predict the response value.

# Predicting Response Value for Banking Client Given Variable Values for the Final Model
mynewdata = data.frame(age=c(58,44),
                       day = c(5,5),
                       grp.job = c("1","1"),
                       marital = c("2","1"),
                       education = c("3","1"),
                       housing = c("1","0"),
                       loan = c("0","0"),
                       contact = c("1","0"),
                       grp.month = c("3","2"),
                       grp.duration = c("2","1"),
                       grp.campaign = c("0","0"),
                       grp.pdays = c("1","2"),
                       grp.previous = c("1","0"),
                       poutcome = c("0","2"))
pred.success.prob = predict(final.model, newdata = mynewdata, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
## threshold probability
cut.off.prob = 0.5
pred.response = ifelse(pred.success.prob > cut.off.prob, 1, 0)  # This predicts the response

# Add the new predicted response to Mynewdata
mynewdata$Pred.Response = pred.response
kable(mynewdata, caption = "Predicted Value of response variable with the given cut-off probability")
Predicted Value of response variable with the given cut-off probability
age day grp.job marital education housing loan contact grp.month grp.duration grp.campaign grp.pdays grp.previous poutcome Pred.Response
58 5 1 2 3 1 0 1 3 2 0 1 1 0 1
44 5 1 1 1 0 0 0 2 1 0 2 0 2 0

The predicted answers for whether or not the client has subscribed a term deposit for these two clients will be attached to the two new data records. The first banking client will subscribe a term deposit while the second one will not.