1 Introduction

A Portuguese Banking Institution has provided data related to direct marketing campaigns. The campaigns were based on phone calls to their customers in order to offer term deposit subscriptions. In this project, we will build a model which predicts the clients who are more likely to subscribe to a term deposit as well as identify the main factors that affect the clients’ decision.

2 Objectives

3 Data Source

The data is available on UCI Machine Learning repository (https://archive.ics.uci.edu/). The data set contains 4119 clients and 20 attributes. The outcome is whether the client subscribed to a term deposit. 89% of the clients did not subscribe to a term deposit in the data set.

4 Exploratory Data Analysis

4.1 Data Cleaning

Due to the data being collected through phone call interviews, many clients refused to provide some information. This missing data is not at random so it should be modeled. The missing values are marked as “unknown” in the data set. From Figure 1 we can see that 75% of the clients provided a full response to the questions. There are 19.5% missing values in default; 4% missing values in education, and so on. We can also look at the histogram which clearly depicts the influence of missing values in the attributes.

4.1.1 Missing Values Imputation

There are a number of different methods to handle missing data. Since the attributes with missing values are categorical with more than two factors, we will use Multivariate Imputation by Chained Equations (MICE) with polytomous regression. In Figure 2 we compare the distribution of the original data and the imputed outcome. We can see that the bar plots are quite similar which means that the MICE imputation is accurate.

4.2 Graphical Results

The figure below displays histograms and bar plots of numeric attributes. The side-by-side boxplots display the attributes by the outcome (Yes/No subscription to a term deposit).

Summary key points

  • 75% of the clients are younger than 47 years old.
  • 2% of the clients are older than 60 years old.
  • The average age for subscribed and non-subscribed clients is quite similar (41 and 40 years old, respectably).
  • There is no significant difference in the age of clients by groups, indicating little association between age and subscription.
  • 75% of clients were contacted less than three times during this campaign.
  • Of the clients contacted once in this campaign, 82% had not been contacted previously.
  • The employment variation rate median of subscribed clients is almost 3 points lower than the median employment variation rate of non-subscribed clients.
  • The average consumer price index is similar for groups: 93.41 for subscribed clients and 93.59 for non-subscribe clients.
  • The consumer confidence index by non-subscribed and subscribed clients does not show an important difference: -40.58 non-subscribed clients and -39.78 subscribed clients.
  • The subscribed group in Euribor 3 month rate shows a lower median and is more variable than non-subscribed clients. The bimodal shape suggests that there are two distinct groups: the low interest rate clients and the high interest rate clients.
  • There is an important difference in the number of employees of the bank by groups of clients. The median of non-subscribed clients (5196 employees) is higher than the median of subscribed clients (5076 employees).
  • There is significant difference in the duration of the call between subscribed and non-subscribed clients. The subscribed-clients show a median call of 7.63 minutes, instead, 50% of non-subscribed clients were on the phone less than 2.75 minutes. Subscribed clients spent more time on the phone during the call.
  • 64% of the clients were contacted via telephone. 5% of these resulted in a subscription.
  • 36% of the clients were contacted via cellular, 14% of these resulted in a subscription.
  • Cellular is the type of contact with the highest subscription rate.
  • Telephone calls that finished with a subscription were longer than cellular calls that ended with a subscription.
  • Clients for who the previous campaign was successful are more likely to subscribe for a term deposit.

5 Testing Attributes

In order to understand which attributes are related to the outcome, we conducted chi-square tests of independence for the categorical attributes and Mann-Whitney test for the numeric attributes. The results are shown in the following table.

6 Predictors

Based on the exploratory data analysis and the inferential analysis, we identified the attributes more likely to affect the client’s subscription. These attributes will be called “predictors” and they are: job, marital education, contact, month, campaign, pdays, previous, poutcome, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed, duration_min.

7 Methodology

Our problem involves predicting whether the client will subscribe to a term deposit or not. This is known as a classification problem with binary response (Yes/No).

We built six different algorithms to predict the term deposit subscription:

Based on the graphical results (Figure 3), we know that some of the attributes have a skewed distribution, others an exponential one, and some of them multimodal distributions. So we executed a power transform to correct each attribute.

We also used resampling methods to split the data set into training and testing sets. To assess the prediction capability of each algorithm on unseen data, Test Accuracy and Area Under Receiver Operating Characteristic (ROC) curve were completed.

8 Algorithm Performance Comparison

Table 2 below lists the performance measurements for the six algorithms developed.

Random forest ranks first in the Area Under Curve (AUC) and has a similar test accuracy result to GLM, so it is the best model. Figure 4 shows the ROC curve and the trade-off between specificity and sensitivity for each algorithm. We can see that random forest holds the higher area.

In Figure 5 the green curve is the error rate for “yes” subscriptions, the red curve is the error rate for “no” subscriptions, while the black curve is the Out-of-Bag error rate (OOB error). The OOB error falls below 9%.

The model predicts better “no” subscriptions than “yes” subscriptions. It occurs due to the fact that there were more clients who didn’t subscribe to a term deposit in the data set. The model is highly sensitive (96.59%) which means a highly true positive rate.

Now we have built the model with the most powerful prediction ability, we will identify which attributes affect decision of the clients the most, in other words, the success of the marketing campaign.

9 Variables of Importance

Figure 6 below shows the relative variable importance in the term deposit subscription by plotting the mean decrease in Gini Index, calculated across all trees.

We can see that the duration of the last contact has the highest relative importance, followed by Euribor 3 month rate, job and number of employees. To analyse the effect of these predictors on the outcome, we will find out their coefficient using the logistic regression algorithm that we performed earlier.

9.1 Coefficient of the Important Predictors

Table 3 was created according to the signs of predictor coefficients in logistic regression.

Duration is statistically significant, suggesting a strong association of the last contact duration and the chance of subscribing to a term deposit.

The positive coefficient of Euribor3m suggests that, all other variables being equal, the Euribor 3 month rate has positive effect on clients’ subscription.

The negative coefficient of job such as blue-collar, entrepreneur, housemaid, and so on suggest that clients in these jobs are less likely to subscribe to a term deposit.

The number of employees has a positive effect on clients’ subscription.

10 Conclusion

In this project we aimed to build a model to predict whether the client will subscribe to a term deposit and identify the main attributes that affect the subscription.

In order to identify the predictors, we conducted several hypothesis testing and we ended up with 15 predictors.

We developed a bunch of linear, nonlinear, bagging and boosting algorithms. Then, we assessed them using test accuracy measurement and AUC. Random forest was the most powerful classifier with 91.58% accuracy and 93.29% AUC.

We ranked the relative variable importance according to the mean decrease of Gini Index. This rank detected duration of the last contact as the most important factor, Euribor 3 month rate as second; job as the third; and number of employees as the fourth.

In summary, firstly, it is more likely that clients would subscribe to a term deposit if the conversation on the phone is long (at least 7.63 minutes).

Secondly, the higher the Euribor 3 month rate, the higher the chance of subscription. In general, when the Euribor interest rates rise or fall there is a high likelihood that the interest rates on banking products such as mortgages, savings accounts and loans will also be adjusted.

Thirdly, technician, retired and clients working in services are more likely to subscribe to a term deposit.

Finally, the number of employees in the bank positively affects the subscription. Usually, the size of the bank is associated with financial strengths of the institution.