Banks utilize multiple direct and indirect marketing strategies to achieve higher customer deposits. The traditional marketing strategies can be friend recommendations, direct mails, advertisements, and public sponsorships. However, as technology improves, we want to figure out whether newer marketing methods such as through telephone and internet are effective to stimulate the customer deposit growths and meanwhile reduce the costs of marketing. This project investigates whether banks’ marketing strategy through telemarketing phone calls is determinant to customers’ long-term deposits. We conduct quantitative model analysis using both linear models and more advanced machine learning algorithms to evaluate the effectiveness of this marketing strategy. Therefore, we construct the following hypothesis:
H0: None of the features contribute to y
H1: At least one feature contributes to y
The data is sourced from a Portuguese banking institution, covering the period from May 2008 to November 2010. The original raw data size is over 40 thousand. After removing duplicate and bad data, the production dataset is about 30 thousand. The dataset includes 20 features, including continuous and categorical variables, such as customers’ demographics and bank balances. The response variable is binary with 1 representing that the customers buy the product(bank term deposit) and 0 representing the opposite.
For data cleaning, first we checked missing values. There is no NA value in the sheet, however there is value of ‘unknown’ in the binomio type of columns: ‘default’, ‘housing’, ‘loan’, ‘y’. We removed all the rows with the value of ‘unknown’. The original dataset contains 41,188 rows, and 21 features. After removing those rows we got 31,817 rows of clean dataset.
After cleaning the data, plot a pie chart to the target variable y, we found the dataset is an imbalance dataset, with around 20% of clients being classified as subscribed to a term deposit. This is not too bad but we still bear it in mind. As most machine learning models work better with balanced dataset, if our models’ performance is not good, we will consider fixing the imbalanced dataset.
For easier application of the dataset to models, we converted column value with “yes”, “no” to ‘1’ and ‘0’ and factorized the categorical variables to dummy variables.
Our goal is to find whether these features contribute to the target variable y. First, we plot the correlation plot for all the features. And we can clearly identify the feature, duration and pout outcome success have relative big value of covariance to y.
We used boxplots and barplots to depict the relationships between y and potential predictors. The graph below shows some variables that seem to have an effect on y.
y with numeric predictors: age, last contact duration, employment variation rate and consumer price index
y with categorical predictors: month, day of week, job and education.We use four different models to investigate the impact of independent variables on whether the customers buy the deposits. Two of the models are generalized linear models, logistic model and lasso. We also use two advanced models, random forest and support vector machine. The details and results are shown as follows:
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y). Before fitting the logistic model, we split the data into train dataset and test dataset by a ratio of 8:2. Firstly, we use the train dataset and include all predictor variables to fit the logistic model “m_logistic”. To avoid overfitting, we use stepwise function and AIC to select features and get the model “m_logistic_stepwise”, which includes 36 features instead of 50 features.
## df AIC
## m_logistic_stepwise 37 12004.02
## m_logistic 51 12014.95
We use RMSE and r-squared to evaluate the model performance. By comparing the RMSE and r-squared of train data and test data, we can tell that we didn’t overfit the train data. Logistic model performs well.
## train data test data
## RMSE 0.2673233 0.2557771
## R-square 0.3688210 0.3963178
Lasso means least absolute shrinkage and selection operator and adds the absolute value of magnitude of coefficient as the penalty term to effectively solve the overfitting problem by balancing error minimization and model complexity. We transform all the categorical variables into numerical variables to fit the lasso. This increases our original 20 features into 50 variables.
We use the optimal lambda which generates the best tradeoff between variance and bias and finally the model selects around 17 important features.
From the plot above, we can clearly see that the mean squared error remains small when the model contains more variables. However, the marginal effect of reducing prediction error is decreasing with the complexity of the model. When we include 17 variables in the model, the marginal benefit is almost reaching zero. So we decide to stop adding more variables to not overfit the model.
We use two criteria rooted mean square error and r-square to evaluate the model performance.
## train data test data
## RMSE 0.2701560 0.2614370
## R-square 0.3553737 0.3693054
From the table above, the RMSE by using train set data and test data are similar with values of 0.26. The r-square by using the train set is 36% and by using the test set is 35%. The decrease of r-square in test set condition indicates the effectiveness of solving overfitting problems by using lasso.
The random forest is a classification algorithm consisting of many decision trees. Thus, it can be very suitable for developing models by classification.
Based on the plot above, we can tell that when the number of trees larger than 20, error rate stays stable. Thus, we choose ntree equals to 200 for insurance.
## [1] 1.0000000 0.3159642
## [1] 2.0000000 0.2550342
## [1] 3.0000000 0.2246284
## [1] 4.0000000 0.2054859
## [1] 5.0000000 0.2026013
## [1] 6.0000000 0.1972531
## [1] 7.0000000 0.1982531
## [1] 8.0000000 0.1937223
## [1] 9.0000000 0.1982188
## [1] 10.0000000 0.1953858
## [1] 11.0000000 0.1941341
## [1] 12.0000000 0.1974754
## [1] 13.0000000 0.1952271
## [1] 14.0000000 0.1932419
## [1] 15.0000000 0.1950893
## [1] 16.0000000 0.1977368
## [1] 17.0000000 0.1957967
## [1] 18.000000 0.196586
## [1] 19.0000000 0.1969163
## [1] 20.000000 0.198835
From the picture above, We can tell that when mtry equals to 13, the mean error rate is the least. Thus, we will build a random forest model with mtry 13.
The importance plot is:
From the picture above, we can tell that duration is the most important variable in this model. Then, the number of employees, euribor 3 month rate and age have a relatively great impact on y.
The confusion table of test data is:
## bank_rf_final_test_pred
## no yes
## no 5309 268
## yes 325 462
The confusion table of train data is:
## no yes class.error
## no 21007 1133 0.05117435
## yes 1386 1927 0.41835195
Ratio comparation table of test and train data:
## yes yes
## header "Accuracy" "Precision" "Recall"
## ratio_train "0.901016656150177" "0.629738562091503" "0.5815746142855"
## ratio_test "0.906819610307982" "0.632876712328767" "0.587039390088945"
From the table above, we can see that both train and test data have over 90% total accuracy. And also, they have similar accuracy, precision and recall numbers, which means this model has a good fit of this dataset.
The other thing we noticed is that although we have a high accuracy for the whole data, our positive accuracy, precision which means that when clients subscribing term deposits, is not that high. Thus we can tell that the model can only predict about 60% when clients are willing to subscribe to term deposits.
Moreover, recall measures the ratio how many clients will subscribe to term deposits when they are marked as yes by this model. In this specific case, about 55% clients marked as yes will subscribe to the term deposits.
The support vector machine implemented the conceptual idea: the input vectors are mapped from an original input space into a high-dimensional feature space through some non-linear mapping function chosen a priori. The linear decision surface is then constructed in this feature space. It is good at dealing with the problem with the real life data sets that the relationship among the input variables are complex and nonlinear.
We transformed all the categorical variables into numerical variables to fit the SVM. This increases our original 20 features into 50 variables. And the test set contains 6364 rows, the train set contains 25453 rows.
Before fitting the SVM model, we use 100 data points with 10-fold cross validation to tune it and find out the proper parameters for the best performance. And we got the model best performance achieve error rate at 0.1260043, with below parameters:
SVM-Kernel: linear
cost: 0.001
gamma: 0.02040816
Using the best model parameters to fit the train data set and make prediction on train dataset, we got the performance of train dataset as:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 22140 0
## 1 0 3313
##
## Accuracy : 1
## 95% CI : (0.9999, 1)
## No Information Rate : 0.8698
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.8698
## Detection Rate : 0.8698
## Detection Prevalence : 0.8698
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
Using the best model fitted by train set, to make prediction with test set, we got the performance of test dataset as:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5577 0
## 1 0 787
##
## Accuracy : 1
## 95% CI : (0.9994, 1)
## No Information Rate : 0.8763
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.8763
## Detection Rate : 0.8763
## Detection Prevalence : 0.8763
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
And we calculated RMSE and R-square for both as below. The result presented, SVM gives a perfect prediction for both train set and test test. Due to parameter tuning for SVM is time consuming, we only took 100 data points to tune parameters. And tuning with the 100 data points is efficient and returns good performance.
Dataset | Train Set | Test Set |
---|---|---|
RSE | 0 | 0 |
R_square | 1 | 1 |
The SVM model performs well to make predictions of the target variable y. Moreover, the small P-value of the model indicating the test with parameters is significantly better than the “no information rate,” which is taken to be the largest class percentage in the data. Hence, we concluded:H1: At least one feature contributes to y.
Based on our analyst, all four models concluded H1: At least one feature contributes to y.
1.Logistic AIC stepwise: Out of 50 features(including dummy features), 16 features have been identified as significant as their p-values are small.
2.Lasso: Out of 50 features(including dummy features), 17 important features were selected by the model.
3.Random Forest: Features importance order been identified by the model importance plot. And we can clearly see duration is the most important variable in this model, the number of employees, euribor 3 month rate and age have a relatively great impact on the target variable.
4.Support Vector Machine: The p-value for the model indicating with features performance significantly better than no information rate.
In conclusion, from our analysis, the telemarketing strategy is effective to encourage customers to buy long-term deposits. Banks could further execute this strategy to promote higher customer deposits in the future.