Data Science Case Study: Bank TeleMarketing Prediction

REPORT OUTLINE:

Business Understanding
Data Understanding
Data Exploration
Data Preparation
Modeling
Evaluation
Recommendation

1. Business Understanding

Telemarketing is a form of direct marketing in which a salesperson meets a client face to face or over the phone to convince him to purchase a product.Nowadays,telephone (fixed-line or mobile) has been broadly used.

Marketing campaign is nothing |but phone calls to the clients to make them accept to make a term deposit with their bank. A term deposit is a deposit that a bank or financial institution provides at a fixed interest (often cheaper than simply opening a deposit account) and can be returned at a specific maturity date.

The goal of business is to improve marketing effectiveness by targeting the right customers. In addition to being more efficient, the potential reduction in marketing costs is likely to increase the profit margin of the product overall. Therefore, machine learning is conducted in this project with the aim to predict how customers will respond to its telemarketing campaign(subscribe or not subscribe).

2. Data Understanding

The data used in this project comes from UCI Machine Learning. You can find the data and data description from here.

Table 1. Result of data understanding

This code will display the the number of rows, columns, type of features, and information about the data set

reactable(
  tableBank1,
  columnGroups = list(
    colGroup(name = "MissingValues", columns = c("Rows",              "Columns","DiscreteFeature","ContinuousFeature","Subcribe","NotSubcribe","DuplicatedRows","Jobs", "Marital","Education","Default","Loan","OutcomeLast"))),
  bordered = TRUE, striped = TRUE, highlight = TRUE
  )

Table 2. Data set Telemarketing

This code is for displaying all Bank Telemarketing dataset

reactable(
  BankMarketing,
    defaultPageSize = 5,
  bordered = TRUE, striped = TRUE, highlight = TRUE
  )

3. Data Exploration

A. Univarite Analysis

From figure 1, we can see that call duration during previous campaign was between 0 and 35 minutes. Most of the clients was called below 5 seconds and only a few was called longer that that.

This code is to display visualization for education level


# RENAME VALUES OF VARIABLE EDUCATION
BankMarketing$education[BankMarketing$education == 'basic.4y'] <- '4Y'
BankMarketing$education[BankMarketing$education == 'basic.6y'] <- '6Y'
BankMarketing$education[BankMarketing$education == 'basic.9y'] <- '9Y'
BankMarketing$education[BankMarketing$education == 'university.degree'] <- 'Univ.'
BankMarketing$education[BankMarketing$education == 'professional.course'] <- 'PCourse'
BankMarketing$education[BankMarketing$education == 'high.school'] <- 'HSchool'

x <- BankMarketing %>%
  mutate(education = education %>% fct_infreq() %>% fct_rev()) %>% ggplot(aes(education)) +
    geom_bar(fill = "cornflowerblue")+labs(y = "Total Customer", x = "Education Level")
ggplotly(x)

Figure 2. Total Clients over Education Level

From figure 2, it is clear that the majority of clients was Univesity graduate with about 12K clients. Followed by high school graduates with almost 10K clients. In short, the most customers came from secondary and tertiary educational level.

B. Bivariate Analysis

This code is to visualize two variables at once (Month and day)

p <- ggplot(data = BankMarketing, aes(x = month, fill = day)) +geom_bar(position = "dodge")+ labs(y = "Total Customer", x = "Last contact month of year") 
ggplotly(p)

Figure 3. The relationship of Last Contact Month and Day of Previous Campaign

First of all, we can notice from figure 3 that no contact has been made during January and February. In addition, calls are not made during weekend days. It is clearly displayed that clients had lots of calls on May especially on Wednesday and Friday. On the contrary, December aside, there are enough observations to conclude this is not pure luck, so this feature will probably be very important in models.

This code is to display visualization of Age Over Status of Subscription

a <- ggplot(BankMarketing, aes(x = Status, y = age, color= Status)) +geom_boxplot() + labs(y ="Age of Customer", x = "Status of subscription")
ggplotly(a)

Figure 4. The relationship of education level and status of customers

From figure 4, In its telemarketing campaigns, clients called by the bank have an extensive age range, from 18 to 95 years old. However, a majority of customers called is in the age of 30s and 40s (33 to 48 years old fall within the 25th to 75th percentiles). The distribution of customer age is fairly normal with a small standard deviation.

#### C. Multivariate Analysis

This code is to display correlation plot

num <- map_lgl(BankMarketing, is.numeric)
num_var <- BankMarketing[, num]
corMatrix <- cor(num_var)
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(corMatrix, method = "color", col=col(200), type="lower", order = "hclust", 
# Add coefficient of correlation
addgrid.col = "darkgray",addCoef.col = "black",number.cex = 0.6, tl.col="black", tl.srt=30,tl.cex = 0.8,diag=FALSE)

Figure 5: Correlation plot among numerical features

Initially, we set the cutoff of 0.5 means that features with correlation coefficient values higher than 0.5 will be removed from the data. The correlation plot shows the correlation value between numEmp-EmpVaRate and numEmp-threeMonthrate are 0.97 and 0.95, respectively. Followed by EmpVaRate-threeMonthRate and EmpVaRate-priceIdx are 0.91 and 0.78, in a sequential order. Correlation values of numEmp-priceIdx at 0.69 as well as threeMonthRate-priceIdx at 0.52. In total, there are four features that eliminated from the data in data preparation stage.

4. Data Preparation

Overall, there are 11 works that we performed during data preparation stage as follows:

1. Renaming Variables

Starting this stage by renaming some columns for better typing and calling variables.

Table 3. Renaming some variables

2. Getting rid of duplicated rows

This code is to display duplicated rows

## BankMarketing[duplicated(BankMarketing),]

Table 4. List of duplicated rows

3. Missing Values Imputation

This code is to impute missing values with mode

## BankMarketing$job[is.na(BankMarketing$job)] <- mfv(BankMarketing$job, na_rm = TRUE)
## BankMarketing$marital[is.na(BankMarketing$marital)] <- mfv(BankMarketing$marital, na_rm = TRUE)
## BankMarketing$education[is.na(BankMarketing$education)] <- mfv(BankMarketing$education, na_rm = TRUE)

## BankMarketing$default[is.na(BankMarketing$default)] <- mfv(BankMarketing$default, na_rm = TRUE)

## BankMarketing$housing[is.na(BankMarketing$housing)] <- mfv(BankMarketing$housing, na_rm = TRUE)

## BankMarketing$loan[is.na(BankMarketing$loan)] <- mfv(BankMarketing$loan, na_rm = TRUE)

## BankMarketing$OutcomeLastCont[is.na(BankMarketing$OutcomeLastCont)] <- mfv(BankMarketing$OutcomeLastCont, na_rm = TRUE)

From table 1, we proposed mode imputation (or mode substitution)technique to replace missing values of a categorical variable by the mode of non-missing cases of that variable. The mode values will be filled in into records with missing values.

Table 5. Mode of Features having missing values

4. Withdrawing unimportant feature

This code is to remove feature ‘default’

BankMarketing <- BankMarketing[-c(6)]

We discovered that ‘default’ is certainly not usable with only 3 individuals replied “yes” to the question “Do you have credit in default?”. People either answered “no” or don’t even reply, which gives zero information in our case. This variable is removed from the data set.

5. Converting some variables

This code is to convert some numeric features into categorical features

BankMarketing = BankMarketing %>% 
  mutate(previous = if_else(previous == 0, 0, 1))

BankMarketing = BankMarketing %>% 
  mutate(lastContact = if_else(lastContact == 999, 0, 1))

BankMarketing = BankMarketing %>% 
 mutate(age = if_else(age >=  60, "Level3", if_else(age >= 30, "Level2", "Level1")))

There are 3 numeric variables that need to be convert into categorical features in order to be meaningful features. Furthermore, these variables have a large number of value but unbalanced number of observations for each value. If we do not convert them, the value with high number of observations will bring bias to the predictive model.

Table 6. Converting feature ‘previous’

Table 6 shows the range of value is from 0 to 7 with the variation in number of observations. The highest number of obs. is value ‘0’ followed by ‘1’ at less than half of number of observation of 0. The lowest number of obs. is value 7 with 1 record only. After doing analysis, we transformed those values into 0 and 1 as shown in table 6.

Table 7. Converting feature ‘Last Contact’

Data exploration on feature last contact showed that number of days that passed by after the client was last contacted from a previous campaign was variative. Value 999 means client was not previously contacted. We transformed ‘before’ values into 0 and 1, resulting in having almost equal number of observations for both groups.

Table 8. Converting feature ‘age’

By analysing the result of table 8, we decided to convert the range of age into 3 categories namely level 1 (customer aged less than 30 years old), level 2(customer aged between 30 and 60 years old), and the remaining ‘age’ into level 3.

6. Removing outliers

This code is to remove outliers from features (duration and campaign)

# Obtaining values of outliers in AGE
out_duration <- boxplot.stats(BankMarketing$duration)$out
#get index of outliers in AGE
out.duration.idx <- which(BankMarketing$duration %in% c(out_duration))
#removing the outliers in AGE
BankMarketing<- BankMarketing[-out.duration.idx,]


# Obtaining values of outliers in campaign
out_camp <- boxplot.stats(BankMarketing$campaign)$out
#get index of outliers in  campaign
out.camp.idx <- which(BankMarketing$campaign %in% c(out_camp))
#removing the outliers in  campaign
BankMarketing<- BankMarketing[-out.camp.idx,]

Figure 6. Before-after features with outliers

7. Eliminating High Correlation Features

This code is to remove high correlation features

num_var <- num_var[,-c(5,6,8,9)]

As the result of multivariate analysis, we removed features numpEmp, EmpVaRate, 3MonthRate, and PriceIdx.

8. Features Encoding
Feature encoding was performed due to the fact that machine learning algorithm works better with numeric features. We implemented one-hot Encoding technique in this data set. As the result, we obtained 43 features from 15 original features.

This code is to implement One-HOT encoding technique

##char_var <- BankMarketing[, !num]
##char_var<- as.data.frame(lapply(char_var, factor))
## dmy <- dummyVars(" ~ .", data = char_var[-c(11)], fullRank = T)
## dataEncoded<- data.frame(predict(dmy, newdata =  char_var[-c(11)]))
## BankMarketing <- cbind(dataEncoded, num_var, char_var[11])

9. Standardization
There was a feature named ConfIdx (Consumer price index) that had wider values than other numeric features. To reduce the bias of the modeling, we performed standardization with scaling and center techniques. By using this technique, all values will be scaled to have mean = 0 and standard deviation = 1. From the range between -50.8 and –26.9 into range of -2.2 and 2.9.

This code is to perform feature scaling (standardization)

## confIdx <- as.data.frame(scale(BankMarketing$confIdx, center = TRUE, scale = TRUE))
## BankMarketing$confIdx <- NULL
## BankMarketing <- cbind(confIdx,BankMarketing)
## names(BankMarketing)[1] <- "confIdx"

10. Dimensionality Reduction

This code is to implement PCA

## pca = preProcess(BankMarketing[,-43], 
               #  method='pca',
                # pcaComp = 2)

## BankMarketingPCA = predict(pca, BankMarketing)  
## BankMarketingPCA <- BankMarketingPCA[c(2,3,1)]

PCA algorithm was implemented on our data set to reduce the number of variables to reduce training computation as well as to work with only features with high value of importance. As a consequence, the final features to be utilized on modeling stage was only 2, such as PCA1 and PCA2.

11. Handling Imbalanced Data
From table 1, we found out that we had imbalanced data set. To have fair evaluation on modeling result, we need to deal with this condition by using ROSE library. We set the parameter method = ‘both’ and seed = 1 to generate data synthetically as well. When method = “both” is selected, both the minority class is oversampled with replacement and the majority class is undersampled without replacement.

This code is to perform ROSE library to deal with imbalanced data set

## set.seed(123)
##BankMarketing.rose <- ROSE(Status ~ ., data = BankMarketingPCA, seed = 1)$data

Table 9. The result of ROSE Library

5. Modeling

In modeling phase, there were 3 schemas proposed using 2 machine learning algorithm. Those schemas are SVM with user-defined cost and gamma, SVM with hypertune parameters using GridSearch technique, and XGBoost(Extreme Gradient Boosting) algorithm with default parameter. We utilized 80% of cleaned data set for model building and the remaining set for model testing. In R, those algorithms can be implemented by using packages ‘e1071’ for SVM and ‘XGBoost’ for XGBoost.

This code is for separate train and test sets

## set random seed for the results to be reproducible
## set.seed(123) 

## trainX <- createDataPartition(BankMarketing.rose$Status,p=0.8,list=FALSE) 
## train <- BankMarketing.rose[trainX,]
## the rest 20% as the test set
## test <- BankMarketing.rose[-trainX,]

This code is to perform SVM

## default parameters
## BankMarketing.svm<- svm(Status~., data = train, kernel = "radial", cost = 1, gamma = 1)
## svm.pred <- predict(BankMarketing.svm, test[-3])
## error = table(svm.pred, test$Status)
## accuracy = sum(diag(error))/sum(error)

# confusionMatrix(test$Status, svm.pred,positive = 'yes')

#GRIDSEARCH
# obj <- tune.svm(Status~., data = train, gamma = 2^(-1:1), cost = 2^(2:4))
# Summary_obj <- summary(obj)
# svm.pred1 <- predict(obj, test[-3])
# error1 = table(svm.pred1, test$Status)
# accuracy1 = sum(diag(error1))/sum(error1)

This code is to perform XGBoost algorithm

6. Model Evaluation

The evaluation of predictive modeling using specificity(recall) only but we provided other evaluation metrics as shown below.

Table 10. Model evaluation using SVM algorithm

Table 11. Model evaluation using XGBoost algorithm

7. Recommendation

Here are some recommendation for this study case: 1. The marketing team should target relatively old age customers.
2. We suggested that the bank can provide more information about the outcome of current campaign.
3. Work with a bigger data size. Specifically, more personal financial records such as income, credit card bills, payments etc.
4. Having more ordinal/numerical columns may bring about improvement in prediction.