## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
Marketing is one of the most important aspect in business as it managed to draw customers in buying products in any business including banking. In banking industry, marketing as well as sales are one of their core forces in promoting bank products so that banks can gather funds for their operations. But, as we all knew, not all of the banks marketing campaigns manage to draw customer successfully while in fact, most of marketing campaigns failed in drawing customers compared with their success. Banks have always been battling in putting their resources on the correct customers correctly. But, what if banks can know which customer that will most likely to buy their products even before they do their marketing campaign?
And we can actually know that with the power of machine learning where we can learn which customers will most likely to buy banks products when got contacted for marketing campaign by analyzing various charateristics (variables) of the customers so that banks can focus their resources more efficiently. In this article, we will demonstrate how to predict which customers that are most likely to say yes for buying financial products after contacted with marketing by making a model that will be trained and tested on a Portuguese marketing bank data. This dataset contains bank client data these variables:
So basically the variables represent these relationships toward the success:
in the 1st step of our data preprocessing, We will check about the duplicated and missing values entries below which are not exist
colSums(is.na(bank))
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
duplicates <- duplicated(bank)
sum(duplicates)
## [1] 0
however, there might be another value that we can consider as missing value as in the context of unknown, the value which we cannot explain. We will try to include all of the value as many as possible and try to convert the unknown into other values, exclusive type of value that are not included into existing categories.
sapply(bank, function(x) sum(x == "unknown"))
## age job marital education default balance housing loan
## 0 288 0 1857 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 13020 0 0 0 0 0 0 36959
## y
## 0
bank$contact <- ifelse(bank$contact == "unknown", 'other', bank$contact)
as shown in the above code and result, we convert contact column that have unknown value into other. The reason of why contact column is the only one that have unknown converted while the others that have unknown value such as jobs and education are not converted because, the number of unknown value in contact is so much bigger compared with meager unknown value in job and education as well as the impossibility in predicting unknown value in job and education due to the vast number of options. The poutcome is a little bit tricky and need to be cross-referenced with other columns.
head(table(bank$previous))
##
## 0 1 2 3 4 5
## 36954 2772 2106 1142 714 459
head(table(bank$pdays))
##
## -1 1 2 3 4 5
## 36954 15 37 1 2 11
head(table(bank$poutcome))
##
## failure other success unknown
## 4901 1840 1511 36959
# Making a new pcontact column in order to convert the unknown value into non-existent
bank$pcontact <- ifelse(bank$pdays == -1, 0, 1)
bank$prev_camp <- ifelse(bank$pcontact == 0, 'nonexistent', bank$poutcome)
bank$prev_camp[bank$poutcome == 'other'] <- 'unknown'
table(bank$prev_camp, bank$y)
##
## no yes
## failure 4283 618
## nonexistent 33570 3384
## success 533 978
## unknown 1536 309
table(bank$y)
##
## no yes
## 39922 5289
As stated in the code above, we noticed that there are some variables especially about client previous bank marketing related that are correlated with each other which are pdays, previous and poutcome that can be and best to be combined into 1 variable. We have the same number of 0s and -1s in both pdays and poutcome column which makes pdays column redundant, hence, we will remade the pdays column into pcontact so that it will have the same value as previous column. We can combine the previous, poutcome columns into a new column called prev_camp with 4 values: non-existent (have no result of previous campaign), failure (the customer did not accept the offer), success (the customer accepted the previous marketing campaign offer), and unknown. We will filter out the remaining unknowns and combine the prev_camp and pcontact later. We also noticed that our target variable is also imbalanced which we might mitigate by using PR AUC metric.
# Initialize a year variable with the starting year
year <- 2008
# Create a vector to store the year for each entry
year_vec <- rep(0, nrow(bank))
# Loop over the dataset and increment the year each time you hit January 1st
for (i in 1:nrow(bank)) {
if (i > 1 && bank$month[i] == "jan" && bank$month[i-1] == "dec") {
year <- year + 1
}
year_vec[i] <- year
}
# Add the year vector to your data frame
bank$year <- year_vec
# Create a manual mapping from month abbreviations to numbers
month_mapping <- c(jan = "01", feb = "02", mar = "03", apr = "04", may = "05", jun = "06",
jul = "07", aug = "08", sep = "09", oct = "10", nov = "11", dec = "12")
bank$month_num <- month_mapping[bank$month]
bank$month_num <- as.numeric(bank$month_num)
# Create a combined date string, padding the month and day with zeros if necessary
bank$date_str <- with(bank, paste(year, month_num, sprintf("%02d", day), sep = "-"))
# Convert the combined string to a Date object
bank$date <- as.Date(bank$date_str, format="%Y-%m-%d")
bank$year_month <- with(bank, factor(paste(year, month_num, sep = "-")))
head(bank$date)
## [1] "2008-05-05" "2008-05-05" "2008-05-05" "2008-05-05" "2008-05-05"
## [6] "2008-05-05"
head(bank$year_month)
## [1] 2008-5 2008-5 2008-5 2008-5 2008-5 2008-5
## 30 Levels: 2008-10 2008-11 2008-12 2008-5 2008-6 2008-7 2008-8 ... 2010-9
Before we move on to the dummy and ordinal variables, we will make a year variable in order for us to add another layer of time representative in this research, not only days and months. We also make date and year_month so that we can make splitting based on the date and year_month should we need it.
bank1 <- bank %>% filter(prev_camp != "unknown" , job != "unknown" , education != "unknown") %>% select(-c("duration", "pdays", "previous", "poutcome", "date_str" ))
# Combine 'contacted_before' and 'poutcome' into a single ordinal variable
bank1$pcontact_prevcamp <- ifelse(bank1$pcontact == 0, 0, # 0 for not contacted
ifelse(bank1$prev_camp == "failure", 1, # 1 failure
ifelse(bank1$prev_camp == "success", 2, NA)))
table(bank1$pcontact_prevcamp)
##
## 0 1 2
## 35281 4709 1424
bank1$default <- ifelse(bank1$default == "no", 0, 1)
bank1$housing <- ifelse(bank1$housing == "no", 0, 1)
bank1$loan <- ifelse(bank1$loan== "no", 0, 1)
We continue on the removing entries with unknown values and make the combination of pcontact and prev_camp which is pcontact_prevcamp column that we will assign ordinal value by assigning numbers that are taken from prev_camp column on them: 0 for not contacted & non-existent, 1 for contacted but failed, 2 for contacted and succeeded (luckily, there is no customers that are contacted but have failed or success value). We also convert the columns that have binary value into 0 and 1.
bank2 <- bank1 %>% select(-c("pcontact", "prev_camp"))
bank2$quarter <- with(bank2, factor(
ifelse(month %in% c("jan", "feb", "mar"), "Q1",
ifelse(month %in% c("apr", "may", "jun"), "Q2",
ifelse(month %in% c("jul", "aug", "sep"), "Q3",
"Q4")))
))
bank2$day_cat <- cut(bank2$day,
breaks = c(0, 10, 20, 31),
labels = c("beginning", "middle", "end"),
include.lowest = TRUE)
bank2$job <- ifelse(bank2$job == "blue-collar", "blue_collar", bank2$job)
bank2$job <- ifelse(bank2$job== "self-employed", "self_employed", bank2$job)
Next, we will decrease the category within the day and month variable since if we apply a dummy variable treatment as they are, it will be too granular and can make the model easily overfit. hence, we will try to categorize them into broader categories such as quarterly. we also fixed some of the value from job column that have - so that it wont cause trouble when we created a dummy variable for it.
bank3 <- dummy_cols( bank2 %>% select(-c("month", "day")), select_columns = c('job', 'marital', 'education', 'contact', 'quarter', 'year', 'day_cat'),
remove_first_dummy = TRUE, remove_selected_columns = TRUE)
glimpse(bank3)
## Rows: 41,414
## Columns: 34
## $ age <int> 58, 44, 33, 35, 28, 42, 58, 43, 41, 29, 53, 57, 51…
## $ default <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ balance <int> 2143, 29, 2, 231, 447, 2, 121, 593, 270, 390, 6, 1…
## $ housing <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ loan <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "n…
## $ month_num <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ date <date> 2008-05-05, 2008-05-05, 2008-05-05, 2008-05-05, 2…
## $ year_month <fct> 2008-5, 2008-5, 2008-5, 2008-5, 2008-5, 2008-5, 20…
## $ pcontact_prevcamp <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ job_blue_collar <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,…
## $ job_entrepreneur <int> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ job_housemaid <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ job_management <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ job_retired <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,…
## $ job_self_employed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ job_services <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ job_student <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ job_technician <int> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ job_unemployed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ marital_married <int> 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,…
## $ marital_single <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ education_secondary <int> 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1,…
## $ education_tertiary <int> 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ contact_other <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ contact_telephone <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ quarter_Q2 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ quarter_Q3 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ quarter_Q4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ year_2009 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ year_2010 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ day_cat_middle <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ day_cat_end <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
So we finally done with the data preprocessing before train test split by applying dummy variable treatment to the remaining categorical variables that are not binary or have more than 2 values.
So we will split the dataset into train and test split in order to prevent overfitting and we will do that with random splitting method first. We also do a tarnsformation for numeric variables such as age, balance and campaign.
set.seed(123)
train_index <- sample(seq_len(nrow(bank3)), size = 0.8*nrow(bank3))
train <- bank3[train_index, ]
test <- bank3[-train_index, ]
train1 <- train %>% select(-c("month_num", "year_month", "date"))
trainn <- train1 %>% select(c("age", "balance", "campaign"))
df_long <- gather(trainn, key="Variable", value="Value")
ggplot(df_long, aes(Value)) +
geom_histogram(bins=50, fill="steelblue", color="black", alpha=0.7) +
facet_wrap(~Variable, scales="free") +
theme_minimal() +
labs(title="Histograms of Numerical Columns", x="", y="Count")
df_long <- pivot_longer(trainn, cols = everything(), names_to = "variable", values_to = "value")
p <- ggplot(df_long, aes(x = factor(1), y = value, fill = variable)) +
geom_boxplot() +
facet_wrap(~ variable, scales = "free_y", ncol = 4) +
labs(title = "Boxplots of Variables", y = "Value")
print(p)
so after seeing the boxplot and histogram of numeric variables, we can see that there are outliers and skewness in data distribution, we will try to preserve the outliers as it might contain important information while applying yeo-johnson transformation in order to try to fix the data distribution as well as rescaling the numeric variables.
lambdas <- list()
for (col in c("age", "balance", "campaign")) {
yj_transform <- yeojohnson(train1[[col]])
train1[[col]] <- yj_transform$x.t
lambdas[[col]] <- yj_transform$lambda
}
# Create a faceted boxplot
df_long <- pivot_longer(trainn, cols = everything(), names_to = "variable", values_to = "value")
p <- ggplot(df_long, aes(x = factor(1), y = value, fill = variable)) +
geom_boxplot() +
facet_wrap(~ variable, scales = "free_y", ncol = 4) + # Adjust 'ncol' as per your preference
labs(title = "Boxplots of Variables", y = "Value")
# Print the plot
print(p)
after transforming all of the numeric variables and storing the lambda for later usage in test dataset, we can see that the data distribution has been centered and fixed the outlier and rescaling to some extent. Then, we will apply the yeo-johnson transformation to test dataset with the lambda that are obtained from the train yeo-johnson transformation.
test1 <- test %>% select(-c("month_num", "year_month", "date"))
for (col in c("age", "balance", "campaign")) {
# Use the stored lambda for the column
lambda_value <- lambdas[[col]]
# Apply the Yeo-Johnson transformation using the specific lambda
yj_transform_test <- yeojohnson(test1[[col]], lambda = lambda_value)
test[[col]] <- yj_transform_test$x.t
}
Since our target variable is imbalanced with the ratio of 1:10, we will need a different classification metric other than our usual classic “accuracy” which we will cover confusion matrix, ROC AUC and PR AUC.
In the case of our bank marketing example:
Short introduction of terminologies:
Next, we will introduced the metrics that can be derived from these terminologies:
The ROC AUC is a plot of the True Positive Rate (Recall) against the False Positive Rate (1 - Specificity) for different threshold settings. The AUC represents the likelihood that a randomly chosen positive instance is correctly ranked higher than a randomly chosen negative instance.
The Precision-Recall AUC provides a comprehensive measure of a model’s performance across different thresholds, focusing on the positive (minority) class. It’s particularly useful in imbalanced datasets. The PR curve plots Precision (y-axis) against Recall (x-axis) for various threshold settings. The area under this curve (AUC) represents the model’s ability to correctly classify the positive class.
for more about ROC AUC and PR AUC, you can visit on the link.
TLDR, if we wanted to pursue the minority class of positives only, we will use PR AUC while if we wanted to have balance coverage of both class (TP and TN), we will use ROC AUC
After short explanation of our metrics, we will test our train and test dataset in 4 different algorithms (Logistic, naive bayes, Decision tree and Random Forest):
## Model Accuracy Precision Recall F1Score ROCAUC PRAUC
## 1 Logistic 0.3087000 0.9448600 0.2371100 0.3790900 0.5886 0.1964400
## 2 Naive 0.5284300 0.9255900 0.5112600 0.6586900 0.5635 0.3436500
## 3 Decision Tree 0.8972594 0.9069013 0.9857569 0.9446864 0.5834 0.8669704
## 4 Random Forest 0.8952071 0.9017791 0.9900977 0.9438769 0.7495 0.8165913
So from the result, we found out that from 4 different algorithm, we can choose either random forest or decision tree since both have good performance in PR AUC section; however, decision tree ROC AUC have almost the same low score as logistic regression model which is why we will select random forest as our base metric even with lower PR AUC (random forest also quite robust against outlier). Next, we will try to improve our model performance.
After we managed to pick our model, we will experiment with 2 ways to increase the model performance: stratified split and feature performance.
# Create a list of indices for stratified sampling based on 'year_month'
set.seed(123) # for reproducibility
trainIndex <- createDataPartition(bank3$year_month, p = 0.8, list = TRUE, times = 1)
# Create stratified training and test sets
train2 <- bank3[trainIndex[[1]], ]
test2 <- bank3[-trainIndex[[1]], ]
table(train2$y)
##
## no yes
## 29356 3789
table(test2$y)
##
## no yes
## 7332 937
We created a stratified split based on date_year column that we have set, so that the train and test target variable have the same proportion and we will see whether this will have effect or not.
## [1] "Random Forest Metrics:"
## $Accuracy
## Accuracy
## 0.8968436
##
## $Precision
## Pos Pred Value
## 0.9066148
##
## $Recall
## Sensitivity
## 0.9851337
##
## $F1
## F1
## 0.9442447
##
## $ROC_AUC
## Area under the curve: 0.7773
##
## $PR_AUC
## [1] 0.8142559
Since the PR AUC of the stratified split is lower than random split, we will use random split (although the ROC AUC have improved due to a better proportion for the target variable). Next, we will use the feature importance to see how imporntant a feature is in the model.
## MeanDecreaseGini
## age 668.55269
## default 15.84238
## balance 799.56472
## housing 122.85407
## loan 66.69304
## campaign 264.09397
## pcontact_prevcamp 502.93568
## job_blue_collar 55.48251
## job_entrepreneur 29.65525
## job_housemaid 25.29230
## job_management 65.20129
## job_retired 37.09474
## job_self_employed 38.18451
## job_services 45.23935
## job_student 30.27883
## job_technician 73.39669
## job_unemployed 37.63598
## marital_married 70.62602
## marital_single 62.14352
## education_secondary 77.72464
## education_tertiary 69.21734
## contact_other 96.98371
## contact_telephone 54.75534
## quarter_Q2 91.27078
## quarter_Q3 87.83489
## quarter_Q4 90.07173
## year_2009 147.08821
## year_2010 394.26397
## day_cat_middle 115.80890
## day_cat_end 97.01929
Mean Decrease in Gini coefficient is a measure of feature importance used in Random Forest and other tree-based models. It is calculated for each feature in the dataset and is based on how much the feature decreases the impurity of the splits it is involved in. The Gini impurity measures the disorder of a set, with lower values indicating less impurity. When a feature is used in a split, it helps to partition the data into purer subsets, and the decrease in impurity from this split is accumulated for the feature over all the trees in the forest. The higher the Mean Decrease Gini, the more important the feature is considered to be in predicting the target variable.However, a high Gini importance does not always equate to a feature being beneficial for the model’s predictive power since it might also indicate overfitting / multicollinearity.
## [1] "Random Forest Metrics:"
## $Accuracy
## Accuracy
## 0.8944827
##
## $Precision
## Pos Pred Value
## 0.8999261
##
## $Recall
## Sensitivity
## 0.9917254
##
## $F1
## F1
## 0.9435983
##
## $ROC_AUC
## Area under the curve: 0.6674
##
## $PR_AUC
## [1] 0.8427744
## Metric Random.RF Stratified.RF Feat.selection...Tuned.Random.RF
## 1 Accuracy 0.8952071 0.8968436 0.8944827
## 2 Precision 0.9017791 0.9066148 0.8999261
## 3 Recall 0.9900977 0.9851337 0.9917254
## 4 F1 Score 0.9438769 0.9442447 0.9435983
## 5 ROC AUC 0.7495000 0.7773000 0.6674000
## 6 PR AUC 0.8165913 0.8142559 0.8427744
an example of this is that by removing some features or aspect and comparing the performance with the base model: by removing all time-related variables, we are trading ROC AUC performance of 8% into less than 2% of PR AUC. As we can see that by trading off some of the metric such as ROC AUC into PR AAUC or a little or precision traded into recall, we can choose how to improve certain metrics.
## MeanDecreaseGini
## age 541.75707
## default 13.32093
## balance 625.66141
## housing 100.83242
## loan 50.61403
## campaign 204.18161
## pcontact_prevcamp 616.29517
## job_blue_collar 37.56741
## job_entrepreneur 25.13782
## job_housemaid 20.37506
## job_management 40.29184
## job_retired 29.19151
## job_self_employed 28.24834
## job_services 29.78643
## job_student 25.83862
## job_technician 45.12576
## job_unemployed 25.78485
## marital_married 44.97470
## marital_single 42.94501
## education_secondary 47.66345
## education_tertiary 43.17801
## contact_other 105.85719
## contact_telephone 38.20522
We can also draw conclusion that age, balance and the result of previous campaign really affect the outcome of this marketing campaign toward banks customers with number of meeting in the current campaign, housing and already contacted via non call looming behind.
We can conclude that predicting the right customer to be offered with the right marketing strategy is very crucial in minimizing resources spent and grabbing every opportunity. Now, the banks can see which metric that they wanted to focus on like “should i focus on both classes instead of only positive class” or “which one is costlier, False Negative or False Positives”. We can also improved the model by: