1. Introduction

## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital   <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

Marketing is one of the most important aspect in business as it managed to draw customers in buying products in any business including banking. In banking industry, marketing as well as sales are one of their core forces in promoting bank products so that banks can gather funds for their operations. But, as we all knew, not all of the banks marketing campaigns manage to draw customer successfully while in fact, most of marketing campaigns failed in drawing customers compared with their success. Banks have always been battling in putting their resources on the correct customers correctly. But, what if banks can know which customer that will most likely to buy their products even before they do their marketing campaign?

And we can actually know that with the power of machine learning where we can learn which customers will most likely to buy banks products when got contacted for marketing campaign by analyzing various charateristics (variables) of the customers so that banks can focus their resources more efficiently. In this article, we will demonstrate how to predict which customers that are most likely to say yes for buying financial products after contacted with marketing by making a model that will be trained and tested on a Portuguese marketing bank data. This dataset contains bank client data these variables:

  1. age : client’s age
  2. job : type of job
  3. marital : marital status
  4. education : Last education
  5. default: has credit in default? (binary: “yes”,“no”)
  6. balance: average yearly balance, in euros (numeric)
  7. housing: has housing loan? (binary: “yes”,“no”)
  8. loan: has personal loan? (binary: “yes”,“no”)
  9. contact: contact communication type
  10. day: last contact day of the month (numeric)
  11. month: last contact of the month (str)
  12. duration: last contact duration, in seconds (numeric), Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  13. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  14. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
  15. previous: number of contacts performed before this campaign and for this client (numeric)
  16. poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
  17. y: has the client subscribed a term deposit? (binary: “yes”,“no”)

So basically the variables represent these relationships toward the success:

  • clients personal profile (ranging from age until education)
  • clients financial history
  • clients last contact with the bank representative
  • clients bank marketing offer history

2. Data Preprocessing

in the 1st step of our data preprocessing, We will check about the duplicated and missing values entries below which are not exist

colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0
duplicates <- duplicated(bank)
sum(duplicates)
## [1] 0

however, there might be another value that we can consider as missing value as in the context of unknown, the value which we cannot explain. We will try to include all of the value as many as possible and try to convert the unknown into other values, exclusive type of value that are not included into existing categories.

sapply(bank, function(x) sum(x == "unknown"))
##       age       job   marital education   default   balance   housing      loan 
##         0       288         0      1857         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##     13020         0         0         0         0         0         0     36959 
##         y 
##         0
bank$contact <- ifelse(bank$contact == "unknown", 'other', bank$contact)

as shown in the above code and result, we convert contact column that have unknown value into other. The reason of why contact column is the only one that have unknown converted while the others that have unknown value such as jobs and education are not converted because, the number of unknown value in contact is so much bigger compared with meager unknown value in job and education as well as the impossibility in predicting unknown value in job and education due to the vast number of options. The poutcome is a little bit tricky and need to be cross-referenced with other columns.

head(table(bank$previous))
## 
##     0     1     2     3     4     5 
## 36954  2772  2106  1142   714   459
head(table(bank$pdays))
## 
##    -1     1     2     3     4     5 
## 36954    15    37     1     2    11
head(table(bank$poutcome))
## 
## failure   other success unknown 
##    4901    1840    1511   36959
# Making a new pcontact column in order to convert the unknown value into non-existent
bank$pcontact <- ifelse(bank$pdays == -1, 0, 1)
bank$prev_camp <- ifelse(bank$pcontact == 0, 'nonexistent', bank$poutcome)
bank$prev_camp[bank$poutcome == 'other'] <- 'unknown'
table(bank$prev_camp, bank$y)
##              
##                  no   yes
##   failure      4283   618
##   nonexistent 33570  3384
##   success       533   978
##   unknown      1536   309
table(bank$y)
## 
##    no   yes 
## 39922  5289

As stated in the code above, we noticed that there are some variables especially about client previous bank marketing related that are correlated with each other which are pdays, previous and poutcome that can be and best to be combined into 1 variable. We have the same number of 0s and -1s in both pdays and poutcome column which makes pdays column redundant, hence, we will remade the pdays column into pcontact so that it will have the same value as previous column. We can combine the previous, poutcome columns into a new column called prev_camp with 4 values: non-existent (have no result of previous campaign), failure (the customer did not accept the offer), success (the customer accepted the previous marketing campaign offer), and unknown. We will filter out the remaining unknowns and combine the prev_camp and pcontact later. We also noticed that our target variable is also imbalanced which we might mitigate by using PR AUC metric.

# Initialize a year variable with the starting year
year <- 2008

# Create a vector to store the year for each entry
year_vec <- rep(0, nrow(bank))

# Loop over the dataset and increment the year each time you hit January 1st
for (i in 1:nrow(bank)) {
  if (i > 1 && bank$month[i] == "jan" && bank$month[i-1] == "dec") {
    year <- year + 1
  }
  year_vec[i] <- year
}

# Add the year vector to your data frame
bank$year <- year_vec

# Create a manual mapping from month abbreviations to numbers
month_mapping <- c(jan = "01", feb = "02", mar = "03", apr = "04", may = "05", jun = "06", 
                   jul = "07", aug = "08", sep = "09", oct = "10", nov = "11", dec = "12") 

bank$month_num <- month_mapping[bank$month]
bank$month_num <- as.numeric(bank$month_num)

# Create a combined date string, padding the month and day with zeros if necessary
bank$date_str <- with(bank, paste(year, month_num, sprintf("%02d", day), sep = "-"))

# Convert the combined string to a Date object
bank$date <- as.Date(bank$date_str, format="%Y-%m-%d")
bank$year_month <- with(bank, factor(paste(year, month_num, sep = "-")))

head(bank$date)
## [1] "2008-05-05" "2008-05-05" "2008-05-05" "2008-05-05" "2008-05-05"
## [6] "2008-05-05"
head(bank$year_month)
## [1] 2008-5 2008-5 2008-5 2008-5 2008-5 2008-5
## 30 Levels: 2008-10 2008-11 2008-12 2008-5 2008-6 2008-7 2008-8 ... 2010-9

Before we move on to the dummy and ordinal variables, we will make a year variable in order for us to add another layer of time representative in this research, not only days and months. We also make date and year_month so that we can make splitting based on the date and year_month should we need it.

bank1 <- bank %>% filter(prev_camp != "unknown" , job != "unknown" , education != "unknown") %>% select(-c("duration", "pdays", "previous", "poutcome", "date_str" )) 

# Combine 'contacted_before' and 'poutcome' into a single ordinal variable
bank1$pcontact_prevcamp <- ifelse(bank1$pcontact == 0, 0, # 0 for not contacted
                                  ifelse(bank1$prev_camp == "failure", 1, # 1 failure
                                         ifelse(bank1$prev_camp == "success", 2, NA))) 
table(bank1$pcontact_prevcamp)
## 
##     0     1     2 
## 35281  4709  1424
bank1$default <- ifelse(bank1$default == "no", 0, 1)
bank1$housing <- ifelse(bank1$housing == "no", 0, 1)
bank1$loan <- ifelse(bank1$loan== "no", 0, 1)

We continue on the removing entries with unknown values and make the combination of pcontact and prev_camp which is pcontact_prevcamp column that we will assign ordinal value by assigning numbers that are taken from prev_camp column on them: 0 for not contacted & non-existent, 1 for contacted but failed, 2 for contacted and succeeded (luckily, there is no customers that are contacted but have failed or success value). We also convert the columns that have binary value into 0 and 1.

bank2 <- bank1 %>% select(-c("pcontact", "prev_camp"))
bank2$quarter <- with(bank2, factor(
  ifelse(month %in% c("jan", "feb", "mar"), "Q1",
    ifelse(month %in% c("apr", "may", "jun"), "Q2",
      ifelse(month %in% c("jul", "aug", "sep"), "Q3",
             "Q4")))
))

bank2$day_cat <- cut(bank2$day,
                       breaks = c(0, 10, 20, 31),
                       labels = c("beginning", "middle", "end"),
                       include.lowest = TRUE)

bank2$job <- ifelse(bank2$job == "blue-collar", "blue_collar", bank2$job)
bank2$job <- ifelse(bank2$job== "self-employed", "self_employed", bank2$job)

Next, we will decrease the category within the day and month variable since if we apply a dummy variable treatment as they are, it will be too granular and can make the model easily overfit. hence, we will try to categorize them into broader categories such as quarterly. we also fixed some of the value from job column that have - so that it wont cause trouble when we created a dummy variable for it.

bank3 <- dummy_cols( bank2 %>% select(-c("month", "day")), select_columns = c('job', 'marital', 'education', 'contact', 'quarter', 'year', 'day_cat'), 
                         remove_first_dummy = TRUE, remove_selected_columns = TRUE)

glimpse(bank3)
## Rows: 41,414
## Columns: 34
## $ age                 <int> 58, 44, 33, 35, 28, 42, 58, 43, 41, 29, 53, 57, 51…
## $ default             <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ balance             <int> 2143, 29, 2, 231, 447, 2, 121, 593, 270, 390, 6, 1…
## $ housing             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ loan                <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ campaign            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ y                   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "n…
## $ month_num           <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ date                <date> 2008-05-05, 2008-05-05, 2008-05-05, 2008-05-05, 2…
## $ year_month          <fct> 2008-5, 2008-5, 2008-5, 2008-5, 2008-5, 2008-5, 20…
## $ pcontact_prevcamp   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ job_blue_collar     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,…
## $ job_entrepreneur    <int> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ job_housemaid       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ job_management      <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ job_retired         <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,…
## $ job_self_employed   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ job_services        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ job_student         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ job_technician      <int> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ job_unemployed      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ marital_married     <int> 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,…
## $ marital_single      <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ education_secondary <int> 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1,…
## $ education_tertiary  <int> 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ contact_other       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ contact_telephone   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ quarter_Q2          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ quarter_Q3          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ quarter_Q4          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ year_2009           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ year_2010           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ day_cat_middle      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ day_cat_end         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

So we finally done with the data preprocessing before train test split by applying dummy variable treatment to the remaining categorical variables that are not binary or have more than 2 values.

3. Train Test split & Transformations

So we will split the dataset into train and test split in order to prevent overfitting and we will do that with random splitting method first. We also do a tarnsformation for numeric variables such as age, balance and campaign.

set.seed(123)
train_index <- sample(seq_len(nrow(bank3)), size = 0.8*nrow(bank3))
train <- bank3[train_index, ]
test <- bank3[-train_index, ]
train1 <- train %>% select(-c("month_num", "year_month", "date"))
trainn <- train1 %>% select(c("age", "balance", "campaign"))

df_long <- gather(trainn, key="Variable", value="Value")

ggplot(df_long, aes(Value)) +
  geom_histogram(bins=50, fill="steelblue", color="black", alpha=0.7) +
  facet_wrap(~Variable, scales="free") +
  theme_minimal() +
  labs(title="Histograms of Numerical Columns", x="", y="Count")

df_long <- pivot_longer(trainn, cols = everything(), names_to = "variable", values_to = "value")

p <- ggplot(df_long, aes(x = factor(1), y = value, fill = variable)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free_y", ncol = 4) +  
  labs(title = "Boxplots of Variables", y = "Value")

print(p)

so after seeing the boxplot and histogram of numeric variables, we can see that there are outliers and skewness in data distribution, we will try to preserve the outliers as it might contain important information while applying yeo-johnson transformation in order to try to fix the data distribution as well as rescaling the numeric variables.

lambdas <- list()

for (col in c("age", "balance", "campaign")) {
    yj_transform <- yeojohnson(train1[[col]])
    train1[[col]] <- yj_transform$x.t
    lambdas[[col]] <- yj_transform$lambda
}

# Create a faceted boxplot
df_long <- pivot_longer(trainn, cols = everything(), names_to = "variable", values_to = "value")

p <- ggplot(df_long, aes(x = factor(1), y = value, fill = variable)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free_y", ncol = 4) +  # Adjust 'ncol' as per your preference
  labs(title = "Boxplots of Variables", y = "Value")

# Print the plot
print(p)

after transforming all of the numeric variables and storing the lambda for later usage in test dataset, we can see that the data distribution has been centered and fixed the outlier and rescaling to some extent. Then, we will apply the yeo-johnson transformation to test dataset with the lambda that are obtained from the train yeo-johnson transformation.

test1 <- test %>% select(-c("month_num", "year_month", "date"))
for (col in c("age", "balance", "campaign")) {
    # Use the stored lambda for the column
    lambda_value <- lambdas[[col]]
    
    # Apply the Yeo-Johnson transformation using the specific lambda
    yj_transform_test <- yeojohnson(test1[[col]], lambda = lambda_value)
    
    test[[col]] <- yj_transform_test$x.t
}

4. Metrics and Models

Since our target variable is imbalanced with the ratio of 1:10, we will need a different classification metric other than our usual classic “accuracy” which we will cover confusion matrix, ROC AUC and PR AUC.

In the case of our bank marketing example:

  • True Positives: we manage to predict the customer that will accept our marketing offer
  • False Positives: we misidentify the customer that will reject our offer as the ones that will accept our offer. This might make banks waste their resources to wrong type of customers (misdirected effort)
  • False Negatives: we misidentify customers that will accept our offer as the ones that will reject our offer. This might make banks to missed their potential revenue or opportunities (missed potential)
  • True Negatives: we manage to predict the customer that will reject our marketing offer

Short introduction of terminologies:

  • Accuracy: The proportion of total correct predictions (both true positives and true negatives) to the total predictions.
  • Precision (Positive Predictive Value): The proportion of true positive predictions to the total predicted positives. It answers, “Of all the instances predicted as positive, how many are actually positive?”
  • Recall (Sensitivity or True Positive Rate): The proportion of true positives to the actual positives in the data. It answers, “Of all the actual positives, how many did we correctly predict as positive?”
  • F1 Score: The harmonic mean of precision and recall. It is a balance between precision and recall.
  • Specificity (True Negative Rate): The proportion of true negatives to the actual negatives in the data. It answers, “Of all the actual negatives, how many did we correctly predict as negative?”

Next, we will introduced the metrics that can be derived from these terminologies:

  1. ROC AUC (Receiver Operating Characteristic Area Under Curve):

The ROC AUC is a plot of the True Positive Rate (Recall) against the False Positive Rate (1 - Specificity) for different threshold settings. The AUC represents the likelihood that a randomly chosen positive instance is correctly ranked higher than a randomly chosen negative instance.

  1. PR AUC (Precision-Recall Area Under Curve) :

The Precision-Recall AUC provides a comprehensive measure of a model’s performance across different thresholds, focusing on the positive (minority) class. It’s particularly useful in imbalanced datasets. The PR curve plots Precision (y-axis) against Recall (x-axis) for various threshold settings. The area under this curve (AUC) represents the model’s ability to correctly classify the positive class.

for more about ROC AUC and PR AUC, you can visit on the link.

TLDR, if we wanted to pursue the minority class of positives only, we will use PR AUC while if we wanted to have balance coverage of both class (TP and TN), we will use ROC AUC

After short explanation of our metrics, we will test our train and test dataset in 4 different algorithms (Logistic, naive bayes, Decision tree and Random Forest):

##           Model  Accuracy Precision    Recall   F1Score ROCAUC     PRAUC
## 1      Logistic 0.3087000 0.9448600 0.2371100 0.3790900 0.5886 0.1964400
## 2         Naive 0.5284300 0.9255900 0.5112600 0.6586900 0.5635 0.3436500
## 3 Decision Tree 0.8972594 0.9069013 0.9857569 0.9446864 0.5834 0.8669704
## 4 Random Forest 0.8952071 0.9017791 0.9900977 0.9438769 0.7495 0.8165913

So from the result, we found out that from 4 different algorithm, we can choose either random forest or decision tree since both have good performance in PR AUC section; however, decision tree ROC AUC have almost the same low score as logistic regression model which is why we will select random forest as our base metric even with lower PR AUC (random forest also quite robust against outlier). Next, we will try to improve our model performance.

5. Stratified Split & Feature Selection

After we managed to pick our model, we will experiment with 2 ways to increase the model performance: stratified split and feature performance.

# Create a list of indices for stratified sampling based on 'year_month'
set.seed(123)  # for reproducibility
trainIndex <- createDataPartition(bank3$year_month, p = 0.8, list = TRUE, times = 1)

# Create stratified training and test sets
train2 <- bank3[trainIndex[[1]], ]
test2 <- bank3[-trainIndex[[1]], ]

table(train2$y)
## 
##    no   yes 
## 29356  3789
table(test2$y)
## 
##   no  yes 
## 7332  937

We created a stratified split based on date_year column that we have set, so that the train and test target variable have the same proportion and we will see whether this will have effect or not.

## [1] "Random Forest Metrics:"
## $Accuracy
##  Accuracy 
## 0.8968436 
## 
## $Precision
## Pos Pred Value 
##      0.9066148 
## 
## $Recall
## Sensitivity 
##   0.9851337 
## 
## $F1
##        F1 
## 0.9442447 
## 
## $ROC_AUC
## Area under the curve: 0.7773
## 
## $PR_AUC
## [1] 0.8142559

Since the PR AUC of the stratified split is lower than random split, we will use random split (although the ROC AUC have improved due to a better proportion for the target variable). Next, we will use the feature importance to see how imporntant a feature is in the model.

##                     MeanDecreaseGini
## age                        668.55269
## default                     15.84238
## balance                    799.56472
## housing                    122.85407
## loan                        66.69304
## campaign                   264.09397
## pcontact_prevcamp          502.93568
## job_blue_collar             55.48251
## job_entrepreneur            29.65525
## job_housemaid               25.29230
## job_management              65.20129
## job_retired                 37.09474
## job_self_employed           38.18451
## job_services                45.23935
## job_student                 30.27883
## job_technician              73.39669
## job_unemployed              37.63598
## marital_married             70.62602
## marital_single              62.14352
## education_secondary         77.72464
## education_tertiary          69.21734
## contact_other               96.98371
## contact_telephone           54.75534
## quarter_Q2                  91.27078
## quarter_Q3                  87.83489
## quarter_Q4                  90.07173
## year_2009                  147.08821
## year_2010                  394.26397
## day_cat_middle             115.80890
## day_cat_end                 97.01929

Mean Decrease in Gini coefficient is a measure of feature importance used in Random Forest and other tree-based models. It is calculated for each feature in the dataset and is based on how much the feature decreases the impurity of the splits it is involved in. The Gini impurity measures the disorder of a set, with lower values indicating less impurity. When a feature is used in a split, it helps to partition the data into purer subsets, and the decrease in impurity from this split is accumulated for the feature over all the trees in the forest. The higher the Mean Decrease Gini, the more important the feature is considered to be in predicting the target variable.However, a high Gini importance does not always equate to a feature being beneficial for the model’s predictive power since it might also indicate overfitting / multicollinearity.

## [1] "Random Forest Metrics:"
## $Accuracy
##  Accuracy 
## 0.8944827 
## 
## $Precision
## Pos Pred Value 
##      0.8999261 
## 
## $Recall
## Sensitivity 
##   0.9917254 
## 
## $F1
##        F1 
## 0.9435983 
## 
## $ROC_AUC
## Area under the curve: 0.6674
## 
## $PR_AUC
## [1] 0.8427744
##      Metric Random.RF Stratified.RF Feat.selection...Tuned.Random.RF
## 1  Accuracy 0.8952071     0.8968436                        0.8944827
## 2 Precision 0.9017791     0.9066148                        0.8999261
## 3    Recall 0.9900977     0.9851337                        0.9917254
## 4  F1 Score 0.9438769     0.9442447                        0.9435983
## 5   ROC AUC 0.7495000     0.7773000                        0.6674000
## 6    PR AUC 0.8165913     0.8142559                        0.8427744

an example of this is that by removing some features or aspect and comparing the performance with the base model: by removing all time-related variables, we are trading ROC AUC performance of 8% into less than 2% of PR AUC. As we can see that by trading off some of the metric such as ROC AUC into PR AAUC or a little or precision traded into recall, we can choose how to improve certain metrics.

##                     MeanDecreaseGini
## age                        541.75707
## default                     13.32093
## balance                    625.66141
## housing                    100.83242
## loan                        50.61403
## campaign                   204.18161
## pcontact_prevcamp          616.29517
## job_blue_collar             37.56741
## job_entrepreneur            25.13782
## job_housemaid               20.37506
## job_management              40.29184
## job_retired                 29.19151
## job_self_employed           28.24834
## job_services                29.78643
## job_student                 25.83862
## job_technician              45.12576
## job_unemployed              25.78485
## marital_married             44.97470
## marital_single              42.94501
## education_secondary         47.66345
## education_tertiary          43.17801
## contact_other              105.85719
## contact_telephone           38.20522

We can also draw conclusion that age, balance and the result of previous campaign really affect the outcome of this marketing campaign toward banks customers with number of meeting in the current campaign, housing and already contacted via non call looming behind.

6. Conclusion

We can conclude that predicting the right customer to be offered with the right marketing strategy is very crucial in minimizing resources spent and grabbing every opportunity. Now, the banks can see which metric that they wanted to focus on like “should i focus on both classes instead of only positive class” or “which one is costlier, False Negative or False Positives”. We can also improved the model by:

  • more detailed feature selection
  • advanced ensemble techniques for the model training
  • binning numeric features
  • adjusting model threshold for PR AUC and ROC AUC
  • hyper parameter tuning and more advanced algorithm such as xgboost