A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing
bank <- read.csv("D:/Documents/DATA 622/bank/bank-full.csv")
df <- bank |> separate(col = age.job.marital.education.default.balance.housing.loan.contact.day.month.duration.campaign.pdays.previous.poutcome.y,
into = c('age','job','marital','education','default',
'balance','housing','loan','contact','day','month',
'duration','campaign','pdays','previous','poutcome','y'), sep=';')
# Convert variables to categorical (factor)
df <- df %>%
mutate(
job = as.factor(job),
marital = as.factor(marital),
education = as.factor(education),
contact = as.factor(contact),
month = as.factor(month),
poutcome = as.factor(poutcome),
y = as.factor(y)
)
# Convert variables to numeric
df <- df %>%
mutate(
age = as.numeric(age),
balance = as.numeric(balance),
day = as.numeric(day),
duration = as.numeric(duration),
campaign = as.numeric(campaign),
pdays = as.numeric(pdays),
previous = as.numeric(previous)
)
df <- df %>%
mutate(pdays = ifelse(pdays == -1, NA, pdays))
# Convert binary variables to logical
df <- df %>%
mutate(
default = ifelse(default == "yes", TRUE, FALSE),
housing = ifelse(housing == "yes", TRUE, FALSE),
loan = ifelse(loan == "yes", TRUE, FALSE)
)
Review the structure and content of the data and answer questions such as:
Are the features (columns) of your data correlated?
Correlation matrix indicates almost all of the numeric variables are not correlated in any significant way (-0.11 to 0.16).
What is the overall distribution of each variable?
Histograms of numeric values indicate only “age” is somewhat normally distributed, while all the others do not demonstrate normality.
Are there any outliers present?
Boxplots identify extreme outliers in almost all variables with the exception of ‘day’, which is the last contact day of the month. This is constrained by 1 and 31 days of the month.
How are categorical variables distributed?
Bar charts identify potential differences in the variables from those that indicate ‘yes’ or ‘no’ for subscribing to a term deposit. In general, the categoricals are similar between those that say “yes” and “no”. However, there are some notable exceptions such those with higher education, single individuals, those that do not have a housing loan, individuals contacted by cell phone are more likely to say ‘yes’ to a term deposit.
Do any patterns or trends emerge in the data?
There is ~90% ‘no’ rate from the campaigns. March, September, and December have the highest success. Previous success correlates with possible current success. There may be possible age-job interactions that might benefit from further analysis. Longer duration calls are associated with possible greater success.
What is the central tendency and spread of each variable?
There are asymmetric distributions for balance, call duration, and campaign contacts. The job categories are not equally represented. There are significant spreads seen for balance, duration, and campaign contacts.
Are there any missing values and how significant are they?
Some variables have ‘other’ or ‘unknown’ effectively indicating a missing value. While these variables can be analyzed with these categories, any potential relationships between the other variables might be obscured. ‘pdays’ does have a significant number of NAs, accounting for over 80% of the observations.
str(df)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : num 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ balance : num 2143 29 2 1506 1 ...
## $ housing : logi TRUE TRUE TRUE TRUE FALSE TRUE ...
## $ loan : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : num 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : num 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : num 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : num NA NA NA NA NA NA NA NA NA NA ...
## $ previous : num 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
summary(df)
## age job marital education
## Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
## 1st Qu.:33.00 management :9458 married :27214 secondary:23202
## Median :39.00 technician :7597 single :12790 tertiary :13301
## Mean :40.94 admin. :5171 unknown : 1857
## 3rd Qu.:48.00 services :4154
## Max. :95.00 retired :2264
## (Other) :6835
## default balance housing loan
## Mode :logical Min. : -8019 Mode :logical Mode :logical
## FALSE:44396 1st Qu.: 72 FALSE:20081 FALSE:37967
## TRUE :815 Median : 448 TRUE :25130 TRUE :7244
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## contact day month duration
## cellular :29285 Min. : 1.00 may :13766 Min. : 0.0
## telephone: 2906 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0
## unknown :13020 Median :16.00 aug : 6247 Median : 180.0
## Mean :15.81 jun : 5341 Mean : 258.2
## 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0
## Max. :31.00 apr : 2932 Max. :4918.0
## (Other): 6060
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 1.0 Min. : 0.0000 failure: 4901
## 1st Qu.: 1.000 1st Qu.:133.0 1st Qu.: 0.0000 other : 1840
## Median : 2.000 Median :194.0 Median : 0.0000 success: 1511
## Mean : 2.764 Mean :224.6 Mean : 0.5803 unknown:36959
## 3rd Qu.: 3.000 3rd Qu.:327.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## NA's :36954
## y
## no :39922
## yes: 5289
##
##
##
##
##
numeric_vars <- c("age", "balance", "day", "duration", "campaign", "pdays", "previous")
cor_matrix <- cor(df[, numeric_vars], use = "pairwise.complete.obs")
print(cor_matrix)
## age balance day duration campaign
## age 1.000000000 0.097782739 -0.009120046 -0.004648428 0.004760312
## balance 0.097782739 1.000000000 0.004502585 0.021560380 -0.014578279
## day -0.009120046 0.004502585 1.000000000 -0.030206341 0.162490216
## duration -0.004648428 0.021560380 -0.030206341 1.000000000 -0.084569503
## campaign 0.004760312 -0.014578279 0.162490216 -0.084569503 1.000000000
## pdays -0.107862882 -0.108122124 -0.090094644 -0.024406586 0.050533690
## previous 0.001288319 0.016673637 -0.051710497 0.001203057 -0.032855290
## pdays previous
## age -0.10786288 0.001288319
## balance -0.10812212 0.016673637
## day -0.09009464 -0.051710497
## duration -0.02440659 0.001203057
## campaign 0.05053369 -0.032855290
## pdays 1.00000000 -0.021884865
## previous -0.02188487 1.000000000
corrplot(cor_matrix,
method = "color",
type = "lower",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.7,
col = colorRampPalette(c("#FC4E07", "#FFEDA0", "#00AFBB"))(200))
categorical_vars <- c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome")
for (var in categorical_vars) {
plot <- ggplot(df, aes(x = y, fill = !!sym(var))) +
geom_bar(position = "fill") +
theme_minimal() +
labs(title = paste("Distribution of", var, "by y"),
x = "y",
y = "Proportion",
fill = var)
print(plot)
}
# Create a function to generate box plots for each numeric variable
create_boxplot <- function(df, var) {
ggplot(df, aes(x = "", y = !!sym(var))) +
geom_boxplot(fill = "steelblue", color = "black") +
labs(title = paste("Box Plot of", var),
x = "",
y = var) +
theme_minimal()
}
# Create box plots for each numeric variable
box_plots <- lapply(numeric_vars, function(var) create_boxplot(df, var))
# Print the box plots
for (plot in box_plots) {
print(plot)
}
## Warning: Removed 36954 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Specify the numeric variables
numeric_vars <- c("age", "balance", "day", "duration", "campaign", "pdays", "previous")
# Create a function to generate histograms for each numeric variable
create_histogram <- function(df, var) {
ggplot(df, aes(x = !!sym(var))) +
geom_histogram(fill = "steelblue", color = "black", bins = 30) +
labs(title = paste("Histogram of", var),
x = var,
y = "Frequency") +
theme_minimal()
}
# Create histograms for each numeric variable
histograms <- lapply(numeric_vars, function(var) create_histogram(df, var))
# Print the histograms
for (plot in histograms) {
print(plot)
}
## Warning: Removed 36954 rows containing non-finite outside the scale range
## (`stat_bin()`).
Now you have completed the EDA, what Algorithms would suit the business purpose for the dataset.
The bank marketing dataset presents several complex characteristics that make traditional machine learning algorithms unsuitable. Since this assignment is due before we have fully explored alternative options of decision trees and random forest applications, we will focus on only the options currently covered. Here is an analysis of why those conventional approaches fall short:
The dataset’s fundamental structure poses significant challenges for traditional algorithms. With seventeen variables of mixed types (numeric, categorical, and binary), considerable missing values (81.7% in ‘pdays’), and severe class imbalance (11.7% yes vs 88.3% no), basic algorithms lack the sophistication to manage these complexities simultaneously.
Linear and logistic regression models, including their regularized variants (Lasso and Ridge), are particularly inappropriate due to their rigid assumptions. The binary outcome variable violates linear regression’s basic assumptions. There are complex conditional dependencies between variables, important interaction effects between categorical and numeric variables, the presence of hierarchical relationships in the data structure, and the need to capture multiple feature interactions simultaneously.
KNN’s implementation becomes problematic due to the “curse of dimensionality” across seventeen variables. The algorithm’s distance calculations become less meaningful in high-dimensional space, particularly with mixed variable types and even with the large dataset of 45,000+ observations. The large dataset size makes KNN computationally inefficient, while its sensitivity to feature scaling and poor handling of categorical variables further limit its applicability.
Linear Discriminant Analysis (LDA) fails to meet its core assumptions. The data does not follow normal distribution patterns, and the variance-covariance matrices are not equal across classes. The presence of multiple categorical predictors with varying levels and significant outliers violates LDA’s fundamental requirements.
While Naive Bayes could potentially work with the numeric variables due to their low correlations, it may still be sub-optimal due to its limitations in handling mixed data types, skewed distributions, and missing values, rather than due to violations of the independence assumption. Its simplified probability estimates cannot adequately capture the complex relationships between variables, particularly with the mixture of numeric and categorical features and the presence of informative missing values.
The significant proportion of missing values in ‘pdays’ (81.7%) poses a particular challenge. Traditional algorithms either require complete cases or simple imputation methods, which would either significantly reduce the dataset or potentially introduce bias. The missing values are informative (indicating no previous contact) and require more sophisticated handling than these algorithms can provide.
The presence of extreme outliers, particularly in financial variables, severely impacts traditional algorithms’ performance. While alternative versions of these algorithms exist, they often sacrifice other important aspects of model performance to manage outliers.
With over 45,000 observations, algorithms like KNN become computationally expensive, particularly during prediction phases. The need for extensive pre-processing (scaling, encoding, imputation) with traditional algorithms further increases computational workload.
These limitations collectively demonstrate why more sophisticated algorithms like Random Forest and AdaBoost, which can naturally manage these data characteristics, are more appropriate choices. Their ability to manage complex interactions, manage missing values, accommodate outliers, and deal with class imbalance makes them better suited for this prediction task.
For Random Forest with this banking dataset, the pre-processing needs are minimal compared to other algorithms, which is one of its advantages. The main pre-processing requirements would focus on managing the class imbalance (11.7% vs 88.3% for the target variable) through either minority over-sampling or class weights and addressing the missing values in ‘pdays’ (81.7% missing) by either creating a binary indicator variable for previous contact. Random Forest can manage both categorical and numerical variables without scaling or normalization, manages outliers well through its bootstrap sampling, and is not affected by non-linear relationships or interactions between variables. While feature engineering could still be beneficial, it is not strictly necessary for the algorithm to perform well. The categorical variables (job, education, marital status, etc.) can be used as-is without one-hot encoding, and the numerical variables (age, balance, duration) do not require standardization or normalization. The focus should be on ensuring the training data provides a balanced representation of both positive and negative cases to prevent the model from being biased toward the majority class.