Introduction
Exploratory Data Analysis (EDA) is an imperative preliminary activity in all data science endeavours since it lays the groundwork for developing sound and efficient machine learning models. EDA entails observation of data structure, patterns, and quality prior to conducting model development. It enables us to identify missing data points, outliers, and class imbalance, while at the same time underlining significant variable relationships. As has been stated, “better data beats better algorithms” emphasizing the value of quality data over model sophistication.
In this assignment, I will do Exploratory Analysis using the selected datasets, being careful about both data preparation and insight extraction. Since this is an exploratory assignment, I will display encountered errors and warnings during analysis to gain a better understanding of data issues. Executing the code on different datasets will also allow us to compare results, highlight differences in distributions, and check how properties of data affect subsequent analysis.
By the end of this assignment, my aim is not only summarize significant discoveries but also establish data pre-processing steps that will be instrumental in predictive modeling. These would be data cleansing, feature engineering, transformation, and correction of class imbalances. Ultimately, this exercise will show how diligent examination of data lays the foundation for improved, more interpretable machine learning models.
Data Set
A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing
Assignment
Review the structure and content of the data and answer questions such as:
Are the features (columns) of your data correlated?
What is the overall distribution of each variable?
Are there any outliers present?
What are the relationships between different variables?
How are categorical variables distributed?
Do any patterns or trends emerge in the data?
What is the central tendency and spread of each variable?
Are there any missing values and how significant are they?
Load Data
#A Portuguese bank conducted a marketing
# Read a CSV file
bank <- read.csv("bank.csv", sep = ";")
# Preview the first few rows of the dataset
kable(head(bank, 10), caption = "Preview of the Bank Dataset")
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no |
35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no |
59 | blue-collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no |
35 | management | single | tertiary | no | 747 | no | no | cellular | 23 | feb | 141 | 2 | 176 | 3 | failure | no |
36 | self-employed | married | tertiary | no | 307 | yes | no | cellular | 14 | may | 341 | 1 | 330 | 2 | other | no |
39 | technician | married | secondary | no | 147 | yes | no | cellular | 6 | may | 151 | 2 | -1 | 0 | unknown | no |
41 | entrepreneur | married | tertiary | no | 221 | yes | no | unknown | 14 | may | 57 | 2 | -1 | 0 | unknown | no |
43 | services | married | primary | no | -88 | yes | yes | cellular | 17 | apr | 313 | 1 | 147 | 2 | failure | no |
This is open source dataset, retrieved from a Portuguese bank’s marketing campaign, it includes phone calls to customers to predict whether they would subscribe to a term deposit. My object to apply machine learning techniques to analyze this data and identify the most effective strategies that can help the bank increase the subscription rate in future campaigns. Data can be downloaded at https://archive.ics.uci.edu/dataset/222/bank+marketing
Data Overview
# Rename columns (make it lowercase and remove spaces)
colnames(bank) <- str_to_lower(gsub(" ", "_", colnames(bank)))
# Check for missing values
colSums(is.na(bank))
age job marital education default balance housing loan
0 0 0 0 0 0 0 0
contact day month duration campaign pdays previous poutcome
0 0 0 0 0 0 0 0
y
0
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 33 35 30 59 35 36 39 41 43 ...
$ job : chr "unemployed" "services" "management" "management" ...
$ marital : chr "married" "married" "single" "married" ...
$ education: chr "primary" "secondary" "tertiary" "tertiary" ...
$ default : chr "no" "no" "no" "no" ...
$ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
$ housing : chr "no" "yes" "yes" "yes" ...
$ loan : chr "no" "yes" "no" "yes" ...
$ contact : chr "cellular" "cellular" "cellular" "unknown" ...
$ day : int 19 11 16 3 5 23 14 6 14 17 ...
$ month : chr "oct" "may" "apr" "jun" ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
$ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
$ previous : int 0 4 1 0 0 3 2 0 0 2 ...
$ poutcome : chr "unknown" "failure" "failure" "unknown" ...
$ y : chr "no" "no" "no" "no" ...
Exploratory Data Analysis
Review the structure and content of the data and answer questions such as:
1.1) Are the features (columns) of your data correlated?
cat("There are", ncol(bank_data), "columns and", nrow(bank_data), "rows.",
"Y is the target variable which denotes whether a client subscribes to a term deposit.",
"There are", ncol(bank_data) - 1, "features as listed below (not including the target variable):")
There are 17 columns and 4521 rows. Y is the target variable which denotes whether a client subscribes to a term deposit. There are 16 features as listed below (not including the target variable):
[1] "age" "job" "marital" "education" "default" "balance"
[7] "housing" "loan" "contact" "day" "month" "duration"
[13] "campaign" "pdays" "previous" "poutcome" "y"
1.2) What is the overall distribution of each variable?
.
: Distributions show whether the data is normally distributed, skewed, or contains outliers. Tools like histograms, density plots, and boxplots provide clear visual summaries.
: Frequency counts and bar plots reveal how data is grouped across different categories, helping to identify imbalances or dominant classes.
Pdays – Highly right-skewed, with median = -1 and mean ≈ 40. Since 75% of values are -1, most clients were not previously contacted. The large gap between mean and quartiles indicates outliers.
Previous – Right-skewed with median = 0 and mean ≈ 0.6. Most clients were not contacted before, and the higher mean suggests some outliers.
Campaign – Right-skewed. Median = 2, mean ≈ 2.8, with 50% of data between 1 and 3. Outliers present due to a higher mean than median.
Day – Approximately normal, with mean and median ≈ 16. Middle 50% of values fall between 8 and 21.
Age – Slight right skew. Half of clients are between 33 and 48 years old.
Duration – Right-skewed. Most calls were short, though some lasted much longer, indicating outliers.
Balance – Strong right skew. Median = 448, mean = 1362. Most clients have low balances, but a few very high values pull the mean upward.
# Seperating numerical features
bank_data_numerical<-bank_data %>%
select(where(is.numeric))
# Using describe () for summary stats
describe(bank_data_numerical)
vars n mean sd median trimmed mad min max range skew
age 1 4521 41.17 10.58 39 40.48 10.38 19 87 68 0.70
balance 2 4521 1422.66 3009.64 444 802.41 658.27 -3313 71188 74501 6.59
day 3 4521 15.92 8.25 16 15.80 10.38 1 31 30 0.09
duration 4 4521 263.96 259.86 185 216.44 143.81 4 3025 3021 2.77
campaign 5 4521 2.79 3.11 2 2.14 1.48 1 50 49 4.74
pdays 6 4521 39.77 100.12 -1 11.56 0.00 -1 871 872 2.72
previous 7 4521 0.54 1.69 0 0.12 0.00 0 25 25 5.87
kurtosis se
age 0.35 0.16
balance 88.25 44.76
day -1.04 0.12
duration 12.51 3.86
campaign 37.11 0.05
pdays 7.94 1.49
previous 51.91 0.03
age balance day duration
Min. :19.00 Min. :-3313 Min. : 1.00 Min. : 4
1st Qu.:33.00 1st Qu.: 69 1st Qu.: 9.00 1st Qu.: 104
Median :39.00 Median : 444 Median :16.00 Median : 185
Mean :41.17 Mean : 1423 Mean :15.92 Mean : 264
3rd Qu.:49.00 3rd Qu.: 1480 3rd Qu.:21.00 3rd Qu.: 329
Max. :87.00 Max. :71188 Max. :31.00 Max. :3025
campaign pdays previous
Min. : 1.000 Min. : -1.00 Min. : 0.0000
1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000
Median : 2.000 Median : -1.00 Median : 0.0000
Mean : 2.794 Mean : 39.77 Mean : 0.5426
3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000
Max. :50.000 Max. :871.00 Max. :25.0000
# Convert to long formate for plotting
bank_data_long <- bank_data_numerical %>%
select(where(is.numeric)) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
mutate(value = round(as.numeric(value), 0))
# Plot histograms
p <- bank_data_long %>%
mutate(variable = fct_reorder(variable, value)) %>%
ggplot(aes(x = value, color = variable, fill = variable)) +
geom_histogram(alpha = 0.6, binwidth = 5) +
scale_fill_viridis(discrete = TRUE) +
scale_color_viridis(discrete = TRUE) +
theme_minimal() +
theme(
legend.position = "none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("Value") +
ylab("Frequency") +
ggtitle("Distribution of Numerical Features ")+
facet_wrap(~variable, scales = "free")
# Print
print(p)
Job – Fairly spread across categories, though blue-collar, management, and technician appear most often.
Marital – Majority are married, nearly double the single group and about 5× the divorced group, showing a strong skew toward married clients.
Education – Secondary education dominates, roughly twice tertiary and three times primary, making the feature skewed toward secondary.
Default – Overwhelming majority have no credit default, with very little variation.
Housing – Close split: ~55% have a housing loan, ~45% do not.
Loan – Most do not have a personal loan (~80% no vs. ~20% yes).
Contact – Majority of contacts were made via cell phone.
Month – Calls are unevenly distributed, with May showing the highest concentration.
Poutcome – Unknown dominates, while among known outcomes, failure is more common than success.
Y (Term Deposit Subscription) – Strong imbalance: most clients did not subscribe (~40k no vs. ~5k yes).
# Seperating categorical features
bank_data_categorical<-bank_data %>%
select(where(is.character))
# Using describe () for summary statsw
describe(bank_data_categorical)
vars n mean sd median trimmed mad min max range skew kurtosis
job* 1 4521 5.41 3.26 5 5.33 4.45 1 12 11 0.24 -1.26
marital* 2 4521 2.15 0.60 2 2.18 0.00 1 3 2 -0.07 -0.35
education* 3 4521 2.23 0.75 2 2.24 0.00 1 4 3 0.19 -0.28
default* 4 4521 1.02 0.13 1 1.00 0.00 1 2 1 7.51 54.48
housing* 5 4521 1.57 0.50 2 1.58 0.00 1 2 1 -0.27 -1.93
loan* 6 4521 1.15 0.36 1 1.07 0.00 1 2 1 1.93 1.72
contact* 7 4521 1.65 0.90 1 1.57 0.00 1 3 2 0.74 -1.36
month* 8 4521 6.54 3.00 7 6.71 2.97 1 12 11 -0.50 -0.98
poutcome* 9 4521 3.56 0.99 4 3.82 0.00 1 4 3 -1.96 2.10
y* 10 4521 1.12 0.32 1 1.02 0.00 1 2 1 2.41 3.80
se
job* 0.05
marital* 0.01
education* 0.01
default* 0.00
housing* 0.01
loan* 0.01
contact* 0.01
month* 0.04
poutcome* 0.01
y* 0.00
# Transform to long
bank_data_long <- bank_data_categorical %>%
pivot_longer(everything(), names_to = "variable", values_to = "category")
# Plot bar plots
p <- bank_data_long %>%
mutate(variable = fct_reorder(variable, category)) %>%
ggplot(aes(x = category, color = variable, fill = variable)) +
geom_bar(alpha = 0.6, binwidth=2) +
scale_fill_viridis(discrete = TRUE) +
scale_color_viridis(discrete = TRUE) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
theme_minimal() +
theme(
legend.position = "none",
panel.spacing = unit(0.5, "lines"),
strip.text.x = element_text(size = 8),
axis.text.x = element_text(angle = 45, hjust = 1.0, size = 7.5),
axis.text.y = element_text(size = 10),
plot.margin = ggplot2::margin(10, 10, 10, 10, unit = "pt"),
plot.title = element_text(size = 14, face = "bold", hjust = 0.5)
) +
xlab("Categories") +
ylab("Frequency") +
ggtitle("Categorical Feature Distributions") +
facet_wrap(~variable, scales = "free_x", ncol = 5)
# Print
print(p)
Conclusion
The categorical distributions show that most clients are married, work in blue-collar, management, or technician jobs, and have secondary education. Most do not have personal loans or credit defaults, and cell phones were the main contact method. Past campaign outcomes are mostly unknown, and the target variable Y is heavily imbalanced, with most clients not subscribing.
1.3) Are there any outliers present?
Outliers were investigated via three methods:
The distribution of the features and target variable in the previous step allude to outliers in the following numerical features/ variables:
Pdays: Median and quartiles = –1, while the mean (~40) suggests outliers.
Earlier: Quartiles and median = 0, but mean (0.58) indicates that some values must be very high.
Campaign: Mean (2.8) > median (2), which shows potential for outliers.
Length: Mostly short calls, but a few significantly longer calls are outliers.
Balance: Extreme difference between mean (1362) and median (448) with extensive IQR, reflecting immensely large numbers in some customers.
Visual inspection confirmed the presence of extreme values across most numerical features, with the exception of day, which showed a more stable distribution.
Using the standard 1.5 x IQR rule, outliers were identified for nearly all numberical features, with the exception of day. The counts of detected outliers are:
Age: 487
Balance: 4729
Campaign: 3064
Duration: 3235
Pdays: 8257
Previous: 8257
# scatterplots for outliers
numeric_data <- bank_data %>%
select(where(is.numeric))
# Pivot the data to long format for plotting
numeric_data_long <- numeric_data %>%
mutate(id = row_number()) %>%
pivot_longer(cols = -id, names_to = "variable", values_to = "value")
# Create scatterplots for each numeric variable
ggplot(numeric_data_long, aes(x = id, y = value)) +
geom_point(alpha = 0.6, color = "darkblue") +
facet_wrap(~variable, scales = "free", ncol = 5) +
theme_minimal() +
theme(
strip.text.x = element_text(size = 10),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
labs(
x = "ID (Row Number)",
y = "Value",
title = "Scatterplots of Numerical Variables"
)
Conclusion: Outliers are present in most numerical features, especially balance and duration. While these extreme values may reflect real-world customer behavior rather than errors, they should be carefully considered in further modeling steps (e.g., using robust scaling or transformation).
# IQR Calculation
calculate_iqr_outliers <- function(df, col_name) {
Q1 <- quantile(df[[col_name]], 0.25, na.rm = TRUE) # 25th
Q3 <- quantile(df[[col_name]], 0.75, na.rm = TRUE) # 75th
IQR <- Q3 - Q1 # Interquartile range
# Calculate outlier threshold
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Identify outliers
df %>%
filter(df[[col_name]] < lower_bound | df[[col_name]] > upper_bound) %>% # Keep only outliers
mutate(variable = col_name) %>%
select(variable)
}
# Apply the function to all numeric variables and count outliers
outlier_counts <- bank_data %>%
select(where(is.numeric)) %>%
names() %>%
map_df(~calculate_iqr_outliers(bank_data, .x)) %>%
group_by(variable) %>%
summarise(outlier_count = n())
# Print the outlier counts for each feature
cat("\n**Table 1: Count of Outliers in Each Numeric Variable**\n")
**Table 1: Count of Outliers in Each Numeric Variable**
variable | outlier_count |
---|---|
age | 38 |
balance | 506 |
campaign | 318 |
duration | 330 |
pdays | 816 |
previous | 816 |
1.4) What are the relationships between different variables?
The correlation matrix suggests that most numerical features do not have strong linear relationships. Only a few show mild or weak associations:
Pdays and Previous (r = 0.45): Indicates that clients previously contacted are more likely to be contacted again soon.
Day and Duration (r = 0.16): A weak positive trend, suggesting calls tend to last slightly longer on later days.
Age and Balance (r = 0.10): A weak relationship where older clients may have marginally higher balances.
All other correlations among numerical features are close to zero, showing little to no linear association.
Correlations among categorical variables are generally weak, through a few mild associations are observed:
Job and Education – Correlation of 0.46 Housing and Month – Correlation of 0.50 Contact and Month – Correlation of 0.51
All other correlations are weak.
# --- Correlation Matrix for Numeric Variables ---
numeric_data <- bank_data %>% select(where(is.numeric))
# Correlation matrix
correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")
# corr matrix
ggcorrplot(correlation_matrix,
method = "circle",
type = "lower",
lab = TRUE,
title = "Correlation Matrix of Numeric Variables")
# Example 1: Combine housing and personal loan
bank_data$has_any_loan <- ifelse(bank_data$housing == "yes" | bank_data$loan == "yes", 1, 0)
table(bank_data$has_any_loan)
0 1
1677 2844
# Example 2: Create seasonal feature from month
bank_data$season <- ifelse(bank_data$month %in% c("dec","jan","feb"), "Winter",
ifelse(bank_data$month %in% c("mar","apr","may"), "Spring",
ifelse(bank_data$month %in% c("jun","jul","aug"), "Summer","Fall")))
table(bank_data$season)
Fall Spring Summer Winter
521 1740 1870 390
Conclusion:
In general, numerical variables also tend to have weak or no relationship, except for the notable one between pdays and previous, meaning repeated follow-ups by clients. Similarly, categorical variables also tend to have primarily weak relationships, though there are gentle relationships between job and education, housing and month, and contact and month.
From a business-domain perspective, these insights are the foundation upon which new composite features are built. The combination of pdays and previous into a repeated_contact feature more accurately represents follow-up action, whereas has_any_loan and season signal client financial stress and campaign timing. Poorly correlated variables may still provide complementary predictive power when designed with careful foresight, making them valuable inputs to model.
# Select categorical variables
categorical_cols <- names(bank_data)[sapply(bank_data, is.character)]
# Function for Cramers
cramers_v_matrix <- function(df, cat_vars) {
n <- length(cat_vars)
result <- matrix(0, n, n, dimnames = list(cat_vars, cat_vars))
for (i in 1:n) {
for (j in i:n) {
if (i == j) {
result[i, j] <- 1
} else {
result[i, j] <- cramerV(df[[cat_vars[i]]], df[[cat_vars[j]]])
result[j, i] <- result[i, j]
}
}
}
return(result)
}
cramers_matrix <- cramers_v_matrix(bank_data, categorical_cols)
cramers_df <- as.data.frame(cramers_matrix)
ggcorrplot(cramers_matrix,
method = "circle",
type = "lower",
lab = TRUE,
title = "Cramers V Correlation Between Categorical Variables")
Conclusion:
The Cramér’s V correlation matrix shows that most categorical variables have weak associations with each other. A few mild relationships are observed, such as between job and education and between contact method and month, but overall the categorical features are largely independent.
1.5) How are categorical variables distributed?
Boxplots are used to examine the relationships between numeric and categorical variables. They allow us to visually assess differences in distributions, detect outliers, and identify patterns across categories.
# Convert categorical variables to factors
bank_data <- bank_data %>%
mutate(across(where(is.character), as.factor))
# Combine all numeric-categorical pairs into one long dataframe
numeric_vars <- names(select(bank_data, where(is.numeric)))
categorical_vars <- names(select(bank_data, where(is.factor)))
# Long FOrmat
bank_data_long <- bank_data %>%
pivot_longer(cols = all_of(numeric_vars), names_to = "Numeric_Variable", values_to = "Value")
# Boxplots
ggplot(bank_data_long, aes(x = .data[[categorical_vars[1]]], y = Value, fill = .data[[categorical_vars[1]]])) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~Numeric_Variable, scales = "free") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels
legend.position = "none" # Remove legend if not needed
) +
labs(
x = "Categorical Variable",
y = "Value",
title = "Boxplots of Numeric Variables by Categorical Groups"
)
Conclusion
The boxplots reveal that numeric variables vary across categorical groups, with some features showing clear differences in median values or spread. Outliers are also apparent in several numeric features, highlighting the influence of extreme values within certain categories.
1.6) Do any patterns or trends emerge in the data?
The most obvious patterns are:
1.7) What is the central tendency and spread of each variable?
The central tendency for each numerical variable is denoted via mean, median and mode. For categorical variables mode is reported.
Spread of numerical variables is denote by standard deviation, variance, IQR and range (max/min).
The spread for the categorical variables is captured via frequency of each category and percent.
# Display numeric summary
# Separate into numeric and categorical variables
numeric_vars <- bank_data %>%
select(where(is.numeric))
categorical_vars <- bank_data %>%
select(where(is.character) | where(is.factor))
# Function to calculate summary statistics for numeric variables
numeric_summary <- numeric_vars %>%
summarise_all(list(
Mean = mean,
Median = median,
SD = sd,
Variance = var,
IQR = IQR,
Range = ~max(., na.rm = TRUE) - min(., na.rm = TRUE)
), na.rm = TRUE) %>%
pivot_longer(cols = everything(), names_to = c("Variable", "Metric"), names_sep = "_") %>%
pivot_wider(names_from = Metric, values_from = value)
# Function to compute mode separately
compute_mode <- function(x) {
tab <- table(x)
names(tab)[which.max(tab)]
}
mode_summary <- numeric_vars %>%
summarise_all(list(Mode = compute_mode)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Mode")
# Merge numeric summary with mode
numeric_summary <- left_join(numeric_summary, mode_summary, by = "Variable")
# Function to find mode for categorical variables
categorical_summary <- categorical_vars %>%
summarise_all(list(
Mode = compute_mode
)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Mode")
# Display numeric summary
print(numeric_summary)
# A tibble: 8 × 9
Variable Mean any Median SD Variance IQR Range Mode
<chr> <list> <list> <list> <list> <list> <list> <list> <chr>
1 age <dbl [1]> <NULL> <dbl [1]> <dbl [1]> <dbl [1]> <dbl> <dbl> <NA>
2 balance <dbl [1]> <NULL> <dbl [1]> <dbl [1]> <dbl [1]> <dbl> <dbl> <NA>
3 day <dbl [1]> <NULL> <dbl [1]> <dbl [1]> <dbl [1]> <dbl> <dbl> <NA>
4 duration <dbl [1]> <NULL> <dbl [1]> <dbl [1]> <dbl [1]> <dbl> <dbl> <NA>
5 campaign <dbl [1]> <NULL> <dbl [1]> <dbl [1]> <dbl [1]> <dbl> <dbl> <NA>
6 pdays <dbl [1]> <NULL> <dbl [1]> <dbl [1]> <dbl [1]> <dbl> <dbl> <NA>
7 previous <dbl [1]> <NULL> <dbl [1]> <dbl [1]> <dbl [1]> <dbl> <dbl> <NA>
8 has <NULL> <dbl [6]> <NULL> <NULL> <NULL> <NULL> <NULL> <NA>
# A tibble: 11 × 2
Variable Mode
<chr> <chr>
1 job_Mode management
2 marital_Mode married
3 education_Mode secondary
4 default_Mode no
5 housing_Mode yes
6 loan_Mode no
7 contact_Mode cellular
8 month_Mode may
9 poutcome_Mode unknown
10 y_Mode no
11 season_Mode Summer
# Function to compute category counts and correct percentages
compute_category_counts <- function(df) {
df %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category") %>%
group_by(Variable, Category) %>%
summarise(Count = n(), .groups = "drop") %>%
group_by(Variable) %>%
mutate(Percentage = round((Count / sum(Count)) * 100, 2)) %>%
arrange(Variable, desc(Count))
}
# Apply function to categorical variables
categorical_vars <- bank_data %>%
select(where(is.character) | where(is.factor))
category_counts <- compute_category_counts(categorical_vars)
# Print
print(category_counts)
# A tibble: 50 × 4
# Groups: Variable [11]
Variable Category Count Percentage
<chr> <fct> <int> <dbl>
1 contact cellular 2896 64.1
2 contact unknown 1324 29.3
3 contact telephone 301 6.66
4 default no 4445 98.3
5 default yes 76 1.68
6 education secondary 2306 51.0
7 education tertiary 1350 29.9
8 education primary 678 15
9 education unknown 187 4.14
10 housing yes 2559 56.6
# ℹ 40 more rows
1.8) Are there any missing values and how significant are they?
There are no missing values.
# Missing values
missing_summary <- bank_data %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
arrange(desc(Missing_Count))
# Print
print(missing_summary)
# A tibble: 19 × 2
Variable Missing_Count
<chr> <int>
1 age 0
2 job 0
3 marital 0
4 education 0
5 default 0
6 balance 0
7 housing 0
8 loan 0
9 contact 0
10 day 0
11 month 0
12 duration 0
13 campaign 0
14 pdays 0
15 previous 0
16 poutcome 0
17 y 0
18 has_any_loan 0
19 season 0
Duplicate or Inconsistent Values
To ensure data quality, I have checked for duplicate rows and inconsistent values.
[1] 0
# Check for inconsistent categorical values
bank_data_categorical <- bank_data %>% select(where(is.factor))
sapply(bank_data_categorical, unique)
$job
[1] unemployed services management blue-collar self-employed
[6] technician entrepreneur admin. student housemaid
[11] retired unknown
12 Levels: admin. blue-collar entrepreneur housemaid management ... unknown
$marital
[1] married single divorced
Levels: divorced married single
$education
[1] primary secondary tertiary unknown
Levels: primary secondary tertiary unknown
$default
[1] no yes
Levels: no yes
$housing
[1] no yes
Levels: no yes
$loan
[1] no yes
Levels: no yes
$contact
[1] cellular unknown telephone
Levels: cellular telephone unknown
$month
[1] oct may apr jun feb aug jan jul nov sep mar dec
Levels: apr aug dec feb jan jul jun mar may nov oct sep
$poutcome
[1] unknown failure other success
Levels: failure other success unknown
$y
[1] no yes
Levels: no yes
$season
[1] Fall Spring Summer Winter
Levels: Fall Spring Summer Winter
Conclusion
No duplicates or inconsistent values were detected, so no further cleaning is required.
The majority of the clients are married with secondary education, which is normal customer demographics. The right-skewed distributions of duration and balance are also expected because marketing campaigns are likely to target client segments of interest. The large imbalance of the target variable y is also consistent because hardly any of the called clients receive term deposits.
Algorithm/Model Selection
what Algorithms would suit the business purpose for the dataset. Answer questions such as:
The dataset represents a binary classification supervised learning problem. It contains 4521 records with 17 variables including both numerical and categorical features. The data is imbalanced, contains some outliers and shows weak correlations between variables. These characteristics influence the choice of algorithms.
I would recommend using Random Forest and Logistic Regression since dataset is:
imbalanced as evident by the imbalance present in the dependent variable itself (),
large, i.e. 4521 rows with 17 variables, an
consists of both numerical and categorical variables.
Pros:
Interpretable and simple; the coefficients represent direction and strength of feature effect.
appropriate for binary classification (target Y: submitted or not submitted).
quick to train even for massive datasets.
Cons:
Assumes linear relationship between features and log-odds.
Multicollinearity is sensitive to.
Has difficulty with complex non-linear patterns.
Pros:
Non-linear relationships and interaction between features are well managed.
Robust to noisy features and outliers.
Can handle both numeric and categorical variables without manual intervention.
Cons:
Not as interpretable as logistic regression.
Slower to acquire on very large datasets.
Prone to overfitting unless properly tuned.
I would recommend Random Forest as the primary model because:
It handles the imbalanced dataset better, especially when combined with techniques such as oversampling or class weights.
It is robust to outliers and skewed variables, both of which are present in the dataset.
It manages both numerical and categorical features without heavy preprocessing.
The dataset is not highly dimensional (17 features). Random forest can struggle with highly dimensional data.
I might also use logistic regression, but the dataset has a lot of outliers which would need to be accounted for in preprocessing. I also briefly considered Naive Bayes, but the dataset has a large number of numerical features which are not normally distributed. Furthermore, features are most likely not independent. Job, Age, education, martial status, housing loan are likely to be correlated. It supports business insight and campaign strategy design.
Yes. The dependent variable is binary and labeled. As such, I mainly considered supervised training algorithms.
The dataset has a binary target and a mix of numeric and categorical features, some with outliers and non-linear patterns. Random Forest handles these characteristics well for accurate predictions, while Logistic Regression offers interpretability for understanding feature impact.
Yes because random forest may overfit when trained on small datasets. In such a case, simpler models such as a single decision tree or k-nearest neighbors (kNN) could be considered.
Preprocessing System
I have done an EDA and selected an Algorithm, what pre-processing (if any) would you require for:
Class Imbalance:
Outlier Detection:
Many outliers are present, as shown by scatterplots and the 1.5×IQR method. Since Random Forest is robust to outliers, they do not pose a significant issue for model performance.
Generally, random forest is not sensitive to high dimensional data because random forest can handle large numbers of features without overfitting, automatically selecting relevant features secondary to bagging and does not require scaling.
If another model were used, such as logistic regression, dimension reduction could be achieved via PCA and/or feature selection.
Below the numerical and categorical correlation matrix’s display a mild correlation between pdays and previous, job and education, housing and month and contact and month, but not enough of a correlation to support dimension reduction.
ggcorrplot(correlation_matrix,
method = "circle",
type = "lower",
lab = TRUE,
title = "Correlation Matrix of Numeric Variables")
Most numeric variables show weak correlations, with the only moderate
association between pdays and previous, indicating past contacts
influence future contact patterns.
ggcorrplot(cramers_matrix,
method = "circle",
type = "lower",
lab = TRUE,
title = "Cramers V Correlation Between Categorical Variables")
Most categorical variables exhibit weak associations. A few mild
correlations, such as job & education and contact & month,
suggest some dependency, but overall the features are largely
independent.
Although random forest is able to handle this sort of dataset, a few variables could be improved upon. Time-based features such as month could be engineered at a seasonal level. Housing and personal loan features could be combined for an all encompassing “loan” variable denoting the presence of any sort of loan.
There is significant imbalance in the target feature ‘y’. Since the dataset is a moderate size of ~45,000 rows, undersampling should be Use reduce the majority class (no) to match the minority class (yes). If the dataset were small, i.e. less than 1,000 rows, oversampling or SMOTE might be utilized to address the imbalance.
Since random forest cannot handle categorical variables, one-hot encoding (low number of categories) and label encoding (for features with a larger number of categories) can be used to transform the data into numerical values. Additionally, highly skewed features could be transformed via methods such as log transformation. Regularization is not needed since random forest automatically selects features and is not sensitive to multicollinearity.
Conclusion
The Exploratory Data Analysis of Bank Marketing Dataset reveals significant patterns and insights towards predictive modeling. The dataset is balanced, as the numerical variables reflect skewness and outliers, and the categorical variables reflect imbalances, particularly in marital status, education, and job. The target variable is significantly imbalanced, where the majority of clients do not subscribe to term deposits. Random Forest and Logistic Regression are valid model choices, with Random Forest dealing effectively with the mix of variable types, outliers, and non-linear relationship. Preprocessing would include categorical encoding, handling class imbalance, and feature engineering as optional. Overall, this analysis provides a strong foundation for model building in order to optimize approaches and drive term deposit subscriptions.