DATA 622 : Exploratory Data Analysis

Author: Rupendra Shrestha | 25 Sep 2025

Column

Column

Introduction

Exploratory Data Analysis (EDA) is an imperative preliminary activity in all data science endeavours since it lays the groundwork for developing sound and efficient machine learning models. EDA entails observation of data structure, patterns, and quality prior to conducting model development. It enables us to identify missing data points, outliers, and class imbalance, while at the same time underlining significant variable relationships. As has been stated, “better data beats better algorithms” emphasizing the value of quality data over model sophistication.

In this assignment, I will do Exploratory Analysis using the selected datasets, being careful about both data preparation and insight extraction. Since this is an exploratory assignment, I will display encountered errors and warnings during analysis to gain a better understanding of data issues. Executing the code on different datasets will also allow us to compare results, highlight differences in distributions, and check how properties of data affect subsequent analysis.

By the end of this assignment, my aim is not only summarize significant discoveries but also establish data pre-processing steps that will be instrumental in predictive modeling. These would be data cleansing, feature engineering, transformation, and correction of class imbalances. Ultimately, this exercise will show how diligent examination of data lays the foundation for improved, more interpretable machine learning models.

Data Set

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing

Assignment

Review the structure and content of the data and answer questions such as:

  1. Are the features (columns) of your data correlated?

  2. What is the overall distribution of each variable?

  3. Are there any outliers present?

  4. What are the relationships between different variables?

  5. How are categorical variables distributed?

  6. Do any patterns or trends emerge in the data?

  7. What is the central tendency and spread of each variable?

  8. Are there any missing values and how significant are they?

Load Data

#A Portuguese bank conducted a marketing
# Read a CSV file
bank <- read.csv("bank.csv", sep = ";")

# Preview the first few rows of the dataset
kable(head(bank, 10), caption = "Preview of the Bank Dataset")
Preview of the Bank Dataset
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown no
35 management single tertiary no 747 no no cellular 23 feb 141 2 176 3 failure no
36 self-employed married tertiary no 307 yes no cellular 14 may 341 1 330 2 other no
39 technician married secondary no 147 yes no cellular 6 may 151 2 -1 0 unknown no
41 entrepreneur married tertiary no 221 yes no unknown 14 may 57 2 -1 0 unknown no
43 services married primary no -88 yes yes cellular 17 apr 313 1 147 2 failure no

This is open source dataset, retrieved from a Portuguese bank’s marketing campaign, it includes phone calls to customers to predict whether they would subscribe to a term deposit. My object to apply machine learning techniques to analyze this data and identify the most effective strategies that can help the bank increase the subscription rate in future campaigns. Data can be downloaded at https://archive.ics.uci.edu/dataset/222/bank+marketing

Data Overview

Check the structure of the dataset (Rename columns and missing values)
# Rename columns (make it lowercase and remove spaces)
colnames(bank) <- str_to_lower(gsub(" ", "_", colnames(bank)))

# Check for missing values
colSums(is.na(bank))  
      age       job   marital education   default   balance   housing      loan 
        0         0         0         0         0         0         0         0 
  contact       day     month  duration  campaign     pdays  previous  poutcome 
        0         0         0         0         0         0         0         0 
        y 
        0 
# Handle missing values
# Remove rows with any NA
bank_data <- bank %>% drop_na()

str(bank_data)
'data.frame':   4521 obs. of  17 variables:
 $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
 $ job      : chr  "unemployed" "services" "management" "management" ...
 $ marital  : chr  "married" "married" "single" "married" ...
 $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
 $ default  : chr  "no" "no" "no" "no" ...
 $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
 $ housing  : chr  "no" "yes" "yes" "yes" ...
 $ loan     : chr  "no" "yes" "no" "yes" ...
 $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
 $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
 $ month    : chr  "oct" "may" "apr" "jun" ...
 $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
 $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
 $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
 $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
 $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
 $ y        : chr  "no" "no" "no" "no" ...

Exploratory Data Analysis

Review the structure and content of the data and answer questions such as:

1.1) Are the features (columns) of your data correlated?

cat("There are", ncol(bank_data), "columns and", nrow(bank_data), "rows.",
      "Y is the target variable which denotes whether a client subscribes to a term deposit.",
      "There are", ncol(bank_data) - 1, "features as listed below (not including the target variable):")
There are 17 columns and 4521 rows. Y is the target variable which denotes whether a client subscribes to a term deposit. There are 16 features as listed below (not including the target variable):
colnames(bank_data)
 [1] "age"       "job"       "marital"   "education" "default"   "balance"  
 [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
[13] "campaign"  "pdays"     "previous"  "poutcome"  "y"        

1.2) What is the overall distribution of each variable?

Exploratory Data Analysis (EDA) is the first step in understanding a dataset. It helps uncover the main characteristics, detect patterns, and identify potential anomalies before applying advanced modeling. A key part of EDA is examining the
overall distribution of each variable

.

a] NUMERICAL FEATURE DISTRIBUTION

: Distributions show whether the data is normally distributed, skewed, or contains outliers. Tools like histograms, density plots, and boxplots provide clear visual summaries.

b] CATEGORICAL FEATURE DISTRIBUTION

: Frequency counts and bar plots reveal how data is grouped across different categories, helping to identify imbalances or dominant classes.

Numerical Feature Distribution

Pdays – Highly right-skewed, with median = -1 and mean ≈ 40. Since 75% of values are -1, most clients were not previously contacted. The large gap between mean and quartiles indicates outliers.

Previous – Right-skewed with median = 0 and mean ≈ 0.6. Most clients were not contacted before, and the higher mean suggests some outliers.

Campaign – Right-skewed. Median = 2, mean ≈ 2.8, with 50% of data between 1 and 3. Outliers present due to a higher mean than median.

Day – Approximately normal, with mean and median ≈ 16. Middle 50% of values fall between 8 and 21.

Age – Slight right skew. Half of clients are between 33 and 48 years old.

Duration – Right-skewed. Most calls were short, though some lasted much longer, indicating outliers.

Balance – Strong right skew. Median = 448, mean = 1362. Most clients have low balances, but a few very high values pull the mean upward.

# Seperating numerical features
bank_data_numerical<-bank_data %>%
  select(where(is.numeric))   

# Using describe () for summary stats
describe(bank_data_numerical)
         vars    n    mean      sd median trimmed    mad   min   max range skew
age         1 4521   41.17   10.58     39   40.48  10.38    19    87    68 0.70
balance     2 4521 1422.66 3009.64    444  802.41 658.27 -3313 71188 74501 6.59
day         3 4521   15.92    8.25     16   15.80  10.38     1    31    30 0.09
duration    4 4521  263.96  259.86    185  216.44 143.81     4  3025  3021 2.77
campaign    5 4521    2.79    3.11      2    2.14   1.48     1    50    49 4.74
pdays       6 4521   39.77  100.12     -1   11.56   0.00    -1   871   872 2.72
previous    7 4521    0.54    1.69      0    0.12   0.00     0    25    25 5.87
         kurtosis    se
age          0.35  0.16
balance     88.25 44.76
day         -1.04  0.12
duration    12.51  3.86
campaign    37.11  0.05
pdays        7.94  1.49
previous    51.91  0.03
# Using summary() for interquartile calculations
summary(bank_data_numerical)
      age           balance           day           duration   
 Min.   :19.00   Min.   :-3313   Min.   : 1.00   Min.   :   4  
 1st Qu.:33.00   1st Qu.:   69   1st Qu.: 9.00   1st Qu.: 104  
 Median :39.00   Median :  444   Median :16.00   Median : 185  
 Mean   :41.17   Mean   : 1423   Mean   :15.92   Mean   : 264  
 3rd Qu.:49.00   3rd Qu.: 1480   3rd Qu.:21.00   3rd Qu.: 329  
 Max.   :87.00   Max.   :71188   Max.   :31.00   Max.   :3025  
    campaign          pdays           previous      
 Min.   : 1.000   Min.   : -1.00   Min.   : 0.0000  
 1st Qu.: 1.000   1st Qu.: -1.00   1st Qu.: 0.0000  
 Median : 2.000   Median : -1.00   Median : 0.0000  
 Mean   : 2.794   Mean   : 39.77   Mean   : 0.5426  
 3rd Qu.: 3.000   3rd Qu.: -1.00   3rd Qu.: 0.0000  
 Max.   :50.000   Max.   :871.00   Max.   :25.0000  
# Convert to long formate for plotting
bank_data_long <- bank_data_numerical %>%
  select(where(is.numeric)) %>%   
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  mutate(value = round(as.numeric(value), 0))  

# Plot histograms
p <- bank_data_long %>%
  mutate(variable = fct_reorder(variable, value)) %>%  
  ggplot(aes(x = value, color = variable, fill = variable)) +
  geom_histogram(alpha = 0.6, binwidth = 5) +  
  scale_fill_viridis(discrete = TRUE) +   
  scale_color_viridis(discrete = TRUE) + 
  theme_minimal() +   
  theme(
    legend.position = "none", 
    panel.spacing = unit(0.1, "lines"),  
    strip.text.x = element_text(size = 8)
  ) +
  xlab("Value") +
  ylab("Frequency") +  
  ggtitle("Distribution of Numerical Features ")+
  facet_wrap(~variable, scales = "free")  

# Print
print(p)

Categorical Feature Distribution

Job – Fairly spread across categories, though blue-collar, management, and technician appear most often.

Marital – Majority are married, nearly double the single group and about 5× the divorced group, showing a strong skew toward married clients.

Education – Secondary education dominates, roughly twice tertiary and three times primary, making the feature skewed toward secondary.

Default – Overwhelming majority have no credit default, with very little variation.

Housing – Close split: ~55% have a housing loan, ~45% do not.

Loan – Most do not have a personal loan (~80% no vs. ~20% yes).

Contact – Majority of contacts were made via cell phone.

Month – Calls are unevenly distributed, with May showing the highest concentration.

Poutcome – Unknown dominates, while among known outcomes, failure is more common than success.

Y (Term Deposit Subscription) – Strong imbalance: most clients did not subscribe (~40k no vs. ~5k yes).

# Seperating categorical features
bank_data_categorical<-bank_data %>%
  select(where(is.character))   

# Using describe () for summary statsw
describe(bank_data_categorical)
           vars    n mean   sd median trimmed  mad min max range  skew kurtosis
job*          1 4521 5.41 3.26      5    5.33 4.45   1  12    11  0.24    -1.26
marital*      2 4521 2.15 0.60      2    2.18 0.00   1   3     2 -0.07    -0.35
education*    3 4521 2.23 0.75      2    2.24 0.00   1   4     3  0.19    -0.28
default*      4 4521 1.02 0.13      1    1.00 0.00   1   2     1  7.51    54.48
housing*      5 4521 1.57 0.50      2    1.58 0.00   1   2     1 -0.27    -1.93
loan*         6 4521 1.15 0.36      1    1.07 0.00   1   2     1  1.93     1.72
contact*      7 4521 1.65 0.90      1    1.57 0.00   1   3     2  0.74    -1.36
month*        8 4521 6.54 3.00      7    6.71 2.97   1  12    11 -0.50    -0.98
poutcome*     9 4521 3.56 0.99      4    3.82 0.00   1   4     3 -1.96     2.10
y*           10 4521 1.12 0.32      1    1.02 0.00   1   2     1  2.41     3.80
             se
job*       0.05
marital*   0.01
education* 0.01
default*   0.00
housing*   0.01
loan*      0.01
contact*   0.01
month*     0.04
poutcome*  0.01
y*         0.00
# Transform to long
bank_data_long <- bank_data_categorical %>%
  pivot_longer(everything(), names_to = "variable", values_to = "category")  

# Plot bar plots
p <- bank_data_long %>%
  mutate(variable = fct_reorder(variable, category)) %>% 
  ggplot(aes(x = category, color = variable, fill = variable)) +
  geom_bar(alpha = 0.6, binwidth=2) + 
  scale_fill_viridis(discrete = TRUE) +   
  scale_color_viridis(discrete = TRUE) + 
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +  
  theme_minimal() +   
  theme(
    legend.position = "none",  
    panel.spacing = unit(0.5, "lines"),  
    strip.text.x = element_text(size = 8),  
    axis.text.x = element_text(angle = 45, hjust = 1.0, size = 7.5),  
    axis.text.y = element_text(size = 10),  
    plot.margin = ggplot2::margin(10, 10, 10, 10, unit = "pt"), 
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5) 
  ) +
  xlab("Categories") +
  ylab("Frequency") +  
  ggtitle("Categorical Feature Distributions") +  
  facet_wrap(~variable, scales = "free_x", ncol = 5)



# Print 
print(p)

Conclusion

The categorical distributions show that most clients are married, work in blue-collar, management, or technician jobs, and have secondary education. Most do not have personal loans or credit defaults, and cell phones were the main contact method. Past campaign outcomes are mostly unknown, and the target variable Y is heavily imbalanced, with most clients not subscribing.

1.3) Are there any outliers present?

Outliers were investigated via three methods:

  • Summary Statistics & Histograms

The distribution of the features and target variable in the previous step allude to outliers in the following numerical features/ variables:

Pdays: Median and quartiles = –1, while the mean (~40) suggests outliers.

Earlier: Quartiles and median = 0, but mean (0.58) indicates that some values must be very high.

Campaign: Mean (2.8) > median (2), which shows potential for outliers.

Length: Mostly short calls, but a few significantly longer calls are outliers.

Balance: Extreme difference between mean (1362) and median (448) with extensive IQR, reflecting immensely large numbers in some customers.

  • Scatterplots

Visual inspection confirmed the presence of extreme values across most numerical features, with the exception of day, which showed a more stable distribution.

  • 1.5 x IQR Rule

Using the standard 1.5 x IQR rule, outliers were identified for nearly all numberical features, with the exception of day. The counts of detected outliers are:

Age: 487

Balance: 4729

Campaign: 3064

Duration: 3235

Pdays: 8257

Previous: 8257

# scatterplots for outliers
numeric_data <- bank_data %>%
  select(where(is.numeric))

# Pivot the data to long format for plotting
numeric_data_long <- numeric_data %>%
  mutate(id = row_number()) %>%  
  pivot_longer(cols = -id, names_to = "variable", values_to = "value")

# Create scatterplots for each numeric variable
ggplot(numeric_data_long, aes(x = id, y = value)) +
  geom_point(alpha = 0.6, color = "darkblue") +  
  facet_wrap(~variable, scales = "free", ncol = 5) + 
  theme_minimal() +
  theme(
    strip.text.x = element_text(size = 10), 
    axis.text.x = element_text(angle = 45, hjust = 1)  
  ) +
  labs(
    x = "ID (Row Number)", 
    y = "Value", 
    title = "Scatterplots of Numerical Variables"
  )

Conclusion: Outliers are present in most numerical features, especially balance and duration. While these extreme values may reflect real-world customer behavior rather than errors, they should be carefully considered in further modeling steps (e.g., using robust scaling or transformation).

# IQR Calculation
calculate_iqr_outliers <- function(df, col_name) {
  Q1 <- quantile(df[[col_name]], 0.25, na.rm = TRUE)  # 25th 
  Q3 <- quantile(df[[col_name]], 0.75, na.rm = TRUE)  # 75th 
  IQR <- Q3 - Q1  # Interquartile range
  
  # Calculate outlier threshold
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  # Identify outliers 
  df %>%
    filter(df[[col_name]] < lower_bound | df[[col_name]] > upper_bound) %>%  # Keep only outliers
    mutate(variable = col_name) %>%  
    select(variable) 
}

# Apply the function to all numeric variables and count outliers
outlier_counts <- bank_data %>%
  select(where(is.numeric)) %>%  
  names() %>%  
  map_df(~calculate_iqr_outliers(bank_data, .x)) %>%  
  group_by(variable) %>%  
  summarise(outlier_count = n())  

# Print the outlier counts for each feature
cat("\n**Table 1: Count of Outliers in Each Numeric Variable**\n")

**Table 1: Count of Outliers in Each Numeric Variable**
kable(outlier_counts, caption = "Count of Outliers in Each Numeric Variable")
Count of Outliers in Each Numeric Variable
variable outlier_count
age 38
balance 506
campaign 318
duration 330
pdays 816
previous 816

1.4) What are the relationships between different variables?

Numerical Variables

The correlation matrix suggests that most numerical features do not have strong linear relationships. Only a few show mild or weak associations:

  • Mild Relationship

Pdays and Previous (r = 0.45): Indicates that clients previously contacted are more likely to be contacted again soon.

  • Weak Relationships

Day and Duration (r = 0.16): A weak positive trend, suggesting calls tend to last slightly longer on later days.

Age and Balance (r = 0.10): A weak relationship where older clients may have marginally higher balances.

All other correlations among numerical features are close to zero, showing little to no linear association.

Categorical Variables

Correlations among categorical variables are generally weak, through a few mild associations are observed:

  • Mild Relationships

Job and Education – Correlation of 0.46 Housing and Month – Correlation of 0.50 Contact and Month – Correlation of 0.51

  • Weak Relationships

All other correlations are weak.

# --- Correlation Matrix for Numeric Variables ---
numeric_data <- bank_data %>% select(where(is.numeric))

# Correlation matrix
correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")

# corr matrix
ggcorrplot(correlation_matrix, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           title = "Correlation Matrix of Numeric Variables")

# Example 1: Combine housing and personal loan
bank_data$has_any_loan <- ifelse(bank_data$housing == "yes" | bank_data$loan == "yes", 1, 0)
table(bank_data$has_any_loan)

   0    1 
1677 2844 
# Example 2: Create seasonal feature from month
bank_data$season <- ifelse(bank_data$month %in% c("dec","jan","feb"), "Winter",
                     ifelse(bank_data$month %in% c("mar","apr","may"), "Spring",
                     ifelse(bank_data$month %in% c("jun","jul","aug"), "Summer","Fall")))
table(bank_data$season)

  Fall Spring Summer Winter 
   521   1740   1870    390 

Conclusion:

In general, numerical variables also tend to have weak or no relationship, except for the notable one between pdays and previous, meaning repeated follow-ups by clients. Similarly, categorical variables also tend to have primarily weak relationships, though there are gentle relationships between job and education, housing and month, and contact and month.

From a business-domain perspective, these insights are the foundation upon which new composite features are built. The combination of pdays and previous into a repeated_contact feature more accurately represents follow-up action, whereas has_any_loan and season signal client financial stress and campaign timing. Poorly correlated variables may still provide complementary predictive power when designed with careful foresight, making them valuable inputs to model.

# Select categorical variables
categorical_cols <- names(bank_data)[sapply(bank_data, is.character)]

# Function for Cramers
cramers_v_matrix <- function(df, cat_vars) {
  n <- length(cat_vars)
  result <- matrix(0, n, n, dimnames = list(cat_vars, cat_vars))
  
  for (i in 1:n) {
    for (j in i:n) {
      if (i == j) {
        result[i, j] <- 1  
      } else {
        result[i, j] <- cramerV(df[[cat_vars[i]]], df[[cat_vars[j]]])
        result[j, i] <- result[i, j]  
      }
    }
  }
  return(result)
}

cramers_matrix <- cramers_v_matrix(bank_data, categorical_cols)
cramers_df <- as.data.frame(cramers_matrix)
ggcorrplot(cramers_matrix, 
           method = "circle", 
           type = "lower",     
           lab = TRUE, 
           title = "Cramers V Correlation Between Categorical Variables")

Conclusion:

The Cramér’s V correlation matrix shows that most categorical variables have weak associations with each other. A few mild relationships are observed, such as between job and education and between contact method and month, but overall the categorical features are largely independent.

1.5) How are categorical variables distributed?

Boxplots are used to examine the relationships between numeric and categorical variables. They allow us to visually assess differences in distributions, detect outliers, and identify patterns across categories.

# Convert categorical variables to factors 
bank_data <- bank_data %>%
  mutate(across(where(is.character), as.factor))

# Combine all numeric-categorical pairs into one long dataframe
numeric_vars <- names(select(bank_data, where(is.numeric)))
categorical_vars <- names(select(bank_data, where(is.factor)))

# Long FOrmat
bank_data_long <- bank_data %>%
  pivot_longer(cols = all_of(numeric_vars), names_to = "Numeric_Variable", values_to = "Value")

# Boxplots 
ggplot(bank_data_long, aes(x = .data[[categorical_vars[1]]], y = Value, fill = .data[[categorical_vars[1]]])) +
  geom_boxplot(alpha = 0.7) +  
  facet_wrap(~Numeric_Variable, scales = "free") +  
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels
    legend.position = "none"  # Remove legend if not needed
  ) +
  labs(
    x = "Categorical Variable",
    y = "Value",
    title = "Boxplots of Numeric Variables by Categorical Groups"
  )

Conclusion

The boxplots reveal that numeric variables vary across categorical groups, with some features showing clear differences in median values or spread. Outliers are also apparent in several numeric features, highlighting the influence of extreme values within certain categories.

1.6) Do any patterns or trends emerge in the data?

The most obvious patterns are:

  • The dataset is highly imbalanced as demonstrated by the bias in the dependent variable. A majority of clients have not subscribed (yes =~5,000, no=~40,000).
  • Many of the features are not normally distributed.
  • Many of the features do not have a linear relationship with the dependent variable.
  • Feature independence is most likely violated. The features job, age, education, martial status, housing loan are likely to be correlated.

1.7) What is the central tendency and spread of each variable?

The central tendency for each numerical variable is denoted via mean, median and mode. For categorical variables mode is reported.

Spread of numerical variables is denote by standard deviation, variance, IQR and range (max/min).

The spread for the categorical variables is captured via frequency of each category and percent.

# Display numeric summary


# Separate into numeric and categorical variables
numeric_vars <- bank_data %>%
  select(where(is.numeric))

categorical_vars <- bank_data %>%
  select(where(is.character) | where(is.factor))

# Function to calculate summary statistics for numeric variables 
numeric_summary <- numeric_vars %>%
  summarise_all(list(
    Mean = mean,
    Median = median,
    SD = sd,
    Variance = var,
    IQR = IQR,
    Range = ~max(., na.rm = TRUE) - min(., na.rm = TRUE)
  ), na.rm = TRUE) %>%
  pivot_longer(cols = everything(), names_to = c("Variable", "Metric"), names_sep = "_") %>%
  pivot_wider(names_from = Metric, values_from = value)

# Function to compute mode separately 
compute_mode <- function(x) {
  tab <- table(x)
  names(tab)[which.max(tab)]  
}

mode_summary <- numeric_vars %>%
  summarise_all(list(Mode = compute_mode)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Mode")

# Merge numeric summary with mode 
numeric_summary <- left_join(numeric_summary, mode_summary, by = "Variable")

# Function to find mode for categorical variables
categorical_summary <- categorical_vars %>%
  summarise_all(list(
    Mode = compute_mode
  )) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Mode")

# Display numeric summary
print(numeric_summary)  
# A tibble: 8 × 9
  Variable Mean      any       Median    SD        Variance  IQR    Range  Mode 
  <chr>    <list>    <list>    <list>    <list>    <list>    <list> <list> <chr>
1 age      <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
2 balance  <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
3 day      <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
4 duration <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
5 campaign <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
6 pdays    <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
7 previous <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
8 has      <NULL>    <dbl [6]> <NULL>    <NULL>    <NULL>    <NULL> <NULL> <NA> 
# Display categorical summary
print(categorical_summary)  
# A tibble: 11 × 2
   Variable       Mode      
   <chr>          <chr>     
 1 job_Mode       management
 2 marital_Mode   married   
 3 education_Mode secondary 
 4 default_Mode   no        
 5 housing_Mode   yes       
 6 loan_Mode      no        
 7 contact_Mode   cellular  
 8 month_Mode     may       
 9 poutcome_Mode  unknown   
10 y_Mode         no        
11 season_Mode    Summer    
# Function to compute category counts and correct percentages
compute_category_counts <- function(df) {
  df %>%
    pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category") %>%
    group_by(Variable, Category) %>%
    summarise(Count = n(), .groups = "drop") %>%
    group_by(Variable) %>%  
    mutate(Percentage = round((Count / sum(Count)) * 100, 2)) %>%  
    arrange(Variable, desc(Count)) 
}

# Apply function to categorical variables
categorical_vars <- bank_data %>%
  select(where(is.character) | where(is.factor))  

category_counts <- compute_category_counts(categorical_vars)

# Print 
print(category_counts)
# A tibble: 50 × 4
# Groups:   Variable [11]
   Variable  Category  Count Percentage
   <chr>     <fct>     <int>      <dbl>
 1 contact   cellular   2896      64.1 
 2 contact   unknown    1324      29.3 
 3 contact   telephone   301       6.66
 4 default   no         4445      98.3 
 5 default   yes          76       1.68
 6 education secondary  2306      51.0 
 7 education tertiary   1350      29.9 
 8 education primary     678      15   
 9 education unknown     187       4.14
10 housing   yes        2559      56.6 
# ℹ 40 more rows

1.8) Are there any missing values and how significant are they?

There are no missing values.

# Missing values 
missing_summary <- bank_data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  arrange(desc(Missing_Count))  

# Print 
print(missing_summary)
# A tibble: 19 × 2
   Variable     Missing_Count
   <chr>                <int>
 1 age                      0
 2 job                      0
 3 marital                  0
 4 education                0
 5 default                  0
 6 balance                  0
 7 housing                  0
 8 loan                     0
 9 contact                  0
10 day                      0
11 month                    0
12 duration                 0
13 campaign                 0
14 pdays                    0
15 previous                 0
16 poutcome                 0
17 y                        0
18 has_any_loan             0
19 season                   0

Duplicate or Inconsistent Values

To ensure data quality, I have checked for duplicate rows and inconsistent values.

# Check for duplicate rows
sum(duplicated(bank_data))
[1] 0
# Check for inconsistent categorical values
bank_data_categorical <- bank_data %>% select(where(is.factor))
sapply(bank_data_categorical, unique)
$job
 [1] unemployed    services      management    blue-collar   self-employed
 [6] technician    entrepreneur  admin.        student       housemaid    
[11] retired       unknown      
12 Levels: admin. blue-collar entrepreneur housemaid management ... unknown

$marital
[1] married  single   divorced
Levels: divorced married single

$education
[1] primary   secondary tertiary  unknown  
Levels: primary secondary tertiary unknown

$default
[1] no  yes
Levels: no yes

$housing
[1] no  yes
Levels: no yes

$loan
[1] no  yes
Levels: no yes

$contact
[1] cellular  unknown   telephone
Levels: cellular telephone unknown

$month
 [1] oct may apr jun feb aug jan jul nov sep mar dec
Levels: apr aug dec feb jan jul jun mar may nov oct sep

$poutcome
[1] unknown failure other   success
Levels: failure other success unknown

$y
[1] no  yes
Levels: no yes

$season
[1] Fall   Spring Summer Winter
Levels: Fall Spring Summer Winter

Conclusion

No duplicates or inconsistent values were detected, so no further cleaning is required.

The majority of the clients are married with secondary education, which is normal customer demographics. The right-skewed distributions of duration and balance are also expected because marketing campaigns are likely to target client segments of interest. The large imbalance of the target variable y is also consistent because hardly any of the called clients receive term deposits.

Algorithm/Model Selection

what Algorithms would suit the business purpose for the dataset. Answer questions such as:

The dataset represents a binary classification supervised learning problem. It contains 4521 records with 17 variables including both numerical and categorical features. The data is imbalanced, contains some outliers and shows weak correlations between variables. These characteristics influence the choice of algorithms.

  1. Select two or more machine learning algorithms presented so far that could be used to train a model.

I would recommend using Random Forest and Logistic Regression since dataset is:

  • imbalanced as evident by the imbalance present in the dependent variable itself (),

  • large, i.e. 4521 rows with 17 variables, an

  • consists of both numerical and categorical variables.

  1. What are the pros and cons of each algorithm you selected?
  1. Logistic Regression

Pros:

  • Interpretable and simple; the coefficients represent direction and strength of feature effect.

  • appropriate for binary classification (target Y: submitted or not submitted).

  • quick to train even for massive datasets.

Cons:

  • Assumes linear relationship between features and log-odds.

  • Multicollinearity is sensitive to.

  • Has difficulty with complex non-linear patterns.

  1. Random Forest

Pros:

  • Non-linear relationships and interaction between features are well managed.

  • Robust to noisy features and outliers.

  • Can handle both numeric and categorical variables without manual intervention.

Cons:

  • Not as interpretable as logistic regression.

  • Slower to acquire on very large datasets.

  • Prone to overfitting unless properly tuned.

  1. Which algorithm would you recommend, and why?

I would recommend Random Forest as the primary model because:

  • It handles the imbalanced dataset better, especially when combined with techniques such as oversampling or class weights.

  • It is robust to outliers and skewed variables, both of which are present in the dataset.

  • It manages both numerical and categorical features without heavy preprocessing.

  • The dataset is not highly dimensional (17 features). Random forest can struggle with highly dimensional data.

I might also use logistic regression, but the dataset has a lot of outliers which would need to be accounted for in preprocessing. I also briefly considered Naive Bayes, but the dataset has a large number of numerical features which are not normally distributed. Furthermore, features are most likely not independent. Job, Age, education, martial status, housing loan are likely to be correlated. It supports business insight and campaign strategy design.

  1. Are there labels in your data? Did that impact your choice of algorithm?

Yes. The dependent variable is binary and labeled. As such, I mainly considered supervised training algorithms.

  1. How does your choice of algorithm relates to the dataset?

The dataset has a binary target and a mix of numeric and categorical features, some with outliers and non-linear patterns. Random Forest handles these characteristics well for accurate predictions, while Logistic Regression offers interpretability for understanding feature impact.

  1. Would your choice of algorithm change if there were fewer than 1,000 data records, and why?

Yes because random forest may overfit when trained on small datasets. In such a case, simpler models such as a single decision tree or k-nearest neighbors (kNN) could be considered.

Preprocessing System

I have done an EDA and selected an Algorithm, what pre-processing (if any) would you require for:

  1. Data Cleaning - improve data quality, address missing data, etc.
There are no missing values in the dataset, so no imputation is required.

Class Imbalance:

The target variable is highly imbalance. Models may struggle to predict the minority class “yes”, potentially biasing results toward “no.” This can be addressed using techniques like oversampling the minority class or undersampling the majority class.

Outlier Detection:

Many outliers are present, as shown by scatterplots and the 1.5×IQR method. Since Random Forest is robust to outliers, they do not pose a significant issue for model performance.

  1. Dimensionality Reduction - remove correlated/redundant data than will slow down training

Generally, random forest is not sensitive to high dimensional data because random forest can handle large numbers of features without overfitting, automatically selecting relevant features secondary to bagging and does not require scaling.

If another model were used, such as logistic regression, dimension reduction could be achieved via PCA and/or feature selection.

Below the numerical and categorical correlation matrix’s display a mild correlation between pdays and previous, job and education, housing and month and contact and month, but not enough of a correlation to support dimension reduction.

ggcorrplot(correlation_matrix, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           title = "Correlation Matrix of Numeric Variables")

Most numeric variables show weak correlations, with the only moderate association between pdays and previous, indicating past contacts influence future contact patterns.

The Cramér’s V correlation matrix is used to assess associations between categorical variables, helping to identify potential relationships or redundancy.
ggcorrplot(cramers_matrix, 
           method = "circle",  
           type = "lower",      
           lab = TRUE, 
           title = "Cramers V Correlation Between Categorical Variables")

Most categorical variables exhibit weak associations. A few mild correlations, such as job & education and contact & month, suggest some dependency, but overall the features are largely independent.

  1. Feature Engineering - use of business knowledge to create new features

Although random forest is able to handle this sort of dataset, a few variables could be improved upon. Time-based features such as month could be engineered at a seasonal level. Housing and personal loan features could be combined for an all encompassing “loan” variable denoting the presence of any sort of loan.

  1. Sampling Data - using sampling to resize datasets

There is significant imbalance in the target feature ‘y’. Since the dataset is a moderate size of ~45,000 rows, undersampling should be Use reduce the majority class (no) to match the minority class (yes). If the dataset were small, i.e. less than 1,000 rows, oversampling or SMOTE might be utilized to address the imbalance.

  1. Data Transformation - regularization, normalization, handling categorical variables

Since random forest cannot handle categorical variables, one-hot encoding (low number of categories) and label encoding (for features with a larger number of categories) can be used to transform the data into numerical values. Additionally, highly skewed features could be transformed via methods such as log transformation. Regularization is not needed since random forest automatically selects features and is not sensitive to multicollinearity.

  1. Imbalanced Data - reducing the imbalance between classes

Conclusion

The Exploratory Data Analysis of Bank Marketing Dataset reveals significant patterns and insights towards predictive modeling. The dataset is balanced, as the numerical variables reflect skewness and outliers, and the categorical variables reflect imbalances, particularly in marital status, education, and job. The target variable is significantly imbalanced, where the majority of clients do not subscribe to term deposits. Random Forest and Logistic Regression are valid model choices, with Random Forest dealing effectively with the mix of variable types, outliers, and non-linear relationship. Preprocessing would include categorical encoding, handling class imbalance, and feature engineering as optional. Overall, this analysis provides a strong foundation for model building in order to optimize approaches and drive term deposit subscriptions.