Data 622 Assignment 1

Assignment

Introduction
This assignment focuses on one of the most important aspects of data science, Exploratory Data Analysis (EDA). Many surveys show that data scientists spend 60-80% of their time on data preparation. EDA allows you to identify data gaps & data imbalances, improve data quality, create better features and gain a deep understanding of your data before doing model training - and that ultimately helps train better models. In machine learning, there is a saying - “better data beats better algorithms” - meaning that it is more productive to spend time improving data quality than improving the code to train the model.

This will be an exploratory exercise, so feel free to show errors and warnings that arise during the analysis. Test the code with both datasets selected and compare the results.

Dataset
A Portuguese bank conducted a marketing campaign(phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing

Assignment
PART I: Exploratory Data Analysis
Review the structure and content of the data and answer questions such as:

Are the features (columns) of your data correlated?
What is the overall distribution of each variable?
Are there any outliers present?
What are the relationships between different variables?
How are categorical variables distributed?
Do any patterns or trends emerge in the data?
What is the central tendency and spread of each variable?
Are there any missing values and how significant are they?

PART II: Algorithm Selection
Now you have completed the EDA, what Algorithms would suit the business purpose for the dataset. Answer questions such as:

Select two or more machine learning algorithms presented so far that could be used to train a model
(no need to train models - I am only looking for your recommendations).
What are the pros and cons of each algorithm you selected?
Which algorithm would you recommend, and why?
Are there labels in your data? Did that impact your choice of algorithm?
How does your choice of algorithm relates to the dataset?
Would your choice of algorithm change if there were fewer than 1,000 data records, and why?

PART III: Pre-processing
Now you have done an EDA and selected an Algorithm, what pre-processing (if any) would you require for:

Data Cleaning - improve data quality, address missing data, etc.
Dimensionality Reduction - remove correlated/redundant data than will slow down training
Feature Engineering - use of business knowledge to create new features
Sampling Data - using sampling to resize datasets
Data Transformation - regularization, normalization, handling categorical variables
Imbalanced Data - reducing the imbalance between classes

Essay

Write a short essay summarizing your findings. Explain your selection of algorithms and how they relate to the data and what you are trying to do.

This analysis focused on Exploratory Data Analysis (EDA) and algorithm selection for bank marketing data. The dataset contains ~45,000 records and 17 variables, including the target variable (y) which denotes whether a client subscribed to a term deposit.

EDA
There was considerable class imbalance in the target variable (y), where ~11% of clients subscribed (yes), while ~88% did not (no). There are seven numerical and ten categorical variables in the dataset. Most numerical features were right-skewed, and many had outliers, detected via IQR and scatterplots. There were no strong linear relationships between features,with most variables showing either very weak correlations via correlation matrix or none at all.
There was no missing data.

Algorithm Selection
Since the target variable is imbalanced, binary and labeled, supervised learning algorithms were deemed appropriate. random forest and logistic regression were investigated as potential models. Ultimately random forest was chosen over logistic regression because logistic regression would have:
1. required a lot of data pre-processing and transformations,
2. the relationships between the independent and dependent variables are non-linear (which random forest does not assume),
3. there are a lot of outliers (which random forest can handle),
4. the dataset is not highly dimensional (17 features- random forest can struggle with highly dimensional data) and,
5. random forest automatically selects features, so feature reduction is not needed

Pre-Processing
There was no missing data, so no imputation is needed. Random forest can handle outliers, so no modifications to outliers need be addressed. Undersampling was advocated to reduce the dominance of the majority class for the target variable ‘y’. Additionally, random forest was advocated because random forest automatically selects features and does not require dimensional reduction. However, for random forest, categorical variables would need to be transformed, i.e..e one-hot coded. Additionally there was no need for feature engineering.

Final Model Recommendation: Random Forest
Overall, class imbalance, skewed non-linear distributions, the presence of outlines, the presence of both categorical and numerical variables, and feature dependence made random forest the most appropriate initial model with which to analyze the dataset.

PART I: EDA

# Load libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(psych)

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(ggplot2)
library(ggcorrplot)
library(rcompanion)

## 
## Attaching package: 'rcompanion'
## 
## The following object is masked from 'package:psych':
## 
##     phi

library(dplyr)
library(viridis)

## Loading required package: viridisLite

library(hrbrthemes)
library(knitr)
library(skimr)
# Import the CSV file from github
bank_data <- read_csv2("https://raw.githubusercontent.com/greggmaloy/Data622/main/bank-full.csv", show_col_types = FALSE)

## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.

# View the first few rows
bank_data

## # A tibble: 45,211 × 17
##      age job       marital education default balance housing loan  contact   day
##    <dbl> <chr>     <chr>   <chr>     <chr>     <dbl> <chr>   <chr> <chr>   <dbl>
##  1    58 manageme… married tertiary  no         2143 yes     no    unknown     5
##  2    44 technici… single  secondary no           29 yes     no    unknown     5
##  3    33 entrepre… married secondary no            2 yes     yes   unknown     5
##  4    47 blue-col… married unknown   no         1506 yes     no    unknown     5
##  5    33 unknown   single  unknown   no            1 no      no    unknown     5
##  6    35 manageme… married tertiary  no          231 yes     no    unknown     5
##  7    28 manageme… single  tertiary  no          447 yes     yes   unknown     5
##  8    42 entrepre… divorc… tertiary  yes           2 yes     no    unknown     5
##  9    58 retired   married primary   no          121 yes     no    unknown     5
## 10    43 technici… single  secondary no          593 yes     no    unknown     5
## # ℹ 45,201 more rows
## # ℹ 7 more variables: month <chr>, duration <dbl>, campaign <dbl>, pdays <dbl>,
## #   previous <dbl>, poutcome <chr>, y <chr>

Exploratory Data Analysis: Are the features (columns) of your data correlated?
There are 17 columns and 45,211 rows. “Y” is the target variable which denotes whether a client subscribes to a term deposit. There are 16 features as listed below (not including the target variable):

colnames(bank_data)

##  [1] "age"       "job"       "marital"   "education" "default"   "balance"  
##  [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
## [13] "campaign"  "pdays"     "previous"  "poutcome"  "y"

Exploratory Data Analysis: What is the overall distribution of each variable?

NUMERICAL FEATURE DISTRIBUTION

Pdays – right skew with median value =-1 and mean = ~40. 1st and 3rd quartiles = -1 meaning, according to the data dictionary, a majority of the clients were not previously contacted. This was difficult to acertain from the small histogram but interquartile range and median provided information regarding the skew. The mean of ~40 alludes to the presence of outliers, as the 1st, 2nd and 3rd quartiles(~75% of the rows) were -1 (not previously contacted).

Previous – right skewed with many values centered around zero. Median value = 0. 1st and 3rd quartiles = 0 meaning, a majority of clients were not previously contacted. The mean of 0.58 alludes to the possible presence of outliers since it is different than the median of 0.

Campaign – right skewed denoting few ‘contact performed during this campaign’. The minimum value of 1 denotes that each client was contacted at least once as part of the present campaign. The median of 2, mean of 2.8, 1st IQR = 1 and 3rd IQR=3 suggest 50% of the data falls between 1 and 3. Possible outliers are alluded to, as the mean is higher than median.

Day - Appears to approximate normal distribution as denoted by approximate equal mean and median (~16) and IQR (1st = 8 and 3rd = 21).

Age - slightly right skewed as denoted by the histogram. The interquartile range denotes that 50% of the clients are between 33 and 48 years old.

Duration – right skewed denoting that most calls were of short length, though some did last significantly longer as denoted by the histogram and IQR.

Balance – right skewed as denoted by the large difference difference between the mean(1362) and median(448). This difference between the mean and median denotes that most people have lower balances, but a few clients have extremely high balances. Additionally the IQR (1st= 72, 3rd=1428) coupled with the large mean (1362) and relatively small median (448), denotes that some clients have significantly higher balance which skews the balance feature and alludes to the presence of outliers.

# Seperating numerical features
bank_data_numerical<-bank_data %>%
  select(where(is.numeric))   

# Using describe () for summary stats
describe(bank_data_numerical)
##          vars     n    mean      sd median trimmed    mad   min    max  range
## age         1 45211   40.94   10.62     39   40.25  10.38    18     95     77
## balance     2 45211 1362.27 3044.77    448  767.21 664.20 -8019 102127 110146
## day         3 45211   15.81    8.32     16   15.69  10.38     1     31     30
## duration    4 45211  258.16  257.53    180  210.87 137.88     0   4918   4918
## campaign    5 45211    2.76    3.10      2    2.12   1.48     1     63     62
## pdays       6 45211   40.20  100.13     -1   11.92   0.00    -1    871    872
## previous    7 45211    0.58    2.30      0    0.13   0.00     0    275    275
##           skew kurtosis    se
## age       0.68     0.32  0.05
## balance   8.36   140.73 14.32
## day       0.09    -1.06  0.04
## duration  3.14    18.15  1.21
## campaign  4.90    39.24  0.01
## pdays     2.62     6.93  0.47
## previous 41.84  4506.16  0.01
# Using summary() for interquartile calculations
summary(bank_data_numerical)
##       age           balance            day           duration     
##  Min.   :18.00   Min.   : -8019   Min.   : 1.00   Min.   :   0.0  
##  1st Qu.:33.00   1st Qu.:    72   1st Qu.: 8.00   1st Qu.: 103.0  
##  Median :39.00   Median :   448   Median :16.00   Median : 180.0  
##  Mean   :40.94   Mean   :  1362   Mean   :15.81   Mean   : 258.2  
##  3rd Qu.:48.00   3rd Qu.:  1428   3rd Qu.:21.00   3rd Qu.: 319.0  
##  Max.   :95.00   Max.   :102127   Max.   :31.00   Max.   :4918.0  
##     campaign          pdays          previous       
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000  
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000  
##  Median : 2.000   Median : -1.0   Median :  0.0000  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803  
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000  
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000

# Convert to long formate for plotting
bank_data_long <- bank_data_numerical %>%
  select(where(is.numeric)) %>%   
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  mutate(value = round(as.numeric(value), 0))  

# Plot histograms
p <- bank_data_long %>%
  mutate(variable = fct_reorder(variable, value)) %>%  
  ggplot(aes(x = value, color = variable, fill = variable)) +
  geom_histogram(alpha = 0.6, binwidth = 5) +  
  scale_fill_viridis(discrete = TRUE) +   
  scale_color_viridis(discrete = TRUE) + 
  theme_minimal() +   
  theme(
    legend.position = "none", 
    panel.spacing = unit(0.1, "lines"),  
    strip.text.x = element_text(size = 8)  
  ) +
  xlab("") +
  ylab("Frequency") +  
  facet_wrap(~variable, scales = "free")  

# Print
print(p)

CATEGORICAL FEATURE DISTRIBUTION

Jobs – The distribution appears fairly spread out. Some jobs appear more frequently than others, i.e. blue-collar, management, technician, as denoted by the histogram.

Martial – Most clients were married as denoted by the histogram and the approximations of mean and median around ‘2’. The number of married clients is approximately Double of those who are single. The number of divorced clients is considerably lower than those married (The number married clients looks to be almost 5x times that of divorced clients). This dataset and feature is significantly skewed in favor of married clients.

Education – Most clients have secondary education as the highest education level. The number of clients with a secondary education is approximately double that of the number of clients with a tertiary education and almost three times that of those of clients who selected primary education. The dataset and feature is skewed in favor of individuals with the highest education level of secondary.

Default – A significant majority of clients do not have credit in default. The mean and median of 1 and low standard deviation (0.13) denote that a vast majority of the clients do not have credit in default. There is significant bias represented here. The dataset and feature is skewed in favor of individuals with no credit default.

Housing – The clients represented in the dataset approximate each other in terms of having a housing loan as denoted by the histogram. Approximately 55% have a housing loan and approximately 45% do not have a housing loan.

Loan – Most clients do not have a personal loan. Approximately ~20% of the clients have a personal loan, while ~80% do not.

Contact – Most clients were contacted via cell phone as denoted by the histogram.

Month – There are significant differences in the months where the clients are called, with May being the month in which most calls are made as denoted by this histogram.

Poutcome – the ‘unknown’ category is significantly larger than other categories as denoted by the histogram. Among those clients who participated in past campaigns, failure is more commonly selected as opposed to success.

Y (AKA Subscription to Term Deposit) – A majority of clients have not subscribed (yes =~5,000, no=~40,000), meaning the dataset and independent variable is imbalanced.

# Code for categorical variables
# Seperating categorical features
bank_data_categorical<-bank_data %>%
  select(where(is.character))   

# Using describe () for summary statsw
describe(bank_data_categorical)

##            vars     n mean   sd median trimmed  mad min max range  skew
## job*          1 45211 5.34 3.27      5    5.25 4.45   1  12    11  0.26
## marital*      2 45211 2.17 0.61      2    2.21 0.00   1   3     2 -0.10
## education*    3 45211 2.22 0.75      2    2.23 0.00   1   4     3  0.20
## default*      4 45211 1.02 0.13      1    1.00 0.00   1   2     1  7.24
## housing*      5 45211 1.56 0.50      2    1.57 0.00   1   2     1 -0.22
## loan*         6 45211 1.16 0.37      1    1.08 0.00   1   2     1  1.85
## contact*      7 45211 1.64 0.90      1    1.55 0.00   1   3     2  0.77
## month*        8 45211 6.52 3.01      7    6.68 2.97   1  12    11 -0.48
## poutcome*     9 45211 3.56 0.99      4    3.82 0.00   1   4     3 -1.97
## y*           10 45211 1.12 0.32      1    1.02 0.00   1   2     1  2.38
##            kurtosis   se
## job*          -1.27 0.02
## marital*      -0.44 0.00
## education*    -0.26 0.00
## default*      50.49 0.00
## housing*      -1.95 0.00
## loan*          1.43 0.00
## contact*      -1.32 0.00
## month*        -1.00 0.01
## poutcome*      2.15 0.00
## y*             3.68 0.00

# Transform to long
bank_data_long <- bank_data_categorical %>%
  pivot_longer(everything(), names_to = "variable", values_to = "category")  

# Plot bar plots
p <- bank_data_long %>%
  mutate(variable = fct_reorder(variable, category)) %>% 
  ggplot(aes(x = category, color = variable, fill = variable)) +
  geom_bar(alpha = 0.6, binwidth=2) + 
  scale_fill_viridis(discrete = TRUE) +   
  scale_color_viridis(discrete = TRUE) + 
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +  
  theme_minimal() +   
  theme(
    legend.position = "none",  
    panel.spacing = unit(0.5, "lines"),  
    strip.text.x = element_text(size = 8),  
    axis.text.x = element_text(angle = 45, hjust = 1.0, size = 7.5),  
    axis.text.y = element_text(size = 10),  
    plot.margin = margin(10, 10, 10, 10)  
  ) +
  xlab("") +
  ylab("Frequency") +  # Adjust y-axis label
  facet_wrap(~variable, scales = "free_x", ncol = 5)#+ theme(axis.text.y=element_text(size=rel(1.0)))  # Limit to 2 grids per row

## Warning in geom_bar(alpha = 0.6, binwidth = 2): Ignoring unknown parameters:
## `binwidth`

# Print 
print(p)

Exploratory Data Analysis: Are there any outliers present?
Outliers were investigated via three methods:
1. Using the histograms and summary statistics from the previous step, including mean, median, and IQR
2. Via scatterplots
3. Via a threshold of 1.5 x the IQR

Outlier analysis via summary statistics
The distribution of the features and target variable in the previous step allude to outliers in the following numerical features/ variables:

Pdays – Median value =-1, 1st and 3rd quartiles = -1 and mean of ~40 alludes to the presence of outliers.

Previous – Median value = 0, 1st and 3rd quartiles = 0 and the mean of 0.58 alludes to the possible presence of outliers since the mean is different than the median of 0.

Campaign – The median of 2, mean of 2.8, and IQR (1st IQR = 1 and 3rd IQR=3) allude to possible outliers, as the mean is higher than median.

Duration – Right skewed denoting that most calls were of short length, though some did last significantly longer and are most likely outliers as denoted by the histogram and IQR.

Balance – The IQR (1st= 72, 3rd=1428) coupled with the large mean (1362) and relatively small median (448), denotes that some clients have significantly higher balance which skews the balance feature and alludes to the presence of outliers.

Outlier analysis via scatter plots
Below scatterplots are plotted to visually inspect for outliers for numerical features. Upon visual inspection, each feature appears to have outliers, with the exception of ‘days’.

Outlier analysis via 1.5 * IQR
Additionally, outliers were calculated for each numerical feature using 1.5 x the IQR. Each feature, with the exception of day, was determined to have outliers (age 487, balance 4729, campaign 3064, duration 3235, pdays 8257, previous 8257).

# scatterplots for outliers
# Select only numerical variables
numeric_data <- bank_data %>%
  select(where(is.numeric))

# Pivot the data to long format for plotting
numeric_data_long <- numeric_data %>%
  mutate(id = row_number()) %>%  
  pivot_longer(cols = -id, names_to = "variable", values_to = "value")

# Create scatterplots for each numeric variable
ggplot(numeric_data_long, aes(x = id, y = value)) +
  geom_point(alpha = 0.6, color = "darkblue") +  
  facet_wrap(~variable, scales = "free", ncol = 5) + 
  theme_minimal() +
  theme(
    strip.text.x = element_text(size = 10), 
    axis.text.x = element_text(angle = 45, hjust = 1)  
  ) +
  labs(
    x = "ID (Row Number)", 
    y = "Value", 
    title = "Scatterplots of Numerical Variables"
  )

#Using IQR to Assess Outlyiers

# IQR Calculation
calculate_iqr_outliers <- function(df, col_name) {
  Q1 <- quantile(df[[col_name]], 0.25, na.rm = TRUE)  # 25th 
  Q3 <- quantile(df[[col_name]], 0.75, na.rm = TRUE)  # 75th 
  IQR <- Q3 - Q1  # Interquartile range
  
  # Calculate outlier threshold
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  # Identify outliers 
  df %>%
    filter(df[[col_name]] < lower_bound | df[[col_name]] > upper_bound) %>%  # Keep only outliers
    mutate(variable = col_name) %>%  
    select(variable) 
}

# Apply the function to all numeric variables and count outliers
outlier_counts <- bank_data %>%
  select(where(is.numeric)) %>%  
  names() %>%  
  map_df(~calculate_iqr_outliers(bank_data, .x)) %>%  
  group_by(variable) %>%  
  summarise(outlier_count = n())  

# Print the outlier counts for each feature
cat("\n**Table 1: Count of Outliers in Each Numeric Variable**\n")

## 
## **Table 1: Count of Outliers in Each Numeric Variable**

kable(outlier_counts, caption = "Count of Outliers in Each Numeric Variable")

Count of Outliers in Each Numeric Variable
variable	outlier_count
age	487
balance	4729
campaign	3064
duration	3235
pdays	8257
previous	8257

Exploratory Data Analysis: What are the relationships between different variables?

NUMERICAL VARIABLES
Below the correlation matrix alludes to relationships for numeric variables in the dataset. Overall, numeric variables in dataset do not have strong linear relationships. The features pdays and previous have a mild relationship which might indicate that people who were contacted before are more likely to be contacted again soon.

MILD RELATIONSHIPs
Pdays and Previous - Correlation of 0.45: Suggests a strongest positive correlation between the the number of times the clients were contacted prior to this campaign and the number of days that passed after the client was contacted during a previous campaign. This appears to be a positive linear correlation.

WEAK RELATIONSHIPs
Day and Duration - correlation of 0.16: Suggests a weak positive correlation, denoting that the duration of call increases with the day, though the relationship is not strong.
Age and Balance - Correlation of 0.1: Suggests a weak correlation denoting older clients may have slightly higher account balances.

Other correlation values for numerical features are close to zero denoting no significant correlation.

CATEGORICAL VARIABLES
Below the correlation matrix alludes to relationships for categorical variables in the dataset. Overall, categorical variables in dataset do not have strong linear relationships with each other.

MILD RELATIONSHIPS
Job and Education – Correlation of 0.46
Housing and Month – Correlation of 0.50
Contact and Month – Correlation of 0.51

WEAK RELATIONSHIPS
All other correlations are weak.

# --- Correlation Matrix for Numeric Variables ---
# Select numeric variables
numeric_data <- bank_data %>% select(where(is.numeric))

# Correlation matrix
correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")

# corr matrix
ggcorrplot(correlation_matrix, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           title = "Correlation Matrix of Numeric Variables")

# Cramers 
# Select categorical variables
categorical_cols <- names(bank_data)[sapply(bank_data, is.character)]

# Function for Cramers
cramers_v_matrix <- function(df, cat_vars) {
  n <- length(cat_vars)
  result <- matrix(0, n, n, dimnames = list(cat_vars, cat_vars))
  
  for (i in 1:n) {
    for (j in i:n) {
      if (i == j) {
        result[i, j] <- 1  
      } else {
        result[i, j] <- cramerV(df[[cat_vars[i]]], df[[cat_vars[j]]])
        result[j, i] <- result[i, j]  
      }
    }
  }
  return(result)
}

cramers_matrix <- cramers_v_matrix(bank_data, categorical_cols)


cramers_df <- as.data.frame(cramers_matrix)


ggcorrplot(cramers_matrix, 
           method = "circle", 
           type = "lower",     
           lab = TRUE, 
           title = "Cramers V Correlation Between Categorical Variables")

Exploratory Data Analysis: How are categorical variables distributed?

# Convert categorical variables to factors 
bank_data <- bank_data %>%
  mutate(across(where(is.character), as.factor))

# Combine all numeric-categorical pairs into one long dataframe
numeric_vars <- names(select(bank_data, where(is.numeric)))
categorical_vars <- names(select(bank_data, where(is.factor)))

# Long FOrmat
bank_data_long <- bank_data %>%
  pivot_longer(cols = all_of(numeric_vars), names_to = "Numeric_Variable", values_to = "Value")

# Boxplots 
ggplot(bank_data_long, aes(x = .data[[categorical_vars[1]]], y = Value, fill = .data[[categorical_vars[1]]])) +
  geom_boxplot(alpha = 0.7) +  
  facet_wrap(~Numeric_Variable, scales = "free") +  
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels
    legend.position = "none"  # Remove legend if not needed
  ) +
  labs(
    x = "Categorical Variable",
    y = "Value",
    title = "Boxplots of Numeric Variables by Categorical Groups"
  )

Exploratory Data Analysis: What is the central tendency and spread of each variable?
Below the central tendency for each numerical variable is denoted via mean, median and mode. For categorical variables mode is reported.

Spread of numerical variables is denote by standard deviation, variance, IQR and range (max/min).
The spread for the categorical variables is captured via frequency of each category and percent.

# Display numeric summary


# Separate into numeric and categorical variables
numeric_vars <- bank_data %>%
  select(where(is.numeric))

categorical_vars <- bank_data %>%
  select(where(is.character) | where(is.factor))

# Function to calculate summary statistics for numeric variables 
numeric_summary <- numeric_vars %>%
  summarise_all(list(
    Mean = mean,
    Median = median,
    SD = sd,
    Variance = var,
    IQR = IQR,
    Range = ~max(., na.rm = TRUE) - min(., na.rm = TRUE)
  ), na.rm = TRUE) %>%
  pivot_longer(cols = everything(), names_to = c("Variable", "Metric"), names_sep = "_") %>%
  pivot_wider(names_from = Metric, values_from = value)

# Function to compute mode separately 
compute_mode <- function(x) {
  tab <- table(x)
  names(tab)[which.max(tab)]  
}

mode_summary <- numeric_vars %>%
  summarise_all(list(Mode = compute_mode)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Mode")

# Merge numeric summary with mode 
numeric_summary <- left_join(numeric_summary, mode_summary, by = "Variable")

# Function to find mode for categorical variables
categorical_summary <- categorical_vars %>%
  summarise_all(list(
    Mode = compute_mode
  )) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Mode")

# Display numeric summary
print(numeric_summary)

## # A tibble: 7 × 8
##   Variable     Mean Median      SD   Variance   IQR  Range Mode 
##   <chr>       <dbl>  <dbl>   <dbl>      <dbl> <dbl>  <dbl> <chr>
## 1 age        40.9       39   10.6      113.      15     77 <NA> 
## 2 balance  1362.       448 3045.   9270599.    1356 110146 <NA> 
## 3 day        15.8       16    8.32      69.3     13     30 <NA> 
## 4 duration  258.       180  258.     66321.     216   4918 <NA> 
## 5 campaign    2.76       2    3.10       9.60     2     62 <NA> 
## 6 pdays      40.2       -1  100.     10026.       0    872 <NA> 
## 7 previous    0.580      0    2.30       5.31     0    275 <NA>

# Display categorical summary
print(categorical_summary)

## # A tibble: 10 × 2
##    Variable       Mode       
##    <chr>          <chr>      
##  1 job_Mode       blue-collar
##  2 marital_Mode   married    
##  3 education_Mode secondary  
##  4 default_Mode   no         
##  5 housing_Mode   yes        
##  6 loan_Mode      no         
##  7 contact_Mode   cellular   
##  8 month_Mode     may        
##  9 poutcome_Mode  unknown    
## 10 y_Mode         no

# Function to compute category counts and correct percentages
compute_category_counts <- function(df) {
  df %>%
    pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category") %>%
    group_by(Variable, Category) %>%
    summarise(Count = n(), .groups = "drop") %>%
    group_by(Variable) %>%  
    mutate(Percentage = round((Count / sum(Count)) * 100, 2)) %>%  
    arrange(Variable, desc(Count)) 
}

# Apply function to categorical variables
categorical_vars <- bank_data %>%
  select(where(is.character) | where(is.factor))  

category_counts <- compute_category_counts(categorical_vars)

# Print 
print(category_counts)

## # A tibble: 46 × 4
## # Groups:   Variable [10]
##    Variable  Category  Count Percentage
##    <chr>     <fct>     <int>      <dbl>
##  1 contact   cellular  29285      64.8 
##  2 contact   unknown   13020      28.8 
##  3 contact   telephone  2906       6.43
##  4 default   no        44396      98.2 
##  5 default   yes         815       1.8 
##  6 education secondary 23202      51.3 
##  7 education tertiary  13301      29.4 
##  8 education primary    6851      15.2 
##  9 education unknown    1857       4.11
## 10 housing   yes       25130      55.6 
## # ℹ 36 more rows

Exploratory Data Analysis: Are there any missing values and how significant are they?
There are no missing values.

# Missing values 
missing_summary <- bank_data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  arrange(desc(Missing_Count))  

# Print 
print(missing_summary)

## # A tibble: 17 × 2
##    Variable  Missing_Count
##    <chr>             <int>
##  1 age                   0
##  2 job                   0
##  3 marital               0
##  4 education             0
##  5 default               0
##  6 balance               0
##  7 housing               0
##  8 loan                  0
##  9 contact               0
## 10 day                   0
## 11 month                 0
## 12 duration              0
## 13 campaign              0
## 14 pdays                 0
## 15 previous              0
## 16 poutcome              0
## 17 y                     0

Exploratory Data Analysis: Do any patterns or trends emerge in the data?
The most obvious patterns are:
1. The dataset is highly imbalanced as demonstrated by the bias in the dependent variable. A majority of clients have not subscribed (yes =~5,000, no=~40,000).
2. Many of the features are not normally distributed.
3. Many of the features do not have a linear relationship with the dependent variable.
4. Feature independence is most likely violated. The features job, age, education, martial status, housing loan are likely to be correlated.

PART II: ALGORITHM SELECTION

Algorithm Selection: Select two or more machine learning algorithms presented so far that could be used to train a model (no need to train models - I am only looking for your recommendations).

I would recommend using Random Forest and Logistic Regression since our dataset is:
1. imbalanced as evident by the imbalance present in the dependent variable itself (),
2. large, i.e. ~45,000 rows with 17 variables, and
3. consists of both numerical and categorical variables.

Algorithm Selection: What are the pros and cons of each algorithm you selected?
Logistic Regression
Pros
1. Provides probabilities for predictions and, as such, is easy to interpret
2. Works well with large datasets
3. Works well with imbalanced data, which we definitely have in our dependent variable
4. Can potentially handle non-linear relationships
5. works well with high dimensional data

Cons
1. Assumes linear relationship between independent and dependent variables
2. Sensitive to outliers
3. Sensitive to multicollinearity
4. Doesn’t work well with categorical variables. Requires one-hot coding for instance

Random Forest
Pros
1. Handles imbalance data well utilizing class weighting
2. Handles categorical and numerical features well (categorical would require one-hot encoding transformation)
3. Reduces overfitting by averaging multiple decision trees
4. Works well with missing data. There is no missing data in our dataset
5. Helps to identify the most important features
6. Handles large datasets

Cons
1. Slower than simpler models such as logistic regression
2. More difficult to interpret than logistic regression
3. Decision trees may become overly complex if high dimension data

Algorithm Selection: Which algorithm would you recommend, and why? How does your choice of algorithm relates to the dataset?
I would recommend Random Forest since the dataset is:
1. highly imbalanced,
2. has numerous numerical and categorical variables,
3. handles large datasets, and
4. the dataset is not highly dimensional (17 features). Random forest can struggle with highly dimensional data
I might also use logistic regression, but the dataset has a lot of outliers which would need to be accounted for in preprocessing. I also briefly considered Naive Bayes, but the dataset has a large number of numerical features which are not normally distributed. Furthermore, features are most likely not independent. Job, Age, education, martial status, housing loan are likely to be correlated.

Algorithm Selection: Are there labels in your data? Did that impact your choice of algorithm?
Yes. The dependent variable is binary and labeled. As such, I mainly considered supervised training algorithms.

Algorithm Selection:Would your choice of algorithm change if there were fewer than 1,000 data records, and why?
Yes because random forest may overfit when trained on small datasets. In such a case, I would probably use decision tree or kNN.

PART III: PRE-PROCESSING

Pre-processing: Data Cleaning - improve data quality, address missing data, etc.
Missing Values
There are no missing values in the dataset, so no imputation for missing values is needed.

Class Imbalance
The target variable is highly imbalanced (Not subscribed= 88.3% versus Subscribed=11.7%).
This means the model may struggle to correctly predict the minority class (yes), leading to bias towards “no”.
To address this imbalance we might oversample to generate more “yes” or use undersampling to reduce the dominance of the “no” class.

Outlier Detection
There are a lot of outliers in the dataset as previous demonstrated in scatterplots and IQR*1.5 detection threshold.
Since we are using random forest, outliers are not of significant concern. The random forest algorithm aggregates results from multiple decision trees creating an average.

Pre-processing: Dimensionality Reduction - remove correlated/redundant data than will slow down training
Generally, random forest is not sensitive to high dimensional data because random forest can handle large numbers of features without overfitting, automatically selecting relevant features secondary to bagging and does not require scaling.

If another model were used, such as logistic regression, dimension reduction could be achieved via PCA and/or feature selection.

Below the numerical and categorical correlation matrix’s display a mild correlation between pdays and previous, job and education, housing and month and contact and month, but not enough of a correlation to support dimension reduction

ggcorrplot(correlation_matrix, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           title = "Correlation Matrix of Numeric Variables")

ggcorrplot(cramers_matrix, 
           method = "circle",  
           type = "lower",      
           lab = TRUE, 
           title = "Cramers V Correlation Between Categorical Variables")

Feature Engineering - use of business knowledge to create new features
Although random forest is able to handle this sort of dataset, a few variables could be improved upon. Time-based features such as month could be engineered at a seasonal level. Housing and personal loan features could be combined for an all encompassing “loan” variable denoting the presence of any sort of loan.

Sampling Data - using sampling to resize datasets and Imbalanced Data - reducing the imbalance between classes
There is significant imbalance in the target feature ‘y’.
Since the dataset is a moderate size of ~45,000 rows, undersampling should be Use reduce the majority class (no) to match the minority class (yes). If the dataset were small, i.e. less than 1,000 rows, oversampling or SMOTE might be utilized to address the imbalance.

Data Transformation - regularization, normalization, handling categorical variables
Since random forest cannot handle categorical variables, one-hot encoding (low number of categories) and label encoding (for features with a larger number of categories) can be used to transform the data into numerical values. Additionally, highly skewed features could be transformed via methods such as log transformation. Regularization is not needed since random forest automatically selects features and is not sensitive to multicollinearity.