Introduction
This assignment focuses on one of the most important aspects of data science, Exploratory Data Analysis (EDA). Many surveys show that data scientists spend 60-80% of their time on data preparation. EDA allows you to identify data gaps & data imbalances, improve data quality, create better features and gain a deep understanding of your data before doing model training - and that ultimately helps train better models. In machine learning, there is a saying - “better data beats better algorithms” - meaning that it is more productive to spend time improving data quality than improving the code to train the model.
This will be an exploratory exercise, so feel free to show errors and warnings that arise during the analysis. Test the code with both datasets selected and compare the results.
Dataset
A Portuguese bank conducted a marketing campaign(phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing
Assignment
PART I: Exploratory Data Analysis
Review the structure and content of the data and answer questions such as:
PART II: Algorithm Selection
Now you have completed the EDA, what Algorithms would suit the business purpose for the dataset. Answer questions such as:
PART III: Pre-processing
Now you have done an EDA and selected an Algorithm, what pre-processing (if any) would you require for:
Write a short essay summarizing your findings. Explain your selection of algorithms and how they relate to the data and what you are trying to do.
This analysis focused on Exploratory Data Analysis (EDA) and algorithm selection for bank marketing data. The dataset contains ~45,000 records and 17 variables, including the target variable (y) which denotes whether a client subscribed to a term deposit.
EDA
There was considerable class imbalance in the target variable (y), where ~11% of clients subscribed (yes), while ~88% did not (no). There are seven numerical and ten categorical variables in the dataset. Most numerical features were right-skewed, and many had outliers, detected via IQR and scatterplots. There were no strong linear relationships between features,with most variables showing either very weak correlations via correlation matrix or none at all.
There was no missing data.
Algorithm Selection
Since the target variable is imbalanced, binary and labeled, supervised learning algorithms were deemed appropriate. random forest and logistic regression were investigated as potential models. Ultimately random forest was chosen over logistic regression because logistic regression would have:
1. required a lot of data pre-processing and transformations,
2. the relationships between the independent and dependent variables are non-linear (which random forest does not assume),
3. there are a lot of outliers (which random forest can handle),
4. the dataset is not highly dimensional (17 features- random forest can struggle with highly dimensional data) and,
5. random forest automatically selects features, so feature reduction is not needed
Pre-Processing
There was no missing data, so no imputation is needed. Random forest can handle outliers, so no modifications to outliers need be addressed. Undersampling was advocated to reduce the dominance of the majority class for the target variable ‘y’. Additionally, random forest was advocated because random forest automatically selects features and does not require dimensional reduction. However, for random forest, categorical variables would need to be transformed, i.e..e one-hot coded. Additionally there was no need for feature engineering.
Final Model Recommendation: Random Forest
Overall, class imbalance, skewed non-linear distributions, the presence of outlines, the presence of both categorical and numerical variables, and feature dependence made random forest the most appropriate initial model with which to analyze the dataset.
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(ggplot2)
library(ggcorrplot)
library(rcompanion)
##
## Attaching package: 'rcompanion'
##
## The following object is masked from 'package:psych':
##
## phi
library(dplyr)
library(viridis)
## Loading required package: viridisLite
library(hrbrthemes)
library(knitr)
library(skimr)
# Import the CSV file from github
bank_data <- read_csv2("https://raw.githubusercontent.com/greggmaloy/Data622/main/bank-full.csv", show_col_types = FALSE)
## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
# View the first few rows
bank_data
## # A tibble: 45,211 × 17
## age job marital education default balance housing loan contact day
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 58 manageme… married tertiary no 2143 yes no unknown 5
## 2 44 technici… single secondary no 29 yes no unknown 5
## 3 33 entrepre… married secondary no 2 yes yes unknown 5
## 4 47 blue-col… married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 manageme… married tertiary no 231 yes no unknown 5
## 7 28 manageme… single tertiary no 447 yes yes unknown 5
## 8 42 entrepre… divorc… tertiary yes 2 yes no unknown 5
## 9 58 retired married primary no 121 yes no unknown 5
## 10 43 technici… single secondary no 593 yes no unknown 5
## # ℹ 45,201 more rows
## # ℹ 7 more variables: month <chr>, duration <dbl>, campaign <dbl>, pdays <dbl>,
## # previous <dbl>, poutcome <chr>, y <chr>
Exploratory Data Analysis: Are the features (columns) of your data correlated?
There are 17 columns and 45,211 rows. “Y” is the target variable which denotes whether a client subscribes to a term deposit. There are 16 features as listed below (not including the target variable):
colnames(bank_data)
## [1] "age" "job" "marital" "education" "default" "balance"
## [7] "housing" "loan" "contact" "day" "month" "duration"
## [13] "campaign" "pdays" "previous" "poutcome" "y"
Exploratory Data Analysis: What is the overall distribution of each variable?
NUMERICAL FEATURE DISTRIBUTION
Pdays – right skew with median value =-1 and mean = ~40. 1st and 3rd quartiles = -1 meaning, according to the data dictionary, a majority of the clients were not previously contacted. This was difficult to acertain from the small histogram but interquartile range and median provided information regarding the skew. The mean of ~40 alludes to the presence of outliers, as the 1st, 2nd and 3rd quartiles(~75% of the rows) were -1 (not previously contacted).
Previous – right skewed with many values centered around zero. Median value = 0. 1st and 3rd quartiles = 0 meaning, a majority of clients were not previously contacted. The mean of 0.58 alludes to the possible presence of outliers since it is different than the median of 0.
Campaign – right skewed denoting few ‘contact performed during this campaign’. The minimum value of 1 denotes that each client was contacted at least once as part of the present campaign. The median of 2, mean of 2.8, 1st IQR = 1 and 3rd IQR=3 suggest 50% of the data falls between 1 and 3. Possible outliers are alluded to, as the mean is higher than median.
Day - Appears to approximate normal distribution as denoted by approximate equal mean and median (~16) and IQR (1st = 8 and 3rd = 21).
Age - slightly right skewed as denoted by the histogram. The interquartile range denotes that 50% of the clients are between 33 and 48 years old.
Duration – right skewed denoting that most calls were of short length, though some did last significantly longer as denoted by the histogram and IQR.
Balance – right skewed as denoted by the large difference difference between the mean(1362) and median(448). This difference between the mean and median denotes that most people have lower balances, but a few clients have extremely high balances. Additionally the IQR (1st= 72, 3rd=1428) coupled with the large mean (1362) and relatively small median (448), denotes that some clients have significantly higher balance which skews the balance feature and alludes to the presence of outliers.
# Seperating numerical features
bank_data_numerical<-bank_data %>%
select(where(is.numeric))
# Using describe () for summary stats
describe(bank_data_numerical)
## vars n mean sd median trimmed mad min max range
## age 1 45211 40.94 10.62 39 40.25 10.38 18 95 77
## balance 2 45211 1362.27 3044.77 448 767.21 664.20 -8019 102127 110146
## day 3 45211 15.81 8.32 16 15.69 10.38 1 31 30
## duration 4 45211 258.16 257.53 180 210.87 137.88 0 4918 4918
## campaign 5 45211 2.76 3.10 2 2.12 1.48 1 63 62
## pdays 6 45211 40.20 100.13 -1 11.92 0.00 -1 871 872
## previous 7 45211 0.58 2.30 0 0.13 0.00 0 275 275
## skew kurtosis se
## age 0.68 0.32 0.05
## balance 8.36 140.73 14.32
## day 0.09 -1.06 0.04
## duration 3.14 18.15 1.21
## campaign 4.90 39.24 0.01
## pdays 2.62 6.93 0.47
## previous 41.84 4506.16 0.01
# Using summary() for interquartile calculations
summary(bank_data_numerical)
## age balance day duration
## Min. :18.00 Min. : -8019 Min. : 1.00 Min. : 0.0
## 1st Qu.:33.00 1st Qu.: 72 1st Qu.: 8.00 1st Qu.: 103.0
## Median :39.00 Median : 448 Median :16.00 Median : 180.0
## Mean :40.94 Mean : 1362 Mean :15.81 Mean : 258.2
## 3rd Qu.:48.00 3rd Qu.: 1428 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :95.00 Max. :102127 Max. :31.00 Max. :4918.0
## campaign pdays previous
## Min. : 1.000 Min. : -1.0 Min. : 0.0000
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000
## Median : 2.000 Median : -1.0 Median : 0.0000
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
# Convert to long formate for plotting
bank_data_long <- bank_data_numerical %>%
select(where(is.numeric)) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
mutate(value = round(as.numeric(value), 0))
# Plot histograms
p <- bank_data_long %>%
mutate(variable = fct_reorder(variable, value)) %>%
ggplot(aes(x = value, color = variable, fill = variable)) +
geom_histogram(alpha = 0.6, binwidth = 5) +
scale_fill_viridis(discrete = TRUE) +
scale_color_viridis(discrete = TRUE) +
theme_minimal() +
theme(
legend.position = "none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Frequency") +
facet_wrap(~variable, scales = "free")
# Print
print(p)
CATEGORICAL FEATURE DISTRIBUTION
Jobs – The distribution appears fairly spread out. Some jobs appear more frequently than others, i.e. blue-collar, management, technician, as denoted by the histogram.
Martial – Most clients were married as denoted by the histogram and the approximations of mean and median around ‘2’. The number of married clients is approximately Double of those who are single. The number of divorced clients is considerably lower than those married (The number married clients looks to be almost 5x times that of divorced clients). This dataset and feature is significantly skewed in favor of married clients.
Education – Most clients have secondary education as the highest education level. The number of clients with a secondary education is approximately double that of the number of clients with a tertiary education and almost three times that of those of clients who selected primary education. The dataset and feature is skewed in favor of individuals with the highest education level of secondary.
Default – A significant majority of clients do not have credit in default. The mean and median of 1 and low standard deviation (0.13) denote that a vast majority of the clients do not have credit in default. There is significant bias represented here. The dataset and feature is skewed in favor of individuals with no credit default.
Housing – The clients represented in the dataset approximate each other in terms of having a housing loan as denoted by the histogram. Approximately 55% have a housing loan and approximately 45% do not have a housing loan.
Loan – Most clients do not have a personal loan. Approximately ~20% of the clients have a personal loan, while ~80% do not.
Contact – Most clients were contacted via cell phone as denoted by the histogram.
Month – There are significant differences in the months where the clients are called, with May being the month in which most calls are made as denoted by this histogram.
Poutcome – the ‘unknown’ category is significantly larger than other categories as denoted by the histogram. Among those clients who participated in past campaigns, failure is more commonly selected as opposed to success.
Y (AKA Subscription to Term Deposit) – A majority of clients have not subscribed (yes =~5,000, no=~40,000), meaning the dataset and independent variable is imbalanced.
# Code for categorical variables
# Seperating categorical features
bank_data_categorical<-bank_data %>%
select(where(is.character))
# Using describe () for summary statsw
describe(bank_data_categorical)
## vars n mean sd median trimmed mad min max range skew
## job* 1 45211 5.34 3.27 5 5.25 4.45 1 12 11 0.26
## marital* 2 45211 2.17 0.61 2 2.21 0.00 1 3 2 -0.10
## education* 3 45211 2.22 0.75 2 2.23 0.00 1 4 3 0.20
## default* 4 45211 1.02 0.13 1 1.00 0.00 1 2 1 7.24
## housing* 5 45211 1.56 0.50 2 1.57 0.00 1 2 1 -0.22
## loan* 6 45211 1.16 0.37 1 1.08 0.00 1 2 1 1.85
## contact* 7 45211 1.64 0.90 1 1.55 0.00 1 3 2 0.77
## month* 8 45211 6.52 3.01 7 6.68 2.97 1 12 11 -0.48
## poutcome* 9 45211 3.56 0.99 4 3.82 0.00 1 4 3 -1.97
## y* 10 45211 1.12 0.32 1 1.02 0.00 1 2 1 2.38
## kurtosis se
## job* -1.27 0.02
## marital* -0.44 0.00
## education* -0.26 0.00
## default* 50.49 0.00
## housing* -1.95 0.00
## loan* 1.43 0.00
## contact* -1.32 0.00
## month* -1.00 0.01
## poutcome* 2.15 0.00
## y* 3.68 0.00
# Transform to long
bank_data_long <- bank_data_categorical %>%
pivot_longer(everything(), names_to = "variable", values_to = "category")
# Plot bar plots
p <- bank_data_long %>%
mutate(variable = fct_reorder(variable, category)) %>%
ggplot(aes(x = category, color = variable, fill = variable)) +
geom_bar(alpha = 0.6, binwidth=2) +
scale_fill_viridis(discrete = TRUE) +
scale_color_viridis(discrete = TRUE) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
theme_minimal() +
theme(
legend.position = "none",
panel.spacing = unit(0.5, "lines"),
strip.text.x = element_text(size = 8),
axis.text.x = element_text(angle = 45, hjust = 1.0, size = 7.5),
axis.text.y = element_text(size = 10),
plot.margin = margin(10, 10, 10, 10)
) +
xlab("") +
ylab("Frequency") + # Adjust y-axis label
facet_wrap(~variable, scales = "free_x", ncol = 5)#+ theme(axis.text.y=element_text(size=rel(1.0))) # Limit to 2 grids per row
## Warning in geom_bar(alpha = 0.6, binwidth = 2): Ignoring unknown parameters:
## `binwidth`
# Print
print(p)
Exploratory Data Analysis: Are there any outliers present?
Outliers were investigated via three methods:
1. Using the histograms and summary statistics from the previous step, including mean, median, and IQR
2. Via scatterplots
3. Via a threshold of 1.5 x the IQR
Outlier analysis via summary statistics
The distribution of the features and target variable in the previous step allude to outliers in the following numerical features/ variables:
Pdays – Median value =-1, 1st and 3rd quartiles = -1 and mean of ~40 alludes to the presence of outliers.
Previous – Median value = 0, 1st and 3rd quartiles = 0 and the mean of 0.58 alludes to the possible presence of outliers since the mean is different than the median of 0.
Campaign – The median of 2, mean of 2.8, and IQR (1st IQR = 1 and 3rd IQR=3) allude to possible outliers, as the mean is higher than median.
Duration – Right skewed denoting that most calls were of short length, though some did last significantly longer and are most likely outliers as denoted by the histogram and IQR.
Balance – The IQR (1st= 72, 3rd=1428) coupled with the large mean (1362) and relatively small median (448), denotes that some clients have significantly higher balance which skews the balance feature and alludes to the presence of outliers.
Outlier analysis via scatter plots
Below scatterplots are plotted to visually inspect for outliers for numerical features. Upon visual inspection, each feature appears to have outliers, with the exception of ‘days’.
Outlier analysis via 1.5 * IQR
Additionally, outliers were calculated for each numerical feature using 1.5 x the IQR. Each feature, with the exception of day, was determined to have outliers (age 487, balance 4729, campaign 3064, duration 3235, pdays 8257, previous 8257).
# scatterplots for outliers
# Select only numerical variables
numeric_data <- bank_data %>%
select(where(is.numeric))
# Pivot the data to long format for plotting
numeric_data_long <- numeric_data %>%
mutate(id = row_number()) %>%
pivot_longer(cols = -id, names_to = "variable", values_to = "value")
# Create scatterplots for each numeric variable
ggplot(numeric_data_long, aes(x = id, y = value)) +
geom_point(alpha = 0.6, color = "darkblue") +
facet_wrap(~variable, scales = "free", ncol = 5) +
theme_minimal() +
theme(
strip.text.x = element_text(size = 10),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
labs(
x = "ID (Row Number)",
y = "Value",
title = "Scatterplots of Numerical Variables"
)
#Using IQR to Assess Outlyiers
# IQR Calculation
calculate_iqr_outliers <- function(df, col_name) {
Q1 <- quantile(df[[col_name]], 0.25, na.rm = TRUE) # 25th
Q3 <- quantile(df[[col_name]], 0.75, na.rm = TRUE) # 75th
IQR <- Q3 - Q1 # Interquartile range
# Calculate outlier threshold
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Identify outliers
df %>%
filter(df[[col_name]] < lower_bound | df[[col_name]] > upper_bound) %>% # Keep only outliers
mutate(variable = col_name) %>%
select(variable)
}
# Apply the function to all numeric variables and count outliers
outlier_counts <- bank_data %>%
select(where(is.numeric)) %>%
names() %>%
map_df(~calculate_iqr_outliers(bank_data, .x)) %>%
group_by(variable) %>%
summarise(outlier_count = n())
# Print the outlier counts for each feature
cat("\n**Table 1: Count of Outliers in Each Numeric Variable**\n")
##
## **Table 1: Count of Outliers in Each Numeric Variable**
kable(outlier_counts, caption = "Count of Outliers in Each Numeric Variable")
| variable | outlier_count |
|---|---|
| age | 487 |
| balance | 4729 |
| campaign | 3064 |
| duration | 3235 |
| pdays | 8257 |
| previous | 8257 |
Exploratory Data Analysis: What are the relationships between different variables?
NUMERICAL VARIABLES
Below the correlation matrix alludes to relationships for numeric variables in the dataset. Overall, numeric variables in dataset do not have strong linear relationships. The features pdays and previous have a mild relationship which might indicate that people who were contacted before are more likely to be contacted again soon.
MILD RELATIONSHIPs
Pdays and Previous - Correlation of 0.45: Suggests a strongest positive correlation between the the number of times the clients were contacted prior to this campaign and the number of days that passed after the client was contacted during a previous campaign. This appears to be a positive linear correlation.
WEAK RELATIONSHIPs
Day and Duration - correlation of 0.16: Suggests a weak positive correlation, denoting that the duration of call increases with the day, though the relationship is not strong.
Age and Balance - Correlation of 0.1: Suggests a weak correlation denoting older clients may have slightly higher account balances.
Other correlation values for numerical features are close to zero denoting no significant correlation.
CATEGORICAL VARIABLES
Below the correlation matrix alludes to relationships for categorical variables in the dataset. Overall, categorical variables in dataset do not have strong linear relationships with each other.
MILD RELATIONSHIPS
Job and Education – Correlation of 0.46
Housing and Month – Correlation of 0.50
Contact and Month – Correlation of 0.51
WEAK RELATIONSHIPS
All other correlations are weak.
# --- Correlation Matrix for Numeric Variables ---
# Select numeric variables
numeric_data <- bank_data %>% select(where(is.numeric))
# Correlation matrix
correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")
# corr matrix
ggcorrplot(correlation_matrix,
method = "circle",
type = "lower",
lab = TRUE,
title = "Correlation Matrix of Numeric Variables")
# Cramers
# Select categorical variables
categorical_cols <- names(bank_data)[sapply(bank_data, is.character)]
# Function for Cramers
cramers_v_matrix <- function(df, cat_vars) {
n <- length(cat_vars)
result <- matrix(0, n, n, dimnames = list(cat_vars, cat_vars))
for (i in 1:n) {
for (j in i:n) {
if (i == j) {
result[i, j] <- 1
} else {
result[i, j] <- cramerV(df[[cat_vars[i]]], df[[cat_vars[j]]])
result[j, i] <- result[i, j]
}
}
}
return(result)
}
cramers_matrix <- cramers_v_matrix(bank_data, categorical_cols)
cramers_df <- as.data.frame(cramers_matrix)
ggcorrplot(cramers_matrix,
method = "circle",
type = "lower",
lab = TRUE,
title = "Cramers V Correlation Between Categorical Variables")
Exploratory Data Analysis: How are categorical variables distributed?
# Convert categorical variables to factors
bank_data <- bank_data %>%
mutate(across(where(is.character), as.factor))
# Combine all numeric-categorical pairs into one long dataframe
numeric_vars <- names(select(bank_data, where(is.numeric)))
categorical_vars <- names(select(bank_data, where(is.factor)))
# Long FOrmat
bank_data_long <- bank_data %>%
pivot_longer(cols = all_of(numeric_vars), names_to = "Numeric_Variable", values_to = "Value")
# Boxplots
ggplot(bank_data_long, aes(x = .data[[categorical_vars[1]]], y = Value, fill = .data[[categorical_vars[1]]])) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~Numeric_Variable, scales = "free") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels
legend.position = "none" # Remove legend if not needed
) +
labs(
x = "Categorical Variable",
y = "Value",
title = "Boxplots of Numeric Variables by Categorical Groups"
)
Exploratory Data Analysis: What is the central tendency and spread of each variable?
Below the central tendency for each numerical variable is denoted via mean, median and mode. For categorical variables mode is reported.
Spread of numerical variables is denote by standard deviation, variance, IQR and range (max/min).
The spread for the categorical variables is captured via frequency of each category and percent.
# Display numeric summary
# Separate into numeric and categorical variables
numeric_vars <- bank_data %>%
select(where(is.numeric))
categorical_vars <- bank_data %>%
select(where(is.character) | where(is.factor))
# Function to calculate summary statistics for numeric variables
numeric_summary <- numeric_vars %>%
summarise_all(list(
Mean = mean,
Median = median,
SD = sd,
Variance = var,
IQR = IQR,
Range = ~max(., na.rm = TRUE) - min(., na.rm = TRUE)
), na.rm = TRUE) %>%
pivot_longer(cols = everything(), names_to = c("Variable", "Metric"), names_sep = "_") %>%
pivot_wider(names_from = Metric, values_from = value)
# Function to compute mode separately
compute_mode <- function(x) {
tab <- table(x)
names(tab)[which.max(tab)]
}
mode_summary <- numeric_vars %>%
summarise_all(list(Mode = compute_mode)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Mode")
# Merge numeric summary with mode
numeric_summary <- left_join(numeric_summary, mode_summary, by = "Variable")
# Function to find mode for categorical variables
categorical_summary <- categorical_vars %>%
summarise_all(list(
Mode = compute_mode
)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Mode")
# Display numeric summary
print(numeric_summary)
## # A tibble: 7 × 8
## Variable Mean Median SD Variance IQR Range Mode
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 age 40.9 39 10.6 113. 15 77 <NA>
## 2 balance 1362. 448 3045. 9270599. 1356 110146 <NA>
## 3 day 15.8 16 8.32 69.3 13 30 <NA>
## 4 duration 258. 180 258. 66321. 216 4918 <NA>
## 5 campaign 2.76 2 3.10 9.60 2 62 <NA>
## 6 pdays 40.2 -1 100. 10026. 0 872 <NA>
## 7 previous 0.580 0 2.30 5.31 0 275 <NA>
# Display categorical summary
print(categorical_summary)
## # A tibble: 10 × 2
## Variable Mode
## <chr> <chr>
## 1 job_Mode blue-collar
## 2 marital_Mode married
## 3 education_Mode secondary
## 4 default_Mode no
## 5 housing_Mode yes
## 6 loan_Mode no
## 7 contact_Mode cellular
## 8 month_Mode may
## 9 poutcome_Mode unknown
## 10 y_Mode no
# Function to compute category counts and correct percentages
compute_category_counts <- function(df) {
df %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category") %>%
group_by(Variable, Category) %>%
summarise(Count = n(), .groups = "drop") %>%
group_by(Variable) %>%
mutate(Percentage = round((Count / sum(Count)) * 100, 2)) %>%
arrange(Variable, desc(Count))
}
# Apply function to categorical variables
categorical_vars <- bank_data %>%
select(where(is.character) | where(is.factor))
category_counts <- compute_category_counts(categorical_vars)
# Print
print(category_counts)
## # A tibble: 46 × 4
## # Groups: Variable [10]
## Variable Category Count Percentage
## <chr> <fct> <int> <dbl>
## 1 contact cellular 29285 64.8
## 2 contact unknown 13020 28.8
## 3 contact telephone 2906 6.43
## 4 default no 44396 98.2
## 5 default yes 815 1.8
## 6 education secondary 23202 51.3
## 7 education tertiary 13301 29.4
## 8 education primary 6851 15.2
## 9 education unknown 1857 4.11
## 10 housing yes 25130 55.6
## # ℹ 36 more rows
Exploratory Data Analysis: Are there any missing values and how significant are they?
There are no missing values.
# Missing values
missing_summary <- bank_data %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
arrange(desc(Missing_Count))
# Print
print(missing_summary)
## # A tibble: 17 × 2
## Variable Missing_Count
## <chr> <int>
## 1 age 0
## 2 job 0
## 3 marital 0
## 4 education 0
## 5 default 0
## 6 balance 0
## 7 housing 0
## 8 loan 0
## 9 contact 0
## 10 day 0
## 11 month 0
## 12 duration 0
## 13 campaign 0
## 14 pdays 0
## 15 previous 0
## 16 poutcome 0
## 17 y 0
Exploratory Data Analysis: Do any patterns or trends emerge in the data?
The most obvious patterns are:
1. The dataset is highly imbalanced as demonstrated by the bias in the dependent variable. A majority of clients have not subscribed (yes =~5,000, no=~40,000).
2. Many of the features are not normally distributed.
3. Many of the features do not have a linear relationship with the dependent variable.
4. Feature independence is most likely violated. The features job, age, education, martial status, housing loan are likely to be correlated.
Algorithm Selection: Select two or more machine learning algorithms presented so far that could be used to train a model (no need to train models - I am only looking for your recommendations).
I would recommend using Random Forest and Logistic Regression since our dataset is:
1. imbalanced as evident by the imbalance present in the dependent variable itself (),
2. large, i.e. ~45,000 rows with 17 variables, and
3. consists of both numerical and categorical variables.
Algorithm Selection: What are the pros and cons of each algorithm you selected?
Logistic Regression
Pros
1. Provides probabilities for predictions and, as such, is easy to interpret
2. Works well with large datasets
3. Works well with imbalanced data, which we definitely have in our dependent variable
4. Can potentially handle non-linear relationships
5. works well with high dimensional data
Cons
1. Assumes linear relationship between independent and dependent variables
2. Sensitive to outliers
3. Sensitive to multicollinearity
4. Doesn’t work well with categorical variables. Requires one-hot coding for instance
Random Forest
Pros
1. Handles imbalance data well utilizing class weighting
2. Handles categorical and numerical features well (categorical would require one-hot encoding transformation)
3. Reduces overfitting by averaging multiple decision trees
4. Works well with missing data. There is no missing data in our dataset
5. Helps to identify the most important features
6. Handles large datasets
Cons
1. Slower than simpler models such as logistic regression
2. More difficult to interpret than logistic regression
3. Decision trees may become overly complex if high dimension data
Algorithm Selection: Which algorithm would you recommend, and why? How does your choice of algorithm relates to the dataset?
I would recommend Random Forest since the dataset is:
1. highly imbalanced,
2. has numerous numerical and categorical variables,
3. handles large datasets, and
4. the dataset is not highly dimensional (17 features). Random forest can struggle with highly dimensional data
I might also use logistic regression, but the dataset has a lot of outliers which would need to be accounted for in preprocessing. I also briefly considered Naive Bayes, but the dataset has a large number of numerical features which are not normally distributed. Furthermore, features are most likely not independent. Job, Age, education, martial status, housing loan are likely to be correlated.
Algorithm Selection: Are there labels in your data? Did that impact your choice of algorithm?
Yes. The dependent variable is binary and labeled. As such, I mainly considered supervised training algorithms.
Algorithm Selection:Would your choice of algorithm change if there were fewer than 1,000 data records, and why?
Yes because random forest may overfit when trained on small datasets. In such a case, I would probably use decision tree or kNN.
Pre-processing: Data Cleaning - improve data quality, address missing data, etc.
Missing Values
There are no missing values in the dataset, so no imputation for missing values is needed.
Class Imbalance
The target variable is highly imbalanced (Not subscribed= 88.3% versus Subscribed=11.7%).
This means the model may struggle to correctly predict the minority class (yes), leading to bias towards “no”.
To address this imbalance we might oversample to generate more “yes” or use undersampling to reduce the dominance of the “no” class.
Outlier Detection
There are a lot of outliers in the dataset as previous demonstrated in scatterplots and IQR*1.5 detection threshold.
Since we are using random forest, outliers are not of significant concern. The random forest algorithm aggregates results from multiple decision trees creating an average.
Pre-processing: Dimensionality Reduction - remove correlated/redundant data than will slow down training
Generally, random forest is not sensitive to high dimensional data because random forest can handle large numbers of features without overfitting, automatically selecting relevant features secondary to bagging and does not require scaling.
If another model were used, such as logistic regression, dimension reduction could be achieved via PCA and/or feature selection.
Below the numerical and categorical correlation matrix’s display a mild correlation between pdays and previous, job and education, housing and month and contact and month, but not enough of a correlation to support dimension reduction
ggcorrplot(correlation_matrix,
method = "circle",
type = "lower",
lab = TRUE,
title = "Correlation Matrix of Numeric Variables")
ggcorrplot(cramers_matrix,
method = "circle",
type = "lower",
lab = TRUE,
title = "Cramers V Correlation Between Categorical Variables")
Feature Engineering - use of business knowledge to create new features
Although random forest is able to handle this sort of dataset, a few variables could be improved upon. Time-based features such as month could be engineered at a seasonal level. Housing and personal loan features could be combined for an all encompassing “loan” variable denoting the presence of any sort of loan.
Sampling Data - using sampling to resize datasets and Imbalanced Data - reducing the imbalance between classes
There is significant imbalance in the target feature ‘y’.
Since the dataset is a moderate size of ~45,000 rows, undersampling should be Use reduce the majority class (no) to match the minority class (yes). If the dataset were small, i.e. less than 1,000 rows, oversampling or SMOTE might be utilized to address the imbalance.
Data Transformation - regularization, normalization, handling categorical variables
Since random forest cannot handle categorical variables, one-hot encoding (low number of categories) and label encoding (for features with a larger number of categories) can be used to transform the data into numerical values. Additionally, highly skewed features could be transformed via methods such as log transformation. Regularization is not needed since random forest automatically selects features and is not sensitive to multicollinearity.