library(tidyverse)
library(caret)
library(ggplot2)
library(ggcorrplot)
library(dplyr)
The assignment states: “This assignment focuses on one of the most important aspects of data science, Exploratory Data Analysis (EDA). Many surveys show that data scientists spend 60-80% of their time on data preparation. EDA allows you to identify data gaps & data imbalances, improve data quality, create better features and gain a deep understanding of your data before doing model training - and that ultimately helps train better models. In machine learning, there is a saying -”better data beats better algorithms” - meaning that it is more productive to spend time improving data quality than improving the code to train the model.”
A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset.”
This report contains the following sections: * Exploratory Data Analysis * Algorithm Selection * Pre-processing
I downloaded the dataset and uploaded the files to my Google Cloud Platform instance. These are public URLs that automatically download the file if clicked.
bank_additional_full_df <- read_csv2('https://storage.googleapis.com/data_science_masters_files/data_622/bank%2Bmarketing/bank-additional-full.csv')
bank_additional_names_txt <- readLines('https://storage.googleapis.com/data_science_masters_files/data_622/bank%2Bmarketing/bank-additional-names.txt')
bank_additional_df <- read_csv2('https://storage.googleapis.com/data_science_masters_files/data_622/bank%2Bmarketing/bank-additional.csv')
bank_full_df <- read_csv2('https://storage.googleapis.com/data_science_masters_files/data_622/bank%2Bmarketing/bank-full.csv')
bank_names_txt <- readLines('https://storage.googleapis.com/data_science_masters_files/data_622/bank%2Bmarketing/bank-names.txt')
bank_df <- read_csv2('https://storage.googleapis.com/data_science_masters_files/data_622/bank%2Bmarketing/bank.csv')
I will start with a review of correlation:
bank_addfull_num_df <- bank_additional_full_df[, sapply(bank_additional_full_df, is.numeric)]
bank_addfull_cat_df <- bank_additional_full_df[, sapply(bank_additional_full_df, is.factor)]
bank_addfull_num_matrix <- cor(bank_addfull_num_df, use = "pairwise.complete.obs")
bank_addfull_high_corr <- colnames(bank_addfull_num_df)[findCorrelation(bank_addfull_num_matrix, cutoff = 0.75)]
bank_additional_num_df <- bank_additional_df[, sapply(bank_additional_df, is.numeric)]
bank_additional_cat_df <- bank_additional_df[, sapply(bank_additional_df, is.factor)]
bank_additional_num_matrix <- cor(bank_additional_num_df, use = "pairwise.complete.obs")
bank_additional_high_corr <- colnames(bank_additional_num_df)[findCorrelation(bank_additional_num_matrix, cutoff = 0.75)]
bank_full_num_df <- bank_full_df[, sapply(bank_full_df, is.numeric)]
bank_full_cat_df <- bank_full_df[, sapply(bank_full_df, is.factor)]
bank_full_num_matrix <- cor(bank_full_num_df, use = "pairwise.complete.obs")
bank_full_high_corr <- colnames(bank_full_num_df)[findCorrelation(bank_full_num_matrix, cutoff = 0.75)]
bank_num_df <- bank_df[, sapply(bank_df, is.numeric)]
bank_cat_df <- bank_df[, sapply(bank_df, is.factor)]
bank_num_matrix <- cor(bank_num_df, use = "pairwise.complete.obs")
bank_high_corr <- colnames(bank_num_df)[findCorrelation(bank_num_matrix, cutoff = 0.75)]
The following features have reasonably high correlation, where reasonably high is greater than or equal to 75%: * Consumer Price Idex and Consumer Confidence Index
ggcorrplot(bank_addfull_num_matrix, type = "lower", lab = TRUE) + ggtitle("bank_additional_full_df")
ggcorrplot(bank_additional_num_matrix, type = "lower", lab = TRUE) + ggtitle("bank_additional_df")
ggcorrplot(bank_full_num_matrix, type = "lower", lab = TRUE) + ggtitle("bank_full_df")
ggcorrplot(bank_num_matrix, type = "lower", lab = TRUE) + ggtitle("bank_df")
The datasets show a relatively normal distribution of the age of people contacted while the typical number of contacts and duration. It’s interesting that the biggest push for campaigns was during May, I wonder if that’s a popular time of year for new banking services in Portugal. I was first confused by the “admin” job category and why it was so high but then I realized it mudt be the category for non-management white collar workers.
cat_chart <- function(df, title) {
df <- df %>% mutate(across(where(is.character), as.factor))
cat_df <- df %>% select(where(is.factor))
if (ncol(cat_df) == 0) {
message("No categorical variables")
return(NULL)
}
cat_df %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value, fill = Variable)) +
geom_bar() +
facet_wrap(~ Variable, scales = "free") +
ggtitle(title) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
}
cat_chart(bank_additional_full_df, "Categorical: bank_additional_full_df")
cat_chart(bank_additional_df, "Categorical: bank_additional_df")
cat_chart(bank_full_df, "Categorical: bank_full_df")
dist_chart <- function(df, title) {
df %>%
select(where(is.numeric)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.6) +
facet_wrap(~ Variable, scales = "free") +
ggtitle(title)
}
dist_chart(bank_additional_full_df, "Numeric: bank_additional_full_df")
dist_chart(bank_additional_df, "Numeric: bank_additional_df")
dist_chart(bank_full_df, "Numeric: bank_full_df")
find_out <- function(df) {
num_df <- df[, sapply(df, is.numeric)]
lapply(num_df, function(x) boxplot.stats(x)$out)
}
summ_out <- function(df) {
num_df <- df[, sapply(df, is.numeric)]
data.frame(Variable = names(num_df),
Outlier_Count = sapply(num_df, function(x) length(boxplot.stats(x)$out)))
}
out_addfull <- find_out(bank_additional_full_df)
summ_addfull <- summ_out(bank_additional_full_df)
cat("bank_additional_full_df outlier count:\n")
## bank_additional_full_df outlier count:
print(summ_addfull)
## Variable Outlier_Count
## age age 469
## duration duration 2963
## campaign campaign 2406
## pdays pdays 1515
## previous previous 5625
## emp.var.rate emp.var.rate 0
## cons.price.idx cons.price.idx 3616
## cons.conf.idx cons.conf.idx 4282
## nr.employed nr.employed 7763
out_additional <- find_out(bank_additional_df)
summ_additional <- summ_out(bank_additional_df)
cat("bank_additional_df outlier count:\n")
## bank_additional_df outlier count:
print(summ_additional)
## Variable Outlier_Count
## age age 39
## duration duration 291
## campaign campaign 235
## pdays pdays 160
## previous previous 596
## emp.var.rate emp.var.rate 0
## cons.price.idx cons.price.idx 386
## cons.conf.idx cons.conf.idx 455
## nr.employed nr.employed 758
out_full <- find_out(bank_full_df)
summ_full <- summ_out(bank_full_df)
cat("bank_full_df outlier count:\n")
## bank_full_df outlier count:
print(summ_full)
## Variable Outlier_Count
## age age 487
## balance balance 4729
## day day 0
## duration duration 3235
## campaign campaign 3064
## pdays pdays 8257
## previous previous 8257
out_bank <- find_out(bank_df)
summ_bank <- summ_out(bank_df)
cat("bank_df outlier count:\n")
## bank_df outlier count:
print(summ_bank)
## Variable Outlier_Count
## age age 38
## balance balance 506
## day day 0
## duration duration 330
## campaign campaign 318
## pdays pdays 816
## previous previous 816
There are a lot of outliers in the duration and campaign variables, speaking to the long tail of marketing camppaigns. The high number for previous contacts also suggests widely variety in how many times a given person is contacted.
The bank balances have the largest spread, covering 9.2 million. While age has a very tight interquartile range of 15, which reaffirms the above seen normal looking distribution, the campaign value is of two along with a median and SD of 2.5 idicates extremely high clustering of values around 2.
options(scipen = 999)
cat("bank_additional_full_df outlier count:\n")
## bank_additional_full_df outlier count:
bank_additional_full_df %>%
summarise(across(where(is.numeric), list(
mean = ~mean(., na.rm = TRUE),
median = ~median(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
var = ~var(., na.rm = TRUE),
IQR = ~IQR(., na.rm = TRUE),
range = ~max(., na.rm = TRUE) - min(., na.rm = TRUE)
))) %>%
pivot_longer(cols = everything(), names_to = "Metric", values_to = "Value")
## # A tibble: 54 × 2
## Metric Value
## <chr> <dbl>
## 1 age_mean 40.0
## 2 age_median 38
## 3 age_sd 10.4
## 4 age_var 109.
## 5 age_IQR 15
## 6 age_range 81
## 7 duration_mean 258.
## 8 duration_median 180
## 9 duration_sd 259.
## 10 duration_var 67226.
## # ℹ 44 more rows
cat("bank_additional_df outlier count:\n")
## bank_additional_df outlier count:
bank_additional_df %>%
summarise(across(where(is.numeric), list(
mean = ~mean(., na.rm = TRUE),
median = ~median(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
var = ~var(., na.rm = TRUE),
IQR = ~IQR(., na.rm = TRUE),
range = ~max(., na.rm = TRUE) - min(., na.rm = TRUE)
))) %>%
pivot_longer(cols = everything(), names_to = "Metric", values_to = "Value")
## # A tibble: 54 × 2
## Metric Value
## <chr> <dbl>
## 1 age_mean 40.1
## 2 age_median 38
## 3 age_sd 10.3
## 4 age_var 106.
## 5 age_IQR 15
## 6 age_range 70
## 7 duration_mean 257.
## 8 duration_median 181
## 9 duration_sd 255.
## 10 duration_var 64874.
## # ℹ 44 more rows
cat("bank_full_df outlier count:\n")
## bank_full_df outlier count:
bank_full_df %>%
summarise(across(where(is.numeric), list(
mean = ~mean(., na.rm = TRUE),
median = ~median(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
var = ~var(., na.rm = TRUE),
IQR = ~IQR(., na.rm = TRUE),
range = ~max(., na.rm = TRUE) - min(., na.rm = TRUE)
))) %>%
pivot_longer(cols = everything(), names_to = "Metric", values_to = "Value")
## # A tibble: 42 × 2
## Metric Value
## <chr> <dbl>
## 1 age_mean 40.9
## 2 age_median 39
## 3 age_sd 10.6
## 4 age_var 113.
## 5 age_IQR 15
## 6 age_range 77
## 7 balance_mean 1362.
## 8 balance_median 448
## 9 balance_sd 3045.
## 10 balance_var 9270599.
## # ℹ 32 more rows
cat("bank_df outlier count:\n")
## bank_df outlier count:
bank_df %>%
summarise(across(where(is.numeric), list(
mean = ~mean(., na.rm = TRUE),
median = ~median(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE),
var = ~var(., na.rm = TRUE),
IQR = ~IQR(., na.rm = TRUE),
range = ~max(., na.rm = TRUE) - min(., na.rm = TRUE)
))) %>%
pivot_longer(cols = everything(), names_to = "Metric", values_to = "Value")
## # A tibble: 42 × 2
## Metric Value
## <chr> <dbl>
## 1 age_mean 41.2
## 2 age_median 39
## 3 age_sd 10.6
## 4 age_var 112.
## 5 age_IQR 16
## 6 age_range 68
## 7 balance_mean 1423.
## 8 balance_median 444
## 9 balance_sd 3010.
## 10 balance_var 9057922.
## # ℹ 32 more rows
Logistic Regression: works well for classification problems and can handle categorical features when properly encoded/
Pros: * Interpretable and easy to implement * Works well with linearly separable data and provides probabilistic outputs
Cons: * Struggles with non-linear relationships unless feature engineering is applied * Requires categorical encoding.
K-Nearest Neighbors (KNN): A non-parametric model that can handle both numeric and categorical features.
Pros: * Non-parametric, meaning it can capture complex relationships without assuming a distribution. * Works well with mixed data types if properly preprocessed.
Cons: * Computationally expensive with large datasets due to distance calculations. * Performance depends heavily on the choice of k and feature scaling.
Overall, I would recommend using K-Nearest Neighbors (KNN). This is because the datasets have both both numeric and categorical features and needs a non-parametric model that can capture complex relationships. I just need to be mindful of feature scaling and ensure the appropriate k value is picked.
The data had labels and allowed me to choose a supervised learning model that could be trained towards known values. if there aren’t labels, I would have leaned towards a clustering model, like K-means.
For fewer than 1,000 data points, I would pick logistic regression and focus on binary outputs from the sparse availability. Otherwise, overfitting becomes a risk.
The metrics need to be normalized, with different types of metrics getting different normalization
Missing values would be imputed where possible
For age, a combination of values for Job, Housing, and Marital can be used to create. reasonable assumption of someone’s age. The value imputed would be the median age based on these three points
I would drop unknowns for Housing and Marital since those look relatively small
I would not need to change the size of this dataset because it’s small to begin with. if it was 100X larger, I would randomly sample 10% to start and see if that was enough.