Introduction
This assignment focuses on one of the most important aspects of data science, Exploratory Data Analysis (EDA). Many surveys show that data scientists spend 60-80% of their time on data preparation. EDA allows you to identify data gaps & data imbalances, improve data quality, create better features and gain a deep understanding of your data before doing model training - and that ultimately helps train better models. In machine learning, there is a saying - “better data beats better algorithms” - meaning that it is more productive to spend time improving data quality than improving the code to train the model.
This will be an exploratory exercise, so feel free to show errors and warnings that arise during the analysis. Test the code with both datasets selected and compare the results.
Dataset
A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing
Assignment
Review the structure and content of the data and answer questions such as:
Are the features (columns) of your data correlated?
What is the overall distribution of each variable?
Are there any outliers present?
What are the relationships between different variables?
How are categorical variables distributed?
Do any patterns or trends emerge in the data?
What is the central tendency and spread of each variable?
Are there any missing values and how significant are they?
Now you have completed the EDA, what Algorithms would suit the business purpose for the dataset. Answer questions such as:
Select two or more machine learning algorithms presented so far that could be used to train a model (no need to train models - I am only looking for your recommendations).
What are the pros and cons of each algorithm you selected?
Which algorithm would you recommend, and why?
Are there labels in your data? Did that impact your choice of algorithm?
How does your choice of algorithm relates to the dataset?
Would your choice of algorithm change if there were fewer than 1,000 data records, and why?
Data Cleaning - improve data quality, address missing data, etc.
Dimensionality Reduction - remove correlated/redundant data than will slow down training
Feature Engineering - use of business knowledge to create new features
Sampling Data - using sampling to resize datasets
Data Transformation - regularization, normalization, handling categorical variables
Imbalanced Data - reducing the imbalance between classes
Deliverable
Write a short essay summarizing your findings. Explain your selection of algorithms and how they relate to the data and what you are trying to do. Format: PDF Code This should include your code, as well as the outputs of your code e.g. correlation chart Format:
This should be saved in https://rpubs.com. Please provide a link to your code in the submission.
library(dplyr)
library("tidyverse")
library(ggplot2)
library(corrplot)
library(forcats)
library(GGally)
There are 45,211 observations and 17 columns.
Below is the description of the variables from UCI Machine Learning.
Variable | Description |
---|---|
age | Age of client |
job | Type of job the client has (admin, blue-collar, entrepreneur, housemaid, management, retired, self -employed, services, student, technician, unemployed, unknown) |
martial | Martial status of client (divorced, married, single, unknown) |
education | Education level of client (primary, secondary, tertiary, unknown) |
default | Has the client’s credit default? (yes/no) |
balance | average yearly balance of client in euros (numeric) |
housing | Does the client have a housing loan? (yes/no) |
loan | Does the client have a personal loan? (yes/no) |
contact | contact communication type of last contact of the current campaign (telephone, cellular, unknown) |
day | last contact day of the month (numeric) |
month | last contact month of year (jan, feb, mar, …, dec) |
duration | last contact duration, in seconds |
campaign | number of contacts performed during this campaign and for this client (includes last contact) |
pdays | number of days that passed by after the client was last contacted from a previous campaign (-1 means client was not previously contacted) |
previous | number of contacts performed before this campaign and for this client |
poutcome | outcome of the previous marketing campaign (unknown, other, failure, success) |
y | has the client subscribed a term deposit? yes/no) |
raw_data <- read.csv("https://raw.githubusercontent.com/suswong/DATA-622/refs/heads/main/bank-full.csv",sep=";")
str(data)
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"),
## envir = .GlobalEnv, overwrite = TRUE)
Below is the summary of the data. It reveals that campaigns reached a broad demographic but was not very effective as indicated by the low percentage of clients that subscribed a term deposit.
Demographic of Client
Financial Information of Client
Contact and Campaign Information
factor <- c("job", "marital", "education", "contact", "month", "poutcome", "default", "housing", "loan", "y")
data <- raw_data %>%
mutate_at(factor, as.factor)
summary(data)
## age job marital education
## Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
## 1st Qu.:33.00 management :9458 married :27214 secondary:23202
## Median :39.00 technician :7597 single :12790 tertiary :13301
## Mean :40.94 admin. :5171 unknown : 1857
## 3rd Qu.:48.00 services :4154
## Max. :95.00 retired :2264
## (Other) :6835
## default balance housing loan contact
## no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
## yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
## Median : 448 unknown :13020
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
##
## day month duration campaign
## Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
## Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
## 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
## (Other): 6060
## pdays previous poutcome y
## Min. : -1.0 Min. : 0.0000 failure: 4901 no :39922
## 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 yes: 5289
## Median : -1.0 Median : 0.0000 success: 1511
## Mean : 40.2 Mean : 0.5803 unknown:36959
## 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :871.0 Max. :275.0000
##
All of the numerical variables are right skewed except “day”.
data %>%
select_if(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(bins=30, fill="lightblue", color="black") +
facet_wrap(~key, scales="free") +
theme_minimal() +
ggtitle("Distribution of Numeric Variables")
categorical_cols <- c("job", "marital", "education", "default", "housing", "loan", "contact", "poutcome", "month", "y")
for (col in categorical_cols) {
data %>%
count(!!sym(col)) %>%
mutate(percentage = n / sum(n) * 100) %>%
mutate(!!col := fct_reorder(!!sym(col), percentage, .desc = TRUE)) %>%
ggplot(aes_string(x = col, y = "percentage", fill = col)) +
geom_bar(stat = "identity") +
geom_text(aes(label = sprintf("%.1f%%", percentage)), vjust = -0.5) +
theme_minimal() +
ggtitle(paste("Distribution of", col)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set3") +
ylab("Percentage") -> p
print(p)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
From the code, there are no missing values. However, some client data has unknown values in categorical variables. About 83% of the client data has at least 1 “unknown” value in their data. From the distribution of categorical variables, the percentage of unknown value for: - job is 0.6% - education is 4.1% - contact is 6.4% - poutcome is 81.7%
sum(is.na(data))
## [1] 0
unknown_rows <- sum(apply(data, 1, function(row) any(row == "unknown")))
total_rows <- nrow(data)
percentage_unknown <- (unknown_rows / total_rows) * 100
percentage_unknown
## [1] 82.65466
There are no duplicated data
sum(duplicated(data))
## [1] 0
Age and balance appear to have a very weak correlation.
pdays and previous have a strong postive correlation.
Campaign and duration has a slight negative correlation. This may indicate that the more times a client is contacted, the shorter the call duration becomes.
cor_matrix <- cor(data %>% select_if(is.numeric))
corrplot(cor_matrix, method="color", type="upper", tl.col="black", tl.srt=45)
The majority of clients have a balance below 25,000. There are some outliers with significant higher balance. The balance seem spread out across all ages.
ggplot(data, aes(x = age, y = balance)) +
geom_point(alpha = 0.5, color = "blue") +
labs(title = "Scatterplot of Age vs Balance",
x = "Age",
y = "Balance") +
theme_minimal()
Campaign and duration has a slight negative correlation. This may indicate that the more times a client is contacted, the shorter the call duration becomes.
ggplot(data, aes(x = campaign, y = duration)) +
geom_point(alpha = 0.5, color = "blue") +
labs(title = "Scatterplot of Campaign vs Duration",
x = "campaign",
y = "duration") +
theme_minimal()
Clients with no loan has a significant number of high-age outliers.
ggplot(data, aes(x = loan, y = age, fill = loan)) +
geom_boxplot() +
labs(title = "Age Distribution by Loan Status",
x = "Has Personal Loan?",
y = "Age",
fill = "Has Loan?") +
theme_minimal()
There are significantly more older age outliers in clients who does nothave a credit default. Older individuals are less likely to default as it has a higher median and broader age group.
ggplot(data, aes(x = default, y = age, fill = default)) +
geom_boxplot() +
labs(title = "Age Distribution by Default Status",
x = "Has Credit Defaulted?",
y = "Age",
fill = "Has Credit Defaulted?") +
theme_minimal()
Since the subscription rate (“yes” for term deposit) is 11.7%, there is class imbalance. Proportions normalization the data within each group in necessary to make a fair comparison.
There appears to be slight higher porportion in clients with teritary education level that did subscribe to the term deposit.
education_y_counts <- data %>%
group_by(education, y) %>%
summarize(n = n()) %>%
ungroup()
## `summarise()` has grouped output by 'education'. You can override using the
## `.groups` argument.
ggplot(education_y_counts, aes(x = education, y = n, fill = y)) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Term Deposit Subscription by Education Level (Normalized)",
x = "Education Level",
y = "Proportion of Clients",
fill = "Subscribed?") +
theme_minimal()
The students, and retired clients has a higher percentage that did subscribe to the term deposit
job_y_counts <- data %>%
group_by(job, y) %>%
summarize(n = n()) %>%
ungroup()
## `summarise()` has grouped output by 'job'. You can override using the `.groups`
## argument.
ggplot(job_y_counts, aes(x = job, y = n, fill = y)) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Term Deposit Subscription by Job Type (Normalized)",
x = "Job Type",
y = "Proportion of Clients",
fill = "Subscribed?") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Clients who are single has a higher percentage that subscribe to the term deposit.
marital_y_counts <- data %>%
group_by(marital, y) %>%
summarize(n = n()) %>%
ungroup()
## `summarise()` has grouped output by 'marital'. You can override using the
## `.groups` argument.
ggplot(marital_y_counts, aes(x = marital, y = n, fill = y)) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Term Deposit Subscription by Marital Status (Normalized)",
x = "Marital Status",
y = "Proportion of Clients",
fill = "Subscribed?") +
theme_minimal()
Although it is minimal, there appears to be a slighter higher percentage of clients with housing loans that did subscribe to the term deposit.
housing_y_counts <- data %>%
group_by(housing, y) %>%
summarize(n = n()) %>%
ungroup()
## `summarise()` has grouped output by 'housing'. You can override using the
## `.groups` argument.
ggplot(housing_y_counts, aes(x = housing, y = n, fill = y)) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Term Deposit Subscription by Housing Loan Status",
x = "Housing Loan",
y = "Proportion of Clients",
fill = "Subscribed?") +
theme_minimal()
There are higher subscription rate among the younger (18-25) and older age group (65+).
data <- data %>%
mutate(age_group = case_when(
age >= 18 & age <= 25 ~ "18-25",
age >= 26 & age <= 35 ~ "26-35",
age >= 36 & age <= 45 ~ "36-45",
age >= 46 & age <= 55 ~ "46-55",
age >= 56 & age <= 65 ~ "56-65",
age >= 66 & age <= 75 ~ "66-75",
age >= 76 ~ "76+",
TRUE ~ NA_character_
))
table(data$age_group)
##
## 18-25 26-35 36-45 46-55 56-65 66-75 76+
## 1336 15571 13856 9548 4149 490 261
age_group_summary <- data %>%
group_by(age_group) %>%
summarise(
total = n(),
subscribed_yes = sum(y == "yes"),
prop_subscribed = subscribed_yes / total
) %>%
arrange(age_group)
print(age_group_summary)
## # A tibble: 7 × 4
## age_group total subscribed_yes prop_subscribed
## <chr> <int> <int> <dbl>
## 1 18-25 1336 320 0.240
## 2 26-35 15571 1869 0.120
## 3 36-45 13856 1301 0.0939
## 4 46-55 9548 893 0.0935
## 5 56-65 4149 586 0.141
## 6 66-75 490 203 0.414
## 7 76+ 261 117 0.448
ggplot(age_group_summary, aes(x = age_group, y = prop_subscribed)) +
geom_bar(stat = "identity", fill = "light blue") +
geom_text(aes(label = scales::percent(prop_subscribed)), vjust = -0.5) +
labs(
title = "Subscription Rate by Age Group",
x = "Age Group",
y = "Proportion Subscribed"
) +
theme_minimal()
Extreme outliers are present in both subscribers and non-scribers. However, it is more promininet in non-subscribers. Both groups present a wide range of balances, but non-subscribers has more high-balance outliers.
ggplot(data, aes(x = y, y = balance, fill = y)) +
geom_boxplot() +
labs(title = "Balance Distribution by Term Deposit Subscription",
x = "Subscribed?",
y = "Balance",
fill = "Subscribed?") +
theme_minimal()
Celluar has the highest subscription rate, making it the most effective method of marketing campaign.
contact_y_counts <- data %>%
group_by(contact, y) %>%
summarize(n = n()) %>%
ungroup()
## `summarise()` has grouped output by 'contact'. You can override using the
## `.groups` argument.
ggplot(contact_y_counts, aes(x = contact, y = n, fill = y)) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Term Deposit Subscription by Contact Method (Normalized)",
x = "Contact Method",
y = "Proportion of Clients",
fill = "Subscribed?") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Subscription rate varies month to month. March, September, and October show higher proportions of term deposit subscriptions. Recall from earlier on the distribution of the months, campaign efforts were aggressive in May at it accounts for about 30% of the campaign, followed by July (15.3%) and Auguest (13.8%). Yet, these months were among the lowest subscription rates, suggesting a not very successful term deposit campaign.
month_y_counts <- data %>%
group_by(month, y) %>%
summarize(n = n()) %>%
ungroup()
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
# b. Create the normalized stacked bar chart
ggplot(month_y_counts, aes(x = month, y = n, fill = y)) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Term Deposit Subscription by Month (Normalized)",
x = "Month",
y = "Proportion of Clients",
fill = "Subscribed?") +
theme_minimal() +
scale_x_discrete(limits = c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec")) + # Ensure months are in order
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Both groups of subscribers and non-subscribers present outliers with very long contact duration. However, it more present in subscribers. The spread of contact duration is greater in subscribers.
ggplot(data, aes(x = y, y = duration, fill = y)) +
geom_boxplot() +
labs(title = "Duration Distribution by Term Deposit Subscription",
x = "Subscribed?",
y = "Duration (seconds)",
fill = "Subscribed?") +
theme_minimal()
Outliers are present in both groups but there are extreme outliers where clients are contacted over 40 times. Repeated contacts may not guarantee more subscribers.
ggplot(data, aes(x = y, y = campaign, fill = y)) +
geom_boxplot() +
labs(title = "Number of Contacts by Term Deposit Subscription",
x = "Subscribed?",
y = "Number of Contacts (Campaign)",
fill = "Subscribed?") +
theme_minimal()
Both groups show similar distribution. Most clients regardless if they subscribed or not have very few previous contact. The number of previous contacts may not have much influence on subscription rates.
ggplot(data, aes(x = y, y = previous, fill = y)) +
geom_boxplot() +
labs(title = "Number of Previous Contacts by Term Deposit Subscription",
x = "Subscribed?",
y = "Number of Previous Contacts",
fill = "Subscribed?") +
theme_minimal()
Since there are labels (term deposits subscriber: yes or no) in our dataset, I would recommend to use supervised learning algorithm such as Random Forest. About 83% of the data contains at 1 “unknown” value. Random Forest can handle missing values without the step of imputation. The dataset contains only about 11% subscribers which shows class imbalance. Random Forest is less sensitive to class imbalance. It uses ensemble learning, which reduces overfitting. However, it is computational slow to use as we have a large dataset. Alternatively, if there were fewer than 1,000 data records, we can use Naive Bases as it is simple and fast, and works well with categorical data. However, it may be sensitive to class imbalance.
For optimal model performance, we need to perform pre-processing steps such as data cleaning. We need to address the “unknown value” in the categorical variables. The variable poutcome (previous outcome) has the highest percentage (81.7%) of “unknown” values. I would remove it as most clients regardless if they subscribed or not have very few previous contact. The number of previous contacts may not have much influence on subscription rates. Due to the right skewness of all numerical variable except day, we would need to perform data transformation such as box cox transformation. Categorical variables will needed to be converted to numerical format using label encoding. Moreover, we should introduce new features such as age groups and contact duration (short, medium, long). Sampling techniques such as SMOTE should be use to address imbalance class.