This assignment focuses on one of the most important aspects of data science, Exploratory Data Analysis (EDA). Many surveys show that data scientists spend 60-80% of their time on data preparation. EDA allows you to identify data gaps & data imbalances, improve data quality, create better features and gain a deep understanding of your data before doing model training - and that ultimately helps train better models. In machine learning, there is a saying - “better data beats better algorithms” - meaning that it is more productive to spend time improving data quality than improving the code to train the model.
library(tidyverse)
library(openintro)
library(infer)
library(dplyr)
library(knitr)
library(corrplot)
library(ggcorrplot)
library(ggthemes)
A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing
# import data into R from my GitHub account and check column names
bank_data <- read.csv("https://raw.githubusercontent.com/vitugo23/DATA622/refs/heads/main/bank-full.csv", sep = ";")
# Preview the first few rows of the dataset
kable(head(bank_data, 10), caption = "Bank Dataset")
| age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 58 | management | married | tertiary | no | 2143 | yes | no | unknown | 5 | may | 261 | 1 | -1 | 0 | unknown | no |
| 44 | technician | single | secondary | no | 29 | yes | no | unknown | 5 | may | 151 | 1 | -1 | 0 | unknown | no |
| 33 | entrepreneur | married | secondary | no | 2 | yes | yes | unknown | 5 | may | 76 | 1 | -1 | 0 | unknown | no |
| 47 | blue-collar | married | unknown | no | 1506 | yes | no | unknown | 5 | may | 92 | 1 | -1 | 0 | unknown | no |
| 33 | unknown | single | unknown | no | 1 | no | no | unknown | 5 | may | 198 | 1 | -1 | 0 | unknown | no |
| 35 | management | married | tertiary | no | 231 | yes | no | unknown | 5 | may | 139 | 1 | -1 | 0 | unknown | no |
| 28 | management | single | tertiary | no | 447 | yes | yes | unknown | 5 | may | 217 | 1 | -1 | 0 | unknown | no |
| 42 | entrepreneur | divorced | tertiary | yes | 2 | yes | no | unknown | 5 | may | 380 | 1 | -1 | 0 | unknown | no |
| 58 | retired | married | primary | no | 121 | yes | no | unknown | 5 | may | 50 | 1 | -1 | 0 | unknown | no |
| 43 | technician | single | secondary | no | 593 | yes | no | unknown | 5 | may | 55 | 1 | -1 | 0 | unknown | no |
# review column names
colnames(bank_data)
## [1] "age" "job" "marital" "education" "default" "balance"
## [7] "housing" "loan" "contact" "day" "month" "duration"
## [13] "campaign" "pdays" "previous" "poutcome" "y"
The “Bank Dataset” has rich data about bank’s marketing campaigns and interactions with customers. The most important columns are:
-age: Customer’s age. -job: Customer’s job description -marital: Marital status of customers. -education: Education level. -default: Customer’s credit status, default or not. -balance: A yearly average of customer’s account. -housing: If the customer has a house loan or not. -loan: Customer’s personal loan information. -contact: Details about contact information. - campaign: Marketing campaign details. -y: Information about customer’s subscription to a term deposit.
# rename column name Y to term for better understanding.
bank_data <- bank_data %>% rename(term = y)
# check dataset columns summary statistics
summary(bank_data)
## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## term
## Length:45211
## Class :character
## Mode :character
##
##
##
The average length on columns is 45211; mean, quartile, and median varies between columns.
# review data structure on dataset
str(bank_data)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ term : chr "no" "no" "no" "no" ...
the dataset seems to have a wide variety of data types, so far everything looks good to start working with the EDA.
Review the structure and content of the data and answer questions such as:
# Transform numeric variables from the dataset
numeric_data <- bank_data %>% select(where(is.numeric))
# Change data from wide to long format for plotting
numeric_data_long <- numeric_data %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
# Plot histograms for numeric variables
ggplot(numeric_data_long, aes(x = Value)) +
geom_histogram(fill = "blue", color = "black", bins = 30) +
facet_wrap(~ Variable, scales = "free") +
theme_minimal() +
ggtitle("Distribution of Numeric Variables") +
labs(x = "Value", y = "Count")
Most of the values are right-skewed, and not perfectly distributed. For example “Age” being right-skewed, might tell us that clients are younger, with fewer older clients.
# Select numerical variables
numeric_data <- bank_data %>%
select(where(is.numeric))
# Pivot the data to long format for plotting
numeric_data_long <- numeric_data %>%
mutate(id = row_number()) %>%
pivot_longer(cols = -id, names_to = "variable", values_to = "value")
# Create scatterplots for each numeric variable
ggplot(numeric_data_long, aes(x = id, y = value)) +
geom_point(alpha = 0.6, color = "darkblue") +
facet_wrap(~variable, scales = "free", ncol = 5) +
theme_minimal() +
theme(
strip.text.x = element_text(size = 10),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
labs(
x = "ID (Row Number)",
y = "Value",
title = "Scatterplots of Numerical Variables"
)
Based on the scatterplots, each feature appears to have outliers, with the exception of ‘day’.
# Correlation Matrix for Numeric Variables
numeric_data <- bank_data %>% select(where(is.numeric))
correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")
# corr matrix
ggcorrplot(correlation_matrix,
method = "circle",
type = "lower",
lab = TRUE,
title = "Correlation Matrix of Numeric Variables")
Based on the correlation matrix above, we don’t have a strong linear relationship between categorical variables in the dataset, some of the strongest relationships are: Job and Education with a correlation of 0.46, Housing and Month with a correlation of 0.50 and Contact and Month with a correlation of 0.51
# Select categorical variables
categorical_vars <- bank_data %>% select_if(is.character)
# Bar plots for categorical variables
categorical_vars %>%
gather(key = "Variable", value = "Value") %>%
ggplot(aes(x = Value)) +
geom_bar(fill = "blue", color = "black") +
facet_wrap(~ Variable, scales = "free") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Distribution of Categorical Variables")
We can see in the plots that categorical variables have strong tendencies to one particular value, for example contact variable has a strong relationship with the cellular value, Education has a strong relationship with secondary, and so on.
Based on the plots above, the most notorious patterns are: Dataset is highly imbalanced.Most of the features are not normally distributed. Most of the features do not have a linear relationship with the dependent variable. Feature independence is most likely violated. The features job, age, education, martial status, housing loan are likely to be correlated.
# Central tendency and spread for numeric variables
numeric_data %>%
gather(key = "Variable", value = "Value") %>%
group_by(Variable) %>%
summarize(
Mean = mean(Value),
Median = median(Value),
SD = sd(Value),
IQR = IQR(Value)
) %>%
kable(caption = "Central Tendency and Spread of Numeric Variables")
| Variable | Mean | Median | SD | IQR |
|---|---|---|---|---|
| age | 40.9362102 | 39 | 10.618762 | 15 |
| balance | 1362.2720577 | 448 | 3044.765829 | 1356 |
| campaign | 2.7638407 | 2 | 3.098021 | 2 |
| day | 15.8064188 | 16 | 8.322476 | 13 |
| duration | 258.1630798 | 180 | 257.527812 | 216 |
| pdays | 40.1978280 | -1 | 100.128746 | 0 |
| previous | 0.5803234 | 0 | 2.303441 | 0 |
The table above reveals that customer base is generally middle-aged, most clients have modest account balances but a few exceptions with high balances, leading to a skewed distribution. The marketing strategy involved minimal contact per customer, with most interactions occurring evenly throughout the month. Calls were generally brief, but some lasted significantly longer, suggesting varying engagement levels. Notably, the majority of customers were contacted for the first time during this campaign, highlighting a strategy focused on reaching new prospects.
missing_summary <- bank_data %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
arrange(desc(Missing_Count))
print(missing_summary)
## # A tibble: 17 × 2
## Variable Missing_Count
## <chr> <int>
## 1 age 0
## 2 job 0
## 3 marital 0
## 4 education 0
## 5 default 0
## 6 balance 0
## 7 housing 0
## 8 loan 0
## 9 contact 0
## 10 day 0
## 11 month 0
## 12 duration 0
## 13 campaign 0
## 14 pdays 0
## 15 previous 0
## 16 poutcome 0
## 17 term 0
Table above shows no missing values in the dataset.
Based on what I’ve learned so far in this program, and the dataset features and categorical values I suggest using the Random Forest Classifier and K-Nearest Neighbors (Knn).
I would recommend Random Forest for this project, since the dataset is highly imbalanced, has numerical and categorical variables, the dataset is relatively large, and the dataset is not highly dimensional.
Knn is computationally expensive with large datasets, so it is not recommended for this project.
There are dependent variable is this dataset, I strongly advice the use of supervised training algorithms.
As I mentioned before, Random forest is the best fit for this project since the dataset is highly imbalanced, and it has a variety of numerical and categorical values, which the selected algorithm handles well.
If the dataset were fewer than 1,000 data records I would use Knn, because Random Forest might cause overfitting when trained on small datasets. Knn works better with small and mixed datasets.
Now you have done an EDA and selected an Algorithm, what pre-processing (if any) would you require for:
The dataset does not contain any missing values, there is no need to address that issue, however I can check to see unknown values in the columns using the caret library.
for (col in colnames(bank_data)) {
if (is.factor(bank_data[[col]]) || is.character(bank_data[[col]])) {
unknown_count <- sum(bank_data[[col]] == "unknown", na.rm = TRUE)
if (unknown_count > 0) {
cat(sprintf("%s: %d unknown values (%.2f%%)\n", col, unknown_count, (unknown_count / nrow(bank_data) * 100)))
}
}
}
## job: 288 unknown values (0.64%)
## education: 1857 unknown values (4.11%)
## contact: 13020 unknown values (28.80%)
## poutcome: 36959 unknown values (81.75%)
The categorical value with the most unknown values are poutcome with more than 80% of unknown values.
I can create new features based on the existing columns, I am going to create features by age group and credit risk. Also I will remove some features that I think are not necesary for this project.
# create new feature called age_group from age column
bank_data <- bank_data %>%
mutate(age_group = case_when(
age < 25 ~ "Young",
age >= 25 & age < 50 ~ "Middle-aged",
age >= 50 ~ "Senior"
))
# create new feature called credit_risk from balance column
bank_data <- bank_data %>%
mutate(credit_risk = case_when(
balance < 0 | loan == "yes" ~ "High Risk",
balance >= 0 & balance < 5000 & loan == "no" ~ "Medium Risk",
balance >= 5000 & loan == "no" ~ "Low Risk"
))
# Remove specific columns from dataset
bank_final <- bank_data[ , !(names(bank_data) %in% c("pdays", "poutcome", "duration"))]
# Print column names of refined dataset
print(names(bank_final))
## [1] "age" "job" "marital" "education" "default"
## [6] "balance" "housing" "loan" "contact" "day"
## [11] "month" "campaign" "previous" "term" "age_group"
## [16] "credit_risk"
# Convert columns to factors
bank_final[c("day", "campaign", "previous")] <- lapply(bank_final[c("day", "campaign", "previous")], factor)
# check datatypes on columns
sapply(bank_final, class)
## age job marital education default balance
## "integer" "character" "character" "character" "character" "integer"
## housing loan contact day month campaign
## "character" "character" "character" "factor" "character" "factor"
## previous term age_group credit_risk
## "factor" "character" "character" "character"
# Convert categorical columns to factor
bank_final[c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "campaign", "previous", "Subscription")] <-
lapply(bank_final[c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "campaign", "previous", "term")], factor)
# confirm datatypes of each column
sapply(bank_final, class)
## age job marital education default balance
## "integer" "factor" "factor" "factor" "factor" "integer"
## housing loan contact day month campaign
## "factor" "factor" "factor" "factor" "factor" "factor"
## previous term age_group credit_risk Subscription
## "factor" "character" "character" "character" "factor"
After performing some modification and refining the data, I think that the dataset is ready to be train. With the creation of columns, converting values of columns to factors, and to remove unnecessary data, I reduced the imbalance between classes and the dataset is ready to build a model.