Exploratory analysis and essay
Introduction
This assignment focuses on one of the most important aspects of data science, Exploratory Data Analysis (EDA). Many surveys show that data scientists spend 60-80% of their time on data preparation. EDA allows you to identify data gaps & data imbalances, improve data quality, create better features and gain a deep understanding of your data before doing model training - and that ultimately helps train better models. In machine learning, there is a saying - “better data beats better algorithms” - meaning that it is more productive to spend time improving data quality than improving the code to train the model.
This will be an exploratory exercise, so feel free to show errors and warnings that arise during the analysis. Test the code with both datasets selected and compare the results.
Dataset
A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing
Assignment
Exploratory Data Analysis Review the structure and content of the data and answer questions such as:
Are the features (columns) of your data correlated?
What is the overall distribution of each variable?
Are there any outliers present?
What are the relationships between different variables?
How are categorical variables distributed?
Do any patterns or trends emerge in the data?
What is the central tendency and spread of each variable?
Are there any missing values and how significant are they?
This exploratory analysis is based on a dataset from a Portuguese bank that launched a telemarketing campaign to offer term deposit products. The dataset contains 41,188 records and 21 features, including a mix of numeric and categorical variables. The goal is to identify patterns in client data and recommend machine learning models that can help predict whether a client will subscribe to a term deposit in future campaigns.
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.95 loaded
library(knitr)
library(ggplot2)
library(ggcorrplot)
library(psych)
## Warning: package 'psych' was built under R version 4.3.3
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
# Load the data
bank_data <- read_delim("/Users/zigcah/Downloads/bank+marketing/bank-additional/bank-additional-full.csv", delim = ";")
## Rows: 41188 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (11): job, marital, education, default, housing, loan, contact, month, d...
## dbl (10): age, duration, campaign, pdays, previous, emp.var.rate, cons.price...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Convert character columns to factors
bank_data <- bank_data %>% mutate(across(where(is.character), as.factor))
# Structure of the dataset
str(bank_data)
## tibble [41,188 × 21] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:41188] 56 57 37 40 56 45 59 41 24 25 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : num [1:41188] 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : num [1:41188] 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : num [1:41188] 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : num [1:41188] 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ emp.var.rate : num [1:41188] 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num [1:41188] 94 94 94 94 94 ...
## $ cons.conf.idx : num [1:41188] -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num [1:41188] 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num [1:41188] 5191 5191 5191 5191 5191 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# Check for missing values
# Check for duplicate rows
sum(duplicated(bank_data))
## [1] 12
colSums(is.na(bank_data))
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y
## 0
# Summary of the dataset
summary(bank_data)
## age job marital
## Min. :17.00 admin. :10422 divorced: 4612
## 1st Qu.:32.00 blue-collar: 9254 married :24928
## Median :38.00 technician : 6743 single :11568
## Mean :40.02 services : 3969 unknown : 80
## 3rd Qu.:47.00 management : 2924
## Max. :98.00 retired : 1720
## (Other) : 6156
## education default housing loan
## university.degree :12168 no :32588 no :18622 no :33950
## high.school : 9515 unknown: 8597 unknown: 990 unknown: 990
## basic.9y : 6045 yes : 3 yes :21576 yes : 6248
## professional.course: 5243
## basic.4y : 4176
## basic.6y : 2292
## (Other) : 1749
## contact month day_of_week duration
## cellular :26144 may :13769 fri:7827 Min. : 0.0
## telephone:15044 jul : 7174 mon:8514 1st Qu.: 102.0
## aug : 6178 thu:8623 Median : 180.0
## jun : 5318 tue:8090 Mean : 258.3
## nov : 4101 wed:8134 3rd Qu.: 319.0
## apr : 2632 Max. :4918.0
## (Other): 2016
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 0.0 Min. :0.000 failure : 4252
## 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 nonexistent:35563
## Median : 2.000 Median :999.0 Median :0.000 success : 1373
## Mean : 2.568 Mean :962.5 Mean :0.173
## 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000
## Max. :56.000 Max. :999.0 Max. :7.000
##
## emp.var.rate cons.price.idx cons.conf.idx euribor3m
## Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.634
## 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344
## Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
## Mean : 0.08189 Mean :93.58 Mean :-40.5 Mean :3.621
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961
## Max. : 1.40000 Max. :94.77 Max. :-26.9 Max. :5.045
##
## nr.employed y
## Min. :4964 no :36548
## 1st Qu.:5099 yes: 4640
## Median :5191
## Mean :5167
## 3rd Qu.:5228
## Max. :5228
##
# Skim summary
skim(bank_data)
| Name | bank_data |
| Number of rows | 41188 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| factor | 11 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| job | 0 | 1 | FALSE | 12 | adm: 10422, blu: 9254, tec: 6743, ser: 3969 |
| marital | 0 | 1 | FALSE | 4 | mar: 24928, sin: 11568, div: 4612, unk: 80 |
| education | 0 | 1 | FALSE | 8 | uni: 12168, hig: 9515, bas: 6045, pro: 5243 |
| default | 0 | 1 | FALSE | 3 | no: 32588, unk: 8597, yes: 3 |
| housing | 0 | 1 | FALSE | 3 | yes: 21576, no: 18622, unk: 990 |
| loan | 0 | 1 | FALSE | 3 | no: 33950, yes: 6248, unk: 990 |
| contact | 0 | 1 | FALSE | 2 | cel: 26144, tel: 15044 |
| month | 0 | 1 | FALSE | 10 | may: 13769, jul: 7174, aug: 6178, jun: 5318 |
| day_of_week | 0 | 1 | FALSE | 5 | thu: 8623, mon: 8514, wed: 8134, tue: 8090 |
| poutcome | 0 | 1 | FALSE | 3 | non: 35563, fai: 4252, suc: 1373 |
| y | 0 | 1 | FALSE | 2 | no: 36548, yes: 4640 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 40.02 | 10.42 | 17.00 | 32.00 | 38.00 | 47.00 | 98.00 | ▅▇▃▁▁ |
| duration | 0 | 1 | 258.29 | 259.28 | 0.00 | 102.00 | 180.00 | 319.00 | 4918.00 | ▇▁▁▁▁ |
| campaign | 0 | 1 | 2.57 | 2.77 | 1.00 | 1.00 | 2.00 | 3.00 | 56.00 | ▇▁▁▁▁ |
| pdays | 0 | 1 | 962.48 | 186.91 | 0.00 | 999.00 | 999.00 | 999.00 | 999.00 | ▁▁▁▁▇ |
| previous | 0 | 1 | 0.17 | 0.49 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 | ▇▁▁▁▁ |
| emp.var.rate | 0 | 1 | 0.08 | 1.57 | -3.40 | -1.80 | 1.10 | 1.40 | 1.40 | ▁▃▁▁▇ |
| cons.price.idx | 0 | 1 | 93.58 | 0.58 | 92.20 | 93.08 | 93.75 | 93.99 | 94.77 | ▁▆▃▇▂ |
| cons.conf.idx | 0 | 1 | -40.50 | 4.63 | -50.80 | -42.70 | -41.80 | -36.40 | -26.90 | ▅▇▁▇▁ |
| euribor3m | 0 | 1 | 3.62 | 1.73 | 0.63 | 1.34 | 4.86 | 4.96 | 5.04 | ▅▁▁▁▇ |
| nr.employed | 0 | 1 | 5167.04 | 72.25 | 4963.60 | 5099.10 | 5191.00 | 5228.10 | 5228.10 | ▁▁▃▁▇ |
EDA: The dataset does not contain any missing values in the traditional sense. However, certain categorical fields such as default, education and job contain the label “unknown”, which essentially reflects missing or undisclosed information. These should not be immediately discarded, especially since some models, like Naive Bayes can handle them reasonably well.
Exploring the target variable y, which indicates whether a client subscribed to a term deposit, reveals a significant class imbalance. Only about 11% of clients subscribed, while 89% did not. This imbalance will be an important consideration when selecting models and evaluation metrics.
Most numeric variables, including campaign, duration, pdays, balance, and previous, show strong right-skewed distributions, with a notable presence of outliers. For instance, the balance variable ranges from -8019 to 102127, while its median is just 448. Outlier counts using the 1.5xIQR rule further confirm extreme values in multiple variables, especially pdays and previous.
# Explore class imbalance
ggplot(bank_data, aes(x = y)) +
geom_bar(fill = "steelblue") +
theme_minimal() +
labs(title = "Target Variable Distribution", x = "Subscribed to Term Deposit", y = "Count")
# Numeric summary and histograms
num_data <- bank_data %>% select(where(is.numeric))
# Pivot for histogram plotting
num_long <- pivot_longer(num_data, cols = everything(), names_to = "variable", values_to = "value")
# Histograms
ggplot(num_long, aes(x = value)) +
geom_histogram(bins = 30, fill = "darkorange", color = "black") +
facet_wrap(~variable, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Numeric Variables")
# Boxplots
ggplot(num_long, aes(x = variable, y = value)) +
geom_boxplot(fill = "skyblue") +
theme_minimal() +
labs(title = "Boxplots of Numeric Variables") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Scatterplots for numeric variables to visually inspect outliers
bank_data_scatter <- bank_data %>%
mutate(row_id = row_number()) %>%
select(row_id, where(is.numeric))
bank_data_scatter_long <- pivot_longer(
bank_data_scatter,
cols = -row_id,
names_to = "variable",
values_to = "value"
)
ggplot(bank_data_scatter_long, aes(x = row_id, y = value)) +
geom_point(alpha = 0.4, color = "darkblue") +
facet_wrap(~ variable, scales = "free", ncol = 3) +
theme_minimal() +
labs(
title = "Scatterplots of Numeric Variables",
x = "Observation Index",
y = "Value"
) +
theme(
strip.text = element_text(size = 10),
axis.text.x = element_text(angle = 45, hjust = 1)
)
# Outlier counts using 1.5*IQR rule
get_outlier_count <- function(x) {
q1 <- quantile(x, 0.25)
q3 <- quantile(x, 0.75)
iqr <- q3 - q1
sum(x < (q1 - 1.5 * iqr) | x > (q3 + 1.5 * iqr))
}
sapply(num_data, get_outlier_count)
## age duration campaign pdays previous
## 469 2963 2406 1515 5625
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 447 0 0
Correlation analysis on numeric features reveals multicollinearity between emp.var.rate, euribor3m, and nr.employed, with coefficients as high as 0.95. This may inflate variance in models like logistic regression if not addressed. Categorical variables such as job, education, and marital show uneven distributions, but these likely reflect the target audience of the bank’s marketing efforts.
# Correlation matrix
cor_matrix <- cor(num_data)
corrplot(cor_matrix, method = "color", type = "upper", tl.cex = 0.7, tl.col = "black")
# Explore categorical variables
cat_vars <- bank_data %>% select(where(is.factor), -y)
for (col in colnames(cat_vars)) {
print(ggplot(bank_data, aes_string(x = col)) +
geom_bar(fill = "darkgreen") +
theme_minimal() +
labs(title = paste("Distribution of", col), x = col, y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)))
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Algorithm Selection
Now you have completed the EDA, what Algorithms would suit the business purpose for the dataset. Answer questions such as: Select two or more machine learning algorithms presented so far that could be used to train a model (no need to train models - I am only looking for your recommendations). What are the pros and cons of each algorithm you selected? Which algorithm would you recommend, and why? Are there labels in your data? Did that impact your choice of algorithm? How does your choice of algorithm relates to the dataset? Would your choice of algorithm change if there were fewer than 1,000 data records, and why?
Algorithm Selection
Given the structure of the data and the business goal, I recommend using Random Forest and Naive Bayes as primary modeling techniques.
Random Forest is well suited to this problem due to its ability to handle both categorical and numeric inputs, resistance to overfitting and its robustness to outliers. I also importantly performs well with imbalanced classes and is capable of capturing non-linear interactions among variables. Since our dataset is relatively large (41k+ observations), Random Forest is a scalable choice that allows for feature importance evaluation and decision transparency.
Naive Bayes, particularly the multinomial variant, is also a strong candidate because it performs surprisingly well in high-dimensional categorical data settings. Since the dataset includes many categorical variablesand many of the numeric ones can be discretized, for example age, campaign, previous, Naive Bayes can leverage that structure efficiently. Additionally, it’s fast to train and insensitive to outliers.
Pre-processing Now you have done an EDA and selected an Algorithm, what pre-processing (if any) would you require for: Data Cleaning - improve data quality, address missing data, etc. Dimensionality Reduction - remove correlated/redundant data than will slow down training Feature Engineering - use of business knowledge to create new features Sampling Data - using sampling to resize datasets Data Transformation - regularization, normalization, handling categorical variables Imbalanced Data - reducing the imbalance between classes Deliverable
Preprocessing Plan Remove duration: This variable reflects call length and directly leaks target information. It would not be known before making a prediction and should be excluded from modeling.
Feature Engineering: pdays and previous can be combined into a binary feature that flags whether a client was contacted before. This will reduce redundancy and make the model more generalizable.
Discretization: Highly skewed numeric features like age, campaign, and balance should be binned for Naive Bayes, while left as-is or scaled for Random Forest.
Encode categorical variables: For Random Forest, factors can be kept as-is (since many implementations in R handle them internally). For Naive Bayes, one-hot or frequency encoding may be applied.
Handle class imbalance: Use oversampling (e.g., SMOTE) or class weighting. Evaluation metrics like ROC AUC, F1-score, and precision/recall should be emphasized over accuracy.
Address multicollinearity: I will consider dropping one of the correlated features for example drop emp.var.rate if euribor3m and nr.employed are retained.
Essay:
This project focuses on analyzing a Portuguese bank’s marketing dataset to better understand what influences a client’s decision to subscribe to a term deposit. The dataset includes customer demographics, financial information and details from past marketing efforts. During the exploratory analysis, it became clear that most customers had not been contacted before and certain variables like call duration had a strong relationship with the outcome. However, I figured since duration isn’t available until after a call, it shouldn’t be used in model training. We also found outliers in features like balance and campaign frequency, and some categorical fields contained “unknown” values that may need to be cleaned or grouped. For the modeling part, I’ve chosen to use Random Forest and Naive Bayes. Random Forest is a flexible model that works well with both numerical and categorical data and it’s good at handling non-linear relationships. Naive Bayes, while simpler, tends to perform surprisingly well on text like or categorical data and is very efficient. Using both models allows for a balance Random Forest offers more predictive power, while Naive Bayes is faster and easier to interpret. Running both will also provide a useful comparison when evaluating performance. Before training the models, some preparation is needed. Categorical variables will need to be encoded and numerical features might be standardized to help with consistency, especially for Naive Bayes. Since the dataset is imbalanced, with more “no” responses than “yes”, some resampling techniques or class weighting might be necessary. Overall, the goal is to build a model that not only performs well but can also offer practical insight for improving future marketing strategies.