DATA 622 Assignment 1

Assignment 1: Exploratory Data Analysis

Exploratory analysis and essay

Introduction

This assignment focuses on one of the most important aspects of data science, Exploratory Data Analysis (EDA). Many surveys show that data scientists spend 60-80% of their time on data preparation. EDA allows you to identify data gaps & data imbalances, improve data quality, create better features and gain a deep understanding of your data before doing model training - and that ultimately helps train better models. In machine learning, there is a saying - “better data beats better algorithms” - meaning that it is more productive to spend time improving data quality than improving the code to train the model.

This will be an exploratory exercise, so feel free to show errors and warnings that arise during the analysis. Test the code with both datasets selected and compare the results.

Dataset

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing

Assignment

Exploratory Data Analysis Review the structure and content of the data and answer questions such as:

Are the features (columns) of your data correlated?

What is the overall distribution of each variable?

Are there any outliers present?

What are the relationships between different variables?

How are categorical variables distributed?

Do any patterns or trends emerge in the data?

What is the central tendency and spread of each variable?

Are there any missing values and how significant are they?

This exploratory analysis is based on a dataset from a Portuguese bank that launched a telemarketing campaign to offer term deposit products. The dataset contains 41,188 records and 21 features, including a mix of numeric and categorical variables. The goal is to identify patterns in client data and recommend machine learning models that can help predict whether a client will subscribe to a term deposit in future campaigns.

# Load libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(skimr)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.3

## corrplot 0.95 loaded

library(knitr)
library(ggplot2)
library(ggcorrplot)
library(psych)

## Warning: package 'psych' was built under R version 4.3.3

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

# Load the data
bank_data <- read_delim("/Users/zigcah/Downloads/bank+marketing/bank-additional/bank-additional-full.csv", delim = ";")

## Rows: 41188 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (11): job, marital, education, default, housing, loan, contact, month, d...
## dbl (10): age, duration, campaign, pdays, previous, emp.var.rate, cons.price...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Convert character columns to factors
bank_data <- bank_data %>% mutate(across(where(is.character), as.factor))

# Structure of the dataset
str(bank_data)

## tibble [41,188 × 21] (S3: tbl_df/tbl/data.frame)
##  $ age           : num [1:41188] 56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
##  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration      : num [1:41188] 261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : num [1:41188] 1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : num [1:41188] 999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : num [1:41188] 0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ emp.var.rate  : num [1:41188] 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num [1:41188] 94 94 94 94 94 ...
##  $ cons.conf.idx : num [1:41188] -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num [1:41188] 4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num [1:41188] 5191 5191 5191 5191 5191 ...
##  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

# Check for missing values
# Check for duplicate rows
sum(duplicated(bank_data))

## [1] 12

colSums(is.na(bank_data))

##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y 
##              0

# Summary of the dataset
summary(bank_data)

##       age                 job            marital     
##  Min.   :17.00   admin.     :10422   divorced: 4612  
##  1st Qu.:32.00   blue-collar: 9254   married :24928  
##  Median :38.00   technician : 6743   single  :11568  
##  Mean   :40.02   services   : 3969   unknown :   80  
##  3rd Qu.:47.00   management : 2924                   
##  Max.   :98.00   retired    : 1720                   
##                  (Other)    : 6156                   
##                education        default         housing           loan      
##  university.degree  :12168   no     :32588   no     :18622   no     :33950  
##  high.school        : 9515   unknown: 8597   unknown:  990   unknown:  990  
##  basic.9y           : 6045   yes    :    3   yes    :21576   yes    : 6248  
##  professional.course: 5243                                                  
##  basic.4y           : 4176                                                  
##  basic.6y           : 2292                                                  
##  (Other)            : 1749                                                  
##       contact          month       day_of_week    duration     
##  cellular :26144   may    :13769   fri:7827    Min.   :   0.0  
##  telephone:15044   jul    : 7174   mon:8514    1st Qu.: 102.0  
##                    aug    : 6178   thu:8623    Median : 180.0  
##                    jun    : 5318   tue:8090    Mean   : 258.3  
##                    nov    : 4101   wed:8134    3rd Qu.: 319.0  
##                    apr    : 2632               Max.   :4918.0  
##                    (Other): 2016                               
##     campaign          pdays          previous            poutcome    
##  Min.   : 1.000   Min.   :  0.0   Min.   :0.000   failure    : 4252  
##  1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000   nonexistent:35563  
##  Median : 2.000   Median :999.0   Median :0.000   success    : 1373  
##  Mean   : 2.568   Mean   :962.5   Mean   :0.173                      
##  3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000                      
##  Max.   :56.000   Max.   :999.0   Max.   :7.000                      
##                                                                      
##   emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
##  Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.634  
##  1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
##  Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
##  Mean   : 0.08189   Mean   :93.58   Mean   :-40.5   Mean   :3.621  
##  3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961  
##  Max.   : 1.40000   Max.   :94.77   Max.   :-26.9   Max.   :5.045  
##                                                                    
##   nr.employed     y        
##  Min.   :4964   no :36548  
##  1st Qu.:5099   yes: 4640  
##  Median :5191              
##  Mean   :5167              
##  3rd Qu.:5228              
##  Max.   :5228              
##

# Skim summary
skim(bank_data)

Data summary
Name	bank_data
Number of rows	41188
Number of columns	21
_______________________
Column type frequency:
factor	11
numeric	10
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
job	1	FALSE	12	adm: 10422, blu: 9254, tec: 6743, ser: 3969
marital	1	FALSE	4	mar: 24928, sin: 11568, div: 4612, unk: 80
education	1	FALSE	8	uni: 12168, hig: 9515, bas: 6045, pro: 5243
default	1	FALSE	3	no: 32588, unk: 8597, yes: 3
housing	1	FALSE	3	yes: 21576, no: 18622, unk: 990
loan	1	FALSE	3	no: 33950, yes: 6248, unk: 990
contact	1	FALSE	2	cel: 26144, tel: 15044
month	1	FALSE	10	may: 13769, jul: 7174, aug: 6178, jun: 5318
day_of_week	1	FALSE	5	thu: 8623, mon: 8514, wed: 8134, tue: 8090
poutcome	1	FALSE	3	non: 35563, fai: 4252, suc: 1373
y	1	FALSE	2	no: 36548, yes: 4640

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	40.02	10.42	17.00	32.00	38.00	47.00	98.00	▅▇▃▁▁
duration	1	258.29	259.28	0.00	102.00	180.00	319.00	4918.00	▇▁▁▁▁
campaign	1	2.57	2.77	1.00	1.00	2.00	3.00	56.00	▇▁▁▁▁
pdays	1	962.48	186.91	0.00	999.00	999.00	999.00	999.00	▁▁▁▁▇
previous	1	0.17	0.49	0.00	0.00	0.00	0.00	7.00	▇▁▁▁▁
emp.var.rate	1	0.08	1.57	-3.40	-1.80	1.10	1.40	1.40	▁▃▁▁▇
cons.price.idx	1	93.58	0.58	92.20	93.08	93.75	93.99	94.77	▁▆▃▇▂
cons.conf.idx	1	-40.50	4.63	-50.80	-42.70	-41.80	-36.40	-26.90	▅▇▁▇▁
euribor3m	1	3.62	1.73	0.63	1.34	4.86	4.96	5.04	▅▁▁▁▇
nr.employed	1	5167.04	72.25	4963.60	5099.10	5191.00	5228.10	5228.10	▁▁▃▁▇

EDA: The dataset does not contain any missing values in the traditional sense. However, certain categorical fields such as default, education and job contain the label “unknown”, which essentially reflects missing or undisclosed information. These should not be immediately discarded, especially since some models, like Naive Bayes can handle them reasonably well.

Exploring the target variable y, which indicates whether a client subscribed to a term deposit, reveals a significant class imbalance. Only about 11% of clients subscribed, while 89% did not. This imbalance will be an important consideration when selecting models and evaluation metrics.

Most numeric variables, including campaign, duration, pdays, balance, and previous, show strong right-skewed distributions, with a notable presence of outliers. For instance, the balance variable ranges from -8019 to 102127, while its median is just 448. Outlier counts using the 1.5xIQR rule further confirm extreme values in multiple variables, especially pdays and previous.

# Explore class imbalance
ggplot(bank_data, aes(x = y)) +
  geom_bar(fill = "steelblue") +
  theme_minimal() +
  labs(title = "Target Variable Distribution", x = "Subscribed to Term Deposit", y = "Count")

# Numeric summary and histograms
num_data <- bank_data %>% select(where(is.numeric))

# Pivot for histogram plotting
num_long <- pivot_longer(num_data, cols = everything(), names_to = "variable", values_to = "value")

# Histograms
ggplot(num_long, aes(x = value)) +
  geom_histogram(bins = 30, fill = "darkorange", color = "black") +
  facet_wrap(~variable, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Numeric Variables")

# Boxplots
ggplot(num_long, aes(x = variable, y = value)) +
  geom_boxplot(fill = "skyblue") +
  theme_minimal() +
  labs(title = "Boxplots of Numeric Variables") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Scatterplots for numeric variables to visually inspect outliers
bank_data_scatter <- bank_data %>%
  mutate(row_id = row_number()) %>%
  select(row_id, where(is.numeric))

bank_data_scatter_long <- pivot_longer(
  bank_data_scatter,
  cols = -row_id,
  names_to = "variable",
  values_to = "value"
)

ggplot(bank_data_scatter_long, aes(x = row_id, y = value)) +
  geom_point(alpha = 0.4, color = "darkblue") +
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(
    title = "Scatterplots of Numeric Variables",
    x = "Observation Index",
    y = "Value"
  ) +
  theme(
    strip.text = element_text(size = 10),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Outlier counts using 1.5*IQR rule
get_outlier_count <- function(x) {
  q1 <- quantile(x, 0.25)
  q3 <- quantile(x, 0.75)
  iqr <- q3 - q1
  sum(x < (q1 - 1.5 * iqr) | x > (q3 + 1.5 * iqr))
}
sapply(num_data, get_outlier_count)

##            age       duration       campaign          pdays       previous 
##            469           2963           2406           1515           5625 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0            447              0              0

Correlation analysis on numeric features reveals multicollinearity between emp.var.rate, euribor3m, and nr.employed, with coefficients as high as 0.95. This may inflate variance in models like logistic regression if not addressed. Categorical variables such as job, education, and marital show uneven distributions, but these likely reflect the target audience of the bank’s marketing efforts.

# Correlation matrix
cor_matrix <- cor(num_data)
corrplot(cor_matrix, method = "color", type = "upper", tl.cex = 0.7, tl.col = "black")

# Explore categorical variables
cat_vars <- bank_data %>% select(where(is.factor), -y)

for (col in colnames(cat_vars)) {
  print(ggplot(bank_data, aes_string(x = col)) +
          geom_bar(fill = "darkgreen") +
          theme_minimal() +
          labs(title = paste("Distribution of", col), x = col, y = "Count") +
          theme(axis.text.x = element_text(angle = 45, hjust = 1)))
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Algorithm Selection

Now you have completed the EDA, what Algorithms would suit the business purpose for the dataset. Answer questions such as: Select two or more machine learning algorithms presented so far that could be used to train a model (no need to train models - I am only looking for your recommendations). What are the pros and cons of each algorithm you selected? Which algorithm would you recommend, and why? Are there labels in your data? Did that impact your choice of algorithm? How does your choice of algorithm relates to the dataset? Would your choice of algorithm change if there were fewer than 1,000 data records, and why?

Algorithm Selection

Given the structure of the data and the business goal, I recommend using Random Forest and Naive Bayes as primary modeling techniques.

Random Forest is well suited to this problem due to its ability to handle both categorical and numeric inputs, resistance to overfitting and its robustness to outliers. I also importantly performs well with imbalanced classes and is capable of capturing non-linear interactions among variables. Since our dataset is relatively large (41k+ observations), Random Forest is a scalable choice that allows for feature importance evaluation and decision transparency.

Naive Bayes, particularly the multinomial variant, is also a strong candidate because it performs surprisingly well in high-dimensional categorical data settings. Since the dataset includes many categorical variablesand many of the numeric ones can be discretized, for example age, campaign, previous, Naive Bayes can leverage that structure efficiently. Additionally, it’s fast to train and insensitive to outliers.

Pre-processing Now you have done an EDA and selected an Algorithm, what pre-processing (if any) would you require for: Data Cleaning - improve data quality, address missing data, etc. Dimensionality Reduction - remove correlated/redundant data than will slow down training Feature Engineering - use of business knowledge to create new features Sampling Data - using sampling to resize datasets Data Transformation - regularization, normalization, handling categorical variables Imbalanced Data - reducing the imbalance between classes Deliverable

Preprocessing Plan Remove duration: This variable reflects call length and directly leaks target information. It would not be known before making a prediction and should be excluded from modeling.

Feature Engineering: pdays and previous can be combined into a binary feature that flags whether a client was contacted before. This will reduce redundancy and make the model more generalizable.

Discretization: Highly skewed numeric features like age, campaign, and balance should be binned for Naive Bayes, while left as-is or scaled for Random Forest.

Encode categorical variables: For Random Forest, factors can be kept as-is (since many implementations in R handle them internally). For Naive Bayes, one-hot or frequency encoding may be applied.

Handle class imbalance: Use oversampling (e.g., SMOTE) or class weighting. Evaluation metrics like ROC AUC, F1-score, and precision/recall should be emphasized over accuracy.

Address multicollinearity: I will consider dropping one of the correlated features for example drop emp.var.rate if euribor3m and nr.employed are retained.

Essay:

This project focuses on analyzing a Portuguese bank’s marketing dataset to better understand what influences a client’s decision to subscribe to a term deposit. The dataset includes customer demographics, financial information and details from past marketing efforts. During the exploratory analysis, it became clear that most customers had not been contacted before and certain variables like call duration had a strong relationship with the outcome. However, I figured since duration isn’t available until after a call, it shouldn’t be used in model training. We also found outliers in features like balance and campaign frequency, and some categorical fields contained “unknown” values that may need to be cleaned or grouped. For the modeling part, I’ve chosen to use Random Forest and Naive Bayes. Random Forest is a flexible model that works well with both numerical and categorical data and it’s good at handling non-linear relationships. Naive Bayes, while simpler, tends to perform surprisingly well on text like or categorical data and is very efficient. Using both models allows for a balance Random Forest offers more predictive power, while Naive Bayes is faster and easier to interpret. Running both will also provide a useful comparison when evaluating performance. Before training the models, some preparation is needed. Categorical variables will need to be encoded and numerical features might be standardized to help with consistency, especially for Naive Bayes. Since the dataset is imbalanced, with more “no” responses than “yes”, some resampling techniques or class weighting might be necessary. Overall, the goal is to build a model that not only performs well but can also offer practical insight for improving future marketing strategies.

DATA 622 Assignment 1

Biyag Dukuray

2025-03-02

Assignment 1: Exploratory Data Analysis