Introduction.

This assignment focuses on one of the most important aspects of data science, Exploratory Data Analysis (EDA). Many surveys show that data scientists spend 60-80% of their time on data preparation. EDA allows you to identify data gaps & data imbalances, improve data quality, create better features and gain a deep understanding of your data before doing model training - and that ultimately helps train better models. In machine learning, there is a saying - “better data beats better algorithms” - meaning that it is more productive to spend time improving data quality than improving the code to train the model.

Libraries required for this project

library(tidyverse)
library(openintro)
library(infer)
library(dplyr)
library(knitr)
library(corrplot)
library(ggcorrplot)
library(ggthemes)

Dataset

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing

# import data into R from my GitHub account and check column names
bank_data <- read.csv("https://raw.githubusercontent.com/vitugo23/DATA622/refs/heads/main/bank-full.csv", sep = ";")

# Preview the first few rows of the dataset
kable(head(bank_data, 10), caption = "Bank Dataset")
Bank Dataset
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
35 management married tertiary no 231 yes no unknown 5 may 139 1 -1 0 unknown no
28 management single tertiary no 447 yes yes unknown 5 may 217 1 -1 0 unknown no
42 entrepreneur divorced tertiary yes 2 yes no unknown 5 may 380 1 -1 0 unknown no
58 retired married primary no 121 yes no unknown 5 may 50 1 -1 0 unknown no
43 technician single secondary no 593 yes no unknown 5 may 55 1 -1 0 unknown no
# review column names  
colnames(bank_data)
##  [1] "age"       "job"       "marital"   "education" "default"   "balance"  
##  [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
## [13] "campaign"  "pdays"     "previous"  "poutcome"  "y"

The “Bank Dataset” has rich data about bank’s marketing campaigns and interactions with customers. The most important columns are:

-age: Customer’s age. -job: Customer’s job description -marital: Marital status of customers. -education: Education level. -default: Customer’s credit status, default or not. -balance: A yearly average of customer’s account. -housing: If the customer has a house loan or not. -loan: Customer’s personal loan information. -contact: Details about contact information. - campaign: Marketing campaign details. -y: Information about customer’s subscription to a term deposit.

# rename column name Y to term for better understanding.
bank_data <- bank_data %>% rename(term = y)
# check dataset columns summary statistics
summary(bank_data)
##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##      term          
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

The average length on columns is 45211; mean, quartile, and median varies between columns.

# review data structure on dataset
str(bank_data)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ term     : chr  "no" "no" "no" "no" ...

the dataset seems to have a wide variety of data types, so far everything looks good to start working with the EDA.

Sections

1.Exploratory Data Analysis

Review the structure and content of the data and answer questions such as:

Questions

Are the features (columns) of your data correlated?

# convert column values to numerical to find correlation between features. 
numeric_vars<- cor(bank_data %>% select(where(is.numeric)))
numeric_vars
##                   age      balance          day     duration     campaign
## age       1.000000000  0.097782739 -0.009120046 -0.004648428  0.004760312
## balance   0.097782739  1.000000000  0.004502585  0.021560380 -0.014578279
## day      -0.009120046  0.004502585  1.000000000 -0.030206341  0.162490216
## duration -0.004648428  0.021560380 -0.030206341  1.000000000 -0.084569503
## campaign  0.004760312 -0.014578279  0.162490216 -0.084569503  1.000000000
## pdays    -0.023758014  0.003435322 -0.093044074 -0.001564770 -0.088627668
## previous  0.001288319  0.016673637 -0.051710497  0.001203057 -0.032855290
##                 pdays     previous
## age      -0.023758014  0.001288319
## balance   0.003435322  0.016673637
## day      -0.093044074 -0.051710497
## duration -0.001564770  0.001203057
## campaign -0.088627668 -0.032855290
## pdays     1.000000000  0.454819635
## previous  0.454819635  1.000000000
# plot correlation between columns using corrplot
corrplot(numeric_vars
         , method = 'color' 
         , order = 'hclust'
         , addCoef.col = 'black'
         , number.cex = .6 
         )

We can see in the correlation plot that there are weak correlations among the variables, with most values close to zero. The strongest positive correlation is between “pdays” and “previous” (0.4548), this can tell us that previous contacts are associated with shorter days since last contact. Day and campaign shows a moderate positive correlation (0.1625), suggesting that an specific day might influence the number of contacts in a campaign. All other correlations are very weak, indicating minimal linear relationships among the remaining variables.

What is the overall distribution of each variable?

# Transform numeric variables from the dataset
numeric_data <- bank_data %>% select(where(is.numeric))

# Change data from wide to long format for plotting
numeric_data_long <- numeric_data %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Plot histograms for numeric variables
ggplot(numeric_data_long, aes(x = Value)) +
  geom_histogram(fill = "blue", color = "black", bins = 30) +
  facet_wrap(~ Variable, scales = "free") +
  theme_minimal() +
  ggtitle("Distribution of Numeric Variables") +
  labs(x = "Value", y = "Count")

Most of the values are right-skewed, and not perfectly distributed. For example “Age” being right-skewed, might tell us that clients are younger, with fewer older clients.

Are there any outliers present?

# Select numerical variables
numeric_data <- bank_data %>%
  select(where(is.numeric))

# Pivot the data to long format for plotting
numeric_data_long <- numeric_data %>%
  mutate(id = row_number()) %>%  
  pivot_longer(cols = -id, names_to = "variable", values_to = "value")

# Create scatterplots for each numeric variable
ggplot(numeric_data_long, aes(x = id, y = value)) +
  geom_point(alpha = 0.6, color = "darkblue") +  
  facet_wrap(~variable, scales = "free", ncol = 5) + 
  theme_minimal() +
  theme(
    strip.text.x = element_text(size = 10), 
    axis.text.x = element_text(angle = 45, hjust = 1)  
  ) +
  labs(
    x = "ID (Row Number)", 
    y = "Value", 
    title = "Scatterplots of Numerical Variables"
  )

Based on the scatterplots, each feature appears to have outliers, with the exception of ‘day’.

What are the relationships between different variables?

# Correlation Matrix for Numeric Variables
numeric_data <- bank_data %>% select(where(is.numeric))

correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")

# corr matrix
ggcorrplot(correlation_matrix, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           title = "Correlation Matrix of Numeric Variables")

Based on the correlation matrix above, we don’t have a strong linear relationship between categorical variables in the dataset, some of the strongest relationships are: Job and Education with a correlation of 0.46, Housing and Month with a correlation of 0.50 and Contact and Month with a correlation of 0.51

How are categorical variables distributed?

# Select categorical variables
categorical_vars <- bank_data %>% select_if(is.character)

# Bar plots for categorical variables
categorical_vars %>%
  gather(key = "Variable", value = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_bar(fill = "blue", color = "black") +
  facet_wrap(~ Variable, scales = "free") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Distribution of Categorical Variables")

We can see in the plots that categorical variables have strong tendencies to one particular value, for example contact variable has a strong relationship with the cellular value, Education has a strong relationship with secondary, and so on.

What is the central tendency and spread of each variable?

# Central tendency and spread for numeric variables
numeric_data %>%
  gather(key = "Variable", value = "Value") %>%
  group_by(Variable) %>%
  summarize(
    Mean = mean(Value),
    Median = median(Value),
    SD = sd(Value),
    IQR = IQR(Value)
  ) %>%
  kable(caption = "Central Tendency and Spread of Numeric Variables")
Central Tendency and Spread of Numeric Variables
Variable Mean Median SD IQR
age 40.9362102 39 10.618762 15
balance 1362.2720577 448 3044.765829 1356
campaign 2.7638407 2 3.098021 2
day 15.8064188 16 8.322476 13
duration 258.1630798 180 257.527812 216
pdays 40.1978280 -1 100.128746 0
previous 0.5803234 0 2.303441 0

The table above reveals that customer base is generally middle-aged, most clients have modest account balances but a few exceptions with high balances, leading to a skewed distribution. The marketing strategy involved minimal contact per customer, with most interactions occurring evenly throughout the month. Calls were generally brief, but some lasted significantly longer, suggesting varying engagement levels. Notably, the majority of customers were contacted for the first time during this campaign, highlighting a strategy focused on reaching new prospects.

Are there any missing values and how significant are they?

missing_summary <- bank_data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  arrange(desc(Missing_Count))  

print(missing_summary)
## # A tibble: 17 × 2
##    Variable  Missing_Count
##    <chr>             <int>
##  1 age                   0
##  2 job                   0
##  3 marital               0
##  4 education             0
##  5 default               0
##  6 balance               0
##  7 housing               0
##  8 loan                  0
##  9 contact               0
## 10 day                   0
## 11 month                 0
## 12 duration              0
## 13 campaign              0
## 14 pdays                 0
## 15 previous              0
## 16 poutcome              0
## 17 term                  0

Table above shows no missing values in the dataset.

2. Algorithm Selection

Sections

Select two or more machine learning algorithms presented so far that could be used to train a model (no need to train models - I am only looking for your recommendations).

Based on what I’ve learned so far in this program, and the dataset features and categorical values I suggest using the Random Forest Classifier and K-Nearest Neighbors (Knn).

What are the pros and cons of each algorithm you selected?

Random Forest Classifier.-
Pros: Handles imbalance data well utilizing class weighting, it works well with categorical and numerical features, reduces overfitting by averaging multiple decision trees
Cons: Works slower than simpler models such as logistic regression, more difficult to interpret than other models, decision trees may become overly complex if high dimension data.
K-Nearest Neighbors (Knn).-
Pros: It can capture complex relationships without assuming a distribution since the alghorithm is non-parametric, works well with mixed data types if data is properly preprocessed.
Cons: Knn it might be computationally expensive with large datasets because of distance calculations, performance depends heavily on the choice of K and feature scaling.

Which algorithm would you recommend, and why?

I would recommend Random Forest for this project, since the dataset is highly imbalanced, has numerical and categorical variables, the dataset is relatively large, and the dataset is not highly dimensional.

Knn is computationally expensive with large datasets, so it is not recommended for this project.

Are there labels in your data? Did that impact your choice of algorithm?

There are dependent variable is this dataset, I strongly advice the use of supervised training algorithms.

How does your choice of algorithm relates to the dataset?

As I mentioned before, Random forest is the best fit for this project since the dataset is highly imbalanced, and it has a variety of numerical and categorical values, which the selected algorithm handles well.

Would your choice of algorithm change if there were fewer than 1,000 data records, and why?

If the dataset were fewer than 1,000 data records I would use Knn, because Random Forest might cause overfitting when trained on small datasets. Knn works better with small and mixed datasets.

3. Pre-processing

Now you have done an EDA and selected an Algorithm, what pre-processing (if any) would you require for:

Sections

Data Cleaning - improve data quality, address missing data, etc.

The dataset does not contain any missing values, there is no need to address that issue, however I can check to see unknown values in the columns using the caret library.

for (col in colnames(bank_data)) {
  if (is.factor(bank_data[[col]]) || is.character(bank_data[[col]])) {
    unknown_count <- sum(bank_data[[col]] == "unknown", na.rm = TRUE)
    if (unknown_count > 0) {
      cat(sprintf("%s: %d unknown values (%.2f%%)\n", col, unknown_count, (unknown_count / nrow(bank_data) * 100)))
    }
  }
}
## job: 288 unknown values (0.64%)
## education: 1857 unknown values (4.11%)
## contact: 13020 unknown values (28.80%)
## poutcome: 36959 unknown values (81.75%)

The categorical value with the most unknown values are poutcome with more than 80% of unknown values.

Dimensionality Reduction - remove correlated/redundant data than will slow down training

ggcorrplot(correlation_matrix, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           title = "Correlation Matrix of Numeric Variables")

Since I am using Random Forest, there is not need of remove correlated/redundant data, since the algorithm handles well large numbers of features without overfitting. The table above displays a mild correlation between numerical values, there is not need of dimesion reduction.

Feature Engineering - use of business knowledge to create new features

I can create new features based on the existing columns, I am going to create features by age group and credit risk. Also I will remove some features that I think are not necesary for this project.

# create new feature called age_group from age column
bank_data <- bank_data %>%
  mutate(age_group = case_when(
    age < 25 ~ "Young",
    age >= 25 & age < 50 ~ "Middle-aged",
    age >= 50 ~ "Senior"
  ))
# create new feature called credit_risk from balance column
bank_data <- bank_data %>%
  mutate(credit_risk = case_when(
    balance < 0 | loan == "yes" ~ "High Risk",
    balance >= 0 & balance < 5000 & loan == "no" ~ "Medium Risk",
    balance >= 5000 & loan == "no" ~ "Low Risk"
  ))
# Remove specific columns from dataset
bank_final <- bank_data[ , !(names(bank_data) %in% c("pdays", "poutcome", "duration"))]
# Print column names of refined dataset
print(names(bank_final))
##  [1] "age"         "job"         "marital"     "education"   "default"    
##  [6] "balance"     "housing"     "loan"        "contact"     "day"        
## [11] "month"       "campaign"    "previous"    "term"        "age_group"  
## [16] "credit_risk"

Sampling Data - using sampling to resize datasets

# Convert columns to factors
bank_final[c("day", "campaign", "previous")] <- lapply(bank_final[c("day", "campaign", "previous")], factor)
# check datatypes on columns
sapply(bank_final, class)
##         age         job     marital   education     default     balance 
##   "integer" "character" "character" "character" "character"   "integer" 
##     housing        loan     contact         day       month    campaign 
## "character" "character" "character"    "factor" "character"    "factor" 
##    previous        term   age_group credit_risk 
##    "factor" "character" "character" "character"

Data Transformation - regularization, normalization, handling categorical variables

# Convert categorical columns to factor
bank_final[c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "campaign", "previous", "Subscription")] <- 
  lapply(bank_final[c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "campaign", "previous", "term")], factor)

# confirm datatypes of each column
sapply(bank_final, class)
##          age          job      marital    education      default      balance 
##    "integer"     "factor"     "factor"     "factor"     "factor"    "integer" 
##      housing         loan      contact          day        month     campaign 
##     "factor"     "factor"     "factor"     "factor"     "factor"     "factor" 
##     previous         term    age_group  credit_risk Subscription 
##     "factor"  "character"  "character"  "character"     "factor"

Imbalanced Data - reducing the imbalance between classes

After performing some modification and refining the data, I think that the dataset is ready to be train. With the creation of columns, converting values of columns to factors, and to remove unnecessary data, I reduced the imbalance between classes and the dataset is ready to build a model.