Introduction.

This assignment focuses on one of the most important aspects of data science, Exploratory Data Analysis (EDA). Many surveys show that data scientists spend 60-80% of their time on data preparation. EDA allows you to identify data gaps & data imbalances, improve data quality, create better features and gain a deep understanding of your data before doing model training - and that ultimately helps train better models. In machine learning, there is a saying - “better data beats better algorithms” - meaning that it is more productive to spend time improving data quality than improving the code to train the model.

Libraries required for this project

library(tidyverse)
library(openintro)
library(infer)
library(dplyr)
library(knitr)
library(corrplot)
library(ggcorrplot)
library(ggthemes)

Dataset

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing

# import data into R from my GitHub account and check column names
bank_data <- read.csv("https://raw.githubusercontent.com/vitugo23/DATA622/refs/heads/main/bank-full.csv", sep = ";")

# Preview the first few rows of the dataset
kable(head(bank_data, 10), caption = "Bank Dataset")

Bank Dataset
age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
33	entrepreneur	married	secondary	no	2	yes	yes	unknown	5	may	76	1	-1	unknown	no
47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no
35	management	married	tertiary	no	231	yes	no	unknown	5	may	139	1	-1	unknown	no
28	management	single	tertiary	no	447	yes	yes	unknown	5	may	217	1	-1	unknown	no
42	entrepreneur	divorced	tertiary	yes	2	yes	no	unknown	5	may	380	1	-1	unknown	no
58	retired	married	primary	no	121	yes	no	unknown	5	may	50	1	-1	unknown	no
43	technician	single	secondary	no	593	yes	no	unknown	5	may	55	1	-1	unknown	no

# review column names  
colnames(bank_data)

##  [1] "age"       "job"       "marital"   "education" "default"   "balance"  
##  [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
## [13] "campaign"  "pdays"     "previous"  "poutcome"  "y"

The “Bank Dataset” has rich data about bank’s marketing campaigns and interactions with customers. The most important columns are:

-age: Customer’s age. -job: Customer’s job description -marital: Marital status of customers. -education: Education level. -default: Customer’s credit status, default or not. -balance: A yearly average of customer’s account. -housing: If the customer has a house loan or not. -loan: Customer’s personal loan information. -contact: Details about contact information. - campaign: Marketing campaign details. -y: Information about customer’s subscription to a term deposit.

# rename column name Y to term for better understanding.
bank_data <- bank_data %>% rename(term = y)

# check dataset columns summary statistics
summary(bank_data)

##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##      term          
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

The average length on columns is 45211; mean, quartile, and median varies between columns.

# review data structure on dataset
str(bank_data)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ term     : chr  "no" "no" "no" "no" ...

the dataset seems to have a wide variety of data types, so far everything looks good to start working with the EDA.

Sections

1.Exploratory Data Analysis

Review the structure and content of the data and answer questions such as:

Questions

Are the features (columns) of your data correlated?

# convert column values to numerical to find correlation between features. 
numeric_vars<- cor(bank_data %>% select(where(is.numeric)))
numeric_vars

##                   age      balance          day     duration     campaign
## age       1.000000000  0.097782739 -0.009120046 -0.004648428  0.004760312
## balance   0.097782739  1.000000000  0.004502585  0.021560380 -0.014578279
## day      -0.009120046  0.004502585  1.000000000 -0.030206341  0.162490216
## duration -0.004648428  0.021560380 -0.030206341  1.000000000 -0.084569503
## campaign  0.004760312 -0.014578279  0.162490216 -0.084569503  1.000000000
## pdays    -0.023758014  0.003435322 -0.093044074 -0.001564770 -0.088627668
## previous  0.001288319  0.016673637 -0.051710497  0.001203057 -0.032855290
##                 pdays     previous
## age      -0.023758014  0.001288319
## balance   0.003435322  0.016673637
## day      -0.093044074 -0.051710497
## duration -0.001564770  0.001203057
## campaign -0.088627668 -0.032855290
## pdays     1.000000000  0.454819635
## previous  0.454819635  1.000000000

# plot correlation between columns using corrplot
corrplot(numeric_vars
         , method = 'color' 
         , order = 'hclust'
         , addCoef.col = 'black'
         , number.cex = .6 
         )

We can see in the correlation plot that there are weak correlations among the variables, with most values close to zero. The strongest positive correlation is between “pdays” and “previous” (0.4548), this can tell us that previous contacts are associated with shorter days since last contact. Day and campaign shows a moderate positive correlation (0.1625), suggesting that an specific day might influence the number of contacts in a campaign. All other correlations are very weak, indicating minimal linear relationships among the remaining variables.

What is the overall distribution of each variable?

# Transform numeric variables from the dataset
numeric_data <- bank_data %>% select(where(is.numeric))

# Change data from wide to long format for plotting
numeric_data_long <- numeric_data %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Plot histograms for numeric variables
ggplot(numeric_data_long, aes(x = Value)) +
  geom_histogram(fill = "blue", color = "black", bins = 30) +
  facet_wrap(~ Variable, scales = "free") +
  theme_minimal() +
  ggtitle("Distribution of Numeric Variables") +
  labs(x = "Value", y = "Count")

Most of the values are right-skewed, and not perfectly distributed. For example “Age” being right-skewed, might tell us that clients are younger, with fewer older clients.

Are there any outliers present?

# Select numerical variables
numeric_data <- bank_data %>%
  select(where(is.numeric))

# Pivot the data to long format for plotting
numeric_data_long <- numeric_data %>%
  mutate(id = row_number()) %>%  
  pivot_longer(cols = -id, names_to = "variable", values_to = "value")

# Create scatterplots for each numeric variable
ggplot(numeric_data_long, aes(x = id, y = value)) +
  geom_point(alpha = 0.6, color = "darkblue") +  
  facet_wrap(~variable, scales = "free", ncol = 5) + 
  theme_minimal() +
  theme(
    strip.text.x = element_text(size = 10), 
    axis.text.x = element_text(angle = 45, hjust = 1)  
  ) +
  labs(
    x = "ID (Row Number)", 
    y = "Value", 
    title = "Scatterplots of Numerical Variables"
  )

Based on the scatterplots, each feature appears to have outliers, with the exception of ‘day’.

What are the relationships between different variables?

# Correlation Matrix for Numeric Variables
numeric_data <- bank_data %>% select(where(is.numeric))

correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")

# corr matrix
ggcorrplot(correlation_matrix, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           title = "Correlation Matrix of Numeric Variables")

Based on the correlation matrix above, we don’t have a strong linear relationship between categorical variables in the dataset, some of the strongest relationships are: Job and Education with a correlation of 0.46, Housing and Month with a correlation of 0.50 and Contact and Month with a correlation of 0.51

How are categorical variables distributed?

# Select categorical variables
categorical_vars <- bank_data %>% select_if(is.character)

# Bar plots for categorical variables
categorical_vars %>%
  gather(key = "Variable", value = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_bar(fill = "blue", color = "black") +
  facet_wrap(~ Variable, scales = "free") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Distribution of Categorical Variables")

We can see in the plots that categorical variables have strong tendencies to one particular value, for example contact variable has a strong relationship with the cellular value, Education has a strong relationship with secondary, and so on.

Do any patterns or trends emerge in the data?

Based on the plots above, the most notorious patterns are: Dataset is highly imbalanced.Most of the features are not normally distributed. Most of the features do not have a linear relationship with the dependent variable. Feature independence is most likely violated. The features job, age, education, martial status, housing loan are likely to be correlated.

What is the central tendency and spread of each variable?

# Central tendency and spread for numeric variables
numeric_data %>%
  gather(key = "Variable", value = "Value") %>%
  group_by(Variable) %>%
  summarize(
    Mean = mean(Value),
    Median = median(Value),
    SD = sd(Value),
    IQR = IQR(Value)
  ) %>%
  kable(caption = "Central Tendency and Spread of Numeric Variables")

Central Tendency and Spread of Numeric Variables
Variable	Mean	Median	SD	IQR
age	40.9362102	39	10.618762	15
balance	1362.2720577	448	3044.765829	1356
campaign	2.7638407	2	3.098021	2
day	15.8064188	16	8.322476	13
duration	258.1630798	180	257.527812	216
pdays	40.1978280	-1	100.128746	0
previous	0.5803234	0	2.303441	0

The table above reveals that customer base is generally middle-aged, most clients have modest account balances but a few exceptions with high balances, leading to a skewed distribution. The marketing strategy involved minimal contact per customer, with most interactions occurring evenly throughout the month. Calls were generally brief, but some lasted significantly longer, suggesting varying engagement levels. Notably, the majority of customers were contacted for the first time during this campaign, highlighting a strategy focused on reaching new prospects.

Are there any missing values and how significant are they?

missing_summary <- bank_data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  arrange(desc(Missing_Count))  

print(missing_summary)

## # A tibble: 17 × 2
##    Variable  Missing_Count
##    <chr>             <int>
##  1 age                   0
##  2 job                   0
##  3 marital               0
##  4 education             0
##  5 default               0
##  6 balance               0
##  7 housing               0
##  8 loan                  0
##  9 contact               0
## 10 day                   0
## 11 month                 0
## 12 duration              0
## 13 campaign              0
## 14 pdays                 0
## 15 previous              0
## 16 poutcome              0
## 17 term                  0

Table above shows no missing values in the dataset.

2. Algorithm Selection

Sections

Select two or more machine learning algorithms presented so far that could be used to train a model (no need to train models - I am only looking for your recommendations).

Based on what I’ve learned so far in this program, and the dataset features and categorical values I suggest using the Random Forest Classifier and K-Nearest Neighbors (Knn).

What are the pros and cons of each algorithm you selected?

Random Forest Classifier.-

Pros: Handles imbalance data well utilizing class weighting, it works well with categorical and numerical features, reduces overfitting by averaging multiple decision trees

Cons: Works slower than simpler models such as logistic regression, more difficult to interpret than other models, decision trees may become overly complex if high dimension data.

K-Nearest Neighbors (Knn).-

Pros: It can capture complex relationships without assuming a distribution since the alghorithm is non-parametric, works well with mixed data types if data is properly preprocessed.

Cons: Knn it might be computationally expensive with large datasets because of distance calculations, performance depends heavily on the choice of K and feature scaling.

Are there labels in your data? Did that impact your choice of algorithm?

There are dependent variable is this dataset, I strongly advice the use of supervised training algorithms.

How does your choice of algorithm relates to the dataset?

As I mentioned before, Random forest is the best fit for this project since the dataset is highly imbalanced, and it has a variety of numerical and categorical values, which the selected algorithm handles well.

Would your choice of algorithm change if there were fewer than 1,000 data records, and why?

If the dataset were fewer than 1,000 data records I would use Knn, because Random Forest might cause overfitting when trained on small datasets. Knn works better with small and mixed datasets.

3. Pre-processing

Now you have done an EDA and selected an Algorithm, what pre-processing (if any) would you require for:

Sections

Data Cleaning - improve data quality, address missing data, etc.

The dataset does not contain any missing values, there is no need to address that issue, however I can check to see unknown values in the columns using the caret library.

for (col in colnames(bank_data)) {
  if (is.factor(bank_data[[col]]) || is.character(bank_data[[col]])) {
    unknown_count <- sum(bank_data[[col]] == "unknown", na.rm = TRUE)
    if (unknown_count > 0) {
      cat(sprintf("%s: %d unknown values (%.2f%%)\n", col, unknown_count, (unknown_count / nrow(bank_data) * 100)))
    }
  }
}

## job: 288 unknown values (0.64%)
## education: 1857 unknown values (4.11%)
## contact: 13020 unknown values (28.80%)
## poutcome: 36959 unknown values (81.75%)

The categorical value with the most unknown values are poutcome with more than 80% of unknown values.

Dimensionality Reduction - remove correlated/redundant data than will slow down training

ggcorrplot(correlation_matrix, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           title = "Correlation Matrix of Numeric Variables")

Since I am using Random Forest, there is not need of remove correlated/redundant data, since the algorithm handles well large numbers of features without overfitting. The table above displays a mild correlation between numerical values, there is not need of dimesion reduction.

Feature Engineering - use of business knowledge to create new features

I can create new features based on the existing columns, I am going to create features by age group and credit risk. Also I will remove some features that I think are not necesary for this project.

# create new feature called age_group from age column
bank_data <- bank_data %>%
  mutate(age_group = case_when(
    age < 25 ~ "Young",
    age >= 25 & age < 50 ~ "Middle-aged",
    age >= 50 ~ "Senior"
  ))

# create new feature called credit_risk from balance column
bank_data <- bank_data %>%
  mutate(credit_risk = case_when(
    balance < 0 | loan == "yes" ~ "High Risk",
    balance >= 0 & balance < 5000 & loan == "no" ~ "Medium Risk",
    balance >= 5000 & loan == "no" ~ "Low Risk"
  ))

# Remove specific columns from dataset
bank_final <- bank_data[ , !(names(bank_data) %in% c("pdays", "poutcome", "duration"))]
# Print column names of refined dataset
print(names(bank_final))

##  [1] "age"         "job"         "marital"     "education"   "default"    
##  [6] "balance"     "housing"     "loan"        "contact"     "day"        
## [11] "month"       "campaign"    "previous"    "term"        "age_group"  
## [16] "credit_risk"

Sampling Data - using sampling to resize datasets

# Convert columns to factors
bank_final[c("day", "campaign", "previous")] <- lapply(bank_final[c("day", "campaign", "previous")], factor)
# check datatypes on columns
sapply(bank_final, class)

##         age         job     marital   education     default     balance 
##   "integer" "character" "character" "character" "character"   "integer" 
##     housing        loan     contact         day       month    campaign 
## "character" "character" "character"    "factor" "character"    "factor" 
##    previous        term   age_group credit_risk 
##    "factor" "character" "character" "character"

Data Transformation - regularization, normalization, handling categorical variables

# Convert categorical columns to factor
bank_final[c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "campaign", "previous", "Subscription")] <- 
  lapply(bank_final[c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "campaign", "previous", "term")], factor)

# confirm datatypes of each column
sapply(bank_final, class)

##          age          job      marital    education      default      balance 
##    "integer"     "factor"     "factor"     "factor"     "factor"    "integer" 
##      housing         loan      contact          day        month     campaign 
##     "factor"     "factor"     "factor"     "factor"     "factor"     "factor" 
##     previous         term    age_group  credit_risk Subscription 
##     "factor"  "character"  "character"  "character"     "factor"

Imbalanced Data - reducing the imbalance between classes

After performing some modification and refining the data, I think that the dataset is ready to be train. With the creation of columns, converting values of columns to factors, and to remove unnecessary data, I reduced the imbalance between classes and the dataset is ready to build a model.

Assignment 1

Victor Torres

2025-02-28

Introduction.

Libraries required for this project

Dataset

Sections

1.Exploratory Data Analysis

Questions

Are the features (columns) of your data correlated?

What is the overall distribution of each variable?

Are there any outliers present?

What are the relationships between different variables?

How are categorical variables distributed?

Do any patterns or trends emerge in the data?

What is the central tendency and spread of each variable?

Are there any missing values and how significant are they?

2. Algorithm Selection

Sections

Select two or more machine learning algorithms presented so far that could be used to train a model (no need to train models - I am only looking for your recommendations).

What are the pros and cons of each algorithm you selected?

Random Forest Classifier.-

Pros: Handles imbalance data well utilizing class weighting, it works well with categorical and numerical features, reduces overfitting by averaging multiple decision trees

Cons: Works slower than simpler models such as logistic regression, more difficult to interpret than other models, decision trees may become overly complex if high dimension data.

K-Nearest Neighbors (Knn).-

Pros: It can capture complex relationships without assuming a distribution since the alghorithm is non-parametric, works well with mixed data types if data is properly preprocessed.

Cons: Knn it might be computationally expensive with large datasets because of distance calculations, performance depends heavily on the choice of K and feature scaling.

Are there labels in your data? Did that impact your choice of algorithm?

How does your choice of algorithm relates to the dataset?

Would your choice of algorithm change if there were fewer than 1,000 data records, and why?

3. Pre-processing

Sections

Data Cleaning - improve data quality, address missing data, etc.

Dimensionality Reduction - remove correlated/redundant data than will slow down training

Feature Engineering - use of business knowledge to create new features

Sampling Data - using sampling to resize datasets

Data Transformation - regularization, normalization, handling categorical variables

Imbalanced Data - reducing the imbalance between classes

Assignment 1

Victor Torres

2025-02-28

Introduction.

Libraries required for this project

Dataset

Sections

1.Exploratory Data Analysis

Questions

Are the features (columns) of your data correlated?

What is the overall distribution of each variable?

Are there any outliers present?

What are the relationships between different variables?

How are categorical variables distributed?

Do any patterns or trends emerge in the data?

What is the central tendency and spread of each variable?

Are there any missing values and how significant are they?

2. Algorithm Selection

Sections

Select two or more machine learning algorithms presented so far that could be used to train a model (no need to train models - I am only looking for your recommendations).

What are the pros and cons of each algorithm you selected?

Random Forest Classifier.-

Pros: Handles imbalance data well utilizing class weighting, it works well with categorical and numerical features, reduces overfitting by averaging multiple decision trees

Cons: Works slower than simpler models such as logistic regression, more difficult to interpret than other models, decision trees may become overly complex if high dimension data.

K-Nearest Neighbors (Knn).-

Pros: It can capture complex relationships without assuming a distribution since the alghorithm is non-parametric, works well with mixed data types if data is properly preprocessed.

Cons: Knn it might be computationally expensive with large datasets because of distance calculations, performance depends heavily on the choice of K and feature scaling.

Which algorithm would you recommend, and why?

Are there labels in your data? Did that impact your choice of algorithm?

How does your choice of algorithm relates to the dataset?

Would your choice of algorithm change if there were fewer than 1,000 data records, and why?

3. Pre-processing

Sections

Data Cleaning - improve data quality, address missing data, etc.

Dimensionality Reduction - remove correlated/redundant data than will slow down training

Feature Engineering - use of business knowledge to create new features

Sampling Data - using sampling to resize datasets

Data Transformation - regularization, normalization, handling categorical variables

Imbalanced Data - reducing the imbalance between classes