Assignment 1: Exploratory Data Analysis

Exploratory Data Analysis

# Load library
library(tidyverse)
library(GGally)
library(gridExtra)
# Load the data
bank <- read.csv("https://raw.githubusercontent.com/yli1048/yli1048/refs/heads/622/bank-full.csv", sep=";")

# Check the structure of the dataset
glimpse(bank)
## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital   <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

Are the features (columns) of your data correlated?

numeric_vars <- bank[sapply(bank, is.numeric)]
cor_matrix <- cor(numeric_vars, use="complete.obs")
print(cor_matrix)
##                   age      balance          day     duration     campaign
## age       1.000000000  0.097782739 -0.009120046 -0.004648428  0.004760312
## balance   0.097782739  1.000000000  0.004502585  0.021560380 -0.014578279
## day      -0.009120046  0.004502585  1.000000000 -0.030206341  0.162490216
## duration -0.004648428  0.021560380 -0.030206341  1.000000000 -0.084569503
## campaign  0.004760312 -0.014578279  0.162490216 -0.084569503  1.000000000
## pdays    -0.023758014  0.003435322 -0.093044074 -0.001564770 -0.088627668
## previous  0.001288319  0.016673637 -0.051710497  0.001203057 -0.032855290
##                 pdays     previous
## age      -0.023758014  0.001288319
## balance   0.003435322  0.016673637
## day      -0.093044074 -0.051710497
## duration -0.001564770  0.001203057
## campaign -0.088627668 -0.032855290
## pdays     1.000000000  0.454819635
## previous  0.454819635  1.000000000
ggpairs(numeric_vars)

The correlation matrix indicated that there were no correlations among the data features.

What is the overall distribution of each variable?

for (col in colnames(numeric_vars)) {
  hist(numeric_vars[[col]], main=paste("Distribution of", col), col="darkorange", border="black", xlab=col)
}

The histograms indicate that most numerical variables are right-skewed. In contrast, the histogram for day displays a uniform distribution.

Are there any outliers present?

boxplot(bank$age, main="Boxplot of Age", ylab="Age", col="green4", horizontal = TRUE)

boxplot(bank$balance, main="Boxplot of Balance", ylab="Balance", col="green4", horizontal = TRUE)

boxplot(bank$day, main="Boxplot of Day", ylab="Day", col="green4", horizontal = TRUE)

boxplot(bank$duration, main="Boxplot of Duration", ylab="Duration", col="green4", horizontal = TRUE)

boxplot(bank$campaign, main="Boxplot of Campaign", ylab="Campaign", col="green4", horizontal = TRUE)

boxplot(bank$pdays, main="Boxplot of Pdays", ylab="Pdays", col="green4", horizontal = TRUE)

boxplot(bank$previous, main="Boxplot of Previous", ylab="Previous", col="green4", horizontal = TRUE)

The boxplots show that many of the numerical variables have outliers, especially the variables balance, duration, and pdays.

What are the relationships between different variables?

There is nearly no linear relationship between the variables, as indicated by the correlation matrix values, which are approximately zero. However, a significant linear relationship exists between ‘previous’ and ‘pday,’ with a correlation value of around 0.58. This suggests that as the value of ‘previous’ increases, the value of ‘pday’ also tends to increase.

How are categorical variables distributed?

categorical_vars <- bank %>%
  select(where(is.character), -y)

for (col in colnames(categorical_vars)) {
  print(paste("Distribution for", col, ":"))
  print(table(categorical_vars[[col]]))
  cat("\n")
}
## [1] "Distribution for job :"
## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##          5171          9732          1487          1240          9458 
##       retired self-employed      services       student    technician 
##          2264          1579          4154           938          7597 
##    unemployed       unknown 
##          1303           288 
## 
## [1] "Distribution for marital :"
## 
## divorced  married   single 
##     5207    27214    12790 
## 
## [1] "Distribution for education :"
## 
##   primary secondary  tertiary   unknown 
##      6851     23202     13301      1857 
## 
## [1] "Distribution for default :"
## 
##    no   yes 
## 44396   815 
## 
## [1] "Distribution for housing :"
## 
##    no   yes 
## 20081 25130 
## 
## [1] "Distribution for loan :"
## 
##    no   yes 
## 37967  7244 
## 
## [1] "Distribution for contact :"
## 
##  cellular telephone   unknown 
##     29285      2906     13020 
## 
## [1] "Distribution for month :"
## 
##   apr   aug   dec   feb   jan   jul   jun   mar   may   nov   oct   sep 
##  2932  6247   214  2649  1403  6895  5341   477 13766  3970   738   579 
## 
## [1] "Distribution for poutcome :"
## 
## failure   other success unknown 
##    4901    1840    1511   36959
# Loop through each categorical variable and create a separate bar plot
for (col in colnames(categorical_vars)) {
  p <- ggplot(categorical_vars, aes_string(x = col)) +
    geom_bar(fill = "steelblue", color = "black") +
    theme_minimal() +
    labs(title = paste("Distribution of", col), x = col, y = "Count") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels if needed
  
  print(p)  # Print each plot separately
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The distributions of the categorical variables are uneven.

# Filter dataset by 'y' variable
bank_yes <- filter(bank, y == "yes")
# Analyze demographics of 'yes' responses
demographic_analysis <- bank_yes %>%
  group_by(job, marital, education) %>%
  summarise(count = n(), .groups = 'drop') %>%
  arrange(desc(count))

# Print demographic insights
print(demographic_analysis)
## # A tibble: 129 × 4
##    job         marital education count
##    <chr>       <chr>   <chr>     <int>
##  1 management  married tertiary    563
##  2 management  single  tertiary    451
##  3 technician  married secondary   276
##  4 blue-collar married secondary   265
##  5 admin.      married secondary   243
##  6 technician  single  secondary   188
##  7 admin.      single  secondary   179
##  8 retired     married secondary   158
##  9 blue-collar single  secondary   147
## 10 technician  single  tertiary    147
## # ℹ 119 more rows

What is the central tendency and spread of each variable?

# Function to find mode of a categorical variable
find_mode <- function(x) {
  mode_value <- names(which.max(table(x)))  # Get most frequent category
  return(mode_value)
}

# Apply the function to each column of the dataset
modes <- sapply(categorical_vars, find_mode)

# Print results
print(modes)
##           job       marital     education       default       housing 
## "blue-collar"     "married"   "secondary"          "no"         "yes" 
##          loan       contact         month      poutcome 
##          "no"    "cellular"         "may"     "unknown"

For categorical variables, the central tendency can be determined by finding the mode for each variable. In this case, we identified the mode for each variable, and they are listed above.

Are there any missing values and how significant are they?

The dataset does not have any missing values; however, there are numerous entries labeled as “Unknown” for both the contact communication type and the outcome of the previous marketing campaign. This lack of information could affect the analysis, making it difficult to determine which communication type is the most effective and whether clients might change their preferences in future campaigns.

2. Algorithm Selection

Select two or more machine learning algorithms presented so far that could be used to train a model (no need to train models - I am only looking for your recommendations).

This assignment focuses on investigating the correlation between various predictors to determine their effectiveness in predicting whether a client will subscribe to a term deposit. Therefore, it is advisable to choose algorithms that excel in prediction, such as decision trees and random forests.

What are the pros and cons of each algorithm you selected?

Decision Trees: Pros - Easy to understand and interpret - Fast training and prediction - Handles both numerical and categorical data - Requires minimal data preprocessing

Cons - Prone to overfitting - High variance - Biased with imbalanced data - Might not lead to globally optimal model

Random Forests: Pros - Reduces overfitting - More stable and robust - higher accuracy - Handles missing data and noise well

Cons - Computationally expensive - Less interpretable - Can be biased, requiring additional techniques - High complexity

Which algorithm would you recommend, and why?

I recommend using random forests because they can manage missing data and noise, and the dataset contained many unknowns.

Are there labels in your data? Did that impact your choice of algorithm?

Yes, the labels in the dataset are in the column y. This impacts my choice of algorithms since the labels are used for supervised learning; therefore, I chose random forests.

How does your choice of algorithm relates to the dataset?

Random forests are more effective for handling large datasets. Since the full dataset has more than 40,000 observations, it is wise to choose random forests. Additionally, random forests are more accurate and stable compared to decision trees.

Would your choice of algorithm change if there were fewer than 1,000 data records, and why?

For datasets with fewer than 1,000 records, using decision trees is wiser, as they are less computationally expensive and less effective with larger datasets.

Pre-processing

1. Data Cleaning

  • There are no missing data in the dataset.
  • The pdays column has negative values, this indicates no previous contact. This can be converted to a categorical column.

2. Dimensionality Reduction

  • Duration is highly predictive but the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

3. Sampling Data

  • Convert pdays into categorical (“contacted before” vs. “not contacted”).
  • Extract seasonal information based on the month.

4. Data Transformation

  • Normalize numerical features using Min-Max scaling or standardization.

5. Imbalanced Data

# Count responses in column 'y'
  print(table(bank$y))
## 
##    no   yes 
## 39922  5289
  • The dataset exhibits an imbalance; consequently, it is necessary to apply undersampling techniques in order to achieve a balanced dataset.