Exploratory Data Analysis (EDA) Assignment #1

Assignment

Exploratory Data Analysis

Review the structure and content of the data and answer questions such as:

Are the features (columns) of your data correlated?

Correlation matrix indicates almost all of the numeric variables are not correlated in any significant way (-0.11 to 0.16).
What is the overall distribution of each variable?

Histograms of numeric values indicate only “age” is somewhat normally distributed, while all the others do not demonstrate normality.
Are there any outliers present?

Boxplots identify extreme outliers in almost all variables with the exception of ‘day’, which is the last contact day of the month. This is constrained by 1 and 31 days of the month.
How are categorical variables distributed?

Bar charts identify potential differences in the variables from those that indicate ‘yes’ or ‘no’ for subscribing to a term deposit. In general, the categoricals are similar between those that say “yes” and “no”. However, there are some notable exceptions such those with higher education, single individuals, those that do not have a housing loan, individuals contacted by cell phone are more likely to say ‘yes’ to a term deposit.
Do any patterns or trends emerge in the data?

There is ~90% ‘no’ rate from the campaigns. March, September, and December have the highest success. Previous success correlates with possible current success. There may be possible age-job interactions that might benefit from further analysis. Longer duration calls are associated with possible greater success.
What is the central tendency and spread of each variable?

There are asymmetric distributions for balance, call duration, and campaign contacts. The job categories are not equally represented. There are significant spreads seen for balance, duration, and campaign contacts.
Are there any missing values and how significant are they?

Some variables have ‘other’ or ‘unknown’ effectively indicating a missing value. While these variables can be analyzed with these categories, any potential relationships between the other variables might be obscured. ‘pdays’ does have a significant number of NAs, accounting for over 80% of the observations.

str(df)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : num  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ balance  : num  2143 29 2 1506 1 ...
##  $ housing  : logi  TRUE TRUE TRUE TRUE FALSE TRUE ...
##  $ loan     : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : num  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ previous : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

summary(df)

##       age                 job           marital          education    
##  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
##  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
##  Median :39.00   technician :7597   single  :12790   tertiary :13301  
##  Mean   :40.94   admin.     :5171                    unknown  : 1857  
##  3rd Qu.:48.00   services   :4154                                     
##  Max.   :95.00   retired    :2264                                     
##                  (Other)    :6835                                     
##   default           balance        housing           loan        
##  Mode :logical   Min.   : -8019   Mode :logical   Mode :logical  
##  FALSE:44396     1st Qu.:    72   FALSE:20081     FALSE:37967    
##  TRUE :815       Median :   448   TRUE :25130     TRUE :7244     
##                  Mean   :  1362                                  
##                  3rd Qu.:  1428                                  
##                  Max.   :102127                                  
##                                                                  
##       contact           day            month          duration     
##  cellular :29285   Min.   : 1.00   may    :13766   Min.   :   0.0  
##  telephone: 2906   1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0  
##  unknown  :13020   Median :16.00   aug    : 6247   Median : 180.0  
##                    Mean   :15.81   jun    : 5341   Mean   : 258.2  
##                    3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0  
##                    Max.   :31.00   apr    : 2932   Max.   :4918.0  
##                                    (Other): 6060                   
##     campaign          pdays          previous           poutcome    
##  Min.   : 1.000   Min.   :  1.0   Min.   :  0.0000   failure: 4901  
##  1st Qu.: 1.000   1st Qu.:133.0   1st Qu.:  0.0000   other  : 1840  
##  Median : 2.000   Median :194.0   Median :  0.0000   success: 1511  
##  Mean   : 2.764   Mean   :224.6   Mean   :  0.5803   unknown:36959  
##  3rd Qu.: 3.000   3rd Qu.:327.0   3rd Qu.:  0.0000                  
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                  
##                   NA's   :36954                                     
##    y        
##  no :39922  
##  yes: 5289  
##             
##             
##             
##             
##

numeric_vars <- c("age", "balance", "day", "duration", "campaign", "pdays", "previous")

cor_matrix <- cor(df[, numeric_vars], use = "pairwise.complete.obs")
print(cor_matrix)

##                   age      balance          day     duration     campaign
## age       1.000000000  0.097782739 -0.009120046 -0.004648428  0.004760312
## balance   0.097782739  1.000000000  0.004502585  0.021560380 -0.014578279
## day      -0.009120046  0.004502585  1.000000000 -0.030206341  0.162490216
## duration -0.004648428  0.021560380 -0.030206341  1.000000000 -0.084569503
## campaign  0.004760312 -0.014578279  0.162490216 -0.084569503  1.000000000
## pdays    -0.107862882 -0.108122124 -0.090094644 -0.024406586  0.050533690
## previous  0.001288319  0.016673637 -0.051710497  0.001203057 -0.032855290
##                pdays     previous
## age      -0.10786288  0.001288319
## balance  -0.10812212  0.016673637
## day      -0.09009464 -0.051710497
## duration -0.02440659  0.001203057
## campaign  0.05053369 -0.032855290
## pdays     1.00000000 -0.021884865
## previous -0.02188487  1.000000000

corrplot(cor_matrix,
         method = "color",
         type = "lower",
         tl.col = "black",
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.7,
         col = colorRampPalette(c("#FC4E07", "#FFEDA0", "#00AFBB"))(200))

categorical_vars <- c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome")


for (var in categorical_vars) {
  plot <- ggplot(df, aes(x = y, fill = !!sym(var))) +
    geom_bar(position = "fill") +
    theme_minimal() +
    labs(title = paste("Distribution of", var, "by y"),
         x = "y",
         y = "Proportion",
         fill = var)
  print(plot)
}

# Create a function to generate box plots for each numeric variable
create_boxplot <- function(df, var) {
  ggplot(df, aes(x = "", y = !!sym(var))) +
    geom_boxplot(fill = "steelblue", color = "black") +
    labs(title = paste("Box Plot of", var),
         x = "",
         y = var) +
    theme_minimal()
}

# Create box plots for each numeric variable
box_plots <- lapply(numeric_vars, function(var) create_boxplot(df, var))

# Print the box plots
for (plot in box_plots) {
  print(plot)
}

## Warning: Removed 36954 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Specify the numeric variables
numeric_vars <- c("age", "balance", "day", "duration", "campaign", "pdays", "previous")


# Create a function to generate histograms for each numeric variable
create_histogram <- function(df, var) {
  ggplot(df, aes(x = !!sym(var))) +
    geom_histogram(fill = "steelblue", color = "black", bins = 30) +
    labs(title = paste("Histogram of", var),
         x = var,
         y = "Frequency") +
    theme_minimal()
}

# Create histograms for each numeric variable
histograms <- lapply(numeric_vars, function(var) create_histogram(df, var))

# Print the histograms
for (plot in histograms) {
  print(plot)
}

## Warning: Removed 36954 rows containing non-finite outside the scale range
## (`stat_bin()`).

Algorithm Selection

Now you have completed the EDA, what Algorithms would suit the business purpose for the dataset.

The bank marketing dataset presents several complex characteristics that make traditional machine learning algorithms unsuitable. Since this assignment is due before we have fully explored alternative options of decision trees and random forest applications, we will focus on only the options currently covered. Here is an analysis of why those conventional approaches fall short:

The dataset’s fundamental structure poses significant challenges for traditional algorithms. With seventeen variables of mixed types (numeric, categorical, and binary), considerable missing values (81.7% in ‘pdays’), and severe class imbalance (11.7% yes vs 88.3% no), basic algorithms lack the sophistication to manage these complexities simultaneously.
Linear and logistic regression models, including their regularized variants (Lasso and Ridge), are particularly inappropriate due to their rigid assumptions. The binary outcome variable violates linear regression’s basic assumptions. There are complex conditional dependencies between variables, important interaction effects between categorical and numeric variables, the presence of hierarchical relationships in the data structure, and the need to capture multiple feature interactions simultaneously.
KNN’s implementation becomes problematic due to the “curse of dimensionality” across seventeen variables. The algorithm’s distance calculations become less meaningful in high-dimensional space, particularly with mixed variable types and even with the large dataset of 45,000+ observations. The large dataset size makes KNN computationally inefficient, while its sensitivity to feature scaling and poor handling of categorical variables further limit its applicability.
Linear Discriminant Analysis (LDA) fails to meet its core assumptions. The data does not follow normal distribution patterns, and the variance-covariance matrices are not equal across classes. The presence of multiple categorical predictors with varying levels and significant outliers violates LDA’s fundamental requirements.
While Naive Bayes could potentially work with the numeric variables due to their low correlations, it may still be sub-optimal due to its limitations in handling mixed data types, skewed distributions, and missing values, rather than due to violations of the independence assumption. Its simplified probability estimates cannot adequately capture the complex relationships between variables, particularly with the mixture of numeric and categorical features and the presence of informative missing values.
The significant proportion of missing values in ‘pdays’ (81.7%) poses a particular challenge. Traditional algorithms either require complete cases or simple imputation methods, which would either significantly reduce the dataset or potentially introduce bias. The missing values are informative (indicating no previous contact) and require more sophisticated handling than these algorithms can provide.
The presence of extreme outliers, particularly in financial variables, severely impacts traditional algorithms’ performance. While alternative versions of these algorithms exist, they often sacrifice other important aspects of model performance to manage outliers.
With over 45,000 observations, algorithms like KNN become computationally expensive, particularly during prediction phases. The need for extensive pre-processing (scaling, encoding, imputation) with traditional algorithms further increases computational workload.

These limitations collectively demonstrate why more sophisticated algorithms like Random Forest and AdaBoost, which can naturally manage these data characteristics, are more appropriate choices. Their ability to manage complex interactions, manage missing values, accommodate outliers, and deal with class imbalance makes them better suited for this prediction task.

Pre-Processing

For Random Forest with this banking dataset, the pre-processing needs are minimal compared to other algorithms, which is one of its advantages. The main pre-processing requirements would focus on managing the class imbalance (11.7% vs 88.3% for the target variable) through either minority over-sampling or class weights and addressing the missing values in ‘pdays’ (81.7% missing) by either creating a binary indicator variable for previous contact. Random Forest can manage both categorical and numerical variables without scaling or normalization, manages outliers well through its bootstrap sampling, and is not affected by non-linear relationships or interactions between variables. While feature engineering could still be beneficial, it is not strictly necessary for the algorithm to perform well. The categorical variables (job, education, marital status, etc.) can be used as-is without one-hot encoding, and the numerical variables (age, balance, duration) do not require standardization or normalization. The focus should be on ensuring the training data provides a balanced representation of both positive and negative cases to prevent the model from being biased toward the majority class.

Exploratory Data Analysis (EDA) Assignment #1

Anthony Conrardy

2025-03-01

Dataset

Assignment