1. Exploratory Data Analysis

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.92 loaded
library(DataExplorer)
df <- read.csv("bank-full.csv", sep = ";")
head(df)
##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no
glimpse(df)
## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital   <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
# Checking for missing values
colSums(is.na(df))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

Are the features (columns) of your data correlated?

Based on the correlation matrix of the numerical variables in the dataset, most correlations are close to zero. This indicates that the features are not strongly related, which is beneficial for modeling as it reduces the issue of multicollinearity. A notable exception is the correlation between \(pdays\) and \(previous\), which stands at 0.45. This suggests a moderate positive correlation.

# Correlation matrix for numerical variables
num_cols <- sapply(df, is.numeric)
cor_matrix <- cor(df[, num_cols])
corrplot(cor_matrix, method = "color",tl.col = "black", addCoef.col = "black", number.cex = 0.8)

What is the overall distribution of each variable?

For the numerical variables, the overall distribution can be shown below with histograms. Most of the variables are right skewed with some outliers.

  • age: right skewed with most values between 25-60. Can also argue that it somewhat normally distributed with a peak around 30-40 years old
  • balance: extremely right skewed with most values near zero, indicating most of the individuals have a lower balance and few individuals have a very high balance
  • campaign, duration, pdays, previous: all heavily right-skewed, indicating most values are small, with a few large outliers
  • day: more uniformly distributed, with peaks around specific days in the middle of the month

Due to the heavy skewness of some features, transformation before modeling might be needed.

# Distribution of numerical variables
plot_histogram(df)

Are there any outliers present?

The boxplot of balance shows a large amount of outliers. With the exception of age, the other variables (campaign, duration, pdays, previous) seem to have outliers too based on the boxplot. It is difficult to see from this boxplot since the value exceeds 100000 while the others are skewed close to 0. However, based on the histogram above, the presence of long tails in balance, campaign, duration, pdays, and previous suggests outliers. These outliers may skew the model’s predictions and can affect the model’s overall performance. For example, having an extreme balance value of 100,000 can disproportionaly influence the model’s weight on the balance feature. In this case, it might be beneficial to use a log transformation to reduce the impact of these extreme values by bringing them closer to the rest of the data.

# Boxplots to detect outliers
df_long <- df %>%
  pivot_longer(cols = c(age, balance, duration, campaign, pdays, previous), 
               names_to = "Variable", values_to = "Value")

ggplot(df_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  ggtitle("Boxplots of Numerical Variables") +
  theme_minimal()

How are categorical variables distributed?

To look at the distribution of the categorical variables, we will use a bar plot.

  • job: the most frequent job category is “blue-collar,” followed by “management” and “technician”
  • marital: the majority of individuals are “married”
  • education: the most common education level is “secondary”
  • default: most individuals do not have a default
  • housing: most individuals have housing loans
  • loan: most individuals do not have personal loans
  • contact: the most common contact method is “unknown”
  • y: a large majority of the outcomes is no
# Categorical variable distributions
plot_bar(df)

What are the relationships between different variables?

The correlation heatmap showed that there was no strong relationship between the numerical variables. Only pdays and previous showed a moderate positive correlation. Since the outcome is yes or no, we can convert it to a numerical format such that no = 0 and yes = 1. This will allow us to see if there is a correlation between the numerical variables and the outcome variable. In this case, we can see that there is a positive correlation (0.39) between call duration and the likelihood of a “yes” outcome. On the other hand, the rest of the numerical variables have a correlation value of 0.1 or less indicating a very weak or negligible linear relationship with the outcome variable. While the correlation matrix only shows the linear relationship between the variables, some of the features can be related based on domain knowledge. Balance, loan, and housing are similar in that they can be combined to form a financial stability indicator. Campaign, previous, and pdays are related in that a combination of these variables can represent a history of customer contact. Even with low correlation, combining these features based on domain knowledge could enhance the predictive power of the models.

# Convert the outcome variable to numeric (0 and 1)
df$y_numeric <- ifelse(df$y == "yes", 1, 0)

# Correlation matrix including the new numeric outcome variable
num_cols_with_y <- sapply(df, is.numeric)
cor_matrix_with_y <- cor(df[, num_cols_with_y])
corrplot(cor_matrix_with_y, method = "color", tl.col = "black", addCoef.col = "black", number.cex = 0.8)

What is the central tendency and spread of each variable?

Based on the summary statistics, the age averages around 41 years, with a median of 39, and a range of 18-95. This indicates that the clients are mostly middle aged people. Balance is extremely skewed, with an average of 1362 but a media of only 448 and a range of -8019 to 102127. Duration, which measures call lengths, has a wide range from 0 to 4,918 seconds, with a mean of 258.2 and a median of 180 seconds, suggesting most calls are relatively short but with some very long outliers.

summary(df)
##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y               y_numeric    
##  Length:45211       Min.   :0.000  
##  Class :character   1st Qu.:0.000  
##  Mode  :character   Median :0.000  
##                     Mean   :0.117  
##                     3rd Qu.:0.000  
##                     Max.   :1.000

Are there any missing values and how significant are they?

There are no missing values in the dataset. However, the ‘unknown’ category appears in features such as job, education, contact, and poutcome. These should be addressed to avoid misleading the model. For job and education, the unknown values are the least frequent categories so we can transform these categories to the most frequent category. For example, for the job feature, the unknowns will be changed to management since this is the most frequent category in jobs. This way it will not impact the model as much. However, with the contact and poutcome variable, ‘unknown’ represents a significant portion of the data so it would not be ideal to transform the value here. It maybe best to exclude these variables from the model.

Duplicate or inconsistent values identified?

While there were no obvious duplicate values I could identify, there were some inconsistent values that might need to be addressed before modeling. There were negative values in the balance feature, which likely represent overdrafts. Additionally, the pdays feature has a value of -1. These inconsistencies can introduce noise and impact the model’s predictions. For balance, we can shift the data to the right and make all values non negative. In this case we would add minimum value (-8019) to all balance values. We can also perform a log transformation to address the skewness of the balance data.

Does the data align with domain knowledge & expectations?

The data aligns well with what we would expect from a banking and marketing perspective. Banks typically use a customer’s age, job, education, and other financial indicators to understand their clients better. The distribution of ages, marital statuses, and job types closely resembles what you’d find in a bank’s customer base. Additionally, the positive relationship between call duration and the target variable fits with the idea that longer calls indicate higher customer interest, which aligns with a banker’s effort to sell a product to an interested customer.

2. Algorithm Selection

Select two or more machine learning algorithms presented so far that could be used to train a model.

Based on the EDA, I would select Logistic Regression and Random Forests for this dataset. Seeing as how the target is a binary outcome, Logistic Regression is ideal for binary classification tasks. It can also handle large datasets with categorical and numerical variables well, making it suitable for the mixed data types in our dataset. On the other hand, Random Forests can also handle mixed data types well in addition to identifying non linear relationships in the data. They are useful when dealing with outliers and missing values, making them a great choice for creating a robust model given that the dataset had many outliers and unknown values.

What are the pros and cons of each algorithm you selected?

Logistic Regression

Pros:

  • Logistic regression is easy to understand and interpret
  • It is efficient and works well on large datasets with minimal training time
  • It is specifically designed for binary classification problems
  • The learned coefficients indicate which features are most influential in predicting the outcome
  • Logistic regression can handle both categorical and numerical features

Cons:

  • Logistic regression assumes a linear relationship between independent variables and the log-odds of the dependent variable
  • Less effective at capturing complex non-linear relationships
  • May struggle with imbalanced data without proper adjustments
  • May require feature scaling and transformation when dealing with skewed, unbalanced data
  • It cannot handle missing data directly so it often requires data cleaning before training
Random Forests

Pros:

  • Random Forests can identify complex, non-linear relationships between features
  • It can handle datasets with mixed features without requiring extensive preprocessing
  • Random Forests are less sensitive to outliers compared to linear models like logistic regression
  • Can handle datasets with many features without requiring dimensionality reduction

Cons: - Training a Random Forest requires more resources and is much slower than logistic regression, especially with large datasets - Hard to understand and interpret compared to logistic regression due to the complexity of the model - Requires careful tuning of hyperparameters to achieve optimal performance

Which algorithm would you recommend, and why?

If costs, speed, and time were not constraints, I recommend using Random Forests for this dataset. Given the complexity of the data, which includes both numerical and categorical variables, as well as the presence of outliers and unknown values, Random Forests offer a better, more flexible model than logistic regression. The correlation plot in the EDA showed that there was little to no linear correlation with the numerical features. Therefore, Random Forests can identify non-linear relationships and patterns in the data. Additionally, Random Forests can provide insights into feature importance using the Gini Index. Lastly, since Random Forests are resilient to outliers and missing values, they offer a more accurate and reliable model for this dataset.

Are there labels in your data? Did that impact your choice of algorithm?

Yes, the dataset contains labels, specifically the target variable ‘y’, which indicates whether or not a client subscribed to a term deposit. The presence of labels means this is a supervised learning problem. While it did not significantly impact my choice of algorithm, it confirmed that both Logistic Regression and Random Forests are appropriate for the task. However, when compared to logistic regression, Random Forests are better for handling complex, non-linear relationships in the data, making them a strong choice for this dataset.

How does your choice of algorithm relates to the dataset?

The dataset includes both categorical variables, such as job, martial, and education, and numerical features that were skewed distributions like balance, duration, and campaign. Also, the correlation plot showed little to no linear correlation among the numerical features, making Random Forests ideal algorithm modeling this data. Additionally, Random Forests handle skewed data effectively through bootstrapping and splitting the data at different points. This is particularly important because many of the numerical features (balance, duration, campaign) were right skewed and hand long tails. Splitting the data at different points also reduces the noise made by outliers from the numerical features and unknown values from the categorical features (job, education, contact). Despite Random Forests being the ideal model, it does cost significantly more and takes longer to train than a logistic model, especially with a 45000 row, 17 feature dataset. From a business perspective, I would choose Random Forests as I want to prioritize accuracy and creating a reliable model over lower costs and faster speeds.

Would your choice of algorithm change if there were fewer than 1,000 data records, and why?

If there were fewer than 1,000 data records, my choice of algorithm would definitely change. Random Forests do not perform as optimally with smaller datasets due to the reduced amonut of data available to train each tree. In this case, simpler algorithms like logistic regression would be more appropriate. Logistic Regression is efficient and can perform well with smaller datasets, providing solid results. Another choice would be Decision Trees as they are less computationally demanding and can handle small datasets effectively while still capturing non-linear relationships.

3. Pre-processing

Data Cleaning - improve data quality, address missing data, etc.

There were several categorical features with ‘unknown’ data, such as job, education, contact, and poutcome. For job and education, the ‘unknown’ value is negligible as it is the least frequent category, so we can impute it with the mode of the variable. As for contact and poutcome, ‘unknown’ is a more common category. Since it is one of the more dominant values in those categories, I would exclude these two features completely.

For the numerical features, there were some with outliers, including balance, campaign, and duration. Random Forests are less sensitive to outliers and noise, so it may not be necessary to address them. In this case, I would not remove or transform the outliers as there are not many of them and I doubt it will impact the performance of the model.

Dimensionality Reduction - remove correlated/redundant data than will slow down training

Given that the dataset shows almost no linear correlation among the features, dimensionality reduction may not be necessary. However, I would consider removing the variables contact and poutcome since unknown is a common category in these features. These variables may be redundant as they might not provide much predictive power and can slow down the training of the model.

Feature Engineering - use of business knowledge to create new features

Based on my business knowledge, one feature I would create is average campaign duration. This will calculate the average duration of interactions by dividing the duration by the number of campaigns. This feature will provide insights into the quality of each interaction. I would predict that would be a positive correlation between this feature and the target variable.

Another feature I would create is combining the job and marital status to capture employment and personal stability. This will be a categorical variable with categories such as ‘stable employment’, ‘unstable employment’, ‘stable employment and single’, and ‘unstable employment and married’. These categories will show how employment and martial stability can impact financial behavior.

Sampling Data - using sampling to resize datasets

There is a class imbalance in the target variable with about 20% of the outcome is a yes variable and 80% is no. A resampling technique is an effective way to adress this class imbalance and resize the dataset to ensure balanced representation. For this dataset, we can use oversampling which increases the number of the minority class (yes) by creating synthetic samples. We can also use undersampling which reduces the number of the majority class (no) by randomly removing samples. However, this may lead to a loss of valuable information. In this case, applying these these resampling techniques can create a balanced and representative dataset that can improve the model. I would first start with oversampling as it preserves the information.

Data Transformation - regularization, normalization, handling categorical variables

For transforming numerical variables, we can normalize features such as balance, duration, and campaign. We can use min-max scaling to transform these features to a range of 0 to 1, ensuring they are on a similar scale and aligned with the model’s requirements. Another option is to apply log transformations to heavily skewed data like balance to reduce the impact of extreme values and improve the model’s performance. For categorical variables, we can apply One-Hot Encoding to convert them into binary vectors, making them suitable for the Random Forest model. Also, for categorical features with many unique categories, Target Encoding can be considered to capture the relationship between the category and the target variable.

Imbalanced Data - reducing the imbalance between classes

Given that there is a significant class imbalance in the dataset, where 20% of the outcomes are ‘yes’ and 80% are ‘no’, reducing the imbalance in the data is important to improving the model. As mentioned before, we can use resample techniques such as oversampling and undersampling. A hybrid of both can be effective but oversampling is preferred in this instance since it would not remove critical data from the training set. We can also employ cost-sensitive learning by adjusting class weights to give more importance to the minority class. By increasing the weight of the minority class, the Random Forest model becomes more attentive to minority class predictions, improving its ability to correctly classify ‘yes’ outcomes.