library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.92 loaded
library(DataExplorer)
df <- read.csv("bank-full.csv", sep = ";")
head(df)
## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome y
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
glimpse(df)
## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
# Checking for missing values
colSums(is.na(df))
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
For the numerical variables, the overall distribution can be shown below with histograms. Most of the variables are right skewed with some outliers.
Due to the heavy skewness of some features, transformation before modeling might be needed.
# Distribution of numerical variables
plot_histogram(df)
The boxplot of balance shows a large amount of outliers. With the exception of age, the other variables (campaign, duration, pdays, previous) seem to have outliers too based on the boxplot. It is difficult to see from this boxplot since the value exceeds 100000 while the others are skewed close to 0. However, based on the histogram above, the presence of long tails in balance, campaign, duration, pdays, and previous suggests outliers. These outliers may skew the model’s predictions and can affect the model’s overall performance. For example, having an extreme balance value of 100,000 can disproportionaly influence the model’s weight on the balance feature. In this case, it might be beneficial to use a log transformation to reduce the impact of these extreme values by bringing them closer to the rest of the data.
# Boxplots to detect outliers
df_long <- df %>%
pivot_longer(cols = c(age, balance, duration, campaign, pdays, previous),
names_to = "Variable", values_to = "Value")
ggplot(df_long, aes(x = Variable, y = Value)) +
geom_boxplot() +
coord_flip() +
ggtitle("Boxplots of Numerical Variables") +
theme_minimal()
To look at the distribution of the categorical variables, we will use a bar plot.
# Categorical variable distributions
plot_bar(df)
The correlation heatmap showed that there was no strong relationship between the numerical variables. Only pdays and previous showed a moderate positive correlation. Since the outcome is yes or no, we can convert it to a numerical format such that no = 0 and yes = 1. This will allow us to see if there is a correlation between the numerical variables and the outcome variable. In this case, we can see that there is a positive correlation (0.39) between call duration and the likelihood of a “yes” outcome. On the other hand, the rest of the numerical variables have a correlation value of 0.1 or less indicating a very weak or negligible linear relationship with the outcome variable. While the correlation matrix only shows the linear relationship between the variables, some of the features can be related based on domain knowledge. Balance, loan, and housing are similar in that they can be combined to form a financial stability indicator. Campaign, previous, and pdays are related in that a combination of these variables can represent a history of customer contact. Even with low correlation, combining these features based on domain knowledge could enhance the predictive power of the models.
# Convert the outcome variable to numeric (0 and 1)
df$y_numeric <- ifelse(df$y == "yes", 1, 0)
# Correlation matrix including the new numeric outcome variable
num_cols_with_y <- sapply(df, is.numeric)
cor_matrix_with_y <- cor(df[, num_cols_with_y])
corrplot(cor_matrix_with_y, method = "color", tl.col = "black", addCoef.col = "black", number.cex = 0.8)
Based on the heatmap above, there is an obvious relationship between duration and outcome, which makes sense since longer call durations can indicate that the client is interested in the term deposit. Also, there may be a seasonal trend as the months are largely skewed towards the warmer months (May, July, August) compared to the colder months (September, March, December). By subsetting the categorical variables into ‘yes’ and ‘no’, it is clear that job status has a significant influence on the target variable. For example, certain job types like ‘management’ or ‘blue-collar’ may have higher subscription rates compared to others.
# Subset the data for 'yes' and 'no' outcomes
df_yes <- df %>% filter(y == "yes")
df_yes$y <- NULL
df_no <- df %>% filter(y == "no")
df_no$y <- NULL
# Plot for 'yes' outcomes
plot_bar(
df_yes,
title = "Categorical Variable Distribution for 'Yes' Outcome",
nrow = 3L,
ncol = 3L
)
# Plot for 'no' outcomes
plot_bar(
df_no,
title = "Categorical Variable Distribution for 'No' Outcome",
nrow = 3L,
ncol = 3L
)
Based on the summary statistics, the age averages around 41 years, with a median of 39, and a range of 18-95. This indicates that the clients are mostly middle aged people. Balance is extremely skewed, with an average of 1362 but a media of only 448 and a range of -8019 to 102127. Duration, which measures call lengths, has a wide range from 0 to 4,918 seconds, with a mean of 258.2 and a median of 180 seconds, suggesting most calls are relatively short but with some very long outliers.
summary(df)
## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## y y_numeric
## Length:45211 Min. :0.000
## Class :character 1st Qu.:0.000
## Mode :character Median :0.000
## Mean :0.117
## 3rd Qu.:0.000
## Max. :1.000
There are no missing values in the dataset. However, the ‘unknown’ category appears in features such as job, education, contact, and poutcome. These should be addressed to avoid misleading the model. For job and education, the unknown values are the least frequent categories so we can transform these categories to the most frequent category. For example, for the job feature, the unknowns will be changed to management since this is the most frequent category in jobs. This way it will not impact the model as much. However, with the contact and poutcome variable, ‘unknown’ represents a significant portion of the data so it would not be ideal to transform the value here. It maybe best to exclude these variables from the model.
While there were no obvious duplicate values I could identify, there were some inconsistent values that might need to be addressed before modeling. There were negative values in the balance feature, which likely represent overdrafts. Additionally, the pdays feature has a value of -1. These inconsistencies can introduce noise and impact the model’s predictions. For balance, we can shift the data to the right and make all values non negative. In this case we would add minimum value (-8019) to all balance values. We can also perform a log transformation to address the skewness of the balance data.
The data aligns well with what we would expect from a banking and marketing perspective. Banks typically use a customer’s age, job, education, and other financial indicators to understand their clients better. The distribution of ages, marital statuses, and job types closely resembles what you’d find in a bank’s customer base. Additionally, the positive relationship between call duration and the target variable fits with the idea that longer calls indicate higher customer interest, which aligns with a banker’s effort to sell a product to an interested customer.
Based on the EDA, I would select Logistic Regression and Random Forests for this dataset. Seeing as how the target is a binary outcome, Logistic Regression is ideal for binary classification tasks. It can also handle large datasets with categorical and numerical variables well, making it suitable for the mixed data types in our dataset. On the other hand, Random Forests can also handle mixed data types well in addition to identifying non linear relationships in the data. They are useful when dealing with outliers and missing values, making them a great choice for creating a robust model given that the dataset had many outliers and unknown values.
Pros:
Cons:
Pros:
Cons: - Training a Random Forest requires more resources and is much slower than logistic regression, especially with large datasets - Hard to understand and interpret compared to logistic regression due to the complexity of the model - Requires careful tuning of hyperparameters to achieve optimal performance
If costs, speed, and time were not constraints, I recommend using Random Forests for this dataset. Given the complexity of the data, which includes both numerical and categorical variables, as well as the presence of outliers and unknown values, Random Forests offer a better, more flexible model than logistic regression. The correlation plot in the EDA showed that there was little to no linear correlation with the numerical features. Therefore, Random Forests can identify non-linear relationships and patterns in the data. Additionally, Random Forests can provide insights into feature importance using the Gini Index. Lastly, since Random Forests are resilient to outliers and missing values, they offer a more accurate and reliable model for this dataset.
Yes, the dataset contains labels, specifically the target variable ‘y’, which indicates whether or not a client subscribed to a term deposit. The presence of labels means this is a supervised learning problem. While it did not significantly impact my choice of algorithm, it confirmed that both Logistic Regression and Random Forests are appropriate for the task. However, when compared to logistic regression, Random Forests are better for handling complex, non-linear relationships in the data, making them a strong choice for this dataset.
The dataset includes both categorical variables, such as job, martial, and education, and numerical features that were skewed distributions like balance, duration, and campaign. Also, the correlation plot showed little to no linear correlation among the numerical features, making Random Forests ideal algorithm modeling this data. Additionally, Random Forests handle skewed data effectively through bootstrapping and splitting the data at different points. This is particularly important because many of the numerical features (balance, duration, campaign) were right skewed and hand long tails. Splitting the data at different points also reduces the noise made by outliers from the numerical features and unknown values from the categorical features (job, education, contact). Despite Random Forests being the ideal model, it does cost significantly more and takes longer to train than a logistic model, especially with a 45000 row, 17 feature dataset. From a business perspective, I would choose Random Forests as I want to prioritize accuracy and creating a reliable model over lower costs and faster speeds.
If there were fewer than 1,000 data records, my choice of algorithm would definitely change. Random Forests do not perform as optimally with smaller datasets due to the reduced amonut of data available to train each tree. In this case, simpler algorithms like logistic regression would be more appropriate. Logistic Regression is efficient and can perform well with smaller datasets, providing solid results. Another choice would be Decision Trees as they are less computationally demanding and can handle small datasets effectively while still capturing non-linear relationships.
There were several categorical features with ‘unknown’ data, such as job, education, contact, and poutcome. For job and education, the ‘unknown’ value is negligible as it is the least frequent category, so we can impute it with the mode of the variable. As for contact and poutcome, ‘unknown’ is a more common category. Since it is one of the more dominant values in those categories, I would exclude these two features completely.
For the numerical features, there were some with outliers, including balance, campaign, and duration. Random Forests are less sensitive to outliers and noise, so it may not be necessary to address them. In this case, I would not remove or transform the outliers as there are not many of them and I doubt it will impact the performance of the model.
Based on my business knowledge, one feature I would create is average campaign duration. This will calculate the average duration of interactions by dividing the duration by the number of campaigns. This feature will provide insights into the quality of each interaction. I would predict that would be a positive correlation between this feature and the target variable.
Another feature I would create is combining the job and marital status to capture employment and personal stability. This will be a categorical variable with categories such as ‘stable employment’, ‘unstable employment’, ‘stable employment and single’, and ‘unstable employment and married’. These categories will show how employment and martial stability can impact financial behavior.
There is a class imbalance in the target variable with about 20% of the outcome is a yes variable and 80% is no. A resampling technique is an effective way to adress this class imbalance and resize the dataset to ensure balanced representation. For this dataset, we can use oversampling which increases the number of the minority class (yes) by creating synthetic samples. We can also use undersampling which reduces the number of the majority class (no) by randomly removing samples. However, this may lead to a loss of valuable information. In this case, applying these these resampling techniques can create a balanced and representative dataset that can improve the model. I would first start with oversampling as it preserves the information.
For transforming numerical variables, we can normalize features such as balance, duration, and campaign. We can use min-max scaling to transform these features to a range of 0 to 1, ensuring they are on a similar scale and aligned with the model’s requirements. Another option is to apply log transformations to heavily skewed data like balance to reduce the impact of extreme values and improve the model’s performance. For categorical variables, we can apply One-Hot Encoding to convert them into binary vectors, making them suitable for the Random Forest model. Also, for categorical features with many unique categories, Target Encoding can be considered to capture the relationship between the category and the target variable.
Given that there is a significant class imbalance in the dataset, where 20% of the outcomes are ‘yes’ and 80% are ‘no’, reducing the imbalance in the data is important to improving the model. As mentioned before, we can use resample techniques such as oversampling and undersampling. A hybrid of both can be effective but oversampling is preferred in this instance since it would not remove critical data from the training set. We can also employ cost-sensitive learning by adjusting class weights to give more importance to the minority class. By increasing the weight of the minority class, the Random Forest model becomes more attentive to minority class predictions, improving its ability to correctly classify ‘yes’ outcomes.