# Load library
library(tidyverse)
library(GGally)
library(gridExtra)
# Load the data
bank <- read.csv("https://raw.githubusercontent.com/yli1048/yli1048/refs/heads/622/bank-full.csv", sep=";")
# Check the structure of the dataset
glimpse(bank)
## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
for (col in colnames(numeric_vars)) {
hist(numeric_vars[[col]], main=paste("Distribution of", col), col="darkorange", border="black", xlab=col)
}
The histograms indicate that most numerical variables are right-skewed. In contrast, the histogram for day displays a uniform distribution.
boxplot(bank$age, main="Boxplot of Age", ylab="Age", col="green4", horizontal = TRUE)
boxplot(bank$balance, main="Boxplot of Balance", ylab="Balance", col="green4", horizontal = TRUE)
boxplot(bank$day, main="Boxplot of Day", ylab="Day", col="green4", horizontal = TRUE)
boxplot(bank$duration, main="Boxplot of Duration", ylab="Duration", col="green4", horizontal = TRUE)
boxplot(bank$campaign, main="Boxplot of Campaign", ylab="Campaign", col="green4", horizontal = TRUE)
boxplot(bank$pdays, main="Boxplot of Pdays", ylab="Pdays", col="green4", horizontal = TRUE)
boxplot(bank$previous, main="Boxplot of Previous", ylab="Previous", col="green4", horizontal = TRUE)
The boxplots show that many of the numerical variables have outliers, especially the variables balance, duration, and pdays.
There is nearly no linear relationship between the variables, as indicated by the correlation matrix values, which are approximately zero. However, a significant linear relationship exists between ‘previous’ and ‘pday,’ with a correlation value of around 0.58. This suggests that as the value of ‘previous’ increases, the value of ‘pday’ also tends to increase.
categorical_vars <- bank %>%
select(where(is.character), -y)
for (col in colnames(categorical_vars)) {
print(paste("Distribution for", col, ":"))
print(table(categorical_vars[[col]]))
cat("\n")
}
## [1] "Distribution for job :"
##
## admin. blue-collar entrepreneur housemaid management
## 5171 9732 1487 1240 9458
## retired self-employed services student technician
## 2264 1579 4154 938 7597
## unemployed unknown
## 1303 288
##
## [1] "Distribution for marital :"
##
## divorced married single
## 5207 27214 12790
##
## [1] "Distribution for education :"
##
## primary secondary tertiary unknown
## 6851 23202 13301 1857
##
## [1] "Distribution for default :"
##
## no yes
## 44396 815
##
## [1] "Distribution for housing :"
##
## no yes
## 20081 25130
##
## [1] "Distribution for loan :"
##
## no yes
## 37967 7244
##
## [1] "Distribution for contact :"
##
## cellular telephone unknown
## 29285 2906 13020
##
## [1] "Distribution for month :"
##
## apr aug dec feb jan jul jun mar may nov oct sep
## 2932 6247 214 2649 1403 6895 5341 477 13766 3970 738 579
##
## [1] "Distribution for poutcome :"
##
## failure other success unknown
## 4901 1840 1511 36959
# Loop through each categorical variable and create a separate bar plot
for (col in colnames(categorical_vars)) {
p <- ggplot(categorical_vars, aes_string(x = col)) +
geom_bar(fill = "steelblue", color = "black") +
theme_minimal() +
labs(title = paste("Distribution of", col), x = col, y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels if needed
print(p) # Print each plot separately
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The distributions of the categorical variables are uneven.
# Filter dataset by 'y' variable
bank_yes <- filter(bank, y == "yes")
# Analyze demographics of 'yes' responses
demographic_analysis <- bank_yes %>%
group_by(job, marital, education) %>%
summarise(count = n(), .groups = 'drop') %>%
arrange(desc(count))
# Print demographic insights
print(demographic_analysis)
## # A tibble: 129 × 4
## job marital education count
## <chr> <chr> <chr> <int>
## 1 management married tertiary 563
## 2 management single tertiary 451
## 3 technician married secondary 276
## 4 blue-collar married secondary 265
## 5 admin. married secondary 243
## 6 technician single secondary 188
## 7 admin. single secondary 179
## 8 retired married secondary 158
## 9 blue-collar single secondary 147
## 10 technician single tertiary 147
## # ℹ 119 more rows
Most clients are blue-collar workers, managers, or technicians. They likely have no credit defaults or personal loans. Additionally, the majority of the clients were last contacted in May.
# Function to find mode of a categorical variable
find_mode <- function(x) {
mode_value <- names(which.max(table(x))) # Get most frequent category
return(mode_value)
}
# Apply the function to each column of the dataset
modes <- sapply(categorical_vars, find_mode)
# Print results
print(modes)
## job marital education default housing
## "blue-collar" "married" "secondary" "no" "yes"
## loan contact month poutcome
## "no" "cellular" "may" "unknown"
For categorical variables, the central tendency can be determined by finding the mode for each variable. In this case, we identified the mode for each variable, and they are listed above.
The dataset does not have any missing values; however, there are numerous entries labeled as “Unknown” for both the contact communication type and the outcome of the previous marketing campaign. This lack of information could affect the analysis, making it difficult to determine which communication type is the most effective and whether clients might change their preferences in future campaigns.
This assignment focuses on investigating the correlation between various predictors to determine their effectiveness in predicting whether a client will subscribe to a term deposit. Therefore, it is advisable to choose algorithms that excel in prediction, such as decision trees and random forests.
Decision Trees: Pros - Easy to understand and interpret - Fast training and prediction - Handles both numerical and categorical data - Requires minimal data preprocessing
Cons - Prone to overfitting - High variance - Biased with imbalanced data - Might not lead to globally optimal model
Random Forests: Pros - Reduces overfitting - More stable and robust - higher accuracy - Handles missing data and noise well
Cons - Computationally expensive - Less interpretable - Can be biased, requiring additional techniques - High complexity
I recommend using random forests because they can manage missing data and noise, and the dataset contained many unknowns.
Yes, the labels in the dataset are in the column y. This impacts my choice of algorithms since the labels are used for supervised learning; therefore, I chose random forests.
Random forests are more effective for handling large datasets. Since the full dataset has more than 40,000 observations, it is wise to choose random forests. Additionally, random forests are more accurate and stable compared to decision trees.
For datasets with fewer than 1,000 records, using decision trees is wiser, as they are less computationally expensive and less effective with larger datasets.
# Count responses in column 'y'
print(table(bank$y))
##
## no yes
## 39922 5289