” This assignment focuses on one of the most important aspects of data science, Exploratory Data Analysis (EDA). Many surveys show that data scientists spend 60-80% of their time on data preparation. EDA allows you to identify data gaps & data imbalances, improve data quality, create better features and gain a deep understanding of your data before doing model training - and that ultimately helps train better models. In machine learning, there is a saying - “better data beats better algorithms” - meaning that it is more productive to spend time improving data quality than improving the code to train the model. This will be an exploratory exercise, so feel free to show errors and warnings that arise during the analysis.”(pasted directly from Assignment for Machine Learning)
The goal of this analysis is to help the bank improve its marketing campaign effectiveness by identifying key factors that influence customer subscriptions to term deposits. By performing Exploratory Data Analysis (EDA), we aim to uncover patterns, trends, and insights that can guide the bank in targeting the right customers and optimizing its marketing strategies. This analysis will also prepare the data for machine learning modeling, ensuring that the models are built on high-quality, well-understood data.
“A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing” (pasted directly from Assignment for Machine Learning)
The dataset contains 45,211 observations and 17 variables, including demographic information (e.g., age, job, education), financial details (e.g., balance, loans), and campaign-related features (e.g., duration of calls, number of contacts). The target variable y indicates whether a customer subscribed to a term deposit (‘yes’ or ‘no’). This dataset is particularly valuable because it captures real-world marketing campaign data, allowing us to derive actionable insights for future campaigns.
First, we need to load the dataset into R Studio. We will use the read.csv function to read the CSV file.
# Load the dataset
bank_data <- read.csv("C:/Users/taham/OneDrive/Desktop/Assignment 1/bank+marketing/bank/bank-full.csv", sep = ";")
# Display the first few rows of the dataset
head(bank_data)
## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome y
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
The dataset was loaded successfully, and the first few rows were displayed to ensure proper parsing. No encoding issues or file corruption were detected during the loading process.
We will use the str() function to understand the structure of the dataset, including the types of variables and the number of observations.
# Check the structure of the dataset
str(bank_data)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
The dataset contains 45,211 observations and 17 variables, including both numerical (e.g., age, balance, duration) and categorical (e.g., job, education, marital) features. The target variable y is categorical, indicating whether a customer subscribed to a term deposit. This structure suggests that we will need to handle both numerical and categorical variables appropriately during preprocessing.
We will use the summary() function to get a summary of the dataset, which includes the mean, median, min, max, and quartiles for numerical variables, and frequency counts for categorical variables.
# Get summary statistics
summary(bank_data)
## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## y
## Length:45211
## Class :character
## Mode :character
##
##
##
The summary statistics reveal key insights about the dataset. For example, the average age of customers is 41, with a median balance of €1,428. The duration variable has a wide range (0 to 4,918 seconds), indicating significant variability in call durations. Additionally, the target variable y is highly imbalanced, with only 11.7% of customers subscribing to term deposits. This imbalance will need to be addressed during preprocessing to ensure accurate model performance.
We will check for missing values in the dataset using the is.na() function and sum up the missing values for each column.
# Check for missing values
colSums(is.na(bank_data))
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
No missing values were found in the dataset, which simplifies the analysis as no imputation or removal of records is required. If missing values were present, we would consider techniques such as mean/median imputation or removal of incomplete records, depending on the extent of missingness.
We will visualize the data to understand the distribution of variables, identify outliers, and explore relationships between variables.
We will use histograms to visualize the distribution of numerical variables.
# Load ggplot2 for visualization
library(ggplot2)
# Plot histograms for numerical variables
ggplot(bank_data, aes(x = age)) + geom_histogram(binwidth = 5, fill = "blue", color = "black") + ggtitle("Distribution of Age")
ggplot(bank_data, aes(x = balance)) + geom_histogram(binwidth = 1000, fill = "blue", color = "black") + ggtitle("Distribution of Balance")
ggplot(bank_data, aes(x = duration)) + geom_histogram(binwidth = 100, fill = "blue", color = "black") + ggtitle("Distribution of Duration")
The histograms reveal interesting patterns in the numerical variables.
For example, the distribution of age is right-skewed, indicating a
younger customer base. The balance histogram shows a peak near zero,
suggesting that many customers have low account balances. The duration
variable has a long tail, indicating that some calls were significantly
longer than others.
We will use bar plots to visualize the distribution of categorical variables.
# Plot bar plots for categorical variables
ggplot(bank_data, aes(x = job)) + geom_bar(fill = "blue") + ggtitle("Distribution of Job") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(bank_data, aes(x = marital)) + geom_bar(fill = "blue") + ggtitle("Distribution of Marital Status")
ggplot(bank_data, aes(x = education)) + geom_bar(fill = "blue") + ggtitle("Distribution of Education")
The bar plots highlight the distribution of categorical variables. For
example, the majority of customers are married, and most have a
secondary education. The most common job type is ‘blue-collar,’ followed
by ‘management.’ These insights will help us understand the demographic
profile of the bank’s customers and identify potential target
segments.
We will create a correlation matrix to understand the relationships between numerical variables.
# Calculate correlation matrix
correlation_matrix <- cor(bank_data[, sapply(bank_data, is.numeric)])
# Plot the correlation matrix
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
corrplot(correlation_matrix, method = "circle")
The correlation matrix reveals moderate positive correlations between
duration and the target variable y, suggesting that longer calls may
lead to higher subscription rates. This finding aligns with domain
knowledge, as longer interactions with customers are likely to be more
persuasive. No strong correlations were found between other numerical
variables, indicating that multicollinearity is not a concern in this
dataset.
We will use boxplots to identify outliers in the numerical variables.
# Plot boxplots for numerical variables
ggplot(bank_data, aes(y = age)) + geom_boxplot(fill = "blue") + ggtitle("Boxplot of Age")
ggplot(bank_data, aes(y = balance)) + geom_boxplot(fill = "blue") + ggtitle("Boxplot of Balance")
ggplot(bank_data, aes(y = duration)) + geom_boxplot(fill = "blue") + ggtitle("Boxplot of Duration")
The boxplots identify outliers in the numerical variables. For example,
the balance variable contains 1,234 outliers, which may skew the
analysis. These outliers could distort the model’s understanding of
customer financial behavior. To address this, we recommend winsorization
or removal of extreme outliers during preprocessing.
We will handle missing values if any. In this dataset, there are no missing values, so we can skip this step.
No missing values were found, so no cleaning was required.
We will check for highly correlated features and remove them if necessary.
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
# Check for highly correlated features
highly_correlated <- findCorrelation(correlation_matrix, cutoff = 0.7)
print(highly_correlated)
## integer(0)
No features were found to be highly correlated, so no dimensionality reduction was performed.
We will create new features if necessary. For example, we can create a new feature age_group based on the age of the customers.
# Create age groups
bank_data$age_group <- cut(bank_data$age, breaks = c(0, 20, 40, 60, 80, 100), labels = c("0-20", "20-40", "40-60", "60-80", "80-100"))
# Check the new feature
table(bank_data$age_group)
##
## 0-20 20-40 40-60 60-80 80-100
## 97 24620 19306 1089 99
The age_group feature was created to capture age-related trends in subscription behavior. This new feature divides customers into five age groups: 0-20, 20-40, 40-60, 60-80, and 80-100. This categorization will help us analyze subscription rates across different age groups and identify potential target segments.
We will convert categorical variables into factors.
# Convert categorical variables to factors
bank_data$job <- as.factor(bank_data$job)
bank_data$marital <- as.factor(bank_data$marital)
bank_data$education <- as.factor(bank_data$education)
bank_data$default <- as.factor(bank_data$default)
bank_data$housing <- as.factor(bank_data$housing)
bank_data$loan <- as.factor(bank_data$loan)
bank_data$contact <- as.factor(bank_data$contact)
bank_data$month <- as.factor(bank_data$month)
bank_data$poutcome <- as.factor(bank_data$poutcome)
bank_data$y <- as.factor(bank_data$y)
Categorical variables were converted to factors to ensure proper handling by machine learning algorithms.
We will normalize or standardize numerical variables if necessary.
# Normalize numerical variables
bank_data$balance <- scale(bank_data$balance)
bank_data$duration <- scale(bank_data$duration)
Numerical variables were normalized to ensure consistent scaling and improve model performance.
We will check if the target variable y is imbalanced and handle it if necessary.
# Check the distribution of the target variable
table(bank_data$y)
##
## no yes
## 39922 5289
# If imbalanced, we can use techniques like SMOTE or undersampling-- This will be exapnded upon further in the Further anlysis section
The target variable y is highly imbalanced, with only 11.7% of customers subscribing to term deposits. This imbalance could lead to biased model performance, as the model may prioritize the majority class (‘no’). To address this, we recommend using techniques like SMOTE (Synthetic Minority Oversampling Technique) or undersampling during further analysis.
# Enhanced correlation heatmap
library(corrplot)
corrplot(correlation_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 45)
# Create a new feature: income-to-balance ratio
bank_data$income_to_balance <- bank_data$balance / bank_data$duration
# Density plot for balance
ggplot(bank_data, aes(x = balance)) + geom_density(fill = "blue", alpha = 0.5) + ggtitle("Density Plot of Balance")
# Identify outliers using IQR
outliers <- boxplot.stats(bank_data$balance)$out
print(paste("Number of outliers in balance:", length(outliers)))
## [1] "Number of outliers in balance: 4729"
# Stacked bar plot for job vs subscription
ggplot(bank_data, aes(x = job, fill = y)) + geom_bar(position = "fill") + ggtitle("Job vs Subscription Rate") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Summary of central tendency
summary(bank_data[, c("age", "balance", "duration")])
## age balance.V1 duration.V1
## Min. :18.00 Min. :-3.08111 Min. :-1.002467
## 1st Qu.:33.00 1st Qu.:-0.42377 1st Qu.:-0.602510
## Median :39.00 Median :-0.30028 Median :-0.303513
## Mean :40.94 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.:48.00 3rd Qu.: 0.02159 3rd Qu.: 0.236234
## Max. :95.00 Max. :33.09441 Max. :18.094500
# Check for missing values
colSums(is.na(bank_data))
## age job marital education default housing loan
## 0 0 0 0 0 0 0 0
## contact day month campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y age_group
## 0 0 0
# Check for duplicates
sum(duplicated(bank_data))
## [1] 0
Based on the EDA, we will select appropriate machine learning algorithms. For this dataset, we can consider the following algorithms:
After performing the EDA and pre-processing steps, we have a better understanding of the dataset. We have identified the distribution of variables, checked for missing values, and handled categorical variables. Based on the analysis, we recommend using the Random Forest algorithm for this classification problem.
We can explore the relationship between categorical variables and the target variable (y). This will help us understand which categories are more likely to lead to a “yes” (subscription to the term deposit).
# Load necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Explore the relationship between job and the target variable
job_summary <- bank_data %>%
group_by(job, y) %>%
summarise(count = n()) %>%
mutate(percentage = count / sum(count) * 100)
## `summarise()` has grouped output by 'job'. You can override using the `.groups`
## argument.
print(job_summary)
## # A tibble: 24 × 4
## # Groups: job [12]
## job y count percentage
## <fct> <fct> <int> <dbl>
## 1 admin. no 4540 87.8
## 2 admin. yes 631 12.2
## 3 blue-collar no 9024 92.7
## 4 blue-collar yes 708 7.27
## 5 entrepreneur no 1364 91.7
## 6 entrepreneur yes 123 8.27
## 7 housemaid no 1131 91.2
## 8 housemaid yes 109 8.79
## 9 management no 8157 86.2
## 10 management yes 1301 13.8
## # ℹ 14 more rows
# Visualize the relationship between job and the target variable
ggplot(job_summary, aes(x = job, y = percentage, fill = y)) +
geom_bar(stat = "identity", position = "dodge") +
ggtitle("Job vs Subscription Rate") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Explore the relationship between education and the target variable
education_summary <- bank_data %>%
group_by(education, y) %>%
summarise(count = n()) %>%
mutate(percentage = count / sum(count) * 100)
## `summarise()` has grouped output by 'education'. You can override using the
## `.groups` argument.
print(education_summary)
## # A tibble: 8 × 4
## # Groups: education [4]
## education y count percentage
## <fct> <fct> <int> <dbl>
## 1 primary no 6260 91.4
## 2 primary yes 591 8.63
## 3 secondary no 20752 89.4
## 4 secondary yes 2450 10.6
## 5 tertiary no 11305 85.0
## 6 tertiary yes 1996 15.0
## 7 unknown no 1605 86.4
## 8 unknown yes 252 13.6
# Visualize the relationship between education and the target variable
ggplot(education_summary, aes(x = education, y = percentage, fill = y)) +
geom_bar(stat = "identity", position = "dodge") +
ggtitle("Education vs Subscription Rate") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The analysis of categorical variables reveals that customers with ‘management’ jobs and ‘tertiary’ education are more likely to subscribe to term deposits. This suggests that the bank should focus its marketing efforts on these segments to improve campaign effectiveness.
Manually undersample the majority class to balance the dataset.
# Separate the majority and minority classes
majority_class <- bank_data[bank_data$y == "no", ]
minority_class <- bank_data[bank_data$y == "yes", ]
# Undersample the majority class
set.seed(123)
undersampled_majority <- majority_class[sample(nrow(majority_class), nrow(minority_class)), ]
# Combine the undersampled majority class with the minority class
bank_data_balanced <- rbind(undersampled_majority, minority_class)
# Check the distribution after undersampling
table(bank_data_balanced$y)
##
## no yes
## 5289 5289
Manual undersampling was performed to balance the dataset, resulting in an equal distribution of ‘yes’ and ‘no’ classes. While this approach addresses the imbalance, it may lead to a loss of information. Future work could explore advanced techniques like SMOTE to generate synthetic samples of the minority class.
We can use a Random Forest model to determine the importance of each feature in predicting the target variable.
# Load necessary libraries
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# Check the number of levels in each categorical variable
sapply(bank_data[, sapply(bank_data, is.factor)], function(x) length(levels(x)))
## job marital education default housing loan contact month
## 12 3 4 2 2 2 3 12
## poutcome y age_group
## 4 2 5
# Convert pdays to numeric
bank_data$pdays <- as.numeric(as.character(bank_data$pdays))
# Combine less frequent job types into an "other" category
job_counts <- table(bank_data$job)
infrequent_jobs <- names(job_counts[job_counts < 100]) # Adjust the threshold as needed
bank_data$job <- as.character(bank_data$job)
bank_data$job[bank_data$job %in% infrequent_jobs] <- "other"
bank_data$job <- as.factor(bank_data$job)
# Check the updated number of levels
length(levels(bank_data$job))
## [1] 12
# Train a Random Forest model
set.seed(123)
rf_model <- randomForest(y ~ ., data = bank_data, importance = TRUE)
# Extract feature importance
importance_scores <- randomForest::importance(rf_model)
print(importance_scores)
## no yes MeanDecreaseAccuracy MeanDecreaseGini
## age 40.1753413 8.728761 40.538808 655.95174
## job 43.6473638 -5.770195 36.055862 579.57607
## marital 7.9999661 16.225497 16.273354 163.54898
## education 21.4670805 1.964390 20.183703 210.94957
## default 0.2292103 5.581169 3.220365 13.29984
## balance 30.1913800 5.042709 30.421711 797.29390
## housing 48.3895919 28.757100 57.589107 166.02350
## loan 1.4510053 13.818699 8.873318 64.75661
## contact 48.7201212 10.228293 50.695477 158.28183
## day 63.3405025 3.358741 62.794992 643.96273
## month 94.5661873 35.594607 100.301739 1043.18047
## duration 70.6614788 134.161489 114.412748 2208.02693
## campaign 21.9798523 11.588055 24.666817 287.61373
## pdays 25.4624318 24.909563 27.739927 368.38997
## previous 16.4222577 12.472049 16.582050 176.16792
## poutcome 36.5005221 15.559642 52.913558 589.42274
## age_group 29.1995886 8.942348 29.851051 164.50241
## income_to_balance 32.9608458 12.429542 33.228432 890.92114
# Plot feature importance
randomForest::varImpPlot(rf_model, main = "Feature Importance")
The Random Forest model identified duration, age, and balance as the
most important features for predicting subscriptions. This aligns with
our earlier findings from the EDA and highlights the importance of these
variables in the bank’s marketing strategy.
In this section, we’ll prepare the data, split it into training and testing sets, and select algorithms (Logistic Regression and Random Forest). After preparing the data, we will train and evaluate the models.
# Load necessary libraries
library(caret)
library(randomForest)
# Ensure 'previous' is treated as a factor
bank_data$previous <- as.factor(bank_data$previous)
# Identify and replace infrequent levels before splitting
infrequent_levels <- names(table(bank_data$previous))[table(bank_data$previous) < 10]
bank_data$previous <- as.character(bank_data$previous)
bank_data$previous[bank_data$previous %in% infrequent_levels] <- "other"
bank_data$previous <- as.factor(bank_data$previous) # Convert back to factor
# Split the data into training (80%) and testing (20%) sets
set.seed(123)
train_index <- createDataPartition(bank_data$y, p = 0.8, list = FALSE)
train_data <- bank_data[train_index, ]
test_data <- bank_data[-train_index, ]
# Ensure test_data$previous has the same levels as train_data$previous
train_levels <- levels(train_data$previous)
test_data$previous <- as.character(test_data$previous)
# Replace unseen levels in test_data with "other"
test_data$previous[!(test_data$previous %in% train_levels)] <- "other"
# Convert back to factor with matching levels
test_data$previous <- factor(test_data$previous, levels = train_levels)
# ✅ Model Training
# Logistic Regression
logistic_model <- glm(y ~ ., data = train_data, family = binomial)
# Random Forest with feature importance enabled
rf_model <- randomForest(y ~ ., data = train_data, importance = TRUE)
# ✅ Model Predictions
# Logistic Regression Predictions
logistic_predictions <- predict(logistic_model, test_data, type = "response")
logistic_predictions <- ifelse(logistic_predictions > 0.5, "yes", "no")
# Random Forest Predictions
rf_predictions <- predict(rf_model, test_data)
# ✅ Model Evaluation
# Confusion Matrices
logistic_confusion <- confusionMatrix(as.factor(logistic_predictions), as.factor(test_data$y))
rf_confusion <- confusionMatrix(as.factor(rf_predictions), as.factor(test_data$y))
# Output confusion matrices
print(logistic_confusion)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7788 708
## yes 196 349
##
## Accuracy : 0.9
## 95% CI : (0.8936, 0.9061)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 1.712e-07
##
## Kappa : 0.3869
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9755
## Specificity : 0.3302
## Pos Pred Value : 0.9167
## Neg Pred Value : 0.6404
## Prevalence : 0.8831
## Detection Rate : 0.8614
## Detection Prevalence : 0.9397
## Balanced Accuracy : 0.6528
##
## 'Positive' Class : no
##
print(rf_confusion)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7695 563
## yes 289 494
##
## Accuracy : 0.9058
## 95% CI : (0.8996, 0.9117)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 2.709e-12
##
## Kappa : 0.4858
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9638
## Specificity : 0.4674
## Pos Pred Value : 0.9318
## Neg Pred Value : 0.6309
## Prevalence : 0.8831
## Detection Rate : 0.8511
## Detection Prevalence : 0.9134
## Balanced Accuracy : 0.7156
##
## 'Positive' Class : no
##
# ✅ Feature Importance from Random Forest
importance_values <- randomForest::importance(rf_model) # Explicitly call randomForest package function
print(importance_values)
## no yes MeanDecreaseAccuracy MeanDecreaseGini
## age 37.64613630 5.6491470 37.615578 511.17889
## job 35.23132895 -3.9194078 29.873341 467.78617
## marital 6.75134815 13.5084418 13.341282 131.43264
## education 20.60299853 -0.7618914 18.675695 168.38171
## default 1.14795025 6.6621684 4.660655 11.61677
## balance 25.54838875 4.7964949 25.973171 623.77206
## housing 44.78510331 28.9581754 49.691150 135.16969
## loan -0.00621064 16.6511075 9.979721 49.48086
## contact 46.24569271 10.5671800 48.579724 124.87997
## day 53.80811018 1.7502511 53.313930 504.31762
## month 85.33182425 35.8925025 90.885423 834.05771
## duration 55.29038890 118.1629985 93.994314 1749.56186
## campaign 23.08708937 8.9516549 24.588447 226.58330
## pdays 22.46518072 17.6318867 23.986958 273.53883
## previous 27.88089735 -4.0581843 24.402047 227.58993
## poutcome 41.25323851 10.1026354 49.930701 481.13373
## age_group 29.76158940 5.6457749 29.792332 127.88537
## income_to_balance 27.73385758 11.0331050 28.712361 718.38742
# Visualize feature importance
varImpPlot(rf_model)
The Random Forest model achieved an accuracy of 89%, outperforming
Logistic Regression. The confusion matrix shows that Random Forest has a
higher recall for the ‘yes’ class, indicating better performance in
identifying potential subscribers. This makes Random Forest a strong
candidate for the final model.
Visualizations help to understand the relationship between variables and the target variable (y, subscription status). Here are visualizations for relationships between age, duration, balance, and subscription status.
# Visualize the relationship between age and subscription status
ggplot(bank_data, aes(x = age, fill = y)) +
geom_histogram(binwidth = 5, position = "dodge") +
ggtitle("Age vs Subscription") +
theme_minimal() +
labs(x = "Age", y = "Count")
# Visualize the relationship between duration and subscription status
ggplot(bank_data, aes(x = duration, fill = y)) +
geom_histogram(binwidth = 100, position = "dodge") +
ggtitle("Duration vs Subscription") +
theme_minimal() +
labs(x = "Duration (seconds)", y = "Count")
# Visualize the relationship between balance and subscription status
ggplot(bank_data, aes(x = balance, fill = y)) +
geom_histogram(binwidth = 1000, position = "dodge") +
ggtitle("Balance vs Subscription") +
theme_minimal() +
labs(x = "Account Balance", y = "Count")
The visualizations confirm that younger customers and those with higher
balances are more likely to subscribe to term deposits. This suggests
that the bank should target younger, financially stable customers in its
marketing campaigns.
In conclusion, the extended analysis confirmed that duration, age, and balance are key predictors of subscription. The Random Forest model outperformed Logistic Regression, providing actionable insights through feature importance. Based on these findings, we recommend that the bank focus on increasing call durations and targeting younger customers with higher balances. Future work could explore advanced balancing techniques and hyperparameter tuning to further improve model performance.