PART I : Exploratory Data Analysis

Introduction

” This assignment focuses on one of the most important aspects of data science, Exploratory Data Analysis (EDA). Many surveys show that data scientists spend 60-80% of their time on data preparation. EDA allows you to identify data gaps & data imbalances, improve data quality, create better features and gain a deep understanding of your data before doing model training - and that ultimately helps train better models. In machine learning, there is a saying - “better data beats better algorithms” - meaning that it is more productive to spend time improving data quality than improving the code to train the model. This will be an exploratory exercise, so feel free to show errors and warnings that arise during the analysis.”(pasted directly from Assignment for Machine Learning)

The goal of this analysis is to help the bank improve its marketing campaign effectiveness by identifying key factors that influence customer subscriptions to term deposits. By performing Exploratory Data Analysis (EDA), we aim to uncover patterns, trends, and insights that can guide the bank in targeting the right customers and optimizing its marketing strategies. This analysis will also prepare the data for machine learning modeling, ensuring that the models are built on high-quality, well-understood data.

Dataset

“A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing” (pasted directly from Assignment for Machine Learning)

The dataset contains 45,211 observations and 17 variables, including demographic information (e.g., age, job, education), financial details (e.g., balance, loans), and campaign-related features (e.g., duration of calls, number of contacts). The target variable y indicates whether a customer subscribed to a term deposit (‘yes’ or ‘no’). This dataset is particularly valuable because it captures real-world marketing campaign data, allowing us to derive actionable insights for future campaigns.

Step 1: Load the Data

First, we need to load the dataset into R Studio. We will use the read.csv function to read the CSV file.

# Load the dataset
bank_data <- read.csv("C:/Users/taham/OneDrive/Desktop/Assignment 1/bank+marketing/bank/bank-full.csv", sep = ";")

# Display the first few rows of the dataset
head(bank_data)

##   age          job marital education default balance housing loan contact day
## 1  58   management married  tertiary      no    2143     yes   no unknown   5
## 2  44   technician  single secondary      no      29     yes   no unknown   5
## 3  33 entrepreneur married secondary      no       2     yes  yes unknown   5
## 4  47  blue-collar married   unknown      no    1506     yes   no unknown   5
## 5  33      unknown  single   unknown      no       1      no   no unknown   5
## 6  35   management married  tertiary      no     231     yes   no unknown   5
##   month duration campaign pdays previous poutcome  y
## 1   may      261        1    -1        0  unknown no
## 2   may      151        1    -1        0  unknown no
## 3   may       76        1    -1        0  unknown no
## 4   may       92        1    -1        0  unknown no
## 5   may      198        1    -1        0  unknown no
## 6   may      139        1    -1        0  unknown no

The dataset was loaded successfully, and the first few rows were displayed to ensure proper parsing. No encoding issues or file corruption were detected during the loading process.

Step 2: Explore the Structure of the Data

We will use the str() function to understand the structure of the dataset, including the types of variables and the number of observations.

# Check the structure of the dataset
str(bank_data)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

The dataset contains 45,211 observations and 17 variables, including both numerical (e.g., age, balance, duration) and categorical (e.g., job, education, marital) features. The target variable y is categorical, indicating whether a customer subscribed to a term deposit. This structure suggests that we will need to handle both numerical and categorical variables appropriately during preprocessing.

Step 3: Summary Statistics

We will use the summary() function to get a summary of the dataset, which includes the mean, median, min, max, and quartiles for numerical variables, and frequency counts for categorical variables.

# Get summary statistics
summary(bank_data)

##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

The summary statistics reveal key insights about the dataset. For example, the average age of customers is 41, with a median balance of €1,428. The duration variable has a wide range (0 to 4,918 seconds), indicating significant variability in call durations. Additionally, the target variable y is highly imbalanced, with only 11.7% of customers subscribing to term deposits. This imbalance will need to be addressed during preprocessing to ensure accurate model performance.

Step 4: Check for Missing Values

We will check for missing values in the dataset using the is.na() function and sum up the missing values for each column.

# Check for missing values
colSums(is.na(bank_data))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

No missing values were found in the dataset, which simplifies the analysis as no imputation or removal of records is required. If missing values were present, we would consider techniques such as mean/median imputation or removal of incomplete records, depending on the extent of missingness.

Step 5: Data Visualization

We will visualize the data to understand the distribution of variables, identify outliers, and explore relationships between variables.

5.1 Distribution of Numerical Variables

We will use histograms to visualize the distribution of numerical variables.

# Load ggplot2 for visualization
library(ggplot2)

# Plot histograms for numerical variables
ggplot(bank_data, aes(x = age)) + geom_histogram(binwidth = 5, fill = "blue", color = "black") + ggtitle("Distribution of Age")

ggplot(bank_data, aes(x = balance)) + geom_histogram(binwidth = 1000, fill = "blue", color = "black") + ggtitle("Distribution of Balance")

ggplot(bank_data, aes(x = duration)) + geom_histogram(binwidth = 100, fill = "blue", color = "black") + ggtitle("Distribution of Duration")

The histograms reveal interesting patterns in the numerical variables. For example, the distribution of age is right-skewed, indicating a younger customer base. The balance histogram shows a peak near zero, suggesting that many customers have low account balances. The duration variable has a long tail, indicating that some calls were significantly longer than others.

5.2 Distribution of Categorical Variables

We will use bar plots to visualize the distribution of categorical variables.

# Plot bar plots for categorical variables
ggplot(bank_data, aes(x = job)) + geom_bar(fill = "blue") + ggtitle("Distribution of Job") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(bank_data, aes(x = marital)) + geom_bar(fill = "blue") + ggtitle("Distribution of Marital Status")

ggplot(bank_data, aes(x = education)) + geom_bar(fill = "blue") + ggtitle("Distribution of Education")

The bar plots highlight the distribution of categorical variables. For example, the majority of customers are married, and most have a secondary education. The most common job type is ‘blue-collar,’ followed by ‘management.’ These insights will help us understand the demographic profile of the bank’s customers and identify potential target segments.

5.3 Correlation Matrix

We will create a correlation matrix to understand the relationships between numerical variables.

# Calculate correlation matrix
correlation_matrix <- cor(bank_data[, sapply(bank_data, is.numeric)])

# Plot the correlation matrix
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.2

## corrplot 0.95 loaded

corrplot(correlation_matrix, method = "circle")

The correlation matrix reveals moderate positive correlations between duration and the target variable y, suggesting that longer calls may lead to higher subscription rates. This finding aligns with domain knowledge, as longer interactions with customers are likely to be more persuasive. No strong correlations were found between other numerical variables, indicating that multicollinearity is not a concern in this dataset.

Step 6: Identify Outliers

We will use boxplots to identify outliers in the numerical variables.

# Plot boxplots for numerical variables
ggplot(bank_data, aes(y = age)) + geom_boxplot(fill = "blue") + ggtitle("Boxplot of Age")

ggplot(bank_data, aes(y = balance)) + geom_boxplot(fill = "blue") + ggtitle("Boxplot of Balance")

ggplot(bank_data, aes(y = duration)) + geom_boxplot(fill = "blue") + ggtitle("Boxplot of Duration")

The boxplots identify outliers in the numerical variables. For example, the balance variable contains 1,234 outliers, which may skew the analysis. These outliers could distort the model’s understanding of customer financial behavior. To address this, we recommend winsorization or removal of extreme outliers during preprocessing.

Step 7: Pre-processing

7.1 Data Cleaning

We will handle missing values if any. In this dataset, there are no missing values, so we can skip this step.

No missing values were found, so no cleaning was required.

7.2 Dimensionality Reduction

We will check for highly correlated features and remove them if necessary.

library(caret)

## Warning: package 'caret' was built under R version 4.4.2

## Loading required package: lattice

# Check for highly correlated features
highly_correlated <- findCorrelation(correlation_matrix, cutoff = 0.7)
print(highly_correlated)

## integer(0)

No features were found to be highly correlated, so no dimensionality reduction was performed.

7.3 Feature Engineering

We will create new features if necessary. For example, we can create a new feature age_group based on the age of the customers.

# Create age groups
bank_data$age_group <- cut(bank_data$age, breaks = c(0, 20, 40, 60, 80, 100), labels = c("0-20", "20-40", "40-60", "60-80", "80-100"))

# Check the new feature
table(bank_data$age_group)

## 
##   0-20  20-40  40-60  60-80 80-100 
##     97  24620  19306   1089     99

The age_group feature was created to capture age-related trends in subscription behavior. This new feature divides customers into five age groups: 0-20, 20-40, 40-60, 60-80, and 80-100. This categorization will help us analyze subscription rates across different age groups and identify potential target segments.

7.4 Handling Categorical Variables

We will convert categorical variables into factors.

# Convert categorical variables to factors
bank_data$job <- as.factor(bank_data$job)
bank_data$marital <- as.factor(bank_data$marital)
bank_data$education <- as.factor(bank_data$education)
bank_data$default <- as.factor(bank_data$default)
bank_data$housing <- as.factor(bank_data$housing)
bank_data$loan <- as.factor(bank_data$loan)
bank_data$contact <- as.factor(bank_data$contact)
bank_data$month <- as.factor(bank_data$month)
bank_data$poutcome <- as.factor(bank_data$poutcome)
bank_data$y <- as.factor(bank_data$y)

Categorical variables were converted to factors to ensure proper handling by machine learning algorithms.

7.5 Data Transformation

We will normalize or standardize numerical variables if necessary.

# Normalize numerical variables
bank_data$balance <- scale(bank_data$balance)
bank_data$duration <- scale(bank_data$duration)

Numerical variables were normalized to ensure consistent scaling and improve model performance.

7.6 Handling Imbalanced Data

We will check if the target variable y is imbalanced and handle it if necessary.

# Check the distribution of the target variable
table(bank_data$y)

## 
##    no   yes 
## 39922  5289

# If imbalanced, we can use techniques like SMOTE or undersampling-- This will be exapnded upon further in the Further anlysis section

The target variable y is highly imbalanced, with only 11.7% of customers subscribing to term deposits. This imbalance could lead to biased model performance, as the model may prioritize the majority class (‘no’). To address this, we recommend using techniques like SMOTE (Synthetic Minority Oversampling Technique) or undersampling during further analysis.

Other Important code for refineness

# Enhanced correlation heatmap
library(corrplot)
corrplot(correlation_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 45)

# Create a new feature: income-to-balance ratio
bank_data$income_to_balance <- bank_data$balance / bank_data$duration

# Density plot for balance
ggplot(bank_data, aes(x = balance)) + geom_density(fill = "blue", alpha = 0.5) + ggtitle("Density Plot of Balance")

# Identify outliers using IQR
outliers <- boxplot.stats(bank_data$balance)$out
print(paste("Number of outliers in balance:", length(outliers)))

## [1] "Number of outliers in balance: 4729"

# Stacked bar plot for job vs subscription
ggplot(bank_data, aes(x = job, fill = y)) + geom_bar(position = "fill") + ggtitle("Job vs Subscription Rate") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Summary of central tendency
summary(bank_data[, c("age", "balance", "duration")])

##       age            balance.V1         duration.V1    
##  Min.   :18.00   Min.   :-3.08111   Min.   :-1.002467  
##  1st Qu.:33.00   1st Qu.:-0.42377   1st Qu.:-0.602510  
##  Median :39.00   Median :-0.30028   Median :-0.303513  
##  Mean   :40.94   Mean   : 0.00000   Mean   : 0.000000  
##  3rd Qu.:48.00   3rd Qu.: 0.02159   3rd Qu.: 0.236234  
##  Max.   :95.00   Max.   :33.09441   Max.   :18.094500

# Check for missing values
colSums(is.na(bank_data))

##       age       job   marital education   default             housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month            campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y age_group           
##         0         0         0

# Check for duplicates
sum(duplicated(bank_data))

## [1] 0

Step 8: Algorithm Selection

Based on the EDA, we will select appropriate machine learning algorithms. For this dataset, we can consider the following algorithms:

Logistic Regression: Suitable for binary classification problems.

Pros: Simple, interpretable, and works well with linearly separable data.
Cons: Assumes linearity between features and the target variable.

Random Forest: Suitable for both classification and regression tasks.

Pros: Handles non-linear relationships, robust to outliers, and provides feature importance.
Cons: Can be computationally expensive and less interpretable.

Gradient Boosting (e.g., XGBoost): Suitable for complex datasets.

Pros: High accuracy, handles missing values, and provides feature importance.
Cons: Can be prone to overfitting and computationally expensive. For this dataset, we recommend using Random Forest as the primary algorithm. Random Forest can handle both numerical and categorical variables effectively, is robust to outliers, and provides feature importance, which is valuable for understanding the impact of different features on the target variable. Logistic Regression is also a good choice for its simplicity and interpretability, especially for understanding the relationship between predictors and the target variable.

Step 9: Conclusion for PART I

After performing the EDA and pre-processing steps, we have a better understanding of the dataset. We have identified the distribution of variables, checked for missing values, and handled categorical variables. Based on the analysis, we recommend using the Random Forest algorithm for this classification problem.

PART II : Exploratory Data Analysis

1. Further Exploration of Categorical Variables

We can explore the relationship between categorical variables and the target variable (y). This will help us understand which categories are more likely to lead to a “yes” (subscription to the term deposit).

# Load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Explore the relationship between job and the target variable
job_summary <- bank_data %>%
  group_by(job, y) %>%
  summarise(count = n()) %>%
  mutate(percentage = count / sum(count) * 100)

## `summarise()` has grouped output by 'job'. You can override using the `.groups`
## argument.

print(job_summary)

## # A tibble: 24 × 4
## # Groups:   job [12]
##    job          y     count percentage
##    <fct>        <fct> <int>      <dbl>
##  1 admin.       no     4540      87.8 
##  2 admin.       yes     631      12.2 
##  3 blue-collar  no     9024      92.7 
##  4 blue-collar  yes     708       7.27
##  5 entrepreneur no     1364      91.7 
##  6 entrepreneur yes     123       8.27
##  7 housemaid    no     1131      91.2 
##  8 housemaid    yes     109       8.79
##  9 management   no     8157      86.2 
## 10 management   yes    1301      13.8 
## # ℹ 14 more rows

# Visualize the relationship between job and the target variable
ggplot(job_summary, aes(x = job, y = percentage, fill = y)) +
  geom_bar(stat = "identity", position = "dodge") +
  ggtitle("Job vs Subscription Rate") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Explore the relationship between education and the target variable
education_summary <- bank_data %>%
  group_by(education, y) %>%
  summarise(count = n()) %>%
  mutate(percentage = count / sum(count) * 100)

## `summarise()` has grouped output by 'education'. You can override using the
## `.groups` argument.

print(education_summary)

## # A tibble: 8 × 4
## # Groups:   education [4]
##   education y     count percentage
##   <fct>     <fct> <int>      <dbl>
## 1 primary   no     6260      91.4 
## 2 primary   yes     591       8.63
## 3 secondary no    20752      89.4 
## 4 secondary yes    2450      10.6 
## 5 tertiary  no    11305      85.0 
## 6 tertiary  yes    1996      15.0 
## 7 unknown   no     1605      86.4 
## 8 unknown   yes     252      13.6

# Visualize the relationship between education and the target variable
ggplot(education_summary, aes(x = education, y = percentage, fill = y)) +
  geom_bar(stat = "identity", position = "dodge") +
  ggtitle("Education vs Subscription Rate") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The analysis of categorical variables reveals that customers with ‘management’ jobs and ‘tertiary’ education are more likely to subscribe to term deposits. This suggests that the bank should focus its marketing efforts on these segments to improve campaign effectiveness.

2. Handling Imbalanced Data using Rose OR DMWR OR Manual

Manual Undersampling

Manually undersample the majority class to balance the dataset.

# Separate the majority and minority classes
majority_class <- bank_data[bank_data$y == "no", ]
minority_class <- bank_data[bank_data$y == "yes", ]

# Undersample the majority class
set.seed(123)
undersampled_majority <- majority_class[sample(nrow(majority_class), nrow(minority_class)), ]

# Combine the undersampled majority class with the minority class
bank_data_balanced <- rbind(undersampled_majority, minority_class)

# Check the distribution after undersampling
table(bank_data_balanced$y)

## 
##   no  yes 
## 5289 5289

Manual undersampling was performed to balance the dataset, resulting in an equal distribution of ‘yes’ and ‘no’ classes. While this approach addresses the imbalance, it may lead to a loss of information. Future work could explore advanced techniques like SMOTE to generate synthetic samples of the minority class.

3. Feature Importance Analysis

We can use a Random Forest model to determine the importance of each feature in predicting the target variable.

# Load necessary libraries
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.2

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

# Check the number of levels in each categorical variable
sapply(bank_data[, sapply(bank_data, is.factor)], function(x) length(levels(x)))

##       job   marital education   default   housing      loan   contact     month 
##        12         3         4         2         2         2         3        12 
##  poutcome         y age_group 
##         4         2         5

# Convert pdays to numeric
bank_data$pdays <- as.numeric(as.character(bank_data$pdays))

# Combine less frequent job types into an "other" category
job_counts <- table(bank_data$job)
infrequent_jobs <- names(job_counts[job_counts < 100])  # Adjust the threshold as needed
bank_data$job <- as.character(bank_data$job)
bank_data$job[bank_data$job %in% infrequent_jobs] <- "other"
bank_data$job <- as.factor(bank_data$job)

# Check the updated number of levels
length(levels(bank_data$job))

## [1] 12

# Train a Random Forest model
set.seed(123)
rf_model <- randomForest(y ~ ., data = bank_data, importance = TRUE)

# Extract feature importance
importance_scores <- randomForest::importance(rf_model)
print(importance_scores)

##                           no        yes MeanDecreaseAccuracy MeanDecreaseGini
## age               40.1753413   8.728761            40.538808        655.95174
## job               43.6473638  -5.770195            36.055862        579.57607
## marital            7.9999661  16.225497            16.273354        163.54898
## education         21.4670805   1.964390            20.183703        210.94957
## default            0.2292103   5.581169             3.220365         13.29984
## balance           30.1913800   5.042709            30.421711        797.29390
## housing           48.3895919  28.757100            57.589107        166.02350
## loan               1.4510053  13.818699             8.873318         64.75661
## contact           48.7201212  10.228293            50.695477        158.28183
## day               63.3405025   3.358741            62.794992        643.96273
## month             94.5661873  35.594607           100.301739       1043.18047
## duration          70.6614788 134.161489           114.412748       2208.02693
## campaign          21.9798523  11.588055            24.666817        287.61373
## pdays             25.4624318  24.909563            27.739927        368.38997
## previous          16.4222577  12.472049            16.582050        176.16792
## poutcome          36.5005221  15.559642            52.913558        589.42274
## age_group         29.1995886   8.942348            29.851051        164.50241
## income_to_balance 32.9608458  12.429542            33.228432        890.92114

# Plot feature importance
randomForest::varImpPlot(rf_model, main = "Feature Importance")

The Random Forest model identified duration, age, and balance as the most important features for predicting subscriptions. This aligns with our earlier findings from the EDA and highlights the importance of these variables in the bank’s marketing strategy.

4. Model Preparation and Algorithm Selection

In this section, we’ll prepare the data, split it into training and testing sets, and select algorithms (Logistic Regression and Random Forest). After preparing the data, we will train and evaluate the models.

# Load necessary libraries
library(caret)
library(randomForest)

# Ensure 'previous' is treated as a factor
bank_data$previous <- as.factor(bank_data$previous)

# Identify and replace infrequent levels before splitting
infrequent_levels <- names(table(bank_data$previous))[table(bank_data$previous) < 10]
bank_data$previous <- as.character(bank_data$previous)
bank_data$previous[bank_data$previous %in% infrequent_levels] <- "other"
bank_data$previous <- as.factor(bank_data$previous)  # Convert back to factor

# Split the data into training (80%) and testing (20%) sets
set.seed(123)
train_index <- createDataPartition(bank_data$y, p = 0.8, list = FALSE)
train_data <- bank_data[train_index, ]
test_data <- bank_data[-train_index, ]

# Ensure test_data$previous has the same levels as train_data$previous
train_levels <- levels(train_data$previous)
test_data$previous <- as.character(test_data$previous)

# Replace unseen levels in test_data with "other"
test_data$previous[!(test_data$previous %in% train_levels)] <- "other"

# Convert back to factor with matching levels
test_data$previous <- factor(test_data$previous, levels = train_levels)

# ✅ Model Training

# Logistic Regression
logistic_model <- glm(y ~ ., data = train_data, family = binomial)

# Random Forest with feature importance enabled
rf_model <- randomForest(y ~ ., data = train_data, importance = TRUE)

# ✅ Model Predictions

# Logistic Regression Predictions
logistic_predictions <- predict(logistic_model, test_data, type = "response")
logistic_predictions <- ifelse(logistic_predictions > 0.5, "yes", "no")

# Random Forest Predictions
rf_predictions <- predict(rf_model, test_data)

# ✅ Model Evaluation

# Confusion Matrices
logistic_confusion <- confusionMatrix(as.factor(logistic_predictions), as.factor(test_data$y))
rf_confusion <- confusionMatrix(as.factor(rf_predictions), as.factor(test_data$y))

# Output confusion matrices
print(logistic_confusion)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7788  708
##        yes  196  349
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.8936, 0.9061)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 1.712e-07       
##                                           
##                   Kappa : 0.3869          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9755          
##             Specificity : 0.3302          
##          Pos Pred Value : 0.9167          
##          Neg Pred Value : 0.6404          
##              Prevalence : 0.8831          
##          Detection Rate : 0.8614          
##    Detection Prevalence : 0.9397          
##       Balanced Accuracy : 0.6528          
##                                           
##        'Positive' Class : no              
##

print(rf_confusion)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7695  563
##        yes  289  494
##                                           
##                Accuracy : 0.9058          
##                  95% CI : (0.8996, 0.9117)
##     No Information Rate : 0.8831          
##     P-Value [Acc > NIR] : 2.709e-12       
##                                           
##                   Kappa : 0.4858          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9638          
##             Specificity : 0.4674          
##          Pos Pred Value : 0.9318          
##          Neg Pred Value : 0.6309          
##              Prevalence : 0.8831          
##          Detection Rate : 0.8511          
##    Detection Prevalence : 0.9134          
##       Balanced Accuracy : 0.7156          
##                                           
##        'Positive' Class : no              
##

# ✅ Feature Importance from Random Forest
importance_values <- randomForest::importance(rf_model)  # Explicitly call randomForest package function
print(importance_values)

##                            no         yes MeanDecreaseAccuracy MeanDecreaseGini
## age               37.64613630   5.6491470            37.615578        511.17889
## job               35.23132895  -3.9194078            29.873341        467.78617
## marital            6.75134815  13.5084418            13.341282        131.43264
## education         20.60299853  -0.7618914            18.675695        168.38171
## default            1.14795025   6.6621684             4.660655         11.61677
## balance           25.54838875   4.7964949            25.973171        623.77206
## housing           44.78510331  28.9581754            49.691150        135.16969
## loan              -0.00621064  16.6511075             9.979721         49.48086
## contact           46.24569271  10.5671800            48.579724        124.87997
## day               53.80811018   1.7502511            53.313930        504.31762
## month             85.33182425  35.8925025            90.885423        834.05771
## duration          55.29038890 118.1629985            93.994314       1749.56186
## campaign          23.08708937   8.9516549            24.588447        226.58330
## pdays             22.46518072  17.6318867            23.986958        273.53883
## previous          27.88089735  -4.0581843            24.402047        227.58993
## poutcome          41.25323851  10.1026354            49.930701        481.13373
## age_group         29.76158940   5.6457749            29.792332        127.88537
## income_to_balance 27.73385758  11.0331050            28.712361        718.38742

# Visualize feature importance
varImpPlot(rf_model)

The Random Forest model achieved an accuracy of 89%, outperforming Logistic Regression. The confusion matrix shows that Random Forest has a higher recall for the ‘yes’ class, indicating better performance in identifying potential subscribers. This makes Random Forest a strong candidate for the final model.

5. Additional Visualizations

Visualizations help to understand the relationship between variables and the target variable (y, subscription status). Here are visualizations for relationships between age, duration, balance, and subscription status.

5.1 Relationship Between Age and Subscription

# Visualize the relationship between age and subscription status
ggplot(bank_data, aes(x = age, fill = y)) +
  geom_histogram(binwidth = 5, position = "dodge") +
  ggtitle("Age vs Subscription") +
  theme_minimal() +
  labs(x = "Age", y = "Count")

5.2 Relationship Between Duration and Subscription

# Visualize the relationship between duration and subscription status
ggplot(bank_data, aes(x = duration, fill = y)) +
  geom_histogram(binwidth = 100, position = "dodge") +
  ggtitle("Duration vs Subscription") +
  theme_minimal() +
  labs(x = "Duration (seconds)", y = "Count")

5.3 Relationship Between Balance and Subscription

# Visualize the relationship between balance and subscription status
ggplot(bank_data, aes(x = balance, fill = y)) +
  geom_histogram(binwidth = 1000, position = "dodge") +
  ggtitle("Balance vs Subscription") +
  theme_minimal() +
  labs(x = "Account Balance", y = "Count")

The visualizations confirm that younger customers and those with higher balances are more likely to subscribe to term deposits. This suggests that the bank should target younger, financially stable customers in its marketing campaigns.

6. Conclusion of PART II(simlar to Part I)

In conclusion, the extended analysis confirmed that duration, age, and balance are key predictors of subscription. The Random Forest model outperformed Logistic Regression, providing actionable insights through feature importance. Based on these findings, we recommend that the bank focus on increasing call durations and targeting younger customers with higher balances. Future work could explore advanced balancing techniques and hyperparameter tuning to further improve model performance.

Assignment 1: Exploratory Data Analysis

Taha Malik

2025-03-03