Assignment1_Data622.knit

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(DataExplorer)

## Warning: package 'DataExplorer' was built under R version 4.4.2

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.2

## corrplot 0.95 loaded

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.2

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(caret)

## Warning: package 'caret' was built under R version 4.4.2

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

bank_full_url <- "https://raw.githubusercontent.com/zachrose97/Assignment1Data622/main/bank-full.csv"
bank_url <- "https://raw.githubusercontent.com/zachrose97/Assignment1Data622/main/bank.csv"
bank_additional_full_url <- "https://raw.githubusercontent.com/zachrose97/Assignment1Data622/main/bank-additional-full.csv"
bank_additional_url <- "https://raw.githubusercontent.com/zachrose97/Assignment1Data622/main/bank-additional.csv"

bank_full <- read.csv(bank_full_url, sep = ";")
bank <- read.csv(bank_url, sep = ";")
bank_additional_full <- read.csv(bank_additional_full_url, sep = ";")
bank_additional <- read.csv(bank_additional_url, sep = ";")

str(bank_full)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

summary(bank_full)

##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

colSums(is.na(bank_full))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

sum(duplicated(bank_full))

## [1] 0

numeric_bank <- bank_full %>%
  mutate_if(is.character, as.factor) %>%
  mutate_if(is.factor, as.integer)

cor_matrix <- cor(numeric_bank, use = "complete.obs")

corrplot(cor_matrix, method = "color", tl.cex = 0.7)

# Plot histograms
library(DataExplorer)
plot_histogram(bank_full)

# Bar plot
ggplot(bank_full, aes(x = job)) + 
  geom_bar(fill = "steelblue") + 
  theme_minimal() +
  labs(title = "Distribution of Job Types", x = "Job Type", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate labels

bank_full <- bank_full %>%
  mutate(balance = ifelse(balance > quantile(balance, 0.99), quantile(balance, 0.99), balance),
         duration = ifelse(duration > quantile(duration, 0.99), quantile(duration, 0.99), duration),
         previous = ifelse(previous > quantile(previous, 0.99), quantile(previous, 0.99), previous))

summary(bank_full)

##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance        housing              loan          
##  Length:45211       Min.   :-8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:   72   Class :character   Class :character  
##  Mode  :character   Median :  448   Mode  :character   Mode  :character  
##                     Mean   : 1274                                        
##                     3rd Qu.: 1428                                        
##                     Max.   :13165                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 254.3  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :1269.0  
##     campaign          pdays          previous        poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :0.5247                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :8.9000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

bank_full <- bank_full %>%
  mutate_if(is.character, as.factor)

bank_full <- bank_full %>%
  mutate_if(is.numeric, scale)

table(bank_full$y)

## 
##    no   yes 
## 39922  5289

# Convert y to binary (1 = "yes", 0 = "no") for Logistic Regression
bank_full$y <- ifelse(bank_full$y == "yes", 1, 0)

bank_full$y <- as.factor(bank_full$y)

log_model <- glm(y ~ ., data = bank_full, family = binomial)

summary(log_model)

## 
## Call:
## glm(formula = y ~ ., family = binomial, data = bank_full)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.674387   0.143382 -11.678  < 2e-16 ***
## age                -0.003742   0.023711  -0.158 0.874606    
## jobblue-collar     -0.327588   0.073518  -4.456 8.35e-06 ***
## jobentrepreneur    -0.380416   0.127672  -2.980 0.002886 ** 
## jobhousemaid       -0.529584   0.138523  -3.823 0.000132 ***
## jobmanagement      -0.177888   0.074304  -2.394 0.016663 *  
## jobretired          0.245041   0.098598   2.485 0.012946 *  
## jobself-employed   -0.317009   0.113295  -2.798 0.005141 ** 
## jobservices        -0.228423   0.084996  -2.687 0.007200 ** 
## jobstudent          0.391102   0.110487   3.540 0.000400 ***
## jobtechnician      -0.183596   0.069801  -2.630 0.008532 ** 
## jobunemployed      -0.199633   0.112892  -1.768 0.077002 .  
## jobunknown         -0.341358   0.237005  -1.440 0.149783    
## maritalmarried     -0.188918   0.059629  -3.168 0.001534 ** 
## maritalsingle       0.085460   0.068065   1.256 0.209274    
## educationsecondary  0.182252   0.065396   2.787 0.005322 ** 
## educationtertiary   0.383599   0.076173   5.036 4.76e-07 ***
## educationunknown    0.247570   0.105202   2.353 0.018608 *  
## defaultyes         -0.008104   0.165546  -0.049 0.960955    
## balance             0.058937   0.017039   3.459 0.000542 ***
## housingyes         -0.693472   0.044455 -15.600  < 2e-16 ***
## loanyes            -0.425325   0.060625  -7.016 2.29e-12 ***
## contacttelephone   -0.150437   0.076152  -1.975 0.048213 *  
## contactunknown     -1.625826   0.073322 -22.174  < 2e-16 ***
## day                 0.085478   0.021027   4.065 4.80e-05 ***
## monthaug           -0.684442   0.079740  -8.583  < 2e-16 ***
## monthdec            0.711000   0.178868   3.975 7.04e-05 ***
## monthfeb           -0.121506   0.090721  -1.339 0.180461    
## monthjan           -1.264332   0.123008 -10.278  < 2e-16 ***
## monthjul           -0.850344   0.078749 -10.798  < 2e-16 ***
## monthjun            0.461342   0.094776   4.868 1.13e-06 ***
## monthmar            1.645125   0.121295  13.563  < 2e-16 ***
## monthmay           -0.401301   0.073491  -5.461 4.75e-08 ***
## monthnov           -0.889958   0.085750 -10.379  < 2e-16 ***
## monthoct            0.911923   0.109487   8.329  < 2e-16 ***
## monthsep            0.895191   0.121228   7.384 1.53e-13 ***
## duration            1.093074   0.015682  69.704  < 2e-16 ***
## campaign           -0.287709   0.031851  -9.033  < 2e-16 ***
## pdays              -0.003027   0.031166  -0.097 0.922623    
## previous            0.079854   0.022736   3.512 0.000444 ***
## poutcomeother       0.180027   0.091442   1.969 0.048980 *  
## poutcomesuccess     2.318524   0.083467  27.778  < 2e-16 ***
## poutcomeunknown     0.034641   0.105144   0.329 0.741808    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 32631  on 45210  degrees of freedom
## Residual deviance: 21020  on 45168  degrees of freedom
## AIC: 21106
## 
## Number of Fisher Scoring iterations: 6

library(tidyverse)
library(recipes)

## Warning: package 'recipes' was built under R version 4.4.2

## 
## Attaching package: 'recipes'

## The following object is masked from 'package:stringr':
## 
##     fixed

## The following object is masked from 'package:stats':
## 
##     step

library(themis)

## Warning: package 'themis' was built under R version 4.4.2

# Convert categorical variables to numeric encoding, but keep y as a factor
bank_full_encoded <- bank_full %>%
  mutate_if(is.character, as.factor) %>%  # Convert characters to factors
  mutate_if(is.factor, as.integer) %>%    # Convert all factors (except y) to integers
  mutate(y = as.factor(bank_full$y))      # Ensure y remains a factor

# Create a recipe to apply SMOTE with a lower oversampling ratio
smote_recipe <- recipe(y ~ ., data = bank_full_encoded) %>%
  step_smote(y, over_ratio = 0.5)  # Prevent excessive oversampling

# apply SMOTE
bank_balanced <- prep(smote_recipe, training = bank_full_encoded) %>%
  bake(new_data = NULL)

# Check class distribution after balancing
table(bank_balanced$y)

## 
##     0     1 
## 39922 19961

bank_balanced$y <- as.factor(bank_balanced$y)

rf_model_balanced <- randomForest(y ~ ., data = bank_balanced, ntree = 100)

print(rf_model_balanced)

## 
## Call:
##  randomForest(formula = y ~ ., data = bank_balanced, ntree = 100) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 6.87%
## Confusion matrix:
##       0     1 class.error
## 0 37811  2111  0.05287811
## 1  2003 17958  0.10034567

importance(rf_model_balanced)

##           MeanDecreaseGini
## age              1615.2433
## job               790.7036
## marital           774.1067
## education         731.7040
## default            43.7810
## balance          1666.6408
## housing          2297.3724
## loan              556.0717
## contact          1094.8041
## day              1449.7013
## month            1950.0629
## duration         8298.5010
## campaign         2039.6012
## pdays             982.0890
## previous          593.6566
## poutcome         1162.2022

varImpPlot(rf_model_balanced)

log_preds <- predict(log_model, bank_full, type = "response")
log_preds_class <- ifelse(log_preds > 0.5, 1, 0)

rf_balanced_preds <- predict(rf_model_balanced, bank_balanced, type = "class")

log_conf_matrix <- confusionMatrix(as.factor(log_preds_class), as.factor(bank_full$y))
rf_balanced_conf_matrix <- confusionMatrix(as.factor(rf_balanced_preds), as.factor(bank_balanced$y))

print(log_conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 38861  3325
##          1  1061  1964
##                                           
##                Accuracy : 0.903           
##                  95% CI : (0.9002, 0.9057)
##     No Information Rate : 0.883           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4234          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9734          
##             Specificity : 0.3713          
##          Pos Pred Value : 0.9212          
##          Neg Pred Value : 0.6493          
##              Prevalence : 0.8830          
##          Detection Rate : 0.8595          
##    Detection Prevalence : 0.9331          
##       Balanced Accuracy : 0.6724          
##                                           
##        'Positive' Class : 0               
##

print(rf_balanced_conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 39920    22
##          1     2 19939
##                                           
##                Accuracy : 0.9996          
##                  95% CI : (0.9994, 0.9997)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9991          
##                                           
##  Mcnemar's Test P-Value : 0.0001052       
##                                           
##             Sensitivity : 0.9999          
##             Specificity : 0.9989          
##          Pos Pred Value : 0.9994          
##          Neg Pred Value : 0.9999          
##              Prevalence : 0.6667          
##          Detection Rate : 0.6666          
##    Detection Prevalence : 0.6670          
##       Balanced Accuracy : 0.9994          
##                                           
##        'Positive' Class : 0               
##

The Bank Marketing Dataset contains customer interaction data from a telemarketing campaign designed to persuade clients to subscribe to a term deposit. The dataset includes customer demographics, economic factors, and previous marketing interactions. The analysis above applies Exploratory Data Analysis and machine learning to identify key factors influencing a customer’s decision to subscribe. By understanding correlations, data distributions, and feature importance, we can help the bank improve its marketing strategies to increase term deposit subscription rates.

To begin, a correlation matrix was created to examine feature relationships. The analysis revealed that the strongest predictor of subscription was call duration, longer calls tended to increase the likelihood of a subscription. Another key predictor was previous successful marketing outreach, indicating that customers who had subscribed in past outreach campaigns were more likely to do so again. A histogram analysis showed that age is normally distributed, with most clients between 30 and 50 years old, whereas balance, duration, and pdays are highly right skewed, indicating potential outliers. The campaign variable revealed that most individuals were contacted only once or twice. Outliers were detected in balance and previous contacts, where some clients had extremely high savings or were contacted an unusually high number of times. Most clients belonged to the blue collar, management, or technician job categories, but students and unemployed individuals had higher conversion rates. Additionally, clients over 50 years old were more likely to subscribe.

Since this is a binary classification problem, we used Logistic Regression and Random Forest models. Logistic Regression is a simple and interpretable statistical model that estimates the probability of a binary outcome. It is advantageous due to its efficiency and explainability, however, it assumes a linear relationship between features and the target variable. The results from the logistic regression showed that call duration, past campaign success, balance, and the month of contact were significant predictors. Certain factors, such as having a housing loan and being contacted in May or November, were negatively associated with subscription likelihood. Random Forest, an ensemble method that constructs multiple decision trees, was chosen because of its ability to handle non linear relationships and complex feature interactions. Initially, the Random Forest model struggled with class imbalance, as only 11.7% of clients subscribed (y = 1), while 88.3% did not (y = 0). This caused the model to misclassify many actual subscribers (y = 1) as non-subscribers (y = 0), leading to a bias towards predicting ‘No.’. To address class imbalance, Synthetic Minority Oversampling (SMOTE) was applied. SMOTE was applied with an oversampling ratio of 0.5, rebalancing the dataset to 39,922 ‘No’ cases (y = 0) and 19,961 ‘Yes’ cases (y = 1). This helped the model learn more effectively from actual subscribers. After SMOTE balancing, the Random Forest model achieved an accuracy of 99.96%, with high sensitivity (99.86%) and specificity (99.88%), significantly improving its ability to classify both subscribers (y = 1) and non-subscribers (y = 0). This significantly reduced the misclassification rate for subscribers (y = 1), improving the model’s overall reliability. Feature importance analysis from Random Forest revealed that the most influential features for predicting a subscription were call duration, loan status, number of contacts, month of contact, and customer balance.

Both models had strengths and weaknesses. Logistic Regression provided clear interpretability, helping to understand how individual factors influenced term deposit subscriptions. Logistic Regression had a predictive accuracy of 90.3%, with a high sensitivity (97.34%) but a low specificity (37.13%). This means the model correctly identified most subscribers (y = 1), but frequently misclassified non-subscribers (y = 0) as subscribers. Random Forest performed better overall, especially after SMOTE balancing, but is more difficult to interpret than Logistic Regression. In conclusion, the most important factors for term deposit subscription were call duration, past campaign success, and balance. These insights suggest that longer and more engaged customer interactions could lead to higher conversion rates. To further optimize marketing efforts, my recommendation to the bank would be to focus on clients with previous successful engagements and implement call duration strategies. By incorporating these findings, the bank can enhance its telemarketing approach, improve customer engagement, and drive term deposit subscription rates.