Data_622_Assignment_3: Chun Shing Leung

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit.

df<- read.csv("https://raw.githubusercontent.com/tonyCUNY/tonyCUNY/refs/heads/main/Data_622/bank-full.csv", sep = ";", stringsAsFactors = T)

Variable overview

age: Age

job: type of job

marital: marital status

education: Education Level

default: has credit in default?

balance: average yearly balance

housing: has housing loan?

loan: has personal loan?

contact: contact communication type (categorical: ‘cellular’,‘telephone’)

day: last contact day of the week

month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)

duration: last contact duration, in seconds (numeric).

campaign: number of contacts performed during this campaign and for this client

pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted)

previous: number of contacts performed before this campaign and for this client

poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’) y: has the client subscribed a term deposit?

Exploratory Data Analysis

The dataset consists of 45,211 instances and 17 features, with most variables being categorical. Among the 10 categorical variables, “contact”, “loan”, “default”, and “y” exhibit highly imbalanced distributions, with more than 80% of the instances belonging to a single category. In contrast, “education”, “housing”, “poutcome”, and “marital” contain three categories each and appear to be fairly balanced. The “job” and “month” variables have 11 and 12 categories, respectively, and show a relatively even distribution.

The dataset also includes seven numeric variables, namely “age”, “balance”, “campaign”, “day”, “duration”, “pdays”, and “previous”. The distributions of “age”, “balance”, “campaign”, and “duration” are right-skewed, indicating that most values are concentrated toward the lower end, with a few extreme values pulling the distribution to the right. The “day” variable exhibits a uniform distribution, while “pdays” and “previous” are extremely right-skewed, suggesting that most data points are clustered near the lower end, with a few outliers significantly increasing the range.

Outlier analysis reveals that “pdays” and “previous” contain the highest proportion of outliers (18%), followed by “balance” (10.5%), “duration” (7.2%), and “campaign” (6.8%). The “age” variable has only 1.1% outliers, making it relatively stable for analysis. Additionally, correlation analysis indicates that “pdays” and “previous” exhibit a weak positive correlation, while the remaining numeric variables show little to no correlation with each other.

The dataset also contains missing values, with “unknown” values treated as missing data. The most significant missing data is found in “poutcome”, where 81.7% of values are missing, followed by “contact” (28.8% missing values). The “education” and “job” variables contain a small number of missing values, which may have minimal impact on analysis. Importantly, there are no duplicated values in the dataset.

Additionally, a Near Zero Variance (NZV) check identified “default” and “pdays” as NZV variables, meaning that a single category dominates almost all rows. These variables provide little to no useful information for modeling and can negatively impact machine learning models by introducing multicollinearity and unstable coefficients.

Overall, the dataset presents several key characteristics, including imbalanced categorical distributions, right-skewed numeric variables, significant outliers, and missing values. These insights help guide further data preprocessing and model selection to ensure accurate and meaningful analysis.

glimpse(df)

## Rows: 45,211
## Columns: 17
## $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <fct> management, technician, entrepreneur, blue-collar, unknown, …
## $ marital   <fct> married, single, married, married, single, married, single, …
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
## $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
## $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
## $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
## $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
## $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …

str(df)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

df.clean <- df %>%
  mutate(across(where(is.character), ~ na_if(., "unknown"))) %>%
  mutate(across(where(is.factor), ~ na_if(as.character(.), "unknown"))) 
df.clean <- df.clean %>%
  mutate(across(c(job, marital, education, default, housing, loan, contact, month, poutcome, y), as.factor))

str(df.clean)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 11 levels "admin.","blue-collar",..: 5 10 3 2 NA 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 3 levels "primary","secondary",..: 3 2 2 NA NA 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 2 levels "cellular","telephone": NA NA NA NA NA NA NA NA NA NA ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 3 levels "failure","other",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Checking correlation using Correlation Matrix

cor_matrix <- cor(df.clean %>% select_if(where(is.numeric)), use = "complete.obs")

cor_long <- melt(cor_matrix)
cor_long <- cor_long[as.numeric(cor_long$Var1) > as.numeric(cor_long$Var2), ]


ggplot(cor_long, aes(Var2, Var1, fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = ifelse(value != 0, round(value, 2), "")), 
            color = "black", size = 3, face="bold") +  
  scale_fill_gradient2(low = "coral", high = "lightblue", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), 
                       space = "Lab", name = "Correlation") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 10),  
    axis.text.y = element_text(size = 10),                                   
    axis.title = element_blank(),                                            
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14)         
  ) +
  ggtitle("Correlation Matrix")

## Warning in geom_text(aes(label = ifelse(value != 0, round(value, 2), "")), :
## Ignoring unknown parameters: `face`

Category and Numeric Variables distribution

# Category variable distribution

category_percentages <- df.clean |>
  summarise(across(where(is.factor), ~ list(prop.table(table(.)) * 100))) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category_Percentage") |>
  drop_na() |>
  unnest(cols = c(Category_Percentage)) |>
  mutate(Category = names(Category_Percentage)) |>  
  arrange(Variable, desc(Category_Percentage)) |>  
  mutate(Variable = factor(Variable, levels = unique(Variable))) 


# Define new variable groups manually
group_1 <- c("contact", "loan", "default", "y")  
group_2 <- c("education", "housing", "poutcome", "marital")  
group_3 <- c("job", "month") 

# Filter data for each chart
category_percentages_1 <- category_percentages |> filter(Variable %in% group_1)
category_percentages_2 <- category_percentages |> filter(Variable %in% group_2)
category_percentages_3 <- category_percentages |> filter(Variable %in% group_3)

# First Plot (Contact, Loan, Default, y)
ggplot(category_percentages_1, aes(x = Category, 
                                   y = Category_Percentage, fill = Variable)) +
  geom_bar(stat = "identity", color = "black") +

  # Percentage on top of the bar
  geom_text(aes(label = paste0(round(Category_Percentage, 1), "%")), 
            vjust = 0.92, size = 3, color = "black", fontface = "bold") +  
  
  facet_wrap(~ Variable, scales = "free_x") +  
  theme_minimal() +
  labs(title = "Category Distribution (Part 1: Contact, Loan, Default, y)", x = "Category", y = "Percentage") +
  scale_fill_brewer(palette = "Set3") +  
  theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1))

## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.

# Second Plot (Education, Housing, Poutcome, marital)
ggplot(category_percentages_2, aes(x = Category, 
                                   y = Category_Percentage, fill = Variable)) +
  geom_bar(stat = "identity", color = "black") +

  # Percentage on top of the bar
  geom_text(aes(label = paste0(round(Category_Percentage, 1), "%")), 
            vjust = 1.5, size = 3, color = "black", fontface = "bold") +  

  facet_wrap(~ Variable, scales = "free_x") +  
  theme_minimal() +
  labs(title = "Category Distribution (Part 2: Education, Housing, Poutcome)", x = "Category", y = "Percentage") +
  scale_fill_brewer(palette = "Set3") +  
  theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1))

## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.

# Third Plot (Job & Month)
ggplot(category_percentages_3, aes(x = Category, 
                                   y = Category_Percentage, fill = Variable)) +
  geom_bar(stat = "identity", color = "black") +

  # Percentage on top of the bar
  geom_text(aes(label = paste0(round(Category_Percentage, 1), "%")), 
            vjust = -0.5, size = 2.5, color = "black", fontface = "bold") +  

  facet_wrap(~ Variable, scales = "free_x") +  
  theme_minimal() +
  labs(title = "Category Distribution (Part 3: Job & Month)", x = "Category", y = "Percentage") +
  scale_fill_brewer(palette = "Set3") +  
  theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1))

## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.

# Numeric Variable Distribution

df.clean |>
  select(where(is.numeric)) |>
    pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
  drop_na() |>
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "coral", color = "black") +
  facet_wrap(~ Variable, scales = "free", ncol = 4) +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  labs(title = "Histograms of Numeric Variables", x = "Value", y = "Frequency")

Checking for Outliers

df.clean |>
  select(where(is.numeric)) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
  drop_na() |>  
  ggplot(aes(x = Variable, y = Value)) +  # Boxplots need categorical x-axis
  geom_boxplot(fill = "lightblue", color = "black") + 
  facet_wrap(~ Variable, scales = "free", ncol = 3) +  # Facet for multiple numeric variables
  theme_minimal() +
  labs(title = "Boxplots of Numeric Variables", x = "Variable", y = "Value") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Finding outliers
outliers <- df.clean |>
  select(where(is.numeric)) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
  drop_na() |> 
  group_by(Variable) |>
  mutate(
    Q1 = quantile(Value, 0.25, na.rm = TRUE),
    Q3 = quantile(Value, 0.75, na.rm = TRUE),
    IQR = Q3 - Q1,
    Lower_Bound = Q1 - 1.5 * IQR,
    Upper_Bound = Q3 + 1.5 * IQR,
    Outlier = Value < Lower_Bound | Value > Upper_Bound
  ) |>
  filter(Outlier) |>
  ungroup()


print(outliers)

## # A tibble: 28,029 × 8
##    Variable Value    Q1    Q3   IQR Lower_Bound Upper_Bound Outlier
##    <chr>    <int> <dbl> <dbl> <dbl>       <dbl>       <dbl> <lgl>  
##  1 balance  10635    72  1428  1356       -1962        3462 TRUE   
##  2 duration  1666   103   319   216        -221         643 TRUE   
##  3 duration  1492   103   319   216        -221         643 TRUE   
##  4 duration   787   103   319   216        -221         643 TRUE   
##  5 duration  1778   103   319   216        -221         643 TRUE   
##  6 duration   812   103   319   216        -221         643 TRUE   
##  7 balance   6530    72  1428  1356       -1962        3462 TRUE   
##  8 balance  12223    72  1428  1356       -1962        3462 TRUE   
##  9 balance   5935    72  1428  1356       -1962        3462 TRUE   
## 10 duration  1042   103   319   216        -221         643 TRUE   
## # ℹ 28,019 more rows

# Visualize outliers

outliers |>
  group_by(Variable) |>
  summarise(
    Outlier_Count = n(),
    Total_Count = nrow(df.clean),  
    Outlier_Percentage = (Outlier_Count / Total_Count) * 100  
  ) |>
  ungroup() |>
  ggplot(aes(x = reorder(Variable, -Outlier_Count), y = Outlier_Count, fill = Outlier_Percentage)) +
  geom_bar(stat = "identity", color = "black") +
  
  # Display outlier count on top of the bar
  geom_text(aes(label = Outlier_Count), vjust = -0.5, size = 3, fontface = "bold") +  
  
  # Display percentage inside the bar (centered)
  geom_text(aes(label = paste0(round(Outlier_Percentage, 1), "%")), 
            vjust = 1.5, size = 3, color = "black", fontface = "bold") +  

  scale_fill_gradient(low = "lightblue", high = "coral", name = "Outlier %") +  
  theme_minimal() +
  labs(
    title = "Outlier Count and Percentage per Numeric Variable",
    x = "Variable",
    y = "Number of Outliers"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

df.clean |>
  select(where(is.numeric)) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
  drop_na() |>
  ggplot(aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "lightblue", color = "black", outlier.color = "coral", outlier.shape = 16) + 
  facet_wrap(~ Variable, scales = "free", ncol = 3) +  
  theme_minimal() +
  labs(title = "Boxplots of Numeric Variables with Outliers", x = "Variable", y = "Value") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Missing Values

missing_data <- df.clean |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") |>
  mutate(Missing_Percentage = (Missing_Count / nrow(df.clean)) * 100) |> 
  arrange(desc(Missing_Count))

print(missing_data)

## # A tibble: 17 × 3
##    Variable  Missing_Count Missing_Percentage
##    <chr>             <int>              <dbl>
##  1 poutcome          36959             81.7  
##  2 contact           13020             28.8  
##  3 education          1857              4.11 
##  4 job                 288              0.637
##  5 age                   0              0    
##  6 marital               0              0    
##  7 default               0              0    
##  8 balance               0              0    
##  9 housing               0              0    
## 10 loan                  0              0    
## 11 day                   0              0    
## 12 month                 0              0    
## 13 duration              0              0    
## 14 campaign              0              0    
## 15 pdays                 0              0    
## 16 previous              0              0    
## 17 y                     0              0

ggplot(missing_data, aes(x = reorder(Variable, -Missing_Count), y = Missing_Count, fill = Missing_Percentage)) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = Missing_Count), vjust = -0.5, size = 2.5, fontface = "bold") +  
  geom_text(aes(label = paste0(round(Missing_Percentage, 1), "%")), 
            vjust = 1.5, size = 2.3, color = "black", fontface = "bold") +  
  scale_fill_gradient(low = "lightblue", high = "coral", name = "Missing %") +  
  theme_minimal() +
  labs(title = "Missing Values Count and Percentage per Variable",
       x = "Variable", y = "Number of Missing Values") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Check for Duplicated Values

sum(duplicated(df.clean))

## [1] 0

Check for Near Zero Variance

nzv <- nearZeroVar(df.clean, saveMetrics = TRUE)
print(nzv)

##            freqRatio percentUnique zeroVar   nzv
## age         1.044589   0.170312535   FALSE FALSE
## job         1.028970   0.024330362   FALSE FALSE
## marital     2.127756   0.006635553   FALSE FALSE
## education   1.744380   0.006635553   FALSE FALSE
## default    54.473620   0.004423702   FALSE  TRUE
## balance    18.020513  15.854548672   FALSE FALSE
## housing     1.251432   0.004423702   FALSE FALSE
## loan        5.241165   0.004423702   FALSE FALSE
## contact    10.077426   0.004423702   FALSE FALSE
## day         1.192374   0.068567384   FALSE FALSE
## month       1.996519   0.026542213   FALSE FALSE
## duration    1.021739   3.479241777   FALSE FALSE
## campaign    1.402959   0.106168853   FALSE FALSE
## pdays     221.281437   1.236424764   FALSE  TRUE
## previous   13.331169   0.090685895   FALSE FALSE
## poutcome    2.663587   0.006635553   FALSE FALSE
## y           7.548119   0.004423702   FALSE FALSE

Data Prepration

Removing Redundant Features

The correlation matrix from EDA showed that “pdays” and “previous” have a weak positive correlation. Plus, Near Zero Variance (NZV) variables such as “default” and “pdays” contribute very little information and should be removed to avoid model instability. poutcome consists of large amount of missing value and shall be removed as well.

df.clean.final <- df.clean |> 
                select(-pdays, -default, -poutcome)

Removing row

According to the data description, when duration is equal to 0, response variable is always equal to no and should be discarded if the intention is to have a realistic predictive model. There are 3 rows of data consist of duration value equal to 0

df.clean.final <- df.clean.final |> filter(duration != 0)

Imputation for missing Value

Contact, education and job have missing value. Here, we use Random Forest Imputation to fill in these value.

library(missForest)

## Warning: package 'missForest' was built under R version 4.2.3

df.imputed <- missForest(df.clean.final)
df.clean.final <- df.imputed$ximp

missing_data_check <- df.clean.final |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") |>
  mutate(Missing_Percentage = (Missing_Count / nrow(df.clean)) * 100) |> 
  arrange(desc(Missing_Count))

print(missing_data_check)

## # A tibble: 14 × 3
##    Variable  Missing_Count Missing_Percentage
##    <chr>             <int>              <dbl>
##  1 age                   0                  0
##  2 job                   0                  0
##  3 marital               0                  0
##  4 education             0                  0
##  5 balance               0                  0
##  6 housing               0                  0
##  7 loan                  0                  0
##  8 contact               0                  0
##  9 day                   0                  0
## 10 month                 0                  0
## 11 duration              0                  0
## 12 campaign              0                  0
## 13 previous              0                  0
## 14 y                     0                  0

Feature Engineering

Age can be converted into categorical data to capture different behaviors.

# Create Age_Group based on defined age bins
df.clean.final <- df.clean.final |> 
  mutate(Age_Group = case_when(
    age < 30 ~ "Young",
    age >= 30 & age <= 50 ~ "Middle",
    age > 50 ~ "Senior"
  ))

# Convert Age_Group into a factor for better modeling
df.clean.final$Age_Group <- factor(df.clean.final$Age_Group, levels = c("Young", "Middle", "Senior"))

# Remove 'age' column from the dataset after creating 'Age_Group'

df.clean.final <- df.clean.final |> 
                        select(-age, -day)

# Check the distribution of Age_Group
table(df.clean.final$Age_Group)

## 
##  Young Middle Senior 
##   5273  30681   9254

Handling imbalanced and skewed data

The df.clean.final data is ready for Random Forest model since Random Forest is less sensitive to class imbalance compared to Logistic Regression but may still favor the majority class.Also,Random Forest is not significantly affected by right-skewed numeric variables. Unlike models that assume a linear relationship (such as Logistic Regression, Linear Regression, or KNN), Random Forest is a tree-based model and does not rely on feature distributions being normal.

Assignment 2 - Experimentation & Model Training

library(caret)
library(rpart)
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(e1071)  # For accuracy and metrics
library(ada)  # For Adaboost

## Warning: package 'ada' was built under R version 4.2.3

library(pROC)  # For AUC-ROC curve

## Warning: package 'pROC' was built under R version 4.2.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Split the data

str(df.clean.final)

## 'data.frame':    45208 obs. of  13 variables:
##  $ job      : Factor w/ 11 levels "admin.","blue-collar",..: 5 10 3 2 1 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 3 levels "primary","secondary",..: 3 2 2 1 2 3 3 3 1 2 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 2 levels "cellular","telephone": 1 1 1 1 1 1 1 1 1 1 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Age_Group: Factor w/ 3 levels "Young","Middle",..: 3 2 2 2 2 2 1 2 3 2 ...

set.seed(42)
sample_set <- sample(nrow(df.clean.final), round(nrow(df.clean.final) * .80), replace = FALSE)
df_train <- df.clean.final[sample_set, ]  # 80% of the data for training
df_test <- df.clean.final[-sample_set, ]  # 20% of the data for testing

##Algorithm Selection

Decision Tree:

# Base Model - Train the decision tree model with the default cp (0.05) and maxdepth (15)
set.seed(42)
base_tree_model <- rpart(
  y ~ ., 
  data = df_train,  # Training on the entire dataset
  method = "class",  # Classification tree
  control = rpart.control(cp = 0.05, maxdepth = 15)  # Default cp and maxdepth
)

# Predict on the test set (class predictions)
y_pred_base <- predict(base_tree_model, newdata = df_test, type = "class")

# Confusion Matrix for Base Model
cm_base <- confusionMatrix(as.factor(y_pred_base), as.factor(df_test$y))
print("Base Model Confusion Matrix:")

## [1] "Base Model Confusion Matrix:"

print(cm_base)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7997 1045
##        yes    0    0
##                                           
##                Accuracy : 0.8844          
##                  95% CI : (0.8777, 0.8909)
##     No Information Rate : 0.8844          
##     P-Value [Acc > NIR] : 0.5082          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8844          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8844          
##          Detection Rate : 0.8844          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : no              
##

# AUC-ROC for Base Model
y_pred_base_prob <- predict(base_tree_model, newdata = df_test, type = "prob")[, 2]  # Get the probability for class 'yes'
roc_curve_base <- roc(df_test$y, y_pred_base_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(paste("Base Model AUC-ROC:", roc_curve_base$auc))

## [1] "Base Model AUC-ROC: 0.5"

Experiment 1: Decreasing cp

Objective (Hypothesis):

Lowering the complexity parameter (cp) allows the decision tree to grow deeper, potentially capturing more complex relationships in the data. However, a very small cp may lead to overfitting.

Variable Change:

The cp parameter is reduced from 0.05 to 0.01 to allow deeper tree growth with less pruning.

set.seed(43)

# Train the model with cp = 0.01
tree_model_1 <- rpart(
  y ~ ., 
  data = df_train,  # Training on the entire dataset
  method = "class",  # Classification tree
  control = rpart.control(cp = 0.01)  # Set cp value to 0.1
)

# Predict on the test set (class predictions)
y_pred_tree_1 <- predict(tree_model_1, newdata = df_test, type = "class")

# Confusion Matrix for Experiment 1
cm_tree_1 <- confusionMatrix(as.factor(y_pred_tree_1), as.factor(df_test$y))
print("Decision Tree with cp = 0.01 Confusion Matrix:")

## [1] "Decision Tree with cp = 0.01 Confusion Matrix:"

print(cm_tree_1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7782  703
##        yes  215  342
##                                           
##                Accuracy : 0.8985          
##                  95% CI : (0.8921, 0.9046)
##     No Information Rate : 0.8844          
##     P-Value [Acc > NIR] : 1.149e-05       
##                                           
##                   Kappa : 0.3769          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9731          
##             Specificity : 0.3273          
##          Pos Pred Value : 0.9171          
##          Neg Pred Value : 0.6140          
##              Prevalence : 0.8844          
##          Detection Rate : 0.8607          
##    Detection Prevalence : 0.9384          
##       Balanced Accuracy : 0.6502          
##                                           
##        'Positive' Class : no              
##

# AUC-ROC for Experiment 1
y_pred_tree_1_prob <- predict(tree_model_1, newdata = df_test, type = "prob")[, 2]  # Get the probability for class 'yes'
roc_curve_tree_1 <- roc(df_test$y, y_pred_tree_1_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(paste("Decision Tree with cp = 0.01 AUC-ROC:", roc_curve_tree_1$auc))

## [1] "Decision Tree with cp = 0.01 AUC-ROC: 0.745104653479504"

Experiment 2: Increasing maxdepth

Objective (Hypothesis):

The maxdepth parameter controls how deep the tree can grow. A higher value allows the tree to model more complexity, while a smaller value restricts depth and helps prevent overfitting.

Variable Change:

maxdepth is increased from 15 to 30, while cp remains fixed at 0.05.

set.seed(44)

# Train the model with maxdepth = 30
tree_model_2 <- rpart(
  y ~ ., 
  data = df_train,  # Training on the entire dataset
  method = "class",  # Classification tree
  control = rpart.control(cp = 0.05, maxdepth = 30)  # Set maxdepth value to 30
)

# Predict on the test set (class predictions)
y_pred_tree_2 <- predict(tree_model_2, newdata = df_test, type = "class")

# Confusion Matrix for Experiment 2
cm_tree_2 <- confusionMatrix(as.factor(y_pred_tree_2), as.factor(df_test$y))
print("Decision Tree with maxdepth = 30 Confusion Matrix:")

## [1] "Decision Tree with maxdepth = 30 Confusion Matrix:"

print(cm_tree_2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7997 1045
##        yes    0    0
##                                           
##                Accuracy : 0.8844          
##                  95% CI : (0.8777, 0.8909)
##     No Information Rate : 0.8844          
##     P-Value [Acc > NIR] : 0.5082          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8844          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8844          
##          Detection Rate : 0.8844          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : no              
##

# AUC-ROC for Experiment 2
y_pred_tree_2_prob <- predict(tree_model_2, newdata = df_test, type = "prob")[, 2]  # Get the probability for class 'yes'
roc_curve_tree_2 <- roc(df_test$y, y_pred_tree_2_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(paste("Decision Tree with maxdepth = 30 AUC-ROC:", roc_curve_tree_2$auc))

## [1] "Decision Tree with maxdepth = 30 AUC-ROC: 0.5"

Random Forest:

Random Forest is an ensemble method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting. It’s a natural extension of decision tree models.

library(doParallel)

## Warning: package 'doParallel' was built under R version 4.2.3

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 4.2.3

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loading required package: iterators

## Warning: package 'iterators' was built under R version 4.2.3

## Loading required package: parallel

# Set the number of CPU cores to use
num_cores <- detectCores() - 1  # Use all available cores except one

# Register the parallel backend
cl <- makeCluster(num_cores)   # Create a cluster with available cores
registerDoParallel(cl)         # Register the parallel backend

set.seed(45)

# Train Random Forest base model with default ntree and mtry
base_rf_model <- randomForest(
  y ~ ., 
  data = df_train,
  ntree = 100, # set the default value
  mtry = 4, # set the default value (sqrt(p) = sqrt(14) = 3.74 ~ 4)
  importance = TRUE,  # Show feature importance
  nthreads = num_cores
)

# Predict on the test set
y_pred_base_rf <- predict(base_rf_model, newdata = df_test)

# Confusion Matrix for Base Model
cm_base_rf <- confusionMatrix(y_pred_base_rf, df_test$y)
print("Base Model Confusion Matrix:")

## [1] "Base Model Confusion Matrix:"

print(cm_base_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7669  567
##        yes  328  478
##                                           
##                Accuracy : 0.901           
##                  95% CI : (0.8947, 0.9071)
##     No Information Rate : 0.8844          
##     P-Value [Acc > NIR] : 2.556e-07       
##                                           
##                   Kappa : 0.4624          
##                                           
##  Mcnemar's Test P-Value : 1.785e-15       
##                                           
##             Sensitivity : 0.9590          
##             Specificity : 0.4574          
##          Pos Pred Value : 0.9312          
##          Neg Pred Value : 0.5931          
##              Prevalence : 0.8844          
##          Detection Rate : 0.8482          
##    Detection Prevalence : 0.9109          
##       Balanced Accuracy : 0.7082          
##                                           
##        'Positive' Class : no              
##

# AUC-ROC for Base Model
y_pred_base_rf_prob <- predict(base_rf_model, newdata = df_test, type = "prob")[, 2]
roc_curve_base_rf <- roc(df_test$y, y_pred_base_rf_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(paste("Base Model AUC-ROC:", roc_curve_base_rf$auc))

## [1] "Base Model AUC-ROC: 0.911444004420318"

# Stop the cluster after training
stopCluster(cl)

Experiment 1: Varying ntree

Objective (Hypothesis):

Increasing the number of trees (ntree) should improve model performance by reducing variance. However, improvements may plateau beyond a certain number of trees.

Variable Change: Increase ntree from 100 (base) to 200, keeping mtry and other parameters at default.

What Stays the Same:

mtry remains at its default value (usually √p, where p = number of features).

Other parameters like nodesize and cp remain unchanged.

# Set the number of CPU cores to use
num_cores <- detectCores() - 1  # Use all available cores except one

# Register the parallel backend
cl <- makeCluster(num_cores)   # Create a cluster with available cores
registerDoParallel(cl)         # Register the parallel backend

set.seed(46)

# Train Random Forest model with ntree = 200 and mtry = 4 for Experiment 1
rf_model_experiment_1 <- randomForest(
  y ~ ., 
  data = df_train, 
  ntree = 200,    # ntree increase  to 200
  mtry = 4,       # mtry remain as 4
  importance = TRUE,  # Show feature importance
  nthreads = num_cores
)

# Predict on the test set (class predictions)
y_pred_rf_experiment_1 <- predict(rf_model_experiment_1, newdata = df_test)

# Confusion Matrix for Experiment 1
cm_rf_experiment_1 <- confusionMatrix(y_pred_rf_experiment_1, df_test$y)
print("Experiment 1 (ntree=200, mtry=6) Confusion Matrix:")

## [1] "Experiment 1 (ntree=200, mtry=6) Confusion Matrix:"

print(cm_rf_experiment_1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7680  569
##        yes  317  476
##                                           
##                Accuracy : 0.902           
##                  95% CI : (0.8957, 0.9081)
##     No Information Rate : 0.8844          
##     P-Value [Acc > NIR] : 4.853e-08       
##                                           
##                   Kappa : 0.4646          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9604          
##             Specificity : 0.4555          
##          Pos Pred Value : 0.9310          
##          Neg Pred Value : 0.6003          
##              Prevalence : 0.8844          
##          Detection Rate : 0.8494          
##    Detection Prevalence : 0.9123          
##       Balanced Accuracy : 0.7079          
##                                           
##        'Positive' Class : no              
##

# AUC-ROC for Experiment 1
y_pred_rf_prob_experiment_1 <- predict(rf_model_experiment_1, newdata = df_test, type = "prob")[, 2]
roc_curve_rf_experiment_1 <- roc(df_test$y, y_pred_rf_prob_experiment_1)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(paste("Experiment 1 AUC-ROC:", roc_curve_rf_experiment_1$auc))

## [1] "Experiment 1 AUC-ROC: 0.91600432698147"

# Stop the cluster after training
stopCluster(cl)

Experiment 2: Varying mtry

Objective (Hypothesis):

The mtry parameter controls how many features are considered at each split. Lower mtry values add randomness and may reduce overfitting, while higher values may lead to more complex trees and possible overfitting.

Variable Change:

Decrease mtry from 4 to 3, keeping ntree fixed at 100.

What Stays the Same:

ntree is fixed at 100.

Other parameters are unchanged.

# Set the number of CPU cores to use
num_cores <- detectCores() - 1  # Use all available cores except one

# Register the parallel backend
cl <- makeCluster(num_cores)   # Create a cluster with available cores
registerDoParallel(cl)         # Register the parallel backend

set.seed(47)

# Train Random Forest model with ntree = 200 and mtry = 6 for Experiment 2
rf_model_experiment_2 <- randomForest(
  y ~ ., 
  data = df_train, 
  ntree = 100,    # ntree remain as 100
  mtry = 3,       # decrease mtry to 3
  importance = TRUE,  # Show feature importance
  nthreads = num_cores
)

# Predict on the test set (class predictions)
y_pred_rf_experiment_2 <- predict(rf_model_experiment_2, newdata = df_test)

# Confusion Matrix for Experiment 2
cm_rf_experiment_2 <- confusionMatrix(y_pred_rf_experiment_2, df_test$y)
print("Experiment 2 (mtry=3) Confusion Matrix:")

## [1] "Experiment 2 (mtry=3) Confusion Matrix:"

print(cm_rf_experiment_2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7685  589
##        yes  312  456
##                                          
##                Accuracy : 0.9004         
##                  95% CI : (0.894, 0.9065)
##     No Information Rate : 0.8844         
##     P-Value [Acc > NIR] : 7.327e-07      
##                                          
##                   Kappa : 0.4491         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9610         
##             Specificity : 0.4364         
##          Pos Pred Value : 0.9288         
##          Neg Pred Value : 0.5938         
##              Prevalence : 0.8844         
##          Detection Rate : 0.8499         
##    Detection Prevalence : 0.9151         
##       Balanced Accuracy : 0.6987         
##                                          
##        'Positive' Class : no             
##

# AUC-ROC for Experiment 2
y_pred_rf_prob_experiment_2 <- predict(rf_model_experiment_2, newdata = df_test, type = "prob")[, 2]
roc_curve_rf_experiment_2 <- roc(df_test$y, y_pred_rf_prob_experiment_2)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(paste("Experiment 2 AUC-ROC:", roc_curve_rf_experiment_2$auc))

## [1] "Experiment 2 AUC-ROC: 0.912468970122169"

# Stop the cluster after training
stopCluster(cl)

Adaboost:

Adaboost (Adaptive Boosting) trains weak learners sequentially, placing more weight on previously misclassified examples. It builds a strong classifier by focusing on the hardest-to-classify data points

# Set seed for reproducibility
set.seed(48)

# 1. Base Model - Adaboost with default parameters
base_adaboost_model <- ada(
  y ~ ., 
  data = df_train, 
  iter = 100,   # n_estimators = 100 (default)
  nu = 0.1,     # learning_rate = 0.1 (default)
  type = "real", # SAMME.R (real version)
)

# Predict on the test set
y_pred_base_adaboost <- predict(base_adaboost_model, newdata = df_test)

# Confusion Matrix for Base Model
cm_base_adaboost <- confusionMatrix(y_pred_base_adaboost, df_test$y)
print("Base Model Confusion Matrix:")

## [1] "Base Model Confusion Matrix:"

print(cm_base_adaboost)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7747  649
##        yes  250  396
##                                           
##                Accuracy : 0.9006          
##                  95% CI : (0.8942, 0.9067)
##     No Information Rate : 0.8844          
##     P-Value [Acc > NIR] : 5.182e-07       
##                                           
##                   Kappa : 0.4169          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9687          
##             Specificity : 0.3789          
##          Pos Pred Value : 0.9227          
##          Neg Pred Value : 0.6130          
##              Prevalence : 0.8844          
##          Detection Rate : 0.8568          
##    Detection Prevalence : 0.9286          
##       Balanced Accuracy : 0.6738          
##                                           
##        'Positive' Class : no              
##

# AUC-ROC for Base Model
y_pred_base_adaboost_prob <- predict(base_adaboost_model, newdata = df_test, type = "prob")[, 2]  # Get the probability for class 'yes'

# Compute ROC curve and AUC
roc_curve_base_adaboost <- roc(df_test$y, y_pred_base_adaboost_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(paste("Base Model AUC-ROC:", roc_curve_base_adaboost$auc))

## [1] "Base Model AUC-ROC: 0.9176476465756"

Experiment 1: Varying n_estimators (Number of Trees)

Objective (Hypothesis):

Increasing the number of estimators improves performance by giving the model more chances to correct errors. However, too many trees may result in diminishing returns.

What Will Change:

n_estimators: Test higher values (e.g., 200 instead of 100).

learning_rate remains fixed at 0.1.

What Will Stay the Same:

learning_rate = 0.1

# Set seed for reproducibility
set.seed(49)

# Train Adaboost model with n_estimators = 200 and learning_rate = 0.1
experiment_1_adaboost <- ada(
  y ~ ., 
  data = df_train, 
  iter = 200,   # n_estimators = 200
  nu = 0.1,     # learning_rate = 0.1
  type = "real" # SAMME.R (real version)
)

# Predict on the test set
y_pred_experiment_1_adaboost <- predict(experiment_1_adaboost, newdata = df_test)

# Confusion Matrix for Experiment 1
cm_experiment_1_adaboost <- confusionMatrix(y_pred_experiment_1_adaboost, df_test$y)
print("Experiment 1 (n_estimators = 200) Confusion Matrix:")

## [1] "Experiment 1 (n_estimators = 200) Confusion Matrix:"

print(cm_experiment_1_adaboost)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7739  623
##        yes  258  422
##                                           
##                Accuracy : 0.9026          
##                  95% CI : (0.8963, 0.9086)
##     No Information Rate : 0.8844          
##     P-Value [Acc > NIR] : 1.848e-08       
##                                           
##                   Kappa : 0.4381          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9677          
##             Specificity : 0.4038          
##          Pos Pred Value : 0.9255          
##          Neg Pred Value : 0.6206          
##              Prevalence : 0.8844          
##          Detection Rate : 0.8559          
##    Detection Prevalence : 0.9248          
##       Balanced Accuracy : 0.6858          
##                                           
##        'Positive' Class : no              
##

# AUC-ROC for Experiment 1
y_pred_experiment_1_adaboost_prob <- predict(experiment_1_adaboost, newdata = df_test, type = "prob")[, 2]
roc_curve_experiment_1_adaboost <- roc(df_test$y, y_pred_experiment_1_adaboost_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(paste("Experiment 1 AUC-ROC:", roc_curve_experiment_1_adaboost$auc))

## [1] "Experiment 1 AUC-ROC: 0.918409953971974"

Experiment 2: Varying learning_rate

Objective (Hypothesis):

The learning_rate determines how much each new tree contributes to the final model. A smaller rate may generalize better but requires more estimators, while a higher rate may speed up learning but risk overfitting.

What Will Change:

learning_rate: Test a higher value (e.g., 0.5).

n_estimators remains fixed at 100.

What Will Stay the Same:

n_estimators = 100

# Set seed for reproducibility
set.seed(50)

# Train Adaboost model with n_estimators = 100 and learning_rate = 0.5
experiment_2_adaboost <- ada(
  y ~ ., 
  data = df_train, 
  iter = 100,   # n_estimators = 100
  nu = 0.5,     # learning_rate = 0.5
  type = "real" # SAMME.R (real version)
)

# Predict on the test set
y_pred_experiment_2_adaboost <- predict(experiment_2_adaboost, newdata = df_test)

# Confusion Matrix for Experiment 2
cm_experiment_2_adaboost <- confusionMatrix(y_pred_experiment_2_adaboost, df_test$y)
print("Experiment 2 (learning_rate = 0.5) Confusion Matrix:")

## [1] "Experiment 2 (learning_rate = 0.5) Confusion Matrix:"

print(cm_experiment_2_adaboost)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7705  589
##        yes  292  456
##                                           
##                Accuracy : 0.9026          
##                  95% CI : (0.8963, 0.9086)
##     No Information Rate : 0.8844          
##     P-Value [Acc > NIR] : 1.848e-08       
##                                           
##                   Kappa : 0.4562          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9635          
##             Specificity : 0.4364          
##          Pos Pred Value : 0.9290          
##          Neg Pred Value : 0.6096          
##              Prevalence : 0.8844          
##          Detection Rate : 0.8521          
##    Detection Prevalence : 0.9173          
##       Balanced Accuracy : 0.6999          
##                                           
##        'Positive' Class : no              
##

# AUC-ROC for Experiment 2
y_pred_experiment_2_adaboost_prob <- predict(experiment_2_adaboost, newdata = df_test, type = "prob")[, 2]
roc_curve_experiment_2_adaboost <- roc(df_test$y, y_pred_experiment_2_adaboost_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(paste("Experiment 2 AUC-ROC:", roc_curve_experiment_2_adaboost$auc))

## [1] "Experiment 2 AUC-ROC: 0.918613020552564"

Assignment 3 Support Vector Machines

Train a Base SVM with radial kernel

# Train a Base SVM with default radial kernel
svm_model <- svm(y ~ ., data = df_train, kernel = "radial", probability = TRUE)

# Predict class
svm_pred_class <- predict(svm_model, newdata = df_test)

# Predict probabilities (for ROC/AUC)
svm_pred_prob <- attr(predict(svm_model, newdata = df_test, probability = TRUE), "probabilities")[, "yes"]

# Confusion matrix
svm_base_cm <- confusionMatrix(svm_pred_class, as.factor(df_test$y), positive = "yes")

# AUC-ROC
svm_base_roc <- roc(df_test$y, svm_pred_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

svm_base_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7910  907
##        yes   87  138
##                                           
##                Accuracy : 0.8901          
##                  95% CI : (0.8834, 0.8964)
##     No Information Rate : 0.8844          
##     P-Value [Acc > NIR] : 0.04758         
##                                           
##                   Kappa : 0.1839          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.13206         
##             Specificity : 0.98912         
##          Pos Pred Value : 0.61333         
##          Neg Pred Value : 0.89713         
##              Prevalence : 0.11557         
##          Detection Rate : 0.01526         
##    Detection Prevalence : 0.02488         
##       Balanced Accuracy : 0.56059         
##                                           
##        'Positive' Class : yes             
##

auc(svm_base_roc)

## Area under the curve: 0.8794

Train a SVM with linear kernel

# Train linear SVM
svm_linear <- svm(y ~ ., data = df_train, kernel = "linear", probability = TRUE)

# Predict class and probabilities
svm_linear_pred_class <- predict(svm_linear, newdata = df_test)
svm_linear_pred_prob <- attr(predict(svm_linear, newdata = df_test, probability = TRUE), "probabilities")[, "yes"]

# Evaluate
svm_linear_cm <- confusionMatrix(svm_linear_pred_class, as.factor(df_test$y), positive = "yes")
svm_linear_roc <- roc(df_test$y, svm_linear_pred_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

svm_linear_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7955  987
##        yes   42   58
##                                           
##                Accuracy : 0.8862          
##                  95% CI : (0.8795, 0.8927)
##     No Information Rate : 0.8844          
##     P-Value [Acc > NIR] : 0.3062          
##                                           
##                   Kappa : 0.0828          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.055502        
##             Specificity : 0.994748        
##          Pos Pred Value : 0.580000        
##          Neg Pred Value : 0.889622        
##              Prevalence : 0.115572        
##          Detection Rate : 0.006415        
##    Detection Prevalence : 0.011060        
##       Balanced Accuracy : 0.525125        
##                                           
##        'Positive' Class : yes             
##

auc(svm_linear_roc)

## Area under the curve: 0.8943

Train a SVM with polynomial kernel

# Train polynomial SVM (default degree = 3)
svm_poly <- svm(y ~ ., data = df_train, kernel = "polynomial", probability = TRUE)

# Predict class and probabilities
svm_poly_pred_class <- predict(svm_poly, newdata = df_test)
svm_poly_pred_prob <- attr(predict(svm_poly, newdata = df_test, probability = TRUE), "probabilities")[, "yes"]

# Evaluate
svm_poly_cm <- confusionMatrix(svm_poly_pred_class, as.factor(df_test$y), positive = "yes")
svm_poly_roc <- roc(df_test$y, svm_poly_pred_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

svm_poly_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7966 1004
##        yes   31   41
##                                          
##                Accuracy : 0.8855         
##                  95% CI : (0.8788, 0.892)
##     No Information Rate : 0.8844         
##     P-Value [Acc > NIR] : 0.3788         
##                                          
##                   Kappa : 0.0594         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.039234       
##             Specificity : 0.996124       
##          Pos Pred Value : 0.569444       
##          Neg Pred Value : 0.888071       
##              Prevalence : 0.115572       
##          Detection Rate : 0.004534       
##    Detection Prevalence : 0.007963       
##       Balanced Accuracy : 0.517679       
##                                          
##        'Positive' Class : yes            
##

auc(svm_poly_roc)

## Area under the curve: 0.8629

Tune with Hyperparameter grid on Radial Kernel

cost - Controls the trade-off between misclassification and margin size

gamma - Defines how much influence a single training point has.

Performs grid search + cross-validation to find the best combination of hyperparameters

set.seed(51)

# Use a random subset just for tuning, Once we find the best cost and gamma, retrain the model on the full training set
tune_subset <- df_train[sample(nrow(df_train), size = 2000), ]

# Tune over a grid of cost and gamma values
tune_result <- tune(
  svm,
  y ~ .,
  data = tune_subset,
  kernel = "radial",
  ranges = list(
    cost = c(0.1, 1, 10),
    gamma = c(0.01, 0.1)
  ),
  probability = TRUE
)

tune_result$best.parameters

##   cost gamma
## 5    1   0.1

svm_tuned <- svm(
  y ~ ., 
  data = df_train, 
  kernel = "radial", 
  cost = tune_result$best.parameters$cost, 
  gamma = tune_result$best.parameters$gamma, 
  probability = TRUE
)

# Predict
svm_tuned_pred_class <- predict(svm_tuned, newdata = df_test)
svm_tuned_pred_prob <- attr(predict(svm_tuned, newdata = df_test, probability = TRUE), "probabilities")[, "yes"]

# Confusion matrix
svm_tuned_cm <- confusionMatrix(svm_tuned_pred_class, as.factor(df_test$y), positive = "yes")

# AUC-ROC

svm_tuned_roc <- roc(df_test$y, svm_tuned_pred_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

svm_tuned_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7831  786
##        yes  166  259
##                                          
##                Accuracy : 0.8947         
##                  95% CI : (0.8882, 0.901)
##     No Information Rate : 0.8844         
##     P-Value [Acc > NIR] : 0.001039       
##                                          
##                   Kappa : 0.306          
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.24785        
##             Specificity : 0.97924        
##          Pos Pred Value : 0.60941        
##          Neg Pred Value : 0.90878        
##              Prevalence : 0.11557        
##          Detection Rate : 0.02864        
##    Detection Prevalence : 0.04700        
##       Balanced Accuracy : 0.61354        
##                                          
##        'Positive' Class : yes            
##

auc(svm_tuned_roc)

## Area under the curve: 0.875

Result

# Load required package
library(knitr)

# Create the table with Model/Experiment names, Accuracy, and AUC-ROC scores
results_df <- data.frame(
  "Model and Experiment" = c(
    "Decision Tree - Base Model (cp=0.05, maxdepth=15)", 
    "Decision Tree - Experiment 1 (cp=0.01)", 
    "Decision Tree - Experiment 2 (maxdepth=30)",
    "Random Forest - Base Model (ntree=100, mtry=4)", 
    "Random Forest - Experiment 1 (ntree=200)", 
    "Random Forest - Experiment 2 (mtry=3)",
    "Adaboost - Base Model (n_estimators=100, learning_rate=0.1)", 
    "Adaboost - Experiment 1 (n_estimators=200)", 
    "Adaboost - Experiment 2 (learning_rate=0.5)",
    "SVM - Base Model (radial, default parameters)",
    "SVM - Tuned Model (radial, tuned on subset, retrained on full set)",
    "SVM - Linear Kernel",
    "SVM - Polynomial Kernel"
  ),
  "Accuracy" = c(
    cm_base$overall['Accuracy'], 
    cm_tree_1$overall['Accuracy'], 
    cm_tree_2$overall['Accuracy'],
    cm_base_rf$overall['Accuracy'], 
    cm_rf_experiment_1$overall['Accuracy'], 
    cm_rf_experiment_2$overall['Accuracy'],
    cm_base_adaboost$overall['Accuracy'], 
    cm_experiment_1_adaboost$overall['Accuracy'], 
    cm_experiment_2_adaboost$overall['Accuracy'],
    svm_base_cm$overall['Accuracy'],
    svm_tuned_cm$overall['Accuracy'],
    svm_linear_cm$overall['Accuracy'],
    svm_poly_cm$overall['Accuracy']
  ),
  "AUC-ROC" = c(
    roc_curve_base$auc, 
    roc_curve_tree_1$auc, 
    roc_curve_tree_2$auc,
    roc_curve_base_rf$auc, 
    roc_curve_rf_experiment_1$auc, 
    roc_curve_rf_experiment_2$auc,
    roc_curve_base_adaboost$auc, 
    roc_curve_experiment_1_adaboost$auc, 
    roc_curve_experiment_2_adaboost$auc,
    svm_base_roc$auc,
    svm_tuned_roc$auc,
    svm_linear_roc$auc,
    svm_poly_roc$auc
  )
)

# Print a clean table using kable
kable(results_df, caption = "Summary of Model Experiments with Accuracy and AUC-ROC")

Summary of Model Experiments with Accuracy and AUC-ROC
Model.and.Experiment	Accuracy	AUC.ROC
Decision Tree - Base Model (cp=0.05, maxdepth=15)	0.8844282	0.5000000
Decision Tree - Experiment 1 (cp=0.01)	0.8984738	0.7451047
Decision Tree - Experiment 2 (maxdepth=30)	0.8844282	0.5000000
Random Forest - Base Model (ntree=100, mtry=4)	0.9010175	0.9114440
Random Forest - Experiment 1 (ntree=200)	0.9020128	0.9160043
Random Forest - Experiment 2 (mtry=3)	0.9003539	0.9124690
Adaboost - Base Model (n_estimators=100, learning_rate=0.1)	0.9005751	0.9176476
Adaboost - Experiment 1 (n_estimators=200)	0.9025658	0.9184100
Adaboost - Experiment 2 (learning_rate=0.5)	0.9025658	0.9186130
SVM - Base Model (radial, default parameters)	0.8900686	0.8793619
SVM - Tuned Model (radial, tuned on subset, retrained on full set)	0.8947136	0.8749523
SVM - Linear Kernel	0.8861977	0.8942953
SVM - Polynomial Kernel	0.8855342	0.8628738

# Set seed for reproducibility
set.seed(1234)

# Set up the plotting area
par(pty = "s")

# Plot the ROC curve for the Decision Tree - Base Model
roc_curve_base <- roc(df_test$y, y_pred_base_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

plot(roc_curve_base,
     col = "#0ABAB5",
     main = "ROC Curve Comparison: Decision Tree & SVM",
     percent = TRUE,
     xlab = "False Positive Percentage",
     ylab = "True Positive Percentage",
     print.auc = TRUE,
     legacy.axes = TRUE)

# Add ROC for Decision Tree - Experiment 1 (cp = 0.01)
roc_curve_tree_1 <- roc(df_test$y, y_pred_tree_1_prob)

## Setting levels: control = no, case = yes
## Setting direction: controls < cases

plot(roc_curve_tree_1, col = "#FA8072", add = TRUE, print.auc = TRUE, print.auc.y = 40)

# Add ROC for Decision Tree - Experiment 2 (maxdepth = 30)
roc_curve_tree_2 <- roc(df_test$y, y_pred_tree_2_prob)

## Setting levels: control = no, case = yes
## Setting direction: controls < cases

plot(roc_curve_tree_2, col = "#FFD700", add = TRUE, print.auc = TRUE, print.auc.y = 60)

# Add ROC for SVM - Base Model (radial)
plot(svm_base_roc, col = "#1E90FF", add = TRUE, print.auc = TRUE, print.auc.y = 20)

# Add ROC for SVM - Tuned Model (radial, tuned on subset)
plot(svm_tuned_roc, col = "#32CD32", add = TRUE, print.auc = TRUE, print.auc.y = 10)

# Add ROC for SVM - Linear Kernel
plot(svm_linear_roc, col = "#FF69B4", add = TRUE, print.auc = TRUE, print.auc.y = 70)

# Add ROC for SVM - Polynomial Kernel
plot(svm_poly_roc, col = "#8A2BE2", add = TRUE, print.auc = TRUE, print.auc.y = 80)

# Add legend
legend("bottomright",
       legend = c(
         "DT Base (cp = 0.05, maxdepth = 15)",
         "DT Exp 1 (cp = 0.01)",
         "DT Exp 2 (maxdepth = 30)",
         "SVM Base (radial)",
         "SVM Tuned (radial, subset tuning)",
         "SVM Linear Kernel",
         "SVM Polynomial Kernel"
       ),
       col = c("#0ABAB5", "#FA8072", "#FFD700", "#1E90FF", "#32CD32", "#FF69B4", "#8A2BE2"),
       lwd = 2,
       cex = 0.85)

# Set seed for reproducibility
set.seed(1235)

# Set up the plotting area
par(pty="s")

# Plot the Base Model ROC curve (Random Forest Base Model)
roc_curve_base_rf <- roc(df_test$y, y_pred_base_rf_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

plot(roc_curve_base_rf, col="#0ABAB5", main="Random Forest ROC Curve Comparison", percent=TRUE,
     xlab="False Positive Percentage", ylab="True Positive Percentage", print.auc=TRUE)

# Plot the ROC curve for Experiment 1 (Varying ntree)
roc_curve_rf_experiment_1 <- roc(df_test$y, y_pred_rf_prob_experiment_1)

## Setting levels: control = no, case = yes
## Setting direction: controls < cases

plot(roc_curve_rf_experiment_1, col="#FA8072", print.auc=TRUE, add=TRUE, print.auc.y=40)

# Plot the ROC curve for Experiment 2 (Varying mtry)
roc_curve_rf_experiment_2 <- roc(df_test$y, y_pred_rf_prob_experiment_2)

## Setting levels: control = no, case = yes
## Setting direction: controls < cases

plot(roc_curve_rf_experiment_2, col="#FFD700", print.auc=TRUE, add=TRUE, print.auc.y=60)

# Add legend to the plot
legend("bottomright", legend=c("Base Model (Random Forest)", "Experiment 1 (Varying ntree)", "Experiment 2 (Varying mtry)"),
       col=c("#0ABAB5", "#FA8072", "#FFD700"), lwd=2)

# Set seed for reproducibility
set.seed(1236)

# Set up the plotting area
par(pty="s")

# Plot the Base Model ROC curve (Adaboost Base Model)
roc_curve_base_adaboost <- roc(df_test$y, y_pred_base_adaboost_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

plot(roc_curve_base_adaboost, col="#0ABAB5", main="Adaboost ROC Curve Comparison", percent=TRUE,
     xlab="False Positive Percentage", ylab="True Positive Percentage", print.auc=TRUE)

# Plot the ROC curve for Experiment 1 (Varying n_estimators)
roc_curve_experiment_1_adaboost <- roc(df_test$y, y_pred_experiment_1_adaboost_prob)

## Setting levels: control = no, case = yes
## Setting direction: controls < cases

plot(roc_curve_experiment_1_adaboost, col="#FA8072", print.auc=TRUE, add=TRUE, print.auc.y=40)

# Plot the ROC curve for Experiment 2 (Varying learning_rate)
roc_curve_experiment_2_adaboost <- roc(df_test$y, y_pred_experiment_2_adaboost_prob)

## Setting levels: control = no, case = yes
## Setting direction: controls < cases

plot(roc_curve_experiment_2_adaboost, col="#FFD700", print.auc=TRUE, add=TRUE, print.auc.y=60)

# Add legend to the plot
legend("bottomright", legend=c("Base Model (Adaboost)", "Experiment 1 (Varying n_estimators)", "Experiment 2 (Varying learning_rate)"),
       col=c("#0ABAB5", "#FA8072", "#FFD700"), lwd=2)

Data_622_Assignment_3: Chun Shing Leung

2025-02-27

Variable overview

Exploratory Data Analysis

Checking correlation using Correlation Matrix

Category and Numeric Variables distribution

Checking for Outliers

Missing Values

Check for Duplicated Values

Check for Near Zero Variance

Data Prepration

Removing Redundant Features

Removing row

Imputation for missing Value

Feature Engineering

Handling imbalanced and skewed data

Assignment 2 - Experimentation & Model Training

Split the data

Decision Tree:

Random Forest:

Adaboost:

Assignment 3 Support Vector Machines

Train a Base SVM with radial kernel

Train a SVM with linear kernel

Train a SVM with polynomial kernel

Tune with Hyperparameter grid on Radial Kernel

Result