A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit.
df<- read.csv("https://raw.githubusercontent.com/tonyCUNY/tonyCUNY/refs/heads/main/Data_622/bank-full.csv", sep = ";", stringsAsFactors = T)
age: Age
job: type of job
marital: marital status
education: Education Level
default: has credit in default?
balance: average yearly balance
housing: has housing loan?
loan: has personal loan?
contact: contact communication type (categorical: ‘cellular’,‘telephone’)
day: last contact day of the week
month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
duration: last contact duration, in seconds (numeric).
campaign: number of contacts performed during this campaign and for this client
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client
poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’) y: has the client subscribed a term deposit?
The dataset consists of 45,211 instances and 17 features, with most variables being categorical. Among the 10 categorical variables, “contact”, “loan”, “default”, and “y” exhibit highly imbalanced distributions, with more than 80% of the instances belonging to a single category. In contrast, “education”, “housing”, “poutcome”, and “marital” contain three categories each and appear to be fairly balanced. The “job” and “month” variables have 11 and 12 categories, respectively, and show a relatively even distribution.
The dataset also includes seven numeric variables, namely “age”, “balance”, “campaign”, “day”, “duration”, “pdays”, and “previous”. The distributions of “age”, “balance”, “campaign”, and “duration” are right-skewed, indicating that most values are concentrated toward the lower end, with a few extreme values pulling the distribution to the right. The “day” variable exhibits a uniform distribution, while “pdays” and “previous” are extremely right-skewed, suggesting that most data points are clustered near the lower end, with a few outliers significantly increasing the range.
Outlier analysis reveals that “pdays” and “previous” contain the highest proportion of outliers (18%), followed by “balance” (10.5%), “duration” (7.2%), and “campaign” (6.8%). The “age” variable has only 1.1% outliers, making it relatively stable for analysis. Additionally, correlation analysis indicates that “pdays” and “previous” exhibit a weak positive correlation, while the remaining numeric variables show little to no correlation with each other.
The dataset also contains missing values, with “unknown” values treated as missing data. The most significant missing data is found in “poutcome”, where 81.7% of values are missing, followed by “contact” (28.8% missing values). The “education” and “job” variables contain a small number of missing values, which may have minimal impact on analysis. Importantly, there are no duplicated values in the dataset.
Additionally, a Near Zero Variance (NZV) check identified “default” and “pdays” as NZV variables, meaning that a single category dominates almost all rows. These variables provide little to no useful information for modeling and can negatively impact machine learning models by introducing multicollinearity and unstable coefficients.
Overall, the dataset presents several key characteristics, including imbalanced categorical distributions, right-skewed numeric variables, significant outliers, and missing values. These insights help guide further data preprocessing and model selection to ensure accurate and meaningful analysis.
glimpse(df)
## Rows: 45,211
## Columns: 17
## $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job <fct> management, technician, entrepreneur, blue-collar, unknown, …
## $ marital <fct> married, single, married, married, single, married, single, …
## $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
## $ default <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
## $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
## $ loan <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
## $ contact <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
## $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
str(df)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
df.clean <- df %>%
mutate(across(where(is.character), ~ na_if(., "unknown"))) %>%
mutate(across(where(is.factor), ~ na_if(as.character(.), "unknown")))
df.clean <- df.clean %>%
mutate(across(c(job, marital, education, default, housing, loan, contact, month, poutcome, y), as.factor))
str(df.clean)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 11 levels "admin.","blue-collar",..: 5 10 3 2 NA 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 3 levels "primary","secondary",..: 3 2 2 NA NA 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": NA NA NA NA NA NA NA NA NA NA ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","other",..: NA NA NA NA NA NA NA NA NA NA ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
cor_matrix <- cor(df.clean %>% select_if(where(is.numeric)), use = "complete.obs")
cor_long <- melt(cor_matrix)
cor_long <- cor_long[as.numeric(cor_long$Var1) > as.numeric(cor_long$Var2), ]
ggplot(cor_long, aes(Var2, Var1, fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = ifelse(value != 0, round(value, 2), "")),
color = "black", size = 3, face="bold") +
scale_fill_gradient2(low = "coral", high = "lightblue", mid = "white",
midpoint = 0, limit = c(-1, 1),
space = "Lab", name = "Correlation") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 10),
axis.text.y = element_text(size = 10),
axis.title = element_blank(),
plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
) +
ggtitle("Correlation Matrix")
## Warning in geom_text(aes(label = ifelse(value != 0, round(value, 2), "")), :
## Ignoring unknown parameters: `face`
# Category variable distribution
category_percentages <- df.clean |>
summarise(across(where(is.factor), ~ list(prop.table(table(.)) * 100))) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category_Percentage") |>
drop_na() |>
unnest(cols = c(Category_Percentage)) |>
mutate(Category = names(Category_Percentage)) |>
arrange(Variable, desc(Category_Percentage)) |>
mutate(Variable = factor(Variable, levels = unique(Variable)))
# Define new variable groups manually
group_1 <- c("contact", "loan", "default", "y")
group_2 <- c("education", "housing", "poutcome", "marital")
group_3 <- c("job", "month")
# Filter data for each chart
category_percentages_1 <- category_percentages |> filter(Variable %in% group_1)
category_percentages_2 <- category_percentages |> filter(Variable %in% group_2)
category_percentages_3 <- category_percentages |> filter(Variable %in% group_3)
# First Plot (Contact, Loan, Default, y)
ggplot(category_percentages_1, aes(x = Category,
y = Category_Percentage, fill = Variable)) +
geom_bar(stat = "identity", color = "black") +
# Percentage on top of the bar
geom_text(aes(label = paste0(round(Category_Percentage, 1), "%")),
vjust = 0.92, size = 3, color = "black", fontface = "bold") +
facet_wrap(~ Variable, scales = "free_x") +
theme_minimal() +
labs(title = "Category Distribution (Part 1: Contact, Loan, Default, y)", x = "Category", y = "Percentage") +
scale_fill_brewer(palette = "Set3") +
theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1))
## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.
# Second Plot (Education, Housing, Poutcome, marital)
ggplot(category_percentages_2, aes(x = Category,
y = Category_Percentage, fill = Variable)) +
geom_bar(stat = "identity", color = "black") +
# Percentage on top of the bar
geom_text(aes(label = paste0(round(Category_Percentage, 1), "%")),
vjust = 1.5, size = 3, color = "black", fontface = "bold") +
facet_wrap(~ Variable, scales = "free_x") +
theme_minimal() +
labs(title = "Category Distribution (Part 2: Education, Housing, Poutcome)", x = "Category", y = "Percentage") +
scale_fill_brewer(palette = "Set3") +
theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1))
## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.
# Third Plot (Job & Month)
ggplot(category_percentages_3, aes(x = Category,
y = Category_Percentage, fill = Variable)) +
geom_bar(stat = "identity", color = "black") +
# Percentage on top of the bar
geom_text(aes(label = paste0(round(Category_Percentage, 1), "%")),
vjust = -0.5, size = 2.5, color = "black", fontface = "bold") +
facet_wrap(~ Variable, scales = "free_x") +
theme_minimal() +
labs(title = "Category Distribution (Part 3: Job & Month)", x = "Category", y = "Percentage") +
scale_fill_brewer(palette = "Set3") +
theme(axis.text.x = element_text(size = 8, angle = 45, hjust = 1))
## Don't know how to automatically pick scale for object of type <table>.
## Defaulting to continuous.
# Numeric Variable Distribution
df.clean |>
select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
drop_na() |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "coral", color = "black") +
facet_wrap(~ Variable, scales = "free", ncol = 4) +
theme(
panel.grid = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
labs(title = "Histograms of Numeric Variables", x = "Value", y = "Frequency")
df.clean |>
select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
drop_na() |>
ggplot(aes(x = Variable, y = Value)) + # Boxplots need categorical x-axis
geom_boxplot(fill = "lightblue", color = "black") +
facet_wrap(~ Variable, scales = "free", ncol = 3) + # Facet for multiple numeric variables
theme_minimal() +
labs(title = "Boxplots of Numeric Variables", x = "Variable", y = "Value") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Finding outliers
outliers <- df.clean |>
select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
drop_na() |>
group_by(Variable) |>
mutate(
Q1 = quantile(Value, 0.25, na.rm = TRUE),
Q3 = quantile(Value, 0.75, na.rm = TRUE),
IQR = Q3 - Q1,
Lower_Bound = Q1 - 1.5 * IQR,
Upper_Bound = Q3 + 1.5 * IQR,
Outlier = Value < Lower_Bound | Value > Upper_Bound
) |>
filter(Outlier) |>
ungroup()
print(outliers)
## # A tibble: 28,029 × 8
## Variable Value Q1 Q3 IQR Lower_Bound Upper_Bound Outlier
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 balance 10635 72 1428 1356 -1962 3462 TRUE
## 2 duration 1666 103 319 216 -221 643 TRUE
## 3 duration 1492 103 319 216 -221 643 TRUE
## 4 duration 787 103 319 216 -221 643 TRUE
## 5 duration 1778 103 319 216 -221 643 TRUE
## 6 duration 812 103 319 216 -221 643 TRUE
## 7 balance 6530 72 1428 1356 -1962 3462 TRUE
## 8 balance 12223 72 1428 1356 -1962 3462 TRUE
## 9 balance 5935 72 1428 1356 -1962 3462 TRUE
## 10 duration 1042 103 319 216 -221 643 TRUE
## # ℹ 28,019 more rows
# Visualize outliers
outliers |>
group_by(Variable) |>
summarise(
Outlier_Count = n(),
Total_Count = nrow(df.clean),
Outlier_Percentage = (Outlier_Count / Total_Count) * 100
) |>
ungroup() |>
ggplot(aes(x = reorder(Variable, -Outlier_Count), y = Outlier_Count, fill = Outlier_Percentage)) +
geom_bar(stat = "identity", color = "black") +
# Display outlier count on top of the bar
geom_text(aes(label = Outlier_Count), vjust = -0.5, size = 3, fontface = "bold") +
# Display percentage inside the bar (centered)
geom_text(aes(label = paste0(round(Outlier_Percentage, 1), "%")),
vjust = 1.5, size = 3, color = "black", fontface = "bold") +
scale_fill_gradient(low = "lightblue", high = "coral", name = "Outlier %") +
theme_minimal() +
labs(
title = "Outlier Count and Percentage per Numeric Variable",
x = "Variable",
y = "Number of Outliers"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
df.clean |>
select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") |>
drop_na() |>
ggplot(aes(x = Variable, y = Value)) +
geom_boxplot(fill = "lightblue", color = "black", outlier.color = "coral", outlier.shape = 16) +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Boxplots of Numeric Variables with Outliers", x = "Variable", y = "Value") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
missing_data <- df.clean |>
summarise(across(everything(), ~ sum(is.na(.)))) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") |>
mutate(Missing_Percentage = (Missing_Count / nrow(df.clean)) * 100) |>
arrange(desc(Missing_Count))
print(missing_data)
## # A tibble: 17 × 3
## Variable Missing_Count Missing_Percentage
## <chr> <int> <dbl>
## 1 poutcome 36959 81.7
## 2 contact 13020 28.8
## 3 education 1857 4.11
## 4 job 288 0.637
## 5 age 0 0
## 6 marital 0 0
## 7 default 0 0
## 8 balance 0 0
## 9 housing 0 0
## 10 loan 0 0
## 11 day 0 0
## 12 month 0 0
## 13 duration 0 0
## 14 campaign 0 0
## 15 pdays 0 0
## 16 previous 0 0
## 17 y 0 0
ggplot(missing_data, aes(x = reorder(Variable, -Missing_Count), y = Missing_Count, fill = Missing_Percentage)) +
geom_bar(stat = "identity", color = "black") +
geom_text(aes(label = Missing_Count), vjust = -0.5, size = 2.5, fontface = "bold") +
geom_text(aes(label = paste0(round(Missing_Percentage, 1), "%")),
vjust = 1.5, size = 2.3, color = "black", fontface = "bold") +
scale_fill_gradient(low = "lightblue", high = "coral", name = "Missing %") +
theme_minimal() +
labs(title = "Missing Values Count and Percentage per Variable",
x = "Variable", y = "Number of Missing Values") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
sum(duplicated(df.clean))
## [1] 0
nzv <- nearZeroVar(df.clean, saveMetrics = TRUE)
print(nzv)
## freqRatio percentUnique zeroVar nzv
## age 1.044589 0.170312535 FALSE FALSE
## job 1.028970 0.024330362 FALSE FALSE
## marital 2.127756 0.006635553 FALSE FALSE
## education 1.744380 0.006635553 FALSE FALSE
## default 54.473620 0.004423702 FALSE TRUE
## balance 18.020513 15.854548672 FALSE FALSE
## housing 1.251432 0.004423702 FALSE FALSE
## loan 5.241165 0.004423702 FALSE FALSE
## contact 10.077426 0.004423702 FALSE FALSE
## day 1.192374 0.068567384 FALSE FALSE
## month 1.996519 0.026542213 FALSE FALSE
## duration 1.021739 3.479241777 FALSE FALSE
## campaign 1.402959 0.106168853 FALSE FALSE
## pdays 221.281437 1.236424764 FALSE TRUE
## previous 13.331169 0.090685895 FALSE FALSE
## poutcome 2.663587 0.006635553 FALSE FALSE
## y 7.548119 0.004423702 FALSE FALSE
The correlation matrix from EDA showed that “pdays” and “previous” have a weak positive correlation. Plus, Near Zero Variance (NZV) variables such as “default” and “pdays” contribute very little information and should be removed to avoid model instability. poutcome consists of large amount of missing value and shall be removed as well.
df.clean.final <- df.clean |>
select(-pdays, -default, -poutcome)
According to the data describetion, when duration is equal to 0, response variable is always equal to no and should be discarded if the intention is to have a realistic predictive model. There are 3 rows of data consist of duration value equal to 0
df.clean.final <- df.clean.final |> filter(duration != 0)
Contact, education and job have missing value. Here, we use Random Forest Imputation to fill in these value.
library(missForest)
## Warning: package 'missForest' was built under R version 4.2.3
df.imputed <- missForest(df.clean.final)
df.clean.final <- df.imputed$ximp
missing_data_check <- df.clean.final |>
summarise(across(everything(), ~ sum(is.na(.)))) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") |>
mutate(Missing_Percentage = (Missing_Count / nrow(df.clean)) * 100) |>
arrange(desc(Missing_Count))
print(missing_data_check)
## # A tibble: 14 × 3
## Variable Missing_Count Missing_Percentage
## <chr> <int> <dbl>
## 1 age 0 0
## 2 job 0 0
## 3 marital 0 0
## 4 education 0 0
## 5 balance 0 0
## 6 housing 0 0
## 7 loan 0 0
## 8 contact 0 0
## 9 day 0 0
## 10 month 0 0
## 11 duration 0 0
## 12 campaign 0 0
## 13 previous 0 0
## 14 y 0 0
Age can be converted into categorical data to capture different behaviors.
# Create Age_Group based on defined age bins
df.clean.final <- df.clean.final |>
mutate(Age_Group = case_when(
age < 30 ~ "Young",
age >= 30 & age <= 50 ~ "Middle",
age > 50 ~ "Senior"
))
# Convert Age_Group into a factor for better modeling
df.clean.final$Age_Group <- factor(df.clean.final$Age_Group, levels = c("Young", "Middle", "Senior"))
# Remove 'age' column from the dataset after creating 'Age_Group'
df.clean.final <- df.clean.final |>
select(-age, -day)
# Check the distribution of Age_Group
table(df.clean.final$Age_Group)
##
## Young Middle Senior
## 5273 30681 9254
The df.clean.final data is ready for Random Forest model since Random Forest is less sensitive to class imbalance compared to Logistic Regression but may still favor the majority class.Also,Random Forest is not significantly affected by right-skewed numeric variables. Unlike models that assume a linear relationship (such as Logistic Regression, Linear Regression, or KNN), Random Forest is a tree-based model and does not rely on feature distributions being normal.
library(caret)
library(rpart)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(e1071) # For accuracy and metrics
library(ada) # For Adaboost
## Warning: package 'ada' was built under R version 4.2.3
library(pROC) # For AUC-ROC curve
## Warning: package 'pROC' was built under R version 4.2.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
str(df.clean.final)
## 'data.frame': 45208 obs. of 13 variables:
## $ job : Factor w/ 11 levels "admin.","blue-collar",..: 5 10 3 2 8 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 3 levels "primary","secondary",..: 3 2 2 1 2 3 3 3 1 2 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 1 1 1 1 1 1 1 1 1 1 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Age_Group: Factor w/ 3 levels "Young","Middle",..: 3 2 2 2 2 2 1 2 3 2 ...
set.seed(42)
sample_set <- sample(nrow(df.clean.final), round(nrow(df.clean.final) * .80), replace = FALSE)
df_train <- df.clean.final[sample_set, ] # 80% of the data for training
df_test <- df.clean.final[-sample_set, ] # 20% of the data for testing
##Algorithm Selection
# Base Model - Train the decision tree model with the default cp (0.05) and maxdepth (15)
set.seed(42)
base_tree_model <- rpart(
y ~ .,
data = df_train, # Training on the entire dataset
method = "class", # Classification tree
control = rpart.control(cp = 0.05, maxdepth = 15) # Default cp and maxdepth
)
# Predict on the test set (class predictions)
y_pred_base <- predict(base_tree_model, newdata = df_test, type = "class")
# Confusion Matrix for Base Model
cm_base <- confusionMatrix(as.factor(y_pred_base), as.factor(df_test$y))
print("Base Model Confusion Matrix:")
## [1] "Base Model Confusion Matrix:"
print(cm_base)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7997 1045
## yes 0 0
##
## Accuracy : 0.8844
## 95% CI : (0.8777, 0.8909)
## No Information Rate : 0.8844
## P-Value [Acc > NIR] : 0.5082
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.8844
## Neg Pred Value : NaN
## Prevalence : 0.8844
## Detection Rate : 0.8844
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : no
##
# AUC-ROC for Base Model
y_pred_base_prob <- predict(base_tree_model, newdata = df_test, type = "prob")[, 2] # Get the probability for class 'yes'
roc_curve_base <- roc(df_test$y, y_pred_base_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(paste("Base Model AUC-ROC:", roc_curve_base$auc))
## [1] "Base Model AUC-ROC: 0.5"
Experiment 1: Decreasing cp
Objective (Hypothesis):
Lowering the complexity parameter (cp) allows the decision tree to grow deeper, potentially capturing more complex relationships in the data. However, a very small cp may lead to overfitting.
Variable Change:
The cp parameter is reduced from 0.05 to 0.01 to allow deeper tree growth with less pruning.
set.seed(43)
# Train the model with cp = 0.01
tree_model_1 <- rpart(
y ~ .,
data = df_train, # Training on the entire dataset
method = "class", # Classification tree
control = rpart.control(cp = 0.01) # Set cp value to 0.1
)
# Predict on the test set (class predictions)
y_pred_tree_1 <- predict(tree_model_1, newdata = df_test, type = "class")
# Confusion Matrix for Experiment 1
cm_tree_1 <- confusionMatrix(as.factor(y_pred_tree_1), as.factor(df_test$y))
print("Decision Tree with cp = 0.01 Confusion Matrix:")
## [1] "Decision Tree with cp = 0.01 Confusion Matrix:"
print(cm_tree_1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7782 703
## yes 215 342
##
## Accuracy : 0.8985
## 95% CI : (0.8921, 0.9046)
## No Information Rate : 0.8844
## P-Value [Acc > NIR] : 1.149e-05
##
## Kappa : 0.3769
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9731
## Specificity : 0.3273
## Pos Pred Value : 0.9171
## Neg Pred Value : 0.6140
## Prevalence : 0.8844
## Detection Rate : 0.8607
## Detection Prevalence : 0.9384
## Balanced Accuracy : 0.6502
##
## 'Positive' Class : no
##
# AUC-ROC for Experiment 1
y_pred_tree_1_prob <- predict(tree_model_1, newdata = df_test, type = "prob")[, 2] # Get the probability for class 'yes'
roc_curve_tree_1 <- roc(df_test$y, y_pred_tree_1_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(paste("Decision Tree with cp = 0.01 AUC-ROC:", roc_curve_tree_1$auc))
## [1] "Decision Tree with cp = 0.01 AUC-ROC: 0.745104653479504"
Experiment 2: Increasing maxdepth
Objective (Hypothesis):
The maxdepth parameter controls how deep the tree can grow. A higher value allows the tree to model more complexity, while a smaller value restricts depth and helps prevent overfitting.
Variable Change:
maxdepth is increased from 15 to 30, while cp remains fixed at 0.05.
set.seed(44)
# Train the model with maxdepth = 30
tree_model_2 <- rpart(
y ~ .,
data = df_train, # Training on the entire dataset
method = "class", # Classification tree
control = rpart.control(cp = 0.05, maxdepth = 30) # Set maxdepth value to 30
)
# Predict on the test set (class predictions)
y_pred_tree_2 <- predict(tree_model_2, newdata = df_test, type = "class")
# Confusion Matrix for Experiment 2
cm_tree_2 <- confusionMatrix(as.factor(y_pred_tree_2), as.factor(df_test$y))
print("Decision Tree with maxdepth = 30 Confusion Matrix:")
## [1] "Decision Tree with maxdepth = 30 Confusion Matrix:"
print(cm_tree_2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7997 1045
## yes 0 0
##
## Accuracy : 0.8844
## 95% CI : (0.8777, 0.8909)
## No Information Rate : 0.8844
## P-Value [Acc > NIR] : 0.5082
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.8844
## Neg Pred Value : NaN
## Prevalence : 0.8844
## Detection Rate : 0.8844
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : no
##
# AUC-ROC for Experiment 2
y_pred_tree_2_prob <- predict(tree_model_2, newdata = df_test, type = "prob")[, 2] # Get the probability for class 'yes'
roc_curve_tree_2 <- roc(df_test$y, y_pred_tree_2_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(paste("Decision Tree with maxdepth = 30 AUC-ROC:", roc_curve_tree_2$auc))
## [1] "Decision Tree with maxdepth = 30 AUC-ROC: 0.5"
Random Forest is an ensemble method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting. It’s a natural extension of decision tree models.
library(doParallel)
## Warning: package 'doParallel' was built under R version 4.2.3
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.2.3
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 4.2.3
## Loading required package: parallel
# Set the number of CPU cores to use
num_cores <- detectCores() - 1 # Use all available cores except one
# Register the parallel backend
cl <- makeCluster(num_cores) # Create a cluster with available cores
registerDoParallel(cl) # Register the parallel backend
set.seed(45)
# Train Random Forest base model with default ntree and mtry
base_rf_model <- randomForest(
y ~ .,
data = df_train,
ntree = 100, # set the default value
mtry = 4, # set the default value (sqrt(p) = sqrt(14) = 3.74 ~ 4)
importance = TRUE, # Show feature importance
nthreads = num_cores
)
# Predict on the test set
y_pred_base_rf <- predict(base_rf_model, newdata = df_test)
# Confusion Matrix for Base Model
cm_base_rf <- confusionMatrix(y_pred_base_rf, df_test$y)
print("Base Model Confusion Matrix:")
## [1] "Base Model Confusion Matrix:"
print(cm_base_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7676 560
## yes 321 485
##
## Accuracy : 0.9026
## 95% CI : (0.8963, 0.9086)
## No Information Rate : 0.8844
## P-Value [Acc > NIR] : 1.848e-08
##
## Kappa : 0.4708
##
## Mcnemar's Test P-Value : 1.071e-15
##
## Sensitivity : 0.9599
## Specificity : 0.4641
## Pos Pred Value : 0.9320
## Neg Pred Value : 0.6017
## Prevalence : 0.8844
## Detection Rate : 0.8489
## Detection Prevalence : 0.9109
## Balanced Accuracy : 0.7120
##
## 'Positive' Class : no
##
# AUC-ROC for Base Model
y_pred_base_rf_prob <- predict(base_rf_model, newdata = df_test, type = "prob")[, 2]
roc_curve_base_rf <- roc(df_test$y, y_pred_base_rf_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(paste("Base Model AUC-ROC:", roc_curve_base_rf$auc))
## [1] "Base Model AUC-ROC: 0.913111555589327"
# Stop the cluster after training
stopCluster(cl)
Experiment 1: Varying ntree
Objective (Hypothesis):
Increasing the number of trees (ntree) should improve model performance by reducing variance. However, improvements may plateau beyond a certain number of trees.
Variable Change: Increase ntree from 100 (base) to 200, keeping mtry and other parameters at default.
What Stays the Same:
mtry remains at its default value (usually √p, where p = number of features).
Other parameters like nodesize and cp remain unchanged.
# Set the number of CPU cores to use
num_cores <- detectCores() - 1 # Use all available cores except one
# Register the parallel backend
cl <- makeCluster(num_cores) # Create a cluster with available cores
registerDoParallel(cl) # Register the parallel backend
set.seed(46)
# Train Random Forest model with ntree = 200 and mtry = 4 for Experiment 1
rf_model_experiment_1 <- randomForest(
y ~ .,
data = df_train,
ntree = 200, # ntree increase to 200
mtry = 4, # mtry remain as 4
importance = TRUE, # Show feature importance
nthreads = num_cores
)
# Predict on the test set (class predictions)
y_pred_rf_experiment_1 <- predict(rf_model_experiment_1, newdata = df_test)
# Confusion Matrix for Experiment 1
cm_rf_experiment_1 <- confusionMatrix(y_pred_rf_experiment_1, df_test$y)
print("Experiment 1 (ntree=200, mtry=6) Confusion Matrix:")
## [1] "Experiment 1 (ntree=200, mtry=6) Confusion Matrix:"
print(cm_rf_experiment_1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7679 562
## yes 318 483
##
## Accuracy : 0.9027
## 95% CI : (0.8964, 0.9087)
## No Information Rate : 0.8844
## P-Value [Acc > NIR] : 1.518e-08
##
## Kappa : 0.4702
##
## Mcnemar's Test P-Value : 2.579e-16
##
## Sensitivity : 0.9602
## Specificity : 0.4622
## Pos Pred Value : 0.9318
## Neg Pred Value : 0.6030
## Prevalence : 0.8844
## Detection Rate : 0.8493
## Detection Prevalence : 0.9114
## Balanced Accuracy : 0.7112
##
## 'Positive' Class : no
##
# AUC-ROC for Experiment 1
y_pred_rf_prob_experiment_1 <- predict(rf_model_experiment_1, newdata = df_test, type = "prob")[, 2]
roc_curve_rf_experiment_1 <- roc(df_test$y, y_pred_rf_prob_experiment_1)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(paste("Experiment 1 AUC-ROC:", roc_curve_rf_experiment_1$auc))
## [1] "Experiment 1 AUC-ROC: 0.91576990893116"
# Stop the cluster after training
stopCluster(cl)
Experiment 2: Varying mtry
Objective (Hypothesis):
The mtry parameter controls how many features are considered at each split. Lower mtry values add randomness and may reduce overfitting, while higher values may lead to more complex trees and possible overfitting.
Variable Change:
Decrease mtry from 4 to 3, keeping ntree fixed at 100.
What Stays the Same:
ntree is fixed at 100.
Other parameters are unchanged.
# Set the number of CPU cores to use
num_cores <- detectCores() - 1 # Use all available cores except one
# Register the parallel backend
cl <- makeCluster(num_cores) # Create a cluster with available cores
registerDoParallel(cl) # Register the parallel backend
set.seed(47)
# Train Random Forest model with ntree = 200 and mtry = 6 for Experiment 2
rf_model_experiment_2 <- randomForest(
y ~ .,
data = df_train,
ntree = 100, # ntree remain as 100
mtry = 3, # decrease mtry to 3
importance = TRUE, # Show feature importance
nthreads = num_cores
)
# Predict on the test set (class predictions)
y_pred_rf_experiment_2 <- predict(rf_model_experiment_2, newdata = df_test)
# Confusion Matrix for Experiment 2
cm_rf_experiment_2 <- confusionMatrix(y_pred_rf_experiment_2, df_test$y)
print("Experiment 2 (mtry=3) Confusion Matrix:")
## [1] "Experiment 2 (mtry=3) Confusion Matrix:"
print(cm_rf_experiment_2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7698 598
## yes 299 447
##
## Accuracy : 0.9008
## 95% CI : (0.8944, 0.9069)
## No Information Rate : 0.8844
## P-Value [Acc > NIR] : 3.648e-07
##
## Kappa : 0.4458
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9626
## Specificity : 0.4278
## Pos Pred Value : 0.9279
## Neg Pred Value : 0.5992
## Prevalence : 0.8844
## Detection Rate : 0.8514
## Detection Prevalence : 0.9175
## Balanced Accuracy : 0.6952
##
## 'Positive' Class : no
##
# AUC-ROC for Experiment 2
y_pred_rf_prob_experiment_2 <- predict(rf_model_experiment_2, newdata = df_test, type = "prob")[, 2]
roc_curve_rf_experiment_2 <- roc(df_test$y, y_pred_rf_prob_experiment_2)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(paste("Experiment 2 AUC-ROC:", roc_curve_rf_experiment_2$auc))
## [1] "Experiment 2 AUC-ROC: 0.915906862202513"
# Stop the cluster after training
stopCluster(cl)
Adaboost (Adaptive Boosting) trains weak learners sequentially, placing more weight on previously misclassified examples. It builds a strong classifier by focusing on the hardest-to-classify data points
# Set seed for reproducibility
set.seed(48)
# 1. Base Model - Adaboost with default parameters
base_adaboost_model <- ada(
y ~ .,
data = df_train,
iter = 100, # n_estimators = 100 (default)
nu = 0.1, # learning_rate = 0.1 (default)
type = "real", # SAMME.R (real version)
)
# Predict on the test set
y_pred_base_adaboost <- predict(base_adaboost_model, newdata = df_test)
# Confusion Matrix for Base Model
cm_base_adaboost <- confusionMatrix(y_pred_base_adaboost, df_test$y)
print("Base Model Confusion Matrix:")
## [1] "Base Model Confusion Matrix:"
print(cm_base_adaboost)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7751 649
## yes 246 396
##
## Accuracy : 0.901
## 95% CI : (0.8947, 0.9071)
## No Information Rate : 0.8844
## P-Value [Acc > NIR] : 2.556e-07
##
## Kappa : 0.4183
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9692
## Specificity : 0.3789
## Pos Pred Value : 0.9227
## Neg Pred Value : 0.6168
## Prevalence : 0.8844
## Detection Rate : 0.8572
## Detection Prevalence : 0.9290
## Balanced Accuracy : 0.6741
##
## 'Positive' Class : no
##
# AUC-ROC for Base Model
y_pred_base_adaboost_prob <- predict(base_adaboost_model, newdata = df_test, type = "prob")[, 2] # Get the probability for class 'yes'
# Compute ROC curve and AUC
roc_curve_base_adaboost <- roc(df_test$y, y_pred_base_adaboost_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(paste("Base Model AUC-ROC:", roc_curve_base_adaboost$auc))
## [1] "Base Model AUC-ROC: 0.917324080262156"
Experiment 1: Varying n_estimators (Number of Trees)
Objective (Hypothesis):
Increasing the number of estimators improves performance by giving the model more chances to correct errors. However, too many trees may result in diminishing returns.
What Will Change:
n_estimators: Test higher values (e.g., 200 instead of 100).
learning_rate remains fixed at 0.1.
What Will Stay the Same:
learning_rate = 0.1
# Set seed for reproducibility
set.seed(49)
# Train Adaboost model with n_estimators = 200 and learning_rate = 0.1
experiment_1_adaboost <- ada(
y ~ .,
data = df_train,
iter = 200, # n_estimators = 200
nu = 0.1, # learning_rate = 0.1
type = "real" # SAMME.R (real version)
)
# Predict on the test set
y_pred_experiment_1_adaboost <- predict(experiment_1_adaboost, newdata = df_test)
# Confusion Matrix for Experiment 1
cm_experiment_1_adaboost <- confusionMatrix(y_pred_experiment_1_adaboost, df_test$y)
print("Experiment 1 (n_estimators = 200) Confusion Matrix:")
## [1] "Experiment 1 (n_estimators = 200) Confusion Matrix:"
print(cm_experiment_1_adaboost)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7730 611
## yes 267 434
##
## Accuracy : 0.9029
## 95% CI : (0.8966, 0.9089)
## No Information Rate : 0.8844
## P-Value [Acc > NIR] : 1.02e-08
##
## Kappa : 0.4457
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9666
## Specificity : 0.4153
## Pos Pred Value : 0.9267
## Neg Pred Value : 0.6191
## Prevalence : 0.8844
## Detection Rate : 0.8549
## Detection Prevalence : 0.9225
## Balanced Accuracy : 0.6910
##
## 'Positive' Class : no
##
# AUC-ROC for Experiment 1
y_pred_experiment_1_adaboost_prob <- predict(experiment_1_adaboost, newdata = df_test, type = "prob")[, 2]
roc_curve_experiment_1_adaboost <- roc(df_test$y, y_pred_experiment_1_adaboost_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(paste("Experiment 1 AUC-ROC:", roc_curve_experiment_1_adaboost$auc))
## [1] "Experiment 1 AUC-ROC: 0.918296274978715"
Experiment 2: Varying learning_rate
Objective (Hypothesis):
The learning_rate determines how much each new tree contributes to the final model. A smaller rate may generalize better but requires more estimators, while a higher rate may speed up learning but risk overfitting.
What Will Change:
learning_rate: Test a higher value (e.g., 0.5).
n_estimators remains fixed at 100.
What Will Stay the Same:
n_estimators = 100
# Set seed for reproducibility
set.seed(50)
# Train Adaboost model with n_estimators = 100 and learning_rate = 0.5
experiment_2_adaboost <- ada(
y ~ .,
data = df_train,
iter = 100, # n_estimators = 100
nu = 0.5, # learning_rate = 0.5
type = "real" # SAMME.R (real version)
)
# Predict on the test set
y_pred_experiment_2_adaboost <- predict(experiment_2_adaboost, newdata = df_test)
# Confusion Matrix for Experiment 2
cm_experiment_2_adaboost <- confusionMatrix(y_pred_experiment_2_adaboost, df_test$y)
print("Experiment 2 (learning_rate = 0.5) Confusion Matrix:")
## [1] "Experiment 2 (learning_rate = 0.5) Confusion Matrix:"
print(cm_experiment_2_adaboost)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7709 593
## yes 288 452
##
## Accuracy : 0.9026
## 95% CI : (0.8963, 0.9086)
## No Information Rate : 0.8844
## P-Value [Acc > NIR] : 1.848e-08
##
## Kappa : 0.4541
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9640
## Specificity : 0.4325
## Pos Pred Value : 0.9286
## Neg Pred Value : 0.6108
## Prevalence : 0.8844
## Detection Rate : 0.8526
## Detection Prevalence : 0.9182
## Balanced Accuracy : 0.6983
##
## 'Positive' Class : no
##
# AUC-ROC for Experiment 2
y_pred_experiment_2_adaboost_prob <- predict(experiment_2_adaboost, newdata = df_test, type = "prob")[, 2]
roc_curve_experiment_2_adaboost <- roc(df_test$y, y_pred_experiment_2_adaboost_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(paste("Experiment 2 AUC-ROC:", roc_curve_experiment_2_adaboost$auc))
## [1] "Experiment 2 AUC-ROC: 0.91608791095704"
# Load required package
library(knitr)
# Create the table with Model/Experiment names, Accuracy, and AUC-ROC scores
results_df <- data.frame(
"Model and Experiment" = c(
"Decision Tree - Base Model (cp=0.05, maxdepth=15)",
"Decision Tree - Experiment 1 (cp=0.01)",
"Decision Tree - Experiment 2 (maxdepth=30)",
"Random Forest - Base Model (ntree=100, mtry=4)",
"Random Forest - Experiment 1 (ntree=200)",
"Random Forest - Experiment 2 (mtry=3)",
"Adaboost - Base Model (n_estimators=100, learning_rate=0.1)",
"Adaboost - Experiment 1 (n_estimators=200)",
"Adaboost - Experiment 2 (learning_rate=0.5)"
),
"Accuracy" = c(
cm_base$overall['Accuracy'],
cm_tree_1$overall['Accuracy'],
cm_tree_2$overall['Accuracy'],
cm_base_rf$overall['Accuracy'],
cm_rf_experiment_1$overall['Accuracy'],
cm_rf_experiment_2$overall['Accuracy'],
cm_base_adaboost$overall['Accuracy'],
cm_experiment_1_adaboost$overall['Accuracy'],
cm_experiment_2_adaboost$overall['Accuracy']
),
"AUC-ROC" = c(
roc_curve_base$auc,
roc_curve_tree_1$auc,
roc_curve_tree_2$auc,
roc_curve_base_rf$auc,
roc_curve_rf_experiment_1$auc,
roc_curve_rf_experiment_2$auc,
roc_curve_base_adaboost$auc,
roc_curve_experiment_1_adaboost$auc,
roc_curve_experiment_2_adaboost$auc
)
)
# Print a clean table using kable
kable(results_df, caption = "Summary of Model Experiments with Accuracy and AUC-ROC")
| Model.and.Experiment | Accuracy | AUC.ROC |
|---|---|---|
| Decision Tree - Base Model (cp=0.05, maxdepth=15) | 0.8844282 | 0.5000000 |
| Decision Tree - Experiment 1 (cp=0.01) | 0.8984738 | 0.7451047 |
| Decision Tree - Experiment 2 (maxdepth=30) | 0.8844282 | 0.5000000 |
| Random Forest - Base Model (ntree=100, mtry=4) | 0.9025658 | 0.9131116 |
| Random Forest - Experiment 1 (ntree=200) | 0.9026764 | 0.9157699 |
| Random Forest - Experiment 2 (mtry=3) | 0.9007963 | 0.9159069 |
| Adaboost - Base Model (n_estimators=100, learning_rate=0.1) | 0.9010175 | 0.9173241 |
| Adaboost - Experiment 1 (n_estimators=200) | 0.9028976 | 0.9182963 |
| Adaboost - Experiment 2 (learning_rate=0.5) | 0.9025658 | 0.9160879 |
# Set seed for reproducibility
set.seed(1234)
# Set up the plotting area
par(pty="s")
# Plot the Base Model ROC curve
roc_curve_base <- roc(df_test$y, y_pred_base_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(roc_curve_base, col="#0ABAB5", main="Decision Tree ROC Curve Comparison", percent=TRUE,
xlab="False Positive Percentage", ylab="True Positive Percentage", print.auc=TRUE)
# Plot the ROC curve for Experiment 1 (Varying cp)
roc_curve_tree_1 <- roc(df_test$y, y_pred_tree_1_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(roc_curve_tree_1, col="#FA8072", print.auc=TRUE, add=TRUE, print.auc.y=40)
# Plot the ROC curve for Experiment 2 (Varying maxdepth)
roc_curve_tree_2 <- roc(df_test$y, y_pred_tree_2_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(roc_curve_tree_2, col="#FFD700", print.auc=TRUE, add=TRUE, print.auc.y=60)
# Add legend to the plot
legend("bottomright", legend=c("Base Model (Decision Tree)", "Experiment 1 (Varying cp)", "Experiment 2 (Varying maxdepth)"),
col=c("#0ABAB5", "#FA8072", "#FFD700"), lwd=2)
# Set seed for reproducibility
set.seed(1235)
# Set up the plotting area
par(pty="s")
# Plot the Base Model ROC curve (Random Forest Base Model)
roc_curve_base_rf <- roc(df_test$y, y_pred_base_rf_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(roc_curve_base_rf, col="#0ABAB5", main="Random Forest ROC Curve Comparison", percent=TRUE,
xlab="False Positive Percentage", ylab="True Positive Percentage", print.auc=TRUE)
# Plot the ROC curve for Experiment 1 (Varying ntree)
roc_curve_rf_experiment_1 <- roc(df_test$y, y_pred_rf_prob_experiment_1)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(roc_curve_rf_experiment_1, col="#FA8072", print.auc=TRUE, add=TRUE, print.auc.y=40)
# Plot the ROC curve for Experiment 2 (Varying mtry)
roc_curve_rf_experiment_2 <- roc(df_test$y, y_pred_rf_prob_experiment_2)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(roc_curve_rf_experiment_2, col="#FFD700", print.auc=TRUE, add=TRUE, print.auc.y=60)
# Add legend to the plot
legend("bottomright", legend=c("Base Model (Random Forest)", "Experiment 1 (Varying ntree)", "Experiment 2 (Varying mtry)"),
col=c("#0ABAB5", "#FA8072", "#FFD700"), lwd=2)
# Set seed for reproducibility
set.seed(1236)
# Set up the plotting area
par(pty="s")
# Plot the Base Model ROC curve (Adaboost Base Model)
roc_curve_base_adaboost <- roc(df_test$y, y_pred_base_adaboost_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(roc_curve_base_adaboost, col="#0ABAB5", main="Adaboost ROC Curve Comparison", percent=TRUE,
xlab="False Positive Percentage", ylab="True Positive Percentage", print.auc=TRUE)
# Plot the ROC curve for Experiment 1 (Varying n_estimators)
roc_curve_experiment_1_adaboost <- roc(df_test$y, y_pred_experiment_1_adaboost_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(roc_curve_experiment_1_adaboost, col="#FA8072", print.auc=TRUE, add=TRUE, print.auc.y=40)
# Plot the ROC curve for Experiment 2 (Varying learning_rate)
roc_curve_experiment_2_adaboost <- roc(df_test$y, y_pred_experiment_2_adaboost_prob)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
plot(roc_curve_experiment_2_adaboost, col="#FFD700", print.auc=TRUE, add=TRUE, print.auc.y=60)
# Add legend to the plot
legend("bottomright", legend=c("Base Model (Adaboost)", "Experiment 1 (Varying n_estimators)", "Experiment 2 (Varying learning_rate)"),
col=c("#0ABAB5", "#FA8072", "#FFD700"), lwd=2)