Instructions
1. Choose a dataset
You get to decide which dataset you want to work on. The data set must
be different from the ones used in previous homeworks. You can work on a
problem from your job, or something you are interested in. You may also
obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau,
USGS or other open data portals.
2.Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.
According to the CDC, 38 million Americans, about 1 in 10, have been diagnosed with Type 2 Diabetes.[^1] In 2023, the CDC determined the prevalence of Type 2 Diabetes among American Indian/Alaska Native adults to be 13.6% compared to 6.9% non-Hispanic whites, or 1 out 7.[^2] This high prevalence places the American Indian demographic group as belonging to a high risk population for the development of Type 2 Diabetes.
This analysis looks to understand health markers which might contribute to the prediction of Type 2 Diabetes among one group of American Indian females age 21 years old or older, the Prima Indians. The dataset utilized for this analysis was collected by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) as part of an effort to understand factors related to Type 2 Diabetes. Understanding health markers associated with an increased risk for predicting Type 2 Diabetes in a high risk population can potentially clarify associated risk factors in the overall population.
From a business perspective, the ability to predict diabetes and prevent onset can save money, contribute to a healthier population and lead to improved patient outcomes.
The dataset contains 768 rows and nine features, including the target variable ‘Outcome’. All features are numeric. All individuals in the dataset are female, of Prima Indian heritage and are twenty one years old or older. There are eight features present in the dataset (Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, or Age), as well as one target variable, Outcome, which denotes a diagnosis of Type 2 Diabetes.
Distribution
At first glance, features appear to have no missing values in the form
of NA or NULLS. Initial histograms, however, alluded to some of the
features as having missing values coded as zero. More specifically, the
features BloodPressure, Glucose, BMI, DiabetesPedigreeFunction,
SkinThickness, and Insulin all contain data points which were recorded
as zero. From a medical perspective, values of zero for these features
are not clinically possible (e.g., a blood pressure of zero is not
possible unless a patient is deceased). As some of the features, such as
SkinThickness, have a large number of missing values coded as zero
(~250), it was decided to impute missing values using the mean, rather
than exclude these rows from the analysis.
Distribution Post Imputation
After mean imputation, the features BloodPressure, BMI, Glucose, and
SkinThickness approximated a normal distribution, while Age,
DiabetesPedigreeFunction, Pregnancies, and Insulin remained
right-skewed. The right skew in these variables alludes to bias in the
underlying dataset in favor of younger patients, individuals with few/no
pregnancies, lower diabetes pedigree function values, and lower insulin
values.
Correlation Matrix
A correlation matrix was used to investigate relationships between
features. The matrix showed that most features have weak to moderate
positive correlations. The strongest relationships were determined to
exist between Glucose and Insulin (r = 0.42), BMI and SkinThickness (r =
0.43), as well as between Age and Pregnancies (r = 0.54). The
scatterplots support the notion that most of relationships between
features are non-linear, with the exception of Glucose and Insulin, BMI
and SkinThickness, and Age and Pregnancy. Overall, the relationships can
be considered non-linear and suitable for SVM and neural network
models.
Linearity
Since the target variable is binary, linearity between features and
target variable was assessed using a LOESS curve. LOESS curves visualize
the relationship between the target variable and probability of the
outcome, in this case diabetes with Outcome=1. The plots revealed
non-linear relationships for most of the features, except Glucose and
lesser so for BMI. Overall, the relationships can be considered
non-linear and suitable for SVM and neural network models.
Outliers
Outliers were inspected using scatter plots and Interquartile range
(IQR). A large number of outliers were detected for Insulin (164) and
SkinThickness (87). As the values of these outliers are possible from a
medical perspective, the decision was made to keep the outlier data but
use min-max to normalize extreme values that might influence models. IQR
based coercion method was used. More specifically outliers with values
larger than 1.5 times the IQR’s 25th and 75th percentiles were replaced
with the closest outlier boundary value.
More Pre-Processing & Class Imbalance
After coercion, min-max scaling was subsequently used in order to
normalize feature values between 0 and 1. This pre-processing was
necessary in order to prepare the data for SVM and Neural Network
analysis.
After min-max scaling, class imbalance in the target variable Outcome was identified (500 patient with no diagnosis of diabetes; 268 patients with a diabetic diagnosis). To address this imbalance, boosted oversampling was applied to the training data. After boost oversampling was applied, the class distribution improved to (319 non-diabetic versus 296 diabetic cases).
| Model | Y=0 | Y=1 | TP | FN | FP | TN | Accuracy | Sensitiv | Specifi | Bal Acc | Prec | F1 | AUC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Initial Data Distribution | |||||||||||||
| Before Boost Over-Sampling | 500 | 268 | |||||||||||
| After Boost Over-Sampling | 394 | 374 | |||||||||||
| SVM | |||||||||||||
| Initial SVM (linear) | 394 | 374 | 33 | 19 | 19 | 82 | 0.7516 | 0.8119 | 0.6346 | 0.7232 | 0.6346 | 0.6346 | 0.8437 |
| Initial SVM (radial) | 394 | 374 | 37 | 15 | 22 | 79 | 0.7582 | 0.7822 | 0.7115 | 0.7469 | 0.8404 | 0.6667 | 0.8505 |
| Radial & Cost = 0.1 | 394 | 374 | 36 | 16 | 24 | 77 | 0.7386 | 0.7624 | 0.6923 | 0.7273 | 0.6923 | 0.6429 | 0.8446 |
| Radial & Cost = 10 | 394 | 374 | 35 | 17 | 19 | 82 | 0.7647 | 0.8119 | 0.6731 | 0.7425 | 0.8283 | 0.6604 | 0.8334 |
| Neural Network | |||||||||||||
| Initial | 394 | 364 | 37 | 15 | 22 | 79 | 0.7582 | 0.7822 | 0.7115 | 0.7469 | 0.8404 | 0.6667 | 0.7489 |
| Size=3 | 394 | 364 | 36 | 16 | 27 | 74 | 0.7190 | 0.7327 | 0.6923 | 0.7125 | 0.8222 | 0.6261 | 0.7125 |
| Size=5 | 394 | 364 | 39 | 13 | 23 | 78 | 0.7647 | 0.7723 | 0.7500 | 0.7611 | 0.6290 | 0.6842 | 0.7611 |
| Size=7 | 394 | 364 | 37 | 15 | 27 | 74 | 0.7255 | 0.7327 | 0.7115 | 0.7221 | 0.8315 | 0.6379 | 0.7221 |
| Size=10 | 394 | 364 | 38 | 14 | 23 | 78 | 0.7582 | 0.7723 | 0.7308 | 0.7515 | 0.8478 | 0.6726 | 0.7515 |
| Size=20 | 394 | 364 | 31 | 21 | 19 | 82 | 0.7386 | 0.8119 | 0.5962 | 0.7040 | 0.7961 | 0.6078 | 0.7040 |
| Size=5 & maxit =500 | 394 | 364 | 39 | 13 | 27 | 74 | 0.7386 | 0.7327 | 0.7500 | 0.7413 | 0.7500 | 0.6610 | 0.7413 |
| Size=5 & maxit =1000 | 394 | 364 | 39 | 13 | 27 | 74 | 0.7386 | 0.7327 | 0.7500 | 0.7413 | 0.7500 | 0.6610 | 0.7413 |
| Size=5 & decay= 0.01 | 394 | 364 | 38 | 14 | 26 | 75 | 0.7386 | 0.7426 | 0.7308 | 0.7367 | 0.7083 | 0.6552 | 0.7367 |
| Size=5 & decay= 0.001 | 394 | 364 | 36 | 16 | 22 | 79 | 0.7516 | 0.7822 | 0.6923 | 0.7372 | 0.6923 | 0.6542 | 0.7372 |
A Note about Healthcare Data
Because the analysis seeks to accurately predict diagnosis of diabetes,
the preferred model would be a model which would minimize false
negatives. In health care, a missed diagnosis (false negative) would be
more detrimental than a false diagnosis (false positive).
Why SVM and Neural Network Models
SVM and Neural network models were chosen because:
Feature Interaction
The features are most likely highly interrelated and interact with each
other in complex ways. For instance, developing one chronic disease,
i.e. blood pressure, has been found to influence the development of
subsequent chronic diseases while also impacting health markers
including the 8 features in this study. SVM and Neural Networks have
been used in the past to capture complex interactions because of the
ability to model multi-dimensional relationships.
Non-Linear Decision Boundaries
The LOESS curve between the target variable and other features, as well
as the correlation matrix between features, both denote that a majority
of the relationships in the dataset are non-linear. Furthermore,
interaction between features is almost certain given that all features
are health markers and have the potential to affect one another. SVM
with a radial kernel and Neural Network models are both able to handle
non-linear relationships.
Sensitivity to Imbalanced Data
The dataset suffers from mild class imbalance, with more non-diabetic
cases than diabetic ones. To address this, boosted oversampling was
applied during the pre-processing phase. While SVM and Neural Network
models do not inherently handle imbalanced data, the boost oversampling
balanced the target variable class distribution, optimizing the dataset
for SVM and Neural Network model use.
Results of SVM
Initially, the SVM model used a linear kernel as a means to provide
sanity check and reference point. A radial kernel was also tested to
account for potential non-linear relationships between features. The
radial kernel model was preferred because it achieved higher overall
performance, including higher AUC (0.8505 vs 0.8437), sensitivity
(0.7822 vs 0.8119), and F1 (0.6667 vs 0.6346). Next, the cost parameter
was adjusted to control the penalty for misclassification. Increasing
the cost to 10 resulted in slightly higher specificity (0.6731) and
precision (0.8283), but a drop in sensitivity (0.8119 vs. 0.7822), which
is less desirable if minimizing false negatives is the priority. Since
identification of false negatives are of utmost importance, the Radial
& Cost = 10 model was the preferred model. While the linear model
and Radial & Cost = 10 model both had a sensitivity of 0.8119, the
Radial & Cost = 10 model was more balanced in terms of other
metrics, including F1, Precision, Accuracy, specificity and balance
accuracy.
Results of Neural Network
An initial neural network model was trained with a single hidden layer
(size = 1). Afterwards, the number of hidden units (size) was increased
(size = 1-20) to assess for improvements. The model with size = 5
emerged as the most preferred. This model had a sensitivity of 0.7723,
meaning it correctly identified over 77% of diabetic cases.
Additionally, it maintained a strong performance across other metrics,
including specificity (0.7500), accuracy (0.7647), F1 score (0.6842),
and AUC (0.7611). While some configurations (e.g., size = 20 or decay
variations) slightly improved sensitivity, they also slightly decreased
precision or balance compared to other model sizes. Of note, additional
neural network models were also trained with manipulation of maxit and
decay. The model with size =5, however, remained the preferred model as
it offered the best overall trade off between sensitivity and
specificity, while maintaining strong precision and
generalizability.
Final Model Selection SVM vs NN & Conclusion
| Model | Y=0 | Y=1 | TP | FN | FP | TN | Accuracy | Sensitiv | Specifi | Bal Acc | Prec | F1 | AUC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SVM | |||||||||||||
| Radial & Cost = 10 | 394 | 364 | 35 | 17 | 19 | 82 | 0.7647 | 0.8119 | 0.6731 | 0.7425 | 0.8283 | 0.6604 | 0.8334 |
| Neural Network | |||||||||||||
| Size=5 | 394 | 364 | 39 | 13 | 23 | 78 | 0.7647 | 0.7723 | 0.7500 | 0.7611 | 0.6290 | 0.6842 | 0.7611 |
Two models were selected for final evaluation: the SVM with a radial kernel and cost=10, and a neural network with size=5. Both models achieved identical accuracy of 0.7647. However, with regard to sensitivity and minimizing false negative, the SVM model was the preferred model (sensitivity svm 0.8119 vs nn 0.7723). The SVM model’s higher sensitivity ensures better identification of diabetic patients. The neural network model did exhibit higher specificity and F1 score. However, in clinical settings where missing a positive diagnosis is riskier than a missed diagnosis, sensitivity is paramount and, as such, the SVM model is the preferred model.
1: Centers for Disease Control and Prevention. About Type 2 Diabetes. U.S. Department of Health and Human Services; 2023. Available at: https://www.cdc.gov/diabetes/about/about-type-2-diabetes.html
2: Centers for Disease Control and Prevention. National Diabetes Statistics Report, 2023: Estimates of Diabetes and Its Burden in the United States. U.S. Department of Health and Human Services; 2023. Available at: https://www.cdc.gov/diabetes/data/statistics-report/index.html
########################################## ONLY CODE BELOW###############################################################
library(ggplot2)
library(tidyr)
library(tidyverse)## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
## Loading required package: viridisLite
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Loaded ROSE 0.0-4
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(nnet)
library(dplyr)
#read dataset in
dm_data <- read_csv("https://raw.githubusercontent.com/greggmaloy/Data622/refs/heads/main/diabetes.csv", show_col_types = FALSE)
#visualize data
dm_data# Pivot to long
dm_data_long <- pivot_longer(dm_data, cols = everything(), names_to = "Variable", values_to = "Value")
# histograms
ggplot(dm_data_long, aes(x = Value, fill = Variable)) +
geom_histogram(bins = 30, color = "black", alpha = 0.8) +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Histograms of Each Variable", x = "Value", y = "Count") + scale_fill_viridis_d() +
theme(legend.position = "none") set.seed(123)
# Impute values using mean
dm_data_cleaned <- dm_data %>%
mutate(
BMI = ifelse(BMI == 0, mean(BMI[BMI != 0], na.rm = TRUE), BMI),
BloodPressure = ifelse(BloodPressure == 0, mean(BloodPressure[BloodPressure != 0], na.rm = TRUE), BloodPressure),
Glucose = ifelse(Glucose == 0, mean(Glucose[Glucose != 0], na.rm = TRUE), Glucose),
Insulin = ifelse(Insulin == 0, mean(Insulin[Insulin != 0], na.rm = TRUE), Insulin),
SkinThickness = ifelse(SkinThickness == 0, mean(SkinThickness[SkinThickness != 0], na.rm = TRUE), SkinThickness),
DiabetesPedigreeFunction = ifelse(DiabetesPedigreeFunction == 0, mean(DiabetesPedigreeFunction[DiabetesPedigreeFunction != 0], na.rm = TRUE), DiabetesPedigreeFunction)
)
cat("\nNumber of rows after imputing zeroes:", nrow(dm_data_cleaned), "\n")##
## Number of rows after imputing zeroes: 768
set.seed(123)
# long format
dm_data_long_cleaned <- dm_data_cleaned %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
# new updated histograms
final_histo <-ggplot(dm_data_long_cleaned, aes(x = Value, fill = Variable)) +
geom_histogram(bins = 30, color = "black", alpha = 0.8) +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Histograms of Each Variable (After Imputation)", x = "Value", y = "Count") +
scale_fill_viridis_d() +
theme(legend.position = "none")
final_histo#correlation matrix
set.seed(123)
GGally::ggpairs(
dm_data_cleaned,
columns = c("Glucose", "BMI", "BloodPressure", "Insulin", "Age", "Pregnancies", "DiabetesPedigreeFunction", "SkinThickness"),
aes(color = factor(Outcome))
)# Compute the log-odds
logistic_model <- glm(Outcome ~ ., data = dm_data_cleaned, family = binomial())
# Extract the fitted probabilities
dm_data_cleaned$log_odds <- predict(logistic_model, type = "response")
dm_data_cleaned$log_odds <- log(dm_data_cleaned$log_odds / (1 - dm_data_cleaned$log_odds))
# Plot
plots <- list()
for (feature in names(dm_data_cleaned)[-c(9, 10)]) {
p <- ggplot(dm_data_cleaned, aes_string(x = feature, y = "log_odds")) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", color = "blue") +
labs(title = paste("Log-Odds vs", feature),
x = feature,
y = "Log-Odds DM") +
theme_minimal() +
theme(
plot.title = element_text(size = 10),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10)
)
plots[[feature]] <- p
}## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
library(ggplot2)
# Loop through each feature and create a LOESS plot
plots <- list()
for (feature in names(dm_data_cleaned)[-9]) { # Exclude Outcome column
p <- ggplot(dm_data_cleaned, aes_string(x = feature, y = "Outcome")) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", color = "blue") +
labs(title = paste("LOESS Curve for", feature),
x = feature,
y = "Probability of Diabetes") +
theme_minimal() +
theme(
plot.title = element_text(size = 10),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10)
)
plots[[feature]] <- p
}
# Display the plots in a grid
library(patchwork)
wrap_plots(plots, ncol = 3)## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#scatterplots
set.seed(123)
# Add a row ID
dm_data_with_id <- dm_data_cleaned %>%
mutate(Row = row_number())
# Pivot
dm_data_long <- dm_data_with_id %>%
pivot_longer(cols = -Row, names_to = "Variable", values_to = "Value")
# scatterplots
ggplot(dm_data_long, aes(x = Row, y = Value, fill = Variable)) +
geom_point(shape = 21, size = 2, color = "black") +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
scale_fill_viridis_d() +
theme_minimal() +
labs(title = "Scatterplots for Each Variable (After Imputation - Outlier Detection)",
x = "Row Index", y = "Value") +
theme(legend.position = "none")# outliers detection using IQR
set.seed(123)
calculate_iqr_outliers_for_count <- function(df, col_name) {
Q1 <- quantile(df[[col_name]], 0.25, na.rm = TRUE)
Q3 <- quantile(df[[col_name]], 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
df %>%
filter(.data[[col_name]] < lower_bound | .data[[col_name]] > upper_bound) %>%
mutate(variable = col_name) %>%
select(variable)
}
outlier_counts_cleaned <- dm_data_cleaned %>%
select(where(is.numeric)) %>%
names() %>%
map_df(~calculate_iqr_outliers_for_count(dm_data_cleaned, .x)) %>%
group_by(variable) %>%
summarise(outlier_count = n(), .groups = 'drop')
# print
cat("\n**Table 1: Count of Outliers in Each Numeric Variable (After Cleaning)**\n")##
## **Table 1: Count of Outliers in Each Numeric Variable (After Cleaning)**
kable(outlier_counts_cleaned, caption = "Count of Outliers in Each Numeric Variable (After Cleaning)")| variable | outlier_count |
|---|---|
| Age | 9 |
| BMI | 8 |
| BloodPressure | 14 |
| DiabetesPedigreeFunction | 29 |
| Insulin | 164 |
| Pregnancies | 4 |
| SkinThickness | 87 |
| log_odds | 2 |
set.seed(123)
# cap outliers using iqr
coerce_iqr <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_val
upper_bound <- Q3 + 1.5 * IQR_val
x <- ifelse(x < lower_bound, lower_bound, x)
x <- ifelse(x > upper_bound, upper_bound, x)
return(x)
}
dm_data_coerced <- dm_data_cleaned %>%
mutate(
Glucose = coerce_iqr(Glucose),
BloodPressure = coerce_iqr(BloodPressure),
SkinThickness = coerce_iqr(SkinThickness),
BMI = coerce_iqr(BMI),
Pregnancies = coerce_iqr(Pregnancies),
Insulin = coerce_iqr(Insulin),
DiabetesPedigreeFunction = coerce_iqr(DiabetesPedigreeFunction),
Age = coerce_iqr(Age)
)#histograms again this time with the outiers and na values accounted for
set.seed(123)
dm_data_cleaned_with_id <- dm_data_cleaned %>% mutate(Row = row_number())
dm_data_coerced_with_id <- dm_data_coerced %>% mutate(Row = row_number())
# Pivot to long format
dm_data_cleaned_long <- dm_data_cleaned_with_id %>%
pivot_longer(cols = -Row, names_to = "Variable", values_to = "Value") %>%
mutate(Dataset = "Before Coercion")
dm_data_coerced_long <- dm_data_coerced_with_id %>%
pivot_longer(cols = -Row, names_to = "Variable", values_to = "Value") %>%
mutate(Dataset = "After Coercion")
combined_data_long <- bind_rows(dm_data_cleaned_long, dm_data_coerced_long)
# histograms before and after coercion
ggplot(combined_data_long, aes(x = Value, fill = Dataset)) +
geom_histogram(bins = 30, position = "identity", alpha = 0.6, color = "black") +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
scale_fill_viridis_d(option = "C") +
theme_minimal() +
labs(title = "Before vs After Coercion of Outliers",
x = "Value",
y = "Count") +
theme(legend.position = "top")library(dplyr)
set.seed(123)
# min max scaling in prep for SVM and neural networks
min_max_scaler <- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
# Apply Min-Max to all features except outcome
dm_data_scaled <- dm_data_coerced %>%
mutate(across(where(is.numeric) & !Outcome, min_max_scaler))
# visualize
head(dm_data_scaled)##
## 0 1
## 500 268
##
## 0 1
## 0.6510417 0.3489583
# Boosted Oversampling for balance target variable
set.seed(123) # for reproducibility
dm_data_boosted <- ROSE(Outcome ~ ., data = dm_data_scaled, seed = 123)$data
table(dm_data_boosted$Outcome)##
## 0 1
## 394 374
set.seed(123)
# test train 80/20
train_index <- createDataPartition(dm_data_scaled$Outcome, p = 0.8, list = FALSE)
train_data <- dm_data_scaled[train_index, ]
test_data <- dm_data_scaled[-train_index, ]# Boost Oversampling on the training data only!!!
train_data_boosted <- ROSE(Outcome ~ ., data = train_data, seed = 123)$data
table(train_data_boosted$Outcome)##
## 0 1
## 319 296
# train svm
#svm_model <- svm(factor(Outcome) ~ ., data = train_data_boosted, kernel = "radial", probability = TRUE)
#summary(svm_model)set.seed(1234)
# svm linear kernel
svm_model <- svm(factor(Outcome) ~ ., data = train_data_boosted, kernel = "linear", probability = TRUE)
# Predict
svm_predictions <- predict(svm_model, newdata = test_data)
# Ensure factors levels
svm_predictions <- factor(svm_predictions, levels = c(0, 1))
test_labels <- factor(test_data$Outcome, levels = c(0, 1))
# Confusion Matrix
conf_matrix <- confusionMatrix(svm_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 82 18
## 1 19 34
##
## Accuracy : 0.7582
## 95% CI : (0.6824, 0.8237)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.005655
##
## Kappa : 0.4636
##
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.8119
## Specificity : 0.6538
## Pos Pred Value : 0.8200
## Neg Pred Value : 0.6415
## Prevalence : 0.6601
## Detection Rate : 0.5359
## Detection Prevalence : 0.6536
## Balanced Accuracy : 0.7329
##
## 'Positive' Class : 0
##
precision <- posPredValue(svm_predictions, test_labels, positive = "1")
recall <- sensitivity(svm_predictions, test_labels, positive = "1")
# F1 score
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 1): %.4f\n", f1_score))##
## F1 Score (class = 1): 0.6476
#ROC Curve & AUC
svm_probabilities <- predict(svm_model, newdata = test_data, probability = TRUE)
svm_prob_df <- attr(svm_probabilities, "probabilities")
probs_class1 <- svm_prob_df[, "1"]
# ROC
roc_curve <- roc(test_data$Outcome, probs_class1)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## AUC: 0.8450
set.seed(1234)
# radial this time
svm_model <- svm(factor(Outcome) ~ ., data = train_data_boosted, kernel = "radial", probability = TRUE)
svm_predictions <- predict(svm_model, newdata = test_data)
svm_predictions <- factor(svm_predictions, levels = c(0, 1))
test_labels <- factor(test_data$Outcome, levels = c(0, 1))
# Confusion Matrix
conf_matrix <- confusionMatrix(svm_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 84 18
## 1 17 34
##
## Accuracy : 0.7712
## 95% CI : (0.6965, 0.8352)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.001883
##
## Kappa : 0.4878
##
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.8317
## Specificity : 0.6538
## Pos Pred Value : 0.8235
## Neg Pred Value : 0.6667
## Prevalence : 0.6601
## Detection Rate : 0.5490
## Detection Prevalence : 0.6667
## Balanced Accuracy : 0.7428
##
## 'Positive' Class : 0
##
precision <- posPredValue(svm_predictions, test_labels, positive = "1")
recall <- sensitivity(svm_predictions, test_labels, positive = "1")
# F1 score
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 1): %.4f\n", f1_score))##
## F1 Score (class = 1): 0.6602
# ROC Curve & AUC
#
svm_probabilities <- predict(svm_model, newdata = test_data, probability = TRUE)
svm_prob_df <- attr(svm_probabilities, "probabilities")
probs_class1 <- svm_prob_df[, "1"]
# ROC
roc_curve <- roc(test_data$Outcome, probs_class1)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## AUC: 0.8494
library(pROC)
# Predict probabilities
svm_probabilities <- predict(svm_model, newdata = test_data, probability = TRUE)
svm_prob_df <- attr(svm_probabilities, "probabilities")
probs_class1 <- svm_prob_df[, "1"]
# Compute ROC
roc_curve <- roc(test_data$Outcome, probs_class1)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC
plot(roc_curve, col = "blue", main = "SVM ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red") # Random guess line## Area under the curve: 0.8494
set.seed(1234)
# radial and cost=0.1
svm_model <- svm(factor(Outcome) ~ ., data = train_data_boosted, kernel = "radial", cost='0.1', probability = TRUE)
svm_predictions <- predict(svm_model, newdata = test_data)
svm_predictions <- factor(svm_predictions, levels = c(0, 1))
test_labels <- factor(test_data$Outcome, levels = c(0, 1))
# Confusion Matrix
conf_matrix <- confusionMatrix(svm_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 79 18
## 1 22 34
##
## Accuracy : 0.7386
## 95% CI : (0.6615, 0.8062)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.02309
##
## Kappa : 0.428
##
## Mcnemar's Test P-Value : 0.63526
##
## Sensitivity : 0.7822
## Specificity : 0.6538
## Pos Pred Value : 0.8144
## Neg Pred Value : 0.6071
## Prevalence : 0.6601
## Detection Rate : 0.5163
## Detection Prevalence : 0.6340
## Balanced Accuracy : 0.7180
##
## 'Positive' Class : 0
##
#F1
precision <- posPredValue(svm_predictions, test_labels, positive = "1")
recall <- sensitivity(svm_predictions, test_labels, positive = "1")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 1): %.4f\n", f1_score))##
## F1 Score (class = 1): 0.6296
# ROC Curve & AUC
# Predict probabilities
svm_probabilities <- predict(svm_model, newdata = test_data, probability = TRUE)
svm_prob_df <- attr(svm_probabilities, "probabilities")
probs_class1 <- svm_prob_df[, "1"]
roc_curve <- roc(test_data$Outcome, probs_class1)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## AUC: 0.8420
set.seed(1234)
# Rradial and cost =10
svm_model <- svm(factor(Outcome) ~ ., data = train_data_boosted, kernel = "radial", cost='10', probability = TRUE)
svm_predictions <- predict(svm_model, newdata = test_data)
svm_predictions <- factor(svm_predictions, levels = c(0, 1))
test_labels <- factor(test_data$Outcome, levels = c(0, 1))
# Confusion Matrix
conf_matrix <- confusionMatrix(svm_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 83 18
## 1 18 34
##
## Accuracy : 0.7647
## 95% CI : (0.6894, 0.8294)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.003318
##
## Kappa : 0.4756
##
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.8218
## Specificity : 0.6538
## Pos Pred Value : 0.8218
## Neg Pred Value : 0.6538
## Prevalence : 0.6601
## Detection Rate : 0.5425
## Detection Prevalence : 0.6601
## Balanced Accuracy : 0.7378
##
## 'Positive' Class : 0
##
# --- F1 Score Calculation for Class 1 (Positive Class) ---
precision <- posPredValue(svm_predictions, test_labels, positive = "1")
recall <- sensitivity(svm_predictions, test_labels, positive = "1")
# F1 score
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 1): %.4f\n", f1_score))##
## F1 Score (class = 1): 0.6538
svm_probabilities <- predict(svm_model, newdata = test_data, probability = TRUE)
svm_prob_df <- attr(svm_probabilities, "probabilities")
probs_class1 <- svm_prob_df[, "1"]
roc_curve <- roc(test_data$Outcome, probs_class1)## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## AUC: 0.8454
# Train a simple Neural Network
set.seed(123)
nn_model <- nnet(
Outcome ~ .,
data = train_data_boosted,
size = 1 # number of neurons in hidden layer
# maxit = 500,
# decay = 0.01,
# trace = FALSE
)## # weights: 12
## initial value 170.393763
## iter 10 value 98.262402
## iter 20 value 97.002545
## iter 30 value 96.804595
## iter 40 value 96.710389
## iter 50 value 96.674100
## iter 60 value 96.599528
## iter 70 value 96.559544
## final value 96.558566
## converged
# Predict probabilities on test set
nn_probs <- predict(nn_model, newdata = test_data, type = "raw")
# Convert probabilities into class predictions
nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))
# Make sure test labels match
test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))
# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 79 16
## Yes 22 36
##
## Accuracy : 0.7516
## 95% CI : (0.6754, 0.8179)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.009329
##
## Kappa : 0.4616
##
## Mcnemar's Test P-Value : 0.417304
##
## Sensitivity : 0.7822
## Specificity : 0.6923
## Pos Pred Value : 0.8316
## Neg Pred Value : 0.6207
## Prevalence : 0.6601
## Detection Rate : 0.5163
## Detection Prevalence : 0.6209
## Balanced Accuracy : 0.7372
##
## 'Positive' Class : No
##
# --- F1 Score Calculation for Class "Yes" (Positive Class = 1) ---
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))##
## F1 Score (class = 'Yes'): 0.6545
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")## AUC: 0.7372
#neural network initial
set.seed(123)
nn_model <- nnet(
Outcome ~ .,
data = train_data_boosted,
size = 3
)## # weights: 34
## initial value 218.660078
## iter 10 value 100.423834
## iter 20 value 96.411732
## iter 30 value 95.342117
## iter 40 value 90.192556
## iter 50 value 87.174369
## iter 60 value 85.471492
## iter 70 value 84.020839
## iter 80 value 83.322474
## iter 90 value 82.756079
## iter 100 value 82.413573
## final value 82.413573
## stopped after 100 iterations
# Predict probabilities
nn_probs <- predict(nn_model, newdata = test_data, type = "raw")
# Convert probabilities to class prediction
nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))
# labels must match
test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))
# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 81 15
## Yes 20 37
##
## Accuracy : 0.7712
## 95% CI : (0.6965, 0.8352)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.001883
##
## Kappa : 0.5018
##
## Mcnemar's Test P-Value : 0.498962
##
## Sensitivity : 0.8020
## Specificity : 0.7115
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.6491
## Prevalence : 0.6601
## Detection Rate : 0.5294
## Detection Prevalence : 0.6275
## Balanced Accuracy : 0.7568
##
## 'Positive' Class : No
##
# F1
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))##
## F1 Score (class = 'Yes'): 0.6789
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")## AUC: 0.7568
#this time size= 5
set.seed(123)
nn_model <- nnet(
Outcome ~ .,
data = train_data_boosted,
size = 5
)## # weights: 56
## initial value 172.943257
## iter 10 value 98.821023
## iter 20 value 93.446649
## iter 30 value 88.696700
## iter 40 value 87.828847
## iter 50 value 87.114495
## iter 60 value 86.327113
## iter 70 value 85.668788
## iter 80 value 85.241567
## iter 90 value 84.978636
## iter 100 value 84.740028
## final value 84.740028
## stopped after 100 iterations
nn_probs <- predict(nn_model, newdata = test_data, type = "raw")
nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))
test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))
# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 75 13
## Yes 26 39
##
## Accuracy : 0.7451
## 95% CI : (0.6684, 0.812)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.01490
##
## Kappa : 0.4644
##
## Mcnemar's Test P-Value : 0.05466
##
## Sensitivity : 0.7426
## Specificity : 0.7500
## Pos Pred Value : 0.8523
## Neg Pred Value : 0.6000
## Prevalence : 0.6601
## Detection Rate : 0.4902
## Detection Prevalence : 0.5752
## Balanced Accuracy : 0.7463
##
## 'Positive' Class : No
##
# F1
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))##
## F1 Score (class = 'Yes'): 0.6667
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")## AUC: 0.7463
## this time size = 7
set.seed(123)
nn_model <- nnet(
Outcome ~ .,
data = train_data_boosted,
size = 7
)## # weights: 78
## initial value 152.940178
## iter 10 value 98.024816
## iter 20 value 88.864770
## iter 30 value 82.740575
## iter 40 value 77.669749
## iter 50 value 72.193129
## iter 60 value 69.543274
## iter 70 value 66.989799
## iter 80 value 65.626394
## iter 90 value 64.985152
## iter 100 value 64.567175
## final value 64.567175
## stopped after 100 iterations
nn_probs <- predict(nn_model, newdata = test_data, type = "raw")
nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))
test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))
# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 68 12
## Yes 33 40
##
## Accuracy : 0.7059
## 95% CI : (0.6269, 0.7767)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.133101
##
## Kappa : 0.403
##
## Mcnemar's Test P-Value : 0.002869
##
## Sensitivity : 0.6733
## Specificity : 0.7692
## Pos Pred Value : 0.8500
## Neg Pred Value : 0.5479
## Prevalence : 0.6601
## Detection Rate : 0.4444
## Detection Prevalence : 0.5229
## Balanced Accuracy : 0.7212
##
## 'Positive' Class : No
##
# F1
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))##
## F1 Score (class = 'Yes'): 0.6400
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")## AUC: 0.7212
## this time size = 10
set.seed(123)
nn_model <- nnet(
Outcome ~ .,
data = train_data_boosted,
size = 10
)## # weights: 111
## initial value 178.943974
## iter 10 value 97.031731
## iter 20 value 90.655966
## iter 30 value 83.644821
## iter 40 value 76.408499
## iter 50 value 68.223722
## iter 60 value 61.980388
## iter 70 value 60.671453
## iter 80 value 59.477138
## iter 90 value 58.969516
## iter 100 value 58.764513
## final value 58.764513
## stopped after 100 iterations
nn_probs <- predict(nn_model, newdata = test_data, type = "raw")
nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))
test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))
# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 71 13
## Yes 30 39
##
## Accuracy : 0.719
## 95% CI : (0.6407, 0.7886)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.07182
##
## Kappa : 0.4197
##
## Mcnemar's Test P-Value : 0.01469
##
## Sensitivity : 0.7030
## Specificity : 0.7500
## Pos Pred Value : 0.8452
## Neg Pred Value : 0.5652
## Prevalence : 0.6601
## Detection Rate : 0.4641
## Detection Prevalence : 0.5490
## Balanced Accuracy : 0.7265
##
## 'Positive' Class : No
##
# F1 Score
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))##
## F1 Score (class = 'Yes'): 0.6446
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")## AUC: 0.7265
## this time size = 20
set.seed(123)
nn_model <- nnet(
Outcome ~ .,
data = train_data_boosted,
size = 20
)## # weights: 221
## initial value 180.104898
## iter 10 value 97.621738
## iter 20 value 90.450339
## iter 30 value 82.485305
## iter 40 value 69.128451
## iter 50 value 52.111810
## iter 60 value 43.067824
## iter 70 value 38.521076
## iter 80 value 37.044397
## iter 90 value 36.330469
## iter 100 value 36.083502
## final value 36.083502
## stopped after 100 iterations
nn_probs <- predict(nn_model, newdata = test_data, type = "raw")
nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))
test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 69 14
## Yes 32 38
##
## Accuracy : 0.6993
## 95% CI : (0.62, 0.7707)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.17421
##
## Kappa : 0.3819
##
## Mcnemar's Test P-Value : 0.01219
##
## Sensitivity : 0.6832
## Specificity : 0.7308
## Pos Pred Value : 0.8313
## Neg Pred Value : 0.5429
## Prevalence : 0.6601
## Detection Rate : 0.4510
## Detection Prevalence : 0.5425
## Balanced Accuracy : 0.7070
##
## 'Positive' Class : No
##
# F1 Score
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))##
## F1 Score (class = 'Yes'): 0.6230
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")## AUC: 0.7070
## this time size = 5 and maxit=500
set.seed(123)
nn_model <- nnet(
Outcome ~ .,
data = train_data_boosted,
size = 5,
maxit = 500
)## # weights: 56
## initial value 172.943257
## iter 10 value 98.821023
## iter 20 value 93.446649
## iter 30 value 88.696700
## iter 40 value 87.828847
## iter 50 value 87.114495
## iter 60 value 86.327113
## iter 70 value 85.668788
## iter 80 value 85.241567
## iter 90 value 84.978636
## iter 100 value 84.740028
## iter 110 value 84.314935
## iter 120 value 84.158814
## iter 130 value 84.114142
## iter 140 value 84.036798
## iter 150 value 83.869205
## iter 160 value 83.796893
## iter 170 value 83.597416
## iter 180 value 83.515043
## iter 190 value 83.343145
## iter 200 value 83.124413
## iter 210 value 83.008431
## iter 220 value 82.881592
## iter 230 value 82.836411
## iter 240 value 82.806466
## iter 250 value 82.772590
## iter 260 value 82.763363
## iter 270 value 82.742984
## iter 280 value 82.730980
## iter 290 value 82.726194
## iter 300 value 82.711006
## iter 310 value 82.698620
## iter 320 value 82.694094
## iter 330 value 82.678021
## iter 340 value 82.637182
## iter 350 value 82.635818
## iter 360 value 82.626370
## iter 370 value 82.619551
## iter 380 value 82.611546
## iter 390 value 82.606563
## iter 400 value 82.603729
## iter 410 value 82.599506
## iter 420 value 82.596834
## iter 430 value 82.582355
## iter 440 value 82.574533
## iter 450 value 82.559039
## iter 460 value 82.547956
## iter 470 value 82.543840
## iter 480 value 82.534981
## iter 490 value 82.530448
## iter 500 value 82.527157
## final value 82.527157
## stopped after 500 iterations
nn_probs <- predict(nn_model, newdata = test_data, type = "raw")
nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))
test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 75 13
## Yes 26 39
##
## Accuracy : 0.7451
## 95% CI : (0.6684, 0.812)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.01490
##
## Kappa : 0.4644
##
## Mcnemar's Test P-Value : 0.05466
##
## Sensitivity : 0.7426
## Specificity : 0.7500
## Pos Pred Value : 0.8523
## Neg Pred Value : 0.6000
## Prevalence : 0.6601
## Detection Rate : 0.4902
## Detection Prevalence : 0.5752
## Balanced Accuracy : 0.7463
##
## 'Positive' Class : No
##
# -f1
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))##
## F1 Score (class = 'Yes'): 0.6667
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")## AUC: 0.7463
## this time size = 5 and maxit=1000
set.seed(123)
nn_model <- nnet(
Outcome ~ .,
data = train_data_boosted,
size = 5,
maxit = 1000
)## # weights: 56
## initial value 172.943257
## iter 10 value 98.821023
## iter 20 value 93.446649
## iter 30 value 88.696700
## iter 40 value 87.828847
## iter 50 value 87.114495
## iter 60 value 86.327113
## iter 70 value 85.668788
## iter 80 value 85.241567
## iter 90 value 84.978636
## iter 100 value 84.740028
## iter 110 value 84.314935
## iter 120 value 84.158814
## iter 130 value 84.114142
## iter 140 value 84.036798
## iter 150 value 83.869205
## iter 160 value 83.796893
## iter 170 value 83.597416
## iter 180 value 83.515043
## iter 190 value 83.343145
## iter 200 value 83.124413
## iter 210 value 83.008431
## iter 220 value 82.881592
## iter 230 value 82.836411
## iter 240 value 82.806466
## iter 250 value 82.772590
## iter 260 value 82.763363
## iter 270 value 82.742984
## iter 280 value 82.730980
## iter 290 value 82.726194
## iter 300 value 82.711006
## iter 310 value 82.698620
## iter 320 value 82.694094
## iter 330 value 82.678021
## iter 340 value 82.637182
## iter 350 value 82.635818
## iter 360 value 82.626370
## iter 370 value 82.619551
## iter 380 value 82.611546
## iter 390 value 82.606563
## iter 400 value 82.603729
## iter 410 value 82.599506
## iter 420 value 82.596834
## iter 430 value 82.582355
## iter 440 value 82.574533
## iter 450 value 82.559039
## iter 460 value 82.547956
## iter 470 value 82.543840
## iter 480 value 82.534981
## iter 490 value 82.530448
## iter 500 value 82.527157
## iter 510 value 82.525797
## iter 520 value 82.523118
## iter 530 value 82.516482
## iter 540 value 82.500384
## iter 550 value 82.486391
## iter 560 value 82.467402
## iter 570 value 82.463631
## iter 580 value 82.459413
## iter 590 value 82.446863
## iter 600 value 82.440162
## iter 610 value 82.436340
## iter 620 value 82.428994
## iter 630 value 82.425332
## iter 640 value 82.423424
## iter 650 value 82.421382
## iter 660 value 82.418861
## iter 670 value 82.416263
## iter 680 value 82.411440
## iter 690 value 82.411228
## iter 700 value 82.410378
## iter 710 value 82.408850
## iter 720 value 82.407687
## iter 730 value 82.405354
## iter 740 value 82.403672
## iter 750 value 82.401724
## iter 760 value 82.391967
## iter 770 value 82.381427
## iter 780 value 82.359972
## iter 790 value 82.346906
## final value 82.346107
## converged
nn_probs <- predict(nn_model, newdata = test_data, type = "raw")
#
nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))
test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))
# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 75 13
## Yes 26 39
##
## Accuracy : 0.7451
## 95% CI : (0.6684, 0.812)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.01490
##
## Kappa : 0.4644
##
## Mcnemar's Test P-Value : 0.05466
##
## Sensitivity : 0.7426
## Specificity : 0.7500
## Pos Pred Value : 0.8523
## Neg Pred Value : 0.6000
## Prevalence : 0.6601
## Detection Rate : 0.4902
## Detection Prevalence : 0.5752
## Balanced Accuracy : 0.7463
##
## 'Positive' Class : No
##
# F1
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))##
## F1 Score (class = 'Yes'): 0.6667
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")## AUC: 0.7463
## this time size = 5 and decay=0.01
nn_model <- nnet(
Outcome ~ .,
data = train_data_boosted,
size = 5,
#maxit = 1000,
decay = 0.01
# trace = FALSE
)## # weights: 56
## initial value 168.936570
## iter 10 value 98.728650
## iter 20 value 94.434797
## iter 30 value 92.024197
## iter 40 value 91.065216
## iter 50 value 90.202569
## iter 60 value 88.304614
## iter 70 value 86.525344
## iter 80 value 85.710244
## iter 90 value 85.069997
## iter 100 value 84.193195
## final value 84.193195
## stopped after 100 iterations
nn_probs <- predict(nn_model, newdata = test_data, type = "raw")
nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))
test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 77 13
## Yes 24 39
##
## Accuracy : 0.7582
## 95% CI : (0.6824, 0.8237)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.005655
##
## Kappa : 0.4874
##
## Mcnemar's Test P-Value : 0.100178
##
## Sensitivity : 0.7624
## Specificity : 0.7500
## Pos Pred Value : 0.8556
## Neg Pred Value : 0.6190
## Prevalence : 0.6601
## Detection Rate : 0.5033
## Detection Prevalence : 0.5882
## Balanced Accuracy : 0.7562
##
## 'Positive' Class : No
##
# f1
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))##
## F1 Score (class = 'Yes'): 0.6783
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")## AUC: 0.7562
## this time size = 5 and decay=0.001
set.seed(123)
nn_model <- nnet(
Outcome ~ .,
data = train_data_boosted,
size = 5,
# maxit = 1000,
decay = 0.001
# trace = FALSE
)## # weights: 56
## initial value 172.952656
## iter 10 value 99.139488
## iter 20 value 94.689617
## iter 30 value 89.119111
## iter 40 value 88.083046
## iter 50 value 87.114890
## iter 60 value 84.435316
## iter 70 value 82.477161
## iter 80 value 80.278521
## iter 90 value 77.958373
## iter 100 value 77.036219
## final value 77.036219
## stopped after 100 iterations
nn_probs <- predict(nn_model, newdata = test_data, type = "raw")
nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))
test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))
# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 72 11
## Yes 29 41
##
## Accuracy : 0.7386
## 95% CI : (0.6615, 0.8062)
## No Information Rate : 0.6601
## P-Value [Acc > NIR] : 0.02309
##
## Kappa : 0.4625
##
## Mcnemar's Test P-Value : 0.00719
##
## Sensitivity : 0.7129
## Specificity : 0.7885
## Pos Pred Value : 0.8675
## Neg Pred Value : 0.5857
## Prevalence : 0.6601
## Detection Rate : 0.4706
## Detection Prevalence : 0.5425
## Balanced Accuracy : 0.7507
##
## 'Positive' Class : No
##
# F1
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))##
## F1 Score (class = 'Yes'): 0.6721
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")## AUC: 0.7507