Data 622 Assignment 4

Part I: Assignment

Instructions
1. Choose a dataset
You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks. You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.

2.Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.

To complete this task:

Describe the problem you are trying to solve.
Describe your dataset and what you did to prepare the data for analysis.
Methodologies you used for analyzing the data
What’s the purpose of the analysis performed
Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

Part II: Data Selection & Problem

According to the CDC, 38 million Americans, about 1 in 10, have been diagnosed with Type 2 Diabetes.[^1] In 2023, the CDC determined the prevalence of Type 2 Diabetes among American Indian/Alaska Native adults to be 13.6% compared to 6.9% non-Hispanic whites, or 1 out 7.[^2] This high prevalence places the American Indian demographic group as belonging to a high risk population for the development of Type 2 Diabetes.

This analysis looks to understand health markers which might contribute to the prediction of Type 2 Diabetes among one group of American Indian females age 21 years old or older, the Prima Indians. The dataset utilized for this analysis was collected by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) as part of an effort to understand factors related to Type 2 Diabetes. Understanding health markers associated with an increased risk for predicting Type 2 Diabetes in a high risk population can potentially clarify associated risk factors in the overall population.

From a business perspective, the ability to predict diabetes and prevent onset can save money, contribute to a healthier population and lead to improved patient outcomes.

PART III: EDA & Pre-Processing

The dataset contains 768 rows and nine features, including the target variable ‘Outcome’. All features are numeric. All individuals in the dataset are female, of Prima Indian heritage and are twenty one years old or older. There are eight features present in the dataset (Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, or Age), as well as one target variable, Outcome, which denotes a diagnosis of Type 2 Diabetes.

Distribution
At first glance, features appear to have no missing values in the form of NA or NULLS. Initial histograms, however, alluded to some of the features as having missing values coded as zero. More specifically, the features BloodPressure, Glucose, BMI, DiabetesPedigreeFunction, SkinThickness, and Insulin all contain data points which were recorded as zero. From a medical perspective, values of zero for these features are not clinically possible (e.g., a blood pressure of zero is not possible unless a patient is deceased). As some of the features, such as SkinThickness, have a large number of missing values coded as zero (~250), it was decided to impute missing values using the mean, rather than exclude these rows from the analysis.

knitr::include_graphics("histo.jpg")

Distribution Post Imputation
After mean imputation, the features BloodPressure, BMI, Glucose, and SkinThickness approximated a normal distribution, while Age, DiabetesPedigreeFunction, Pregnancies, and Insulin remained right-skewed. The right skew in these variables alludes to bias in the underlying dataset in favor of younger patients, individuals with few/no pregnancies, lower diabetes pedigree function values, and lower insulin values.

knitr::include_graphics("histo_after_imputation.jpg")

Correlation Matrix
A correlation matrix was used to investigate relationships between features. The matrix showed that most features have weak to moderate positive correlations. The strongest relationships were determined to exist between Glucose and Insulin (r = 0.42), BMI and SkinThickness (r = 0.43), as well as between Age and Pregnancies (r = 0.54). The scatterplots support the notion that most of relationships between features are non-linear, with the exception of Glucose and Insulin, BMI and SkinThickness, and Age and Pregnancy. Overall, the relationships can be considered non-linear and suitable for SVM and neural network models.

knitr::include_graphics("matrix.jpg")

Linearity
Since the target variable is binary, linearity between features and target variable was assessed using a LOESS curve. LOESS curves visualize the relationship between the target variable and probability of the outcome, in this case diabetes with Outcome=1. The plots revealed non-linear relationships for most of the features, except Glucose and lesser so for BMI. Overall, the relationships can be considered non-linear and suitable for SVM and neural network models.

knitr::include_graphics("loess.jpg")

Outliers
Outliers were inspected using scatter plots and Interquartile range (IQR). A large number of outliers were detected for Insulin (164) and SkinThickness (87). As the values of these outliers are possible from a medical perspective, the decision was made to keep the outlier data but use min-max to normalize extreme values that might influence models. IQR based coercion method was used. More specifically outliers with values larger than 1.5 times the IQR’s 25th and 75th percentiles were replaced with the closest outlier boundary value.

knitr::include_graphics("scat.jpg")

knitr::include_graphics("count.jpg")

More Pre-Processing & Class Imbalance
After coercion, min-max scaling was subsequently used in order to normalize feature values between 0 and 1. This pre-processing was necessary in order to prepare the data for SVM and Neural Network analysis.

After min-max scaling, class imbalance in the target variable Outcome was identified (500 patient with no diagnosis of diabetes; 268 patients with a diabetic diagnosis). To address this imbalance, boosted oversampling was applied to the training data. After boost oversampling was applied, the class distribution improved to (319 non-diabetic versus 296 diabetic cases).

PART IV: Essay

Model	Y=0	Y=1	TP	FN	FP	TN	Accuracy	Sensitiv	Specifi	Bal Acc	Prec	F1	AUC
Initial Data Distribution
Before Boost Over-Sampling	500	268
After Boost Over-Sampling	394	374
SVM
Initial SVM (linear)	394	374	33	19	19	82	0.7516	0.8119	0.6346	0.7232	0.6346	0.6346	0.8437
Initial SVM (radial)	394	374	37	15	22	79	0.7582	0.7822	0.7115	0.7469	0.8404	0.6667	0.8505
Radial & Cost = 0.1	394	374	36	16	24	77	0.7386	0.7624	0.6923	0.7273	0.6923	0.6429	0.8446
Radial & Cost = 10	394	374	35	17	19	82	0.7647	0.8119	0.6731	0.7425	0.8283	0.6604	0.8334
Neural Network
Initial	394	364	37	15	22	79	0.7582	0.7822	0.7115	0.7469	0.8404	0.6667	0.7489
Size=3	394	364	36	16	27	74	0.7190	0.7327	0.6923	0.7125	0.8222	0.6261	0.7125
Size=5	394	364	39	13	23	78	0.7647	0.7723	0.7500	0.7611	0.6290	0.6842	0.7611
Size=7	394	364	37	15	27	74	0.7255	0.7327	0.7115	0.7221	0.8315	0.6379	0.7221
Size=10	394	364	38	14	23	78	0.7582	0.7723	0.7308	0.7515	0.8478	0.6726	0.7515
Size=20	394	364	31	21	19	82	0.7386	0.8119	0.5962	0.7040	0.7961	0.6078	0.7040
Size=5 & maxit =500	394	364	39	13	27	74	0.7386	0.7327	0.7500	0.7413	0.7500	0.6610	0.7413
Size=5 & maxit =1000	394	364	39	13	27	74	0.7386	0.7327	0.7500	0.7413	0.7500	0.6610	0.7413
Size=5 & decay= 0.01	394	364	38	14	26	75	0.7386	0.7426	0.7308	0.7367	0.7083	0.6552	0.7367
Size=5 & decay= 0.001	394	364	36	16	22	79	0.7516	0.7822	0.6923	0.7372	0.6923	0.6542	0.7372

A Note about Healthcare Data
Because the analysis seeks to accurately predict diagnosis of diabetes, the preferred model would be a model which would minimize false negatives. In health care, a missed diagnosis (false negative) would be more detrimental than a false diagnosis (false positive).

Why SVM and Neural Network Models
SVM and Neural network models were chosen because:
Feature Interaction
The features are most likely highly interrelated and interact with each other in complex ways. For instance, developing one chronic disease, i.e. blood pressure, has been found to influence the development of subsequent chronic diseases while also impacting health markers including the 8 features in this study. SVM and Neural Networks have been used in the past to capture complex interactions because of the ability to model multi-dimensional relationships.

Non-Linear Decision Boundaries
The LOESS curve between the target variable and other features, as well as the correlation matrix between features, both denote that a majority of the relationships in the dataset are non-linear. Furthermore, interaction between features is almost certain given that all features are health markers and have the potential to affect one another. SVM with a radial kernel and Neural Network models are both able to handle non-linear relationships.

Sensitivity to Imbalanced Data
The dataset suffers from mild class imbalance, with more non-diabetic cases than diabetic ones. To address this, boosted oversampling was applied during the pre-processing phase. While SVM and Neural Network models do not inherently handle imbalanced data, the boost oversampling balanced the target variable class distribution, optimizing the dataset for SVM and Neural Network model use.

Results of SVM
Initially, the SVM model used a linear kernel as a means to provide sanity check and reference point. A radial kernel was also tested to account for potential non-linear relationships between features. The radial kernel model was preferred because it achieved higher overall performance, including higher AUC (0.8505 vs 0.8437), sensitivity (0.7822 vs 0.8119), and F1 (0.6667 vs 0.6346). Next, the cost parameter was adjusted to control the penalty for misclassification. Increasing the cost to 10 resulted in slightly higher specificity (0.6731) and precision (0.8283), but a drop in sensitivity (0.8119 vs. 0.7822), which is less desirable if minimizing false negatives is the priority. Since identification of false negatives are of utmost importance, the Radial & Cost = 10 model was the preferred model. While the linear model and Radial & Cost = 10 model both had a sensitivity of 0.8119, the Radial & Cost = 10 model was more balanced in terms of other metrics, including F1, Precision, Accuracy, specificity and balance accuracy.

Results of Neural Network
An initial neural network model was trained with a single hidden layer (size = 1). Afterwards, the number of hidden units (size) was increased (size = 1-20) to assess for improvements. The model with size = 5 emerged as the most preferred. This model had a sensitivity of 0.7723, meaning it correctly identified over 77% of diabetic cases. Additionally, it maintained a strong performance across other metrics, including specificity (0.7500), accuracy (0.7647), F1 score (0.6842), and AUC (0.7611). While some configurations (e.g., size = 20 or decay variations) slightly improved sensitivity, they also slightly decreased precision or balance compared to other model sizes. Of note, additional neural network models were also trained with manipulation of maxit and decay. The model with size =5, however, remained the preferred model as it offered the best overall trade off between sensitivity and specificity, while maintaining strong precision and generalizability.

Final Model Selection SVM vs NN & Conclusion

Model	Y=0	Y=1	TP	FN	FP	TN	Accuracy	Sensitiv	Specifi	Bal Acc	Prec	F1	AUC
SVM
Radial & Cost = 10	394	364	35	17	19	82	0.7647	0.8119	0.6731	0.7425	0.8283	0.6604	0.8334
Neural Network
Size=5	394	364	39	13	23	78	0.7647	0.7723	0.7500	0.7611	0.6290	0.6842	0.7611

Two models were selected for final evaluation: the SVM with a radial kernel and cost=10, and a neural network with size=5. Both models achieved identical accuracy of 0.7647. However, with regard to sensitivity and minimizing false negative, the SVM model was the preferred model (sensitivity svm 0.8119 vs nn 0.7723). The SVM model’s higher sensitivity ensures better identification of diabetic patients. The neural network model did exhibit higher specificity and F1 score. However, in clinical settings where missing a positive diagnosis is riskier than a missed diagnosis, sensitivity is paramount and, as such, the SVM model is the preferred model.

PART V: References

1: Centers for Disease Control and Prevention. About Type 2 Diabetes. U.S. Department of Health and Human Services; 2023. Available at: https://www.cdc.gov/diabetes/about/about-type-2-diabetes.html

2: Centers for Disease Control and Prevention. National Diabetes Statistics Report, 2023: Estimates of Diabetes and Its Burden in the United States. U.S. Department of Health and Human Services; 2023. Available at: https://www.cdc.gov/diabetes/data/statistics-report/index.html

PART VI: Code

########################################## ONLY CODE BELOW###############################################################
library(ggplot2)
library(tidyr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(fastDummies)
library(rpart)
library(rpart.plot)
library(smotefamily)
library(randomForest)

## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(mlbench)
library(e1071)
library(dplyr)
library(viridis)

## Loading required package: viridisLite

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(purrr)
library(knitr)
library(ROSE)

## Loaded ROSE 0.0-4

library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(nnet)
library(dplyr)

#read dataset in
dm_data <- read_csv("https://raw.githubusercontent.com/greggmaloy/Data622/refs/heads/main/diabetes.csv", show_col_types = FALSE)

#visualize data
dm_data

#features
data.frame(
  Variable = names(dm_data),
  Data_Type = sapply(dm_data, class)
)

#summarize na's
dm_data %>%
  summarise(across(everything(), ~sum(is.na(.))))

# Pivot to long
dm_data_long <- pivot_longer(dm_data, cols = everything(), names_to = "Variable", values_to = "Value")

# histograms
ggplot(dm_data_long, aes(x = Value, fill = Variable)) +
  geom_histogram(bins = 30, color = "black", alpha = 0.8) +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Histograms of Each Variable", x = "Value", y = "Count") +  scale_fill_viridis_d() +
  theme(legend.position = "none")

set.seed(123)
# Impute values using mean
dm_data_cleaned <- dm_data %>%
  mutate(
    BMI = ifelse(BMI == 0, mean(BMI[BMI != 0], na.rm = TRUE), BMI),
    BloodPressure = ifelse(BloodPressure == 0, mean(BloodPressure[BloodPressure != 0], na.rm = TRUE), BloodPressure),
    Glucose = ifelse(Glucose == 0, mean(Glucose[Glucose != 0], na.rm = TRUE), Glucose),
    Insulin = ifelse(Insulin == 0, mean(Insulin[Insulin != 0], na.rm = TRUE), Insulin),
    SkinThickness = ifelse(SkinThickness == 0, mean(SkinThickness[SkinThickness != 0], na.rm = TRUE), SkinThickness),
    DiabetesPedigreeFunction = ifelse(DiabetesPedigreeFunction == 0, mean(DiabetesPedigreeFunction[DiabetesPedigreeFunction != 0], na.rm = TRUE), DiabetesPedigreeFunction)
  )


cat("\nNumber of rows after imputing zeroes:", nrow(dm_data_cleaned), "\n")

## 
## Number of rows after imputing zeroes: 768

#visualize
dm_data_cleaned

set.seed(123)
# long format
dm_data_long_cleaned <- dm_data_cleaned %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# new updated histograms
final_histo <-ggplot(dm_data_long_cleaned, aes(x = Value, fill = Variable)) +
  geom_histogram(bins = 30, color = "black", alpha = 0.8) +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Histograms of Each Variable (After Imputation)", x = "Value", y = "Count") +
  scale_fill_viridis_d() +
  theme(legend.position = "none")

final_histo

#correlation matrix
set.seed(123)
GGally::ggpairs(
  dm_data_cleaned,
  columns = c("Glucose", "BMI", "BloodPressure", "Insulin", "Age", "Pregnancies", "DiabetesPedigreeFunction", "SkinThickness"),
  aes(color = factor(Outcome))
)

# Compute the log-odds 
logistic_model <- glm(Outcome ~ ., data = dm_data_cleaned, family = binomial())

# Extract the fitted probabilities
dm_data_cleaned$log_odds <- predict(logistic_model, type = "response")
dm_data_cleaned$log_odds <- log(dm_data_cleaned$log_odds / (1 - dm_data_cleaned$log_odds))

# Plot 
plots <- list()
for (feature in names(dm_data_cleaned)[-c(9, 10)]) {
  p <- ggplot(dm_data_cleaned, aes_string(x = feature, y = "log_odds")) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "loess", color = "blue") +
    labs(title = paste("Log-Odds vs", feature),
         x = feature,
         y = "Log-Odds DM") +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 10),
      axis.title = element_text(size = 10),
      axis.text = element_text(size = 10)
    )
  
  plots[[feature]] <- p
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Display 
library(patchwork)
wrap_plots(plots, ncol = 3)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

library(ggplot2)

# Loop through each feature and create a LOESS plot
plots <- list()
for (feature in names(dm_data_cleaned)[-9]) {  # Exclude Outcome column
  p <- ggplot(dm_data_cleaned, aes_string(x = feature, y = "Outcome")) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "loess", color = "blue") +
    labs(title = paste("LOESS Curve for", feature),
         x = feature,
         y = "Probability of Diabetes") +
    theme_minimal() +
    theme(
      plot.title = element_text(size = 10),
      axis.title = element_text(size = 10),
      axis.text = element_text(size = 10)
    )
  
  plots[[feature]] <- p
}

# Display the plots in a grid
library(patchwork)
wrap_plots(plots, ncol = 3)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

#scatterplots
set.seed(123)
# Add a row ID 
dm_data_with_id <- dm_data_cleaned %>%
  mutate(Row = row_number())

# Pivot 
dm_data_long <- dm_data_with_id %>%
  pivot_longer(cols = -Row, names_to = "Variable", values_to = "Value")

# scatterplots
ggplot(dm_data_long, aes(x = Row, y = Value, fill = Variable)) +
  geom_point(shape = 21, size = 2, color = "black") +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  scale_fill_viridis_d() +
  theme_minimal() +
  labs(title = "Scatterplots for Each Variable (After Imputation - Outlier Detection)",
       x = "Row Index", y = "Value") +
  theme(legend.position = "none")

# outliers detection using IQR
set.seed(123)

calculate_iqr_outliers_for_count <- function(df, col_name) {
  Q1 <- quantile(df[[col_name]], 0.25, na.rm = TRUE)
  Q3 <- quantile(df[[col_name]], 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  df %>%
    filter(.data[[col_name]] < lower_bound | .data[[col_name]] > upper_bound) %>%
    mutate(variable = col_name) %>%
    select(variable)
}


outlier_counts_cleaned <- dm_data_cleaned %>%
  select(where(is.numeric)) %>%
  names() %>%
  map_df(~calculate_iqr_outliers_for_count(dm_data_cleaned, .x)) %>%
  group_by(variable) %>%
  summarise(outlier_count = n(), .groups = 'drop')

# print 
cat("\n**Table 1: Count of Outliers in Each Numeric Variable (After Cleaning)**\n")

## 
## **Table 1: Count of Outliers in Each Numeric Variable (After Cleaning)**

kable(outlier_counts_cleaned, caption = "Count of Outliers in Each Numeric Variable (After Cleaning)")

Count of Outliers in Each Numeric Variable (After Cleaning)
variable	outlier_count
Age	9
BMI	8
BloodPressure	14
DiabetesPedigreeFunction	29
Insulin	164
Pregnancies	4
SkinThickness	87
log_odds	2

set.seed(123)
# cap outliers using iqr 
coerce_iqr <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR_val <- Q3 - Q1
  
  lower_bound <- Q1 - 1.5 * IQR_val
  upper_bound <- Q3 + 1.5 * IQR_val
  
  x <- ifelse(x < lower_bound, lower_bound, x)
  x <- ifelse(x > upper_bound, upper_bound, x)
  return(x)
}

dm_data_coerced <- dm_data_cleaned %>%
  mutate(
   
    Glucose = coerce_iqr(Glucose),
    BloodPressure = coerce_iqr(BloodPressure),
    SkinThickness = coerce_iqr(SkinThickness),
    BMI = coerce_iqr(BMI),
    Pregnancies = coerce_iqr(Pregnancies),
    Insulin = coerce_iqr(Insulin),
    DiabetesPedigreeFunction = coerce_iqr(DiabetesPedigreeFunction),
    Age = coerce_iqr(Age)
  )

#histograms again this time with the outiers and na values accounted for
set.seed(123)

dm_data_cleaned_with_id <- dm_data_cleaned %>% mutate(Row = row_number())
dm_data_coerced_with_id <- dm_data_coerced %>% mutate(Row = row_number())

# Pivot to long format
dm_data_cleaned_long <- dm_data_cleaned_with_id %>%
  pivot_longer(cols = -Row, names_to = "Variable", values_to = "Value") %>%
  mutate(Dataset = "Before Coercion")

dm_data_coerced_long <- dm_data_coerced_with_id %>%
  pivot_longer(cols = -Row, names_to = "Variable", values_to = "Value") %>%
  mutate(Dataset = "After Coercion")

combined_data_long <- bind_rows(dm_data_cleaned_long, dm_data_coerced_long)


# histograms before and after coercion
ggplot(combined_data_long, aes(x = Value, fill = Dataset)) +
  geom_histogram(bins = 30, position = "identity", alpha = 0.6, color = "black") +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  scale_fill_viridis_d(option = "C") +
  theme_minimal() +
  labs(title = "Before vs After Coercion of Outliers",
       x = "Value",
       y = "Count") +
  theme(legend.position = "top")

library(dplyr)
set.seed(123)
# min max scaling in prep for SVM and neural networks
min_max_scaler <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

# Apply Min-Max to all features except outcome
dm_data_scaled <- dm_data_coerced %>%
  mutate(across(where(is.numeric) & !Outcome, min_max_scaler))

# visualize
head(dm_data_scaled)

# checking class distribution counts and %
table(dm_data$Outcome)

## 
##   0   1 
## 500 268

prop.table(table(dm_data$Outcome))

## 
##         0         1 
## 0.6510417 0.3489583

# Boosted Oversampling for balance target variable
set.seed(123)  # for reproducibility
dm_data_boosted <- ROSE(Outcome ~ ., data = dm_data_scaled, seed = 123)$data
table(dm_data_boosted$Outcome)

## 
##   0   1 
## 394 374

set.seed(123)  

# test train 80/20
train_index <- createDataPartition(dm_data_scaled$Outcome, p = 0.8, list = FALSE)
train_data <- dm_data_scaled[train_index, ]
test_data <- dm_data_scaled[-train_index, ]

# Boost Oversampling on the training data only!!!
train_data_boosted <- ROSE(Outcome ~ ., data = train_data, seed = 123)$data
table(train_data_boosted$Outcome)

## 
##   0   1 
## 319 296

# train svm
#svm_model <- svm(factor(Outcome) ~ ., data = train_data_boosted, kernel = "radial", probability = TRUE)
#summary(svm_model)

set.seed(1234)


# svm linear kernel
svm_model <- svm(factor(Outcome) ~ ., data = train_data_boosted, kernel = "linear", probability = TRUE)

# Predict 
svm_predictions <- predict(svm_model, newdata = test_data)

# Ensure factors levels
svm_predictions <- factor(svm_predictions, levels = c(0, 1))
test_labels <- factor(test_data$Outcome, levels = c(0, 1))

# Confusion Matrix
conf_matrix <- confusionMatrix(svm_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 82 18
##          1 19 34
##                                           
##                Accuracy : 0.7582          
##                  95% CI : (0.6824, 0.8237)
##     No Information Rate : 0.6601          
##     P-Value [Acc > NIR] : 0.005655        
##                                           
##                   Kappa : 0.4636          
##                                           
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.8119          
##             Specificity : 0.6538          
##          Pos Pred Value : 0.8200          
##          Neg Pred Value : 0.6415          
##              Prevalence : 0.6601          
##          Detection Rate : 0.5359          
##    Detection Prevalence : 0.6536          
##       Balanced Accuracy : 0.7329          
##                                           
##        'Positive' Class : 0               
##

precision <- posPredValue(svm_predictions, test_labels, positive = "1")
recall <- sensitivity(svm_predictions, test_labels, positive = "1")

# F1 score
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 1): %.4f\n", f1_score))

## 
## F1 Score (class = 1): 0.6476

#ROC Curve & AUC

svm_probabilities <- predict(svm_model, newdata = test_data, probability = TRUE)
svm_prob_df <- attr(svm_probabilities, "probabilities")
probs_class1 <- svm_prob_df[, "1"]

# ROC
roc_curve <- roc(test_data$Outcome, probs_class1)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve, col = "blue", main = "SVM ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Print AUC
auc_val <- auc(roc_curve)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.8450

set.seed(1234)


# radial this time
svm_model <- svm(factor(Outcome) ~ ., data = train_data_boosted, kernel = "radial", probability = TRUE)


svm_predictions <- predict(svm_model, newdata = test_data)


svm_predictions <- factor(svm_predictions, levels = c(0, 1))
test_labels <- factor(test_data$Outcome, levels = c(0, 1))

# Confusion Matrix
conf_matrix <- confusionMatrix(svm_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 84 18
##          1 17 34
##                                           
##                Accuracy : 0.7712          
##                  95% CI : (0.6965, 0.8352)
##     No Information Rate : 0.6601          
##     P-Value [Acc > NIR] : 0.001883        
##                                           
##                   Kappa : 0.4878          
##                                           
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.8317          
##             Specificity : 0.6538          
##          Pos Pred Value : 0.8235          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.6601          
##          Detection Rate : 0.5490          
##    Detection Prevalence : 0.6667          
##       Balanced Accuracy : 0.7428          
##                                           
##        'Positive' Class : 0               
##

precision <- posPredValue(svm_predictions, test_labels, positive = "1")
recall <- sensitivity(svm_predictions, test_labels, positive = "1")

# F1 score
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 1): %.4f\n", f1_score))

## 
## F1 Score (class = 1): 0.6602

# ROC Curve & AUC
# 
svm_probabilities <- predict(svm_model, newdata = test_data, probability = TRUE)
svm_prob_df <- attr(svm_probabilities, "probabilities")
probs_class1 <- svm_prob_df[, "1"]

# ROC
roc_curve <- roc(test_data$Outcome, probs_class1)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve, col = "blue", main = "SVM ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Print AUC
auc_val <- auc(roc_curve)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.8494

library(pROC)

# Predict probabilities 
svm_probabilities <- predict(svm_model, newdata = test_data, probability = TRUE)
svm_prob_df <- attr(svm_probabilities, "probabilities")
probs_class1 <- svm_prob_df[, "1"]

# Compute ROC
roc_curve <- roc(test_data$Outcome, probs_class1)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Plot ROC 
plot(roc_curve, col = "blue", main = "SVM ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")  # Random guess line

auc(roc_curve)

## Area under the curve: 0.8494

set.seed(1234)


# radial and cost=0.1
svm_model <- svm(factor(Outcome) ~ ., data = train_data_boosted, kernel = "radial", cost='0.1', probability = TRUE)


svm_predictions <- predict(svm_model, newdata = test_data)


svm_predictions <- factor(svm_predictions, levels = c(0, 1))
test_labels <- factor(test_data$Outcome, levels = c(0, 1))

# Confusion Matrix
conf_matrix <- confusionMatrix(svm_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 79 18
##          1 22 34
##                                           
##                Accuracy : 0.7386          
##                  95% CI : (0.6615, 0.8062)
##     No Information Rate : 0.6601          
##     P-Value [Acc > NIR] : 0.02309         
##                                           
##                   Kappa : 0.428           
##                                           
##  Mcnemar's Test P-Value : 0.63526         
##                                           
##             Sensitivity : 0.7822          
##             Specificity : 0.6538          
##          Pos Pred Value : 0.8144          
##          Neg Pred Value : 0.6071          
##              Prevalence : 0.6601          
##          Detection Rate : 0.5163          
##    Detection Prevalence : 0.6340          
##       Balanced Accuracy : 0.7180          
##                                           
##        'Positive' Class : 0               
##

#F1
precision <- posPredValue(svm_predictions, test_labels, positive = "1")
recall <- sensitivity(svm_predictions, test_labels, positive = "1")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 1): %.4f\n", f1_score))

## 
## F1 Score (class = 1): 0.6296

# ROC Curve & AUC
# Predict probabilities
svm_probabilities <- predict(svm_model, newdata = test_data, probability = TRUE)
svm_prob_df <- attr(svm_probabilities, "probabilities")
probs_class1 <- svm_prob_df[, "1"]

roc_curve <- roc(test_data$Outcome, probs_class1)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve, col = "blue", main = "SVM ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

auc_val <- auc(roc_curve)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.8420

set.seed(1234)


# Rradial and cost =10
svm_model <- svm(factor(Outcome) ~ ., data = train_data_boosted, kernel = "radial", cost='10', probability = TRUE)


svm_predictions <- predict(svm_model, newdata = test_data)


svm_predictions <- factor(svm_predictions, levels = c(0, 1))
test_labels <- factor(test_data$Outcome, levels = c(0, 1))

# Confusion Matrix
conf_matrix <- confusionMatrix(svm_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 83 18
##          1 18 34
##                                           
##                Accuracy : 0.7647          
##                  95% CI : (0.6894, 0.8294)
##     No Information Rate : 0.6601          
##     P-Value [Acc > NIR] : 0.003318        
##                                           
##                   Kappa : 0.4756          
##                                           
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.8218          
##             Specificity : 0.6538          
##          Pos Pred Value : 0.8218          
##          Neg Pred Value : 0.6538          
##              Prevalence : 0.6601          
##          Detection Rate : 0.5425          
##    Detection Prevalence : 0.6601          
##       Balanced Accuracy : 0.7378          
##                                           
##        'Positive' Class : 0               
##

# --- F1 Score Calculation for Class 1 (Positive Class) ---

precision <- posPredValue(svm_predictions, test_labels, positive = "1")
recall <- sensitivity(svm_predictions, test_labels, positive = "1")

# F1 score
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 1): %.4f\n", f1_score))

## 
## F1 Score (class = 1): 0.6538

svm_probabilities <- predict(svm_model, newdata = test_data, probability = TRUE)
svm_prob_df <- attr(svm_probabilities, "probabilities")
probs_class1 <- svm_prob_df[, "1"]


roc_curve <- roc(test_data$Outcome, probs_class1)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve, col = "blue", main = "SVM ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

auc_val <- auc(roc_curve)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.8454

# Train a simple Neural Network
set.seed(123)

nn_model <- nnet(
  Outcome ~ ., 
  data = train_data_boosted, 
  size = 1         # number of neurons in hidden layer
  # maxit = 500,
  # decay = 0.01,
  # trace = FALSE
)

## # weights:  12
## initial  value 170.393763 
## iter  10 value 98.262402
## iter  20 value 97.002545
## iter  30 value 96.804595
## iter  40 value 96.710389
## iter  50 value 96.674100
## iter  60 value 96.599528
## iter  70 value 96.559544
## final  value 96.558566 
## converged

# Predict probabilities on test set
nn_probs <- predict(nn_model, newdata = test_data, type = "raw")

# Convert probabilities into class predictions
nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))

# Make sure test labels match
test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))

# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  79  16
##        Yes 22  36
##                                           
##                Accuracy : 0.7516          
##                  95% CI : (0.6754, 0.8179)
##     No Information Rate : 0.6601          
##     P-Value [Acc > NIR] : 0.009329        
##                                           
##                   Kappa : 0.4616          
##                                           
##  Mcnemar's Test P-Value : 0.417304        
##                                           
##             Sensitivity : 0.7822          
##             Specificity : 0.6923          
##          Pos Pred Value : 0.8316          
##          Neg Pred Value : 0.6207          
##              Prevalence : 0.6601          
##          Detection Rate : 0.5163          
##    Detection Prevalence : 0.6209          
##       Balanced Accuracy : 0.7372          
##                                           
##        'Positive' Class : No              
##

# --- F1 Score Calculation for Class "Yes" (Positive Class = 1) ---
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))

## 
## F1 Score (class = 'Yes'): 0.6545

# --- ROC Curve & AUC ---
roc_nn <- roc(test_labels, as.numeric(nn_predictions == "Yes"))

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Print AUC
auc_val <- auc(roc_nn)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.7372

#neural network initial
set.seed(123)

nn_model <- nnet(
  Outcome ~ ., 
  data = train_data_boosted, 
  size = 3       

)

## # weights:  34
## initial  value 218.660078 
## iter  10 value 100.423834
## iter  20 value 96.411732
## iter  30 value 95.342117
## iter  40 value 90.192556
## iter  50 value 87.174369
## iter  60 value 85.471492
## iter  70 value 84.020839
## iter  80 value 83.322474
## iter  90 value 82.756079
## iter 100 value 82.413573
## final  value 82.413573 
## stopped after 100 iterations

# Predict probabilities 
nn_probs <- predict(nn_model, newdata = test_data, type = "raw")

# Convert probabilities to class prediction
nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))

# labels must match
test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))

# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  81  15
##        Yes 20  37
##                                           
##                Accuracy : 0.7712          
##                  95% CI : (0.6965, 0.8352)
##     No Information Rate : 0.6601          
##     P-Value [Acc > NIR] : 0.001883        
##                                           
##                   Kappa : 0.5018          
##                                           
##  Mcnemar's Test P-Value : 0.498962        
##                                           
##             Sensitivity : 0.8020          
##             Specificity : 0.7115          
##          Pos Pred Value : 0.8438          
##          Neg Pred Value : 0.6491          
##              Prevalence : 0.6601          
##          Detection Rate : 0.5294          
##    Detection Prevalence : 0.6275          
##       Balanced Accuracy : 0.7568          
##                                           
##        'Positive' Class : No              
##

# F1
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))

## 
## F1 Score (class = 'Yes'): 0.6789

#  ROC Curve & AUC
roc_nn <- roc(test_labels, as.numeric(nn_predictions == "Yes"))

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Print AUC
auc_val <- auc(roc_nn)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.7568

#this time size= 5
set.seed(123)

nn_model <- nnet(
  Outcome ~ ., 
  data = train_data_boosted, 
  size = 5        
)

## # weights:  56
## initial  value 172.943257 
## iter  10 value 98.821023
## iter  20 value 93.446649
## iter  30 value 88.696700
## iter  40 value 87.828847
## iter  50 value 87.114495
## iter  60 value 86.327113
## iter  70 value 85.668788
## iter  80 value 85.241567
## iter  90 value 84.978636
## iter 100 value 84.740028
## final  value 84.740028 
## stopped after 100 iterations

nn_probs <- predict(nn_model, newdata = test_data, type = "raw")


nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))


test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))

# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  75  13
##        Yes 26  39
##                                          
##                Accuracy : 0.7451         
##                  95% CI : (0.6684, 0.812)
##     No Information Rate : 0.6601         
##     P-Value [Acc > NIR] : 0.01490        
##                                          
##                   Kappa : 0.4644         
##                                          
##  Mcnemar's Test P-Value : 0.05466        
##                                          
##             Sensitivity : 0.7426         
##             Specificity : 0.7500         
##          Pos Pred Value : 0.8523         
##          Neg Pred Value : 0.6000         
##              Prevalence : 0.6601         
##          Detection Rate : 0.4902         
##    Detection Prevalence : 0.5752         
##       Balanced Accuracy : 0.7463         
##                                          
##        'Positive' Class : No             
##

#  F1 
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))

## 
## F1 Score (class = 'Yes'): 0.6667

# ROC Curve & AUC 
roc_nn <- roc(test_labels, as.numeric(nn_predictions == "Yes"))

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Print AUC
auc_val <- auc(roc_nn)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.7463

## this time size = 7   
set.seed(123)

nn_model <- nnet(
  Outcome ~ ., 
  data = train_data_boosted, 
  size = 7      
)

## # weights:  78
## initial  value 152.940178 
## iter  10 value 98.024816
## iter  20 value 88.864770
## iter  30 value 82.740575
## iter  40 value 77.669749
## iter  50 value 72.193129
## iter  60 value 69.543274
## iter  70 value 66.989799
## iter  80 value 65.626394
## iter  90 value 64.985152
## iter 100 value 64.567175
## final  value 64.567175 
## stopped after 100 iterations

nn_probs <- predict(nn_model, newdata = test_data, type = "raw")


nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))


test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))

# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  68  12
##        Yes 33  40
##                                           
##                Accuracy : 0.7059          
##                  95% CI : (0.6269, 0.7767)
##     No Information Rate : 0.6601          
##     P-Value [Acc > NIR] : 0.133101        
##                                           
##                   Kappa : 0.403           
##                                           
##  Mcnemar's Test P-Value : 0.002869        
##                                           
##             Sensitivity : 0.6733          
##             Specificity : 0.7692          
##          Pos Pred Value : 0.8500          
##          Neg Pred Value : 0.5479          
##              Prevalence : 0.6601          
##          Detection Rate : 0.4444          
##    Detection Prevalence : 0.5229          
##       Balanced Accuracy : 0.7212          
##                                           
##        'Positive' Class : No              
##

# F1 
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))

## 
## F1 Score (class = 'Yes'): 0.6400

# ROC Curve & AUC
roc_nn <- roc(test_labels, as.numeric(nn_predictions == "Yes"))

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Print AUC
auc_val <- auc(roc_nn)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.7212

## this time size = 10 
set.seed(123)

nn_model <- nnet(
  Outcome ~ ., 
  data = train_data_boosted, 
  size = 10     
)

## # weights:  111
## initial  value 178.943974 
## iter  10 value 97.031731
## iter  20 value 90.655966
## iter  30 value 83.644821
## iter  40 value 76.408499
## iter  50 value 68.223722
## iter  60 value 61.980388
## iter  70 value 60.671453
## iter  80 value 59.477138
## iter  90 value 58.969516
## iter 100 value 58.764513
## final  value 58.764513 
## stopped after 100 iterations

nn_probs <- predict(nn_model, newdata = test_data, type = "raw")


nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))


test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))

# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  71  13
##        Yes 30  39
##                                           
##                Accuracy : 0.719           
##                  95% CI : (0.6407, 0.7886)
##     No Information Rate : 0.6601          
##     P-Value [Acc > NIR] : 0.07182         
##                                           
##                   Kappa : 0.4197          
##                                           
##  Mcnemar's Test P-Value : 0.01469         
##                                           
##             Sensitivity : 0.7030          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.8452          
##          Neg Pred Value : 0.5652          
##              Prevalence : 0.6601          
##          Detection Rate : 0.4641          
##    Detection Prevalence : 0.5490          
##       Balanced Accuracy : 0.7265          
##                                           
##        'Positive' Class : No              
##

# F1 Score 
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))

## 
## F1 Score (class = 'Yes'): 0.6446

# ROC Curve & AUC 
roc_nn <- roc(test_labels, as.numeric(nn_predictions == "Yes"))

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Print AUC
auc_val <- auc(roc_nn)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.7265

## this time size = 20
set.seed(123)

nn_model <- nnet(
  Outcome ~ ., 
  data = train_data_boosted, 
  size = 20      
)

## # weights:  221
## initial  value 180.104898 
## iter  10 value 97.621738
## iter  20 value 90.450339
## iter  30 value 82.485305
## iter  40 value 69.128451
## iter  50 value 52.111810
## iter  60 value 43.067824
## iter  70 value 38.521076
## iter  80 value 37.044397
## iter  90 value 36.330469
## iter 100 value 36.083502
## final  value 36.083502 
## stopped after 100 iterations

nn_probs <- predict(nn_model, newdata = test_data, type = "raw")


nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))


test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))


conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  69  14
##        Yes 32  38
##                                         
##                Accuracy : 0.6993        
##                  95% CI : (0.62, 0.7707)
##     No Information Rate : 0.6601        
##     P-Value [Acc > NIR] : 0.17421       
##                                         
##                   Kappa : 0.3819        
##                                         
##  Mcnemar's Test P-Value : 0.01219       
##                                         
##             Sensitivity : 0.6832        
##             Specificity : 0.7308        
##          Pos Pred Value : 0.8313        
##          Neg Pred Value : 0.5429        
##              Prevalence : 0.6601        
##          Detection Rate : 0.4510        
##    Detection Prevalence : 0.5425        
##       Balanced Accuracy : 0.7070        
##                                         
##        'Positive' Class : No            
##

# F1 Score 
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))

## 
## F1 Score (class = 'Yes'): 0.6230

# ROC Curve & AUC
roc_nn <- roc(test_labels, as.numeric(nn_predictions == "Yes"))

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Print AUC
auc_val <- auc(roc_nn)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.7070

## this time size = 5 and maxit=500
set.seed(123)

nn_model <- nnet(
  Outcome ~ ., 
  data = train_data_boosted, 
  size = 5,      
  maxit = 500

)

## # weights:  56
## initial  value 172.943257 
## iter  10 value 98.821023
## iter  20 value 93.446649
## iter  30 value 88.696700
## iter  40 value 87.828847
## iter  50 value 87.114495
## iter  60 value 86.327113
## iter  70 value 85.668788
## iter  80 value 85.241567
## iter  90 value 84.978636
## iter 100 value 84.740028
## iter 110 value 84.314935
## iter 120 value 84.158814
## iter 130 value 84.114142
## iter 140 value 84.036798
## iter 150 value 83.869205
## iter 160 value 83.796893
## iter 170 value 83.597416
## iter 180 value 83.515043
## iter 190 value 83.343145
## iter 200 value 83.124413
## iter 210 value 83.008431
## iter 220 value 82.881592
## iter 230 value 82.836411
## iter 240 value 82.806466
## iter 250 value 82.772590
## iter 260 value 82.763363
## iter 270 value 82.742984
## iter 280 value 82.730980
## iter 290 value 82.726194
## iter 300 value 82.711006
## iter 310 value 82.698620
## iter 320 value 82.694094
## iter 330 value 82.678021
## iter 340 value 82.637182
## iter 350 value 82.635818
## iter 360 value 82.626370
## iter 370 value 82.619551
## iter 380 value 82.611546
## iter 390 value 82.606563
## iter 400 value 82.603729
## iter 410 value 82.599506
## iter 420 value 82.596834
## iter 430 value 82.582355
## iter 440 value 82.574533
## iter 450 value 82.559039
## iter 460 value 82.547956
## iter 470 value 82.543840
## iter 480 value 82.534981
## iter 490 value 82.530448
## iter 500 value 82.527157
## final  value 82.527157 
## stopped after 500 iterations

nn_probs <- predict(nn_model, newdata = test_data, type = "raw")


nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))


test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))

conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  75  13
##        Yes 26  39
##                                          
##                Accuracy : 0.7451         
##                  95% CI : (0.6684, 0.812)
##     No Information Rate : 0.6601         
##     P-Value [Acc > NIR] : 0.01490        
##                                          
##                   Kappa : 0.4644         
##                                          
##  Mcnemar's Test P-Value : 0.05466        
##                                          
##             Sensitivity : 0.7426         
##             Specificity : 0.7500         
##          Pos Pred Value : 0.8523         
##          Neg Pred Value : 0.6000         
##              Prevalence : 0.6601         
##          Detection Rate : 0.4902         
##    Detection Prevalence : 0.5752         
##       Balanced Accuracy : 0.7463         
##                                          
##        'Positive' Class : No             
##

# -f1
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))

## 
## F1 Score (class = 'Yes'): 0.6667

# ROC Curve & AUC 
roc_nn <- roc(test_labels, as.numeric(nn_predictions == "Yes"))

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Print AUC
auc_val <- auc(roc_nn)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.7463

## this time size = 5 and maxit=1000
set.seed(123)

nn_model <- nnet(
  Outcome ~ ., 
  data = train_data_boosted, 
  size = 5,      
  maxit = 1000

)

## # weights:  56
## initial  value 172.943257 
## iter  10 value 98.821023
## iter  20 value 93.446649
## iter  30 value 88.696700
## iter  40 value 87.828847
## iter  50 value 87.114495
## iter  60 value 86.327113
## iter  70 value 85.668788
## iter  80 value 85.241567
## iter  90 value 84.978636
## iter 100 value 84.740028
## iter 110 value 84.314935
## iter 120 value 84.158814
## iter 130 value 84.114142
## iter 140 value 84.036798
## iter 150 value 83.869205
## iter 160 value 83.796893
## iter 170 value 83.597416
## iter 180 value 83.515043
## iter 190 value 83.343145
## iter 200 value 83.124413
## iter 210 value 83.008431
## iter 220 value 82.881592
## iter 230 value 82.836411
## iter 240 value 82.806466
## iter 250 value 82.772590
## iter 260 value 82.763363
## iter 270 value 82.742984
## iter 280 value 82.730980
## iter 290 value 82.726194
## iter 300 value 82.711006
## iter 310 value 82.698620
## iter 320 value 82.694094
## iter 330 value 82.678021
## iter 340 value 82.637182
## iter 350 value 82.635818
## iter 360 value 82.626370
## iter 370 value 82.619551
## iter 380 value 82.611546
## iter 390 value 82.606563
## iter 400 value 82.603729
## iter 410 value 82.599506
## iter 420 value 82.596834
## iter 430 value 82.582355
## iter 440 value 82.574533
## iter 450 value 82.559039
## iter 460 value 82.547956
## iter 470 value 82.543840
## iter 480 value 82.534981
## iter 490 value 82.530448
## iter 500 value 82.527157
## iter 510 value 82.525797
## iter 520 value 82.523118
## iter 530 value 82.516482
## iter 540 value 82.500384
## iter 550 value 82.486391
## iter 560 value 82.467402
## iter 570 value 82.463631
## iter 580 value 82.459413
## iter 590 value 82.446863
## iter 600 value 82.440162
## iter 610 value 82.436340
## iter 620 value 82.428994
## iter 630 value 82.425332
## iter 640 value 82.423424
## iter 650 value 82.421382
## iter 660 value 82.418861
## iter 670 value 82.416263
## iter 680 value 82.411440
## iter 690 value 82.411228
## iter 700 value 82.410378
## iter 710 value 82.408850
## iter 720 value 82.407687
## iter 730 value 82.405354
## iter 740 value 82.403672
## iter 750 value 82.401724
## iter 760 value 82.391967
## iter 770 value 82.381427
## iter 780 value 82.359972
## iter 790 value 82.346906
## final  value 82.346107 
## converged

nn_probs <- predict(nn_model, newdata = test_data, type = "raw")

#
nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))


test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))

# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  75  13
##        Yes 26  39
##                                          
##                Accuracy : 0.7451         
##                  95% CI : (0.6684, 0.812)
##     No Information Rate : 0.6601         
##     P-Value [Acc > NIR] : 0.01490        
##                                          
##                   Kappa : 0.4644         
##                                          
##  Mcnemar's Test P-Value : 0.05466        
##                                          
##             Sensitivity : 0.7426         
##             Specificity : 0.7500         
##          Pos Pred Value : 0.8523         
##          Neg Pred Value : 0.6000         
##              Prevalence : 0.6601         
##          Detection Rate : 0.4902         
##    Detection Prevalence : 0.5752         
##       Balanced Accuracy : 0.7463         
##                                          
##        'Positive' Class : No             
##

# F1 
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))

## 
## F1 Score (class = 'Yes'): 0.6667

#ROC Curve & AUC
roc_nn <- roc(test_labels, as.numeric(nn_predictions == "Yes"))

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Print AUC
auc_val <- auc(roc_nn)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.7463

## this time size = 5 and decay=0.01
nn_model <- nnet(
  Outcome ~ ., 
  data = train_data_boosted, 
  size = 5,      
  #maxit = 1000,
  decay = 0.01
  # trace = FALSE
)

## # weights:  56
## initial  value 168.936570 
## iter  10 value 98.728650
## iter  20 value 94.434797
## iter  30 value 92.024197
## iter  40 value 91.065216
## iter  50 value 90.202569
## iter  60 value 88.304614
## iter  70 value 86.525344
## iter  80 value 85.710244
## iter  90 value 85.069997
## iter 100 value 84.193195
## final  value 84.193195 
## stopped after 100 iterations

nn_probs <- predict(nn_model, newdata = test_data, type = "raw")

nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))

test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))


conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  77  13
##        Yes 24  39
##                                           
##                Accuracy : 0.7582          
##                  95% CI : (0.6824, 0.8237)
##     No Information Rate : 0.6601          
##     P-Value [Acc > NIR] : 0.005655        
##                                           
##                   Kappa : 0.4874          
##                                           
##  Mcnemar's Test P-Value : 0.100178        
##                                           
##             Sensitivity : 0.7624          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.8556          
##          Neg Pred Value : 0.6190          
##              Prevalence : 0.6601          
##          Detection Rate : 0.5033          
##    Detection Prevalence : 0.5882          
##       Balanced Accuracy : 0.7562          
##                                           
##        'Positive' Class : No              
##

# f1
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))

## 
## F1 Score (class = 'Yes'): 0.6783

roc_nn <- roc(test_labels, as.numeric(nn_predictions == "Yes"))

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Print AUC
auc_val <- auc(roc_nn)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.7562

## this time size = 5 and decay=0.001
set.seed(123)

nn_model <- nnet(
  Outcome ~ ., 
  data = train_data_boosted, 
  size = 5,      
 # maxit = 1000,
  decay = 0.001
  # trace = FALSE
)

## # weights:  56
## initial  value 172.952656 
## iter  10 value 99.139488
## iter  20 value 94.689617
## iter  30 value 89.119111
## iter  40 value 88.083046
## iter  50 value 87.114890
## iter  60 value 84.435316
## iter  70 value 82.477161
## iter  80 value 80.278521
## iter  90 value 77.958373
## iter 100 value 77.036219
## final  value 77.036219 
## stopped after 100 iterations

nn_probs <- predict(nn_model, newdata = test_data, type = "raw")


nn_predictions <- ifelse(nn_probs > 0.5, "Yes", "No")
nn_predictions <- factor(nn_predictions, levels = c("No", "Yes"))


test_labels <- factor(test_data$Outcome, levels = c(0,1), labels = c("No", "Yes"))

# Confusion Matrix
conf_matrix <- confusionMatrix(nn_predictions, test_labels)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  72  11
##        Yes 29  41
##                                           
##                Accuracy : 0.7386          
##                  95% CI : (0.6615, 0.8062)
##     No Information Rate : 0.6601          
##     P-Value [Acc > NIR] : 0.02309         
##                                           
##                   Kappa : 0.4625          
##                                           
##  Mcnemar's Test P-Value : 0.00719         
##                                           
##             Sensitivity : 0.7129          
##             Specificity : 0.7885          
##          Pos Pred Value : 0.8675          
##          Neg Pred Value : 0.5857          
##              Prevalence : 0.6601          
##          Detection Rate : 0.4706          
##    Detection Prevalence : 0.5425          
##       Balanced Accuracy : 0.7507          
##                                           
##        'Positive' Class : No              
##

#  F1 
precision <- posPredValue(nn_predictions, test_labels, positive = "Yes")
recall <- sensitivity(nn_predictions, test_labels, positive = "Yes")
f1_score <- 2 * (precision * recall) / (precision + recall)
cat(sprintf("\nF1 Score (class = 'Yes'): %.4f\n", f1_score))

## 
## F1 Score (class = 'Yes'): 0.6721

# ROC Curve & AU
roc_nn <- roc(test_labels, as.numeric(nn_predictions == "Yes"))

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

plot(roc_nn, col = "darkgreen", main = "Neural Network ROC Curve")
abline(a = 0, b = 1, lty = 2, col = "red")

# Print AUC
auc_val <- auc(roc_nn)
cat(sprintf("AUC: %.4f\n", auc_val))

## AUC: 0.7507