1 Introduction

This report details the development of two types of supervised classification models to predict diabetes status: logistic regression and a single-layer neural network (perceptron). The project is divided into three parts:

  1. Part 1: Data loading, exploratory data analysis (EDA) and preprocessing.
  2. Part 2: Development of three logistic regression models and one perceptron model.
  3. Part 3: Comparison of all four models using ROC-AUC analysis to determine the best-performing model.

2 Part 1: Data Preparation and EDA

The first step in any modeling process is to load, understand and clean the data. This section covers our data prepartion and feature engineering.

2.1 Data Loading and Initial Inspection

setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Project 2/Data")
# Load the CSV file
diabetes.data <- read.csv("diabetes_prediction_dataset.csv")

After loading we inspect its dimensions (dim()) and data types (str()) to understand the dataset’s structure and the variables.

# Display the dimensions (rows, columns) of the data
dim(diabetes.data)
[1] 100000      9
# Structure of the data including variable names and types
str(diabetes.data)
'data.frame':   100000 obs. of  9 variables:
 $ gender             : chr  "Female" "Female" "Male" "Female" ...
 $ age                : num  80 54 28 36 76 20 44 79 42 32 ...
 $ hypertension       : int  0 0 0 0 1 0 0 0 0 0 ...
 $ heart_disease      : int  1 0 0 0 1 0 0 0 0 0 ...
 $ smoking_history    : chr  "never" "No Info" "never" "current" ...
 $ bmi                : num  25.2 27.3 27.3 23.4 20.1 ...
 $ HbA1c_level        : num  6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...
 $ blood_glucose_level: int  140 80 158 155 155 85 200 85 145 100 ...
 $ diabetes           : int  0 0 0 0 0 0 1 0 0 0 ...

summary() providing an overview of each variable.

# Summary of each variable
summary(diabetes.data)
    gender               age         hypertension     heart_disease    
 Length:100000      Min.   : 0.08   Min.   :0.00000   Min.   :0.00000  
 Class :character   1st Qu.:24.00   1st Qu.:0.00000   1st Qu.:0.00000  
 Mode  :character   Median :43.00   Median :0.00000   Median :0.00000  
                    Mean   :41.89   Mean   :0.07485   Mean   :0.03942  
                    3rd Qu.:60.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
                    Max.   :80.00   Max.   :1.00000   Max.   :1.00000  
 smoking_history         bmi         HbA1c_level    blood_glucose_level
 Length:100000      Min.   :10.01   Min.   :3.500   Min.   : 80.0      
 Class :character   1st Qu.:23.63   1st Qu.:4.800   1st Qu.:100.0      
 Mode  :character   Median :27.32   Median :5.800   Median :140.0      
                    Mean   :27.32   Mean   :5.528   Mean   :138.1      
                    3rd Qu.:29.58   3rd Qu.:6.200   3rd Qu.:159.0      
                    Max.   :95.69   Max.   :9.000   Max.   :300.0      
    diabetes    
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.085  
 3rd Qu.:0.000  
 Max.   :1.000  

2.2 Data Preprocessing

The next step is to clean and prepare the data for modeling. This involves handling missing values and ensuring all variables are in the correct format.

2.2.1 Missing Value Handling

The first step for missing values, we use colSums(is.na()) to count the number of NA values in each column.

# Check for missing (NA) values in each column
colSums(is.na(diabetes.data))
             gender                 age        hypertension       heart_disease 
                  0                   0                   0                   0 
    smoking_history                 bmi         HbA1c_level blood_glucose_level 
                  0                   0                   0                   0 
           diabetes 
                  0 

Observation: The output shows 0 missing values for all columns, so no adjustment is needed.

2.2.2 Feature Encoding

The dataset contains several character-based variables (gender, smoking_history) that need to be converted to factors for R’s modeling functions (like glm()) to interpret them correctly as categorical predictors.

The target variable, diabetes, is also converted to a factor with clear ‘Yes’/‘No’ labels for better interpretability in our results.

During our initial summary(), we noted that the ‘Other’ category in gender has very few observations (only 18). To prevent model instability, we will remove these observations. We then use droplevels() to remove ‘Other’ from the factor levels.

# Convert categorical string variables to factors
diabetes.data$gender <- as.factor(diabetes.data$gender)
diabetes.data$smoking_history <- as.factor(diabetes.data$smoking_history)

# Convert the target variable 'diabetes' to a factor
diabetes.data$diabetes <- factor(diabetes.data$diabetes,
                                 levels = c(0, 1),
                                 labels = c("No", "Yes"))

# The 'Other' value in 'gender' has few observations so we remove it for model stability.
diabetes.data <- diabetes.data[diabetes.data$gender != "Other", ]

# Remove 'Other' level from the factor
diabetes.data$gender <- droplevels(diabetes.data$gender)

# Check the structure to confirm all changes have been applied
str(diabetes.data)
'data.frame':   99982 obs. of  9 variables:
 $ gender             : Factor w/ 2 levels "Female","Male": 1 1 2 1 2 1 1 1 2 1 ...
 $ age                : num  80 54 28 36 76 20 44 79 42 32 ...
 $ hypertension       : int  0 0 0 0 1 0 0 0 0 0 ...
 $ heart_disease      : int  1 0 0 0 1 0 0 0 0 0 ...
 $ smoking_history    : Factor w/ 6 levels "current","ever",..: 4 5 4 1 1 4 4 5 4 4 ...
 $ bmi                : num  25.2 27.3 27.3 23.4 20.1 ...
 $ HbA1c_level        : num  6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...
 $ blood_glucose_level: int  140 80 158 155 155 85 200 85 145 100 ...
 $ diabetes           : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...

3 Part 2: Model Development

Now we develop and compare models. This involves building multiple glm models and one neuralnet model on the full dataset to compare their performance.

3.1 Logistic Regression (Baseline)

We will build three logistic regression models to serve as our baseline and for comparison.

  1. reducedModel: A simple model with only the variables we hypothesize are most clinically significant.
  2. fullModel: A complex model that includes all available predictors.
  3. forwards: An optimized model found using forward selection, starting from the reducedModel and adding predictors from the fullModel based on AIC.
# Define a reduced model with variables we assume are significant
reducedModel <- glm(diabetes ~ age + bmi + blood_glucose_level + HbA1c_level,
                    family = binomial(link = logit),
                    data = diabetes.data)

# Define the full model with all variables
fullModel <- glm(diabetes ~ .,
                 family = binomial(link = logit),
                 data = diabetes.data)

# Use forward selection to find the best model between reduced and full
# trace = FALSE hides the step-by-step output of the selection process
forwards <- step(reducedModel,
                 scope = list(lower = formula(reducedModel), upper = formula(fullModel)),
                 direction = "forward",
                 trace = FALSE)

# Display the summary of the final, forward-selected model
summary(forwards)

Call:
glm(formula = diabetes ~ age + bmi + blood_glucose_level + HbA1c_level + 
    hypertension + smoking_history + heart_disease + gender, 
    family = binomial(link = logit), data = diabetes.data)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -2.708e+01  2.929e-01 -92.456  < 2e-16 ***
age                         4.620e-02  1.126e-03  41.040  < 2e-16 ***
bmi                         8.895e-02  2.555e-03  34.819  < 2e-16 ***
blood_glucose_level         3.336e-02  4.821e-04  69.207  < 2e-16 ***
HbA1c_level                 2.340e+00  3.578e-02  65.414  < 2e-16 ***
hypertension                7.413e-01  4.710e-02  15.737  < 2e-16 ***
smoking_historyever        -5.097e-02  9.248e-02  -0.551  0.58154    
smoking_historyformer      -1.084e-01  7.009e-02  -1.546  0.12203    
smoking_historynever       -1.566e-01  6.057e-02  -2.586  0.00971 ** 
smoking_historyNo Info     -7.304e-01  6.651e-02 -10.981  < 2e-16 ***
smoking_historynot current -2.114e-01  8.332e-02  -2.538  0.01115 *  
heart_disease               7.346e-01  6.072e-02  12.099  < 2e-16 ***
genderMale                  2.724e-01  3.613e-02   7.540 4.69e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 58160  on 99981  degrees of freedom
Residual deviance: 22627  on 99969  degrees of freedom
AIC: 22653

Number of Fisher Scoring iterations: 8

3.2 Perceptron (Single-Layer Neural Network)

Neural networks are sensitive to the scale of input data and require all inputs to be numeric. Before building our perceptron, we must preprocess the data, which involves two key steps: manually scaling numeric features and creating a design matrix (dummifying) for the neuralnet function.

3.2.1 Data Prep for neuralnet

We will use min-max normalization to scale all numeric predictors to a range of [0, 1]. This ensures that variables with large magnitudes (like blood_glucose_level) do not disproportionately influence the model’s weights compared to variables with small magnitudes.

# Create a copy of the data for neural network preprocessing
neuralData <- diabetes.data

# Identify numeric variables for scaling
numeric.vars <- c("age", "bmi", "HbA1c_level", "blood_glucose_level")

# Loop through numeric variables, scale them using min-max normalization
for (col in numeric.vars) {
  min.val <- min(neuralData[[col]])
  max.val <- max(neuralData[[col]])
  
  # The min-max formula
  neuralData[[col]] <- (neuralData[[col]] - min.val) / (max.val - min.val)
}

Next, the neuralnet package requires a formula and a data frame that does not contain factors. We use model.matrix() to automatically create dummy variables for all our factors (like genderFemale, smoking_historynever, etc.).

This creates a new data frame of only numeric values. We must also clean the column names (using make.names()) to remove spaces or special characters (e.g., “No Info” becomes “No.Info”) and then dynamically build a formula string that includes all these new dummy predictors.

# Create the design matrix, which automatically dummifies factor variables
# The '~ .' formula includes all variables
neuralData.matrix <- model.matrix(~ ., data = neuralData)
neuralData.nn <- as.data.frame(neuralData.matrix)

# Clean the column names to make them valid R variables
# This fixes errors from factor levels with spaces like "No Info"
valid.names <- make.names(colnames(neuralData.nn))
colnames(neuralData.nn) <- valid.names

# Add the numeric response variable (0/1) for neuralnet
# The neuralnet function requires a numeric target
neuralData.nn$diabetes_num <- ifelse(neuralData$diabetes == "Yes", 1, 0)

# Get all column names from the new data frame
columnNames <- colnames(neuralData.nn)

# Create the list of predictors by removing the first (Intercept)
# and the last (our new response variable 'diabetes_num')
columnList <- paste(columnNames[-c(1, length(columnNames))], collapse = "+")

# Create the final formula string
modelFormula <- as.formula(paste("diabetes_num ~", columnList))

# Print the formula to check
print(modelFormula)
diabetes_num ~ genderMale + age + hypertension + heart_disease + 
    smoking_historyever + smoking_historyformer + smoking_historynever + 
    smoking_historyNo.Info + smoking_historynot.current + bmi + 
    HbA1c_level + blood_glucose_level + diabetesYes

3.2.2 Build Perceptron Model

With the scaled and dummified data prepared, we can now train the perceptron.

  • We set hidden = 1 to create a single-layer network (a true perceptron).
  • We use act.fct = "logistic" (the sigmoid function) because this is a binary classification problem, which mirrors our logistic regression.
  • We set linear.output = FALSE to ensure the activation function is applied to the output, giving us a probability between 0 and 1.
# Train the perceptron
set.seed(123)
perceptron.model <- neuralnet(modelFormula,
                              data = neuralData.nn,
                              hidden = 1, 
                              act.fct = "logistic",
                              linear.output = FALSE) 

4 Part 3: Model Comparison

To determine the best model we will compare their predictive performance on the entire dataset. The primary metric for comparison will be the Area Under the Curve (AUC) from the Receiver Operating Characteristic (ROC) curve. An AUC of 1.0 represents a perfect classifier, while an AUC of 0.5 represents a model with no discriminatory power (same as random guessing).

We will generate predictions from all four models and plot their ROC curves on a single graph for direct comparison.

# 1. Get predictions (as probabilities) for all models
predReduced <- predict(reducedModel, newdata = diabetes.data, type = "response")
predFull <- predict(fullModel, newdata = diabetes.data, type = "response")
predForwards <- predict(forwards, newdata = diabetes.data, type = "response")

# Neural network model predictions
predNN.raw <- predict(perceptron.model, newdata = neuralData.nn)
predNN <- as.vector(predNN.raw) # Ensure it's a vector for pROC

# 2. Create ROC objects for all models
category <- diabetes.data$diabetes == "Yes"
ROCobj.reduced <- roc(category, predReduced)
ROCobj.full <- roc(category, predFull)
ROCobj.forwards <- roc(category, predForwards)
ROCobj.NN <- roc(category, predNN)

# 3. Get AUC values from each ROC object
reducedAUC <- ROCobj.reduced$auc
fullAUC <- ROCobj.full$auc
forwardsAUC <- ROCobj.forwards$auc
NNAUC <- ROCobj.NN$auc

# 4. Plot all ROC curves on one graph for comparison
colors <- c("#8B4500", "#00008B", "#8B008B", "#055d03")

plot(ROCobj.reduced, col = colors[1], lwd = 2, main = "ROC Curves of Candidate Models (Full Dataset)")
lines(ROCobj.full, col = colors[2], lwd = 2, lty = 2)
lines(ROCobj.forwards, col = colors[3], lwd = 1)
lines(ROCobj.NN, col = colors[4], lwd = 1)

# Add legend
legend("bottomright", c("reduced", "full", "forwards", "NN"),
       col = colors, lwd = c(2, 2, 1, 1), lty = c(1, 2, 1, 1), bty = "n")

# AUC text annotations for clarity
text(0.4, 0.4, paste("AUC.reduced =", round(reducedAUC, 4)), col = colors[1], adj = 0)
text(0.4, 0.35, paste("AUC.full =", round(fullAUC, 4)), col = colors[2], adj = 0)
text(0.4, 0.3, paste("AUC.forwards =", round(forwardsAUC, 4)), col = colors[3], adj = 0)
text(0.4, 0.25, paste("AUC.NN =", round(NNAUC, 4)), col = colors[4], adj = 0)

5 Conclusion

  • Part 1 (Data Prep): The data was loaded, reviewed and prepped. No missing values were found. The gender variable was cleaned by removing the ‘Other’ category.
  • Part 2 (Model Development): We developed four models:
    • Logistic Regression: We built a reducedModel, a fullModel, and a forwards selection model using glm() and step().
    • Perceptron (neuralnet): We created a separate, scaled and dummified dataset using model.matrix. We built a neuralnet model with one hidden node (hidden = 1) and a logistic activation function.
  • Part 3 (Model Comparison): We evaluated all four models by calculating their ROC-AUC on the entire dataset. The final AUC scores were:
    • Reduced GLM AUC: 0.9586
    • Full GLM AUC: 0.9619
    • Forwards GLM AUC: 0.9619
    • Perceptron (NN) AUC: 1

Based on this analysis, the Full Logistic Model and the Forwards-Selected Logistic Model have almost identical and superior predictive results (AUC 0.962) compared to the simpler reduced model. This suggests that the additional variables in the full/forwards models (like smoking_history and gender) provide valuable predictive information.

---
title: 'Project Two: Supervised Classification'
author: 'Jeff Delva'
date: "October 29, 2025"
output:
  html_document:
    toc: yes
    toc_float: yes
    toc_depth: 4
    fig_width: 8
    fig_height: 5
    fig_caption: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
    highlight: tango
---

```{css, echo = FALSE}
h1.title {
  font-size: 24px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
}
h4.author, h4.date {
  font-size: 18px;
  font-weight: bold;
  font-family: "Times New Roman", Times, serif;
  color: DarkBlue;
  text-align: center;
}
h1 {
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: center;
}
h2 {
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}
h3 {
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}
.header-section-number::after {
  content: ".";
}
```

```{r setup, include=FALSE}
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("pander")) {
   install.packages("pander")
   library(pander)
}
if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)
}
if (!require("neuralnet")) {
   install.packages("neuralnet")
   library(neuralnet)
}

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, results = TRUE, message = FALSE, comment = NA )
```

# Introduction

This report details the development of two types of supervised classification models to predict diabetes status: logistic regression and a single-layer neural network (perceptron). The project is divided into three parts:

1.  **Part 1:** Data loading, exploratory data analysis (EDA) and preprocessing.
2.  **Part 2:** Development of three logistic regression models and one perceptron model.
3.  **Part 3:** Comparison of all four models using ROC-AUC analysis to determine the best-performing model.

-----

# Part 1: Data Preparation and EDA

The first step in any modeling process is to load, understand and clean the data. This section covers our data prepartion and feature engineering.

## Data Loading and Initial Inspection

```{r load-data}
setwd("/Users/jeffery/Library/Mobile Documents/com~apple~CloudDocs/Documents/Documents - jMacP/WCUPA/Classes/Fall 2025/STA551/Project 2/Data")
# Load the CSV file
diabetes.data <- read.csv("diabetes_prediction_dataset.csv")
```

After loading we inspect its dimensions (`dim()`) and data types (`str()`) to understand the dataset's structure and the variables.

```{r initial-review}
# Display the dimensions (rows, columns) of the data
dim(diabetes.data)

# Structure of the data including variable names and types
str(diabetes.data)
```

`summary()` providing an overview of each variable.

```{r summary}
# Summary of each variable
summary(diabetes.data)
```

## Data Preprocessing

The next step is to clean and prepare the data for modeling. This involves handling missing values and ensuring all variables are in the correct format.

### Missing Value Handling

The first step for missing values, we use `colSums(is.na())` to count the number of `NA` values in each column.

```{r missing-values}
# Check for missing (NA) values in each column
colSums(is.na(diabetes.data))

```

**Observation**: The output shows 0 missing values for all columns, so no adjustment is needed.

### Feature Encoding

The dataset contains several character-based variables (`gender`, `smoking_history`) that need to be converted to factors for R's modeling functions (like `glm()`) to interpret them correctly as categorical predictors.

The target variable, `diabetes`, is also converted to a factor with clear 'Yes'/'No' labels for better interpretability in our results.

During our initial `summary()`, we noted that the 'Other' category in `gender` has very few observations (only 18). To prevent model instability, we will remove these observations. We then use `droplevels()` to remove 'Other' from the factor levels.

```{r factor-conversion}
# Convert categorical string variables to factors
diabetes.data$gender <- as.factor(diabetes.data$gender)
diabetes.data$smoking_history <- as.factor(diabetes.data$smoking_history)

# Convert the target variable 'diabetes' to a factor
diabetes.data$diabetes <- factor(diabetes.data$diabetes,
                                 levels = c(0, 1),
                                 labels = c("No", "Yes"))

# The 'Other' value in 'gender' has few observations so we remove it for model stability.
diabetes.data <- diabetes.data[diabetes.data$gender != "Other", ]

# Remove 'Other' level from the factor
diabetes.data$gender <- droplevels(diabetes.data$gender)

# Check the structure to confirm all changes have been applied
str(diabetes.data)
```

-----

# Part 2: Model Development

Now we develop and compare models. This involves building multiple `glm` models and one `neuralnet` model on the **full dataset** to compare their performance.

## Logistic Regression (Baseline)

We will build three logistic regression models to serve as our baseline and for comparison.

1.  **`reducedModel`**: A simple model with only the variables we hypothesize are most clinically significant.
2.  **`fullModel`**: A complex model that includes all available predictors.
3.  **`forwards`**: An optimized model found using forward selection, starting from the `reducedModel` and adding predictors from the `fullModel` based on AIC.

```{r models}
# Define a reduced model with variables we assume are significant
reducedModel <- glm(diabetes ~ age + bmi + blood_glucose_level + HbA1c_level,
                    family = binomial(link = logit),
                    data = diabetes.data)

# Define the full model with all variables
fullModel <- glm(diabetes ~ .,
                 family = binomial(link = logit),
                 data = diabetes.data)

# Use forward selection to find the best model between reduced and full
# trace = FALSE hides the step-by-step output of the selection process
forwards <- step(reducedModel,
                 scope = list(lower = formula(reducedModel), upper = formula(fullModel)),
                 direction = "forward",
                 trace = FALSE)

# Display the summary of the final, forward-selected model
summary(forwards)
```

## Perceptron (Single-Layer Neural Network)

Neural networks are sensitive to the scale of input data and require all inputs to be numeric. Before building our perceptron, we must preprocess the data, which involves two key steps: manually scaling numeric features and creating a design matrix (dummifying) for the `neuralnet` function.

### Data Prep for `neuralnet`

We will use **min-max normalization** to scale all numeric predictors to a range of [0, 1]. This ensures that variables with large magnitudes (like `blood_glucose_level`) do not disproportionately influence the model's weights compared to variables with small magnitudes.

```{r manual-scaling}
# Create a copy of the data for neural network preprocessing
neuralData <- diabetes.data

# Identify numeric variables for scaling
numeric.vars <- c("age", "bmi", "HbA1c_level", "blood_glucose_level")

# Loop through numeric variables, scale them using min-max normalization
for (col in numeric.vars) {
  min.val <- min(neuralData[[col]])
  max.val <- max(neuralData[[col]])
  
  # The min-max formula
  neuralData[[col]] <- (neuralData[[col]] - min.val) / (max.val - min.val)
}
```

Next, the `neuralnet` package requires a formula and a data frame that does *not* contain factors. We use `model.matrix()` to automatically create dummy variables for all our factors (like `genderFemale`, `smoking_historynever`, etc.).

This creates a new data frame of only numeric values. We must also clean the column names (using `make.names()`) to remove spaces or special characters (e.g., "No Info" becomes "No.Info") and then dynamically build a formula string that includes all these new dummy predictors.

```{r nn-formula}
# Create the design matrix, which automatically dummifies factor variables
# The '~ .' formula includes all variables
neuralData.matrix <- model.matrix(~ ., data = neuralData)
neuralData.nn <- as.data.frame(neuralData.matrix)

# Clean the column names to make them valid R variables
# This fixes errors from factor levels with spaces like "No Info"
valid.names <- make.names(colnames(neuralData.nn))
colnames(neuralData.nn) <- valid.names

# Add the numeric response variable (0/1) for neuralnet
# The neuralnet function requires a numeric target
neuralData.nn$diabetes_num <- ifelse(neuralData$diabetes == "Yes", 1, 0)

# Get all column names from the new data frame
columnNames <- colnames(neuralData.nn)

# Create the list of predictors by removing the first (Intercept)
# and the last (our new response variable 'diabetes_num')
columnList <- paste(columnNames[-c(1, length(columnNames))], collapse = "+")

# Create the final formula string
modelFormula <- as.formula(paste("diabetes_num ~", columnList))

# Print the formula to check
print(modelFormula)
```

### Build Perceptron Model

With the scaled and dummified data prepared, we can now train the perceptron.

  * We set `hidden = 1` to create a single-layer network (a true perceptron).
  * We use `act.fct = "logistic"` (the sigmoid function) because this is a binary classification problem, which mirrors our logistic regression.
  * We set `linear.output = FALSE` to ensure the activation function is applied to the output, giving us a probability between 0 and 1.

```{r nn-train}
# Train the perceptron
set.seed(123)
perceptron.model <- neuralnet(modelFormula,
                              data = neuralData.nn,
                              hidden = 1, 
                              act.fct = "logistic",
                              linear.output = FALSE) 
```

# Part 3: Model Comparison

To determine the best model we will compare their predictive performance on the **entire dataset**. The primary metric for comparison will be the **Area Under the Curve (AUC)** from the Receiver Operating Characteristic (ROC) curve. An AUC of 1.0 represents a perfect classifier, while an AUC of 0.5 represents a model with no discriminatory power (same as random guessing).

We will generate predictions from all four models and plot their ROC curves on a single graph for direct comparison.

```{r final-roc-comparison}
# 1. Get predictions (as probabilities) for all models
predReduced <- predict(reducedModel, newdata = diabetes.data, type = "response")
predFull <- predict(fullModel, newdata = diabetes.data, type = "response")
predForwards <- predict(forwards, newdata = diabetes.data, type = "response")

# Neural network model predictions
predNN.raw <- predict(perceptron.model, newdata = neuralData.nn)
predNN <- as.vector(predNN.raw) # Ensure it's a vector for pROC

# 2. Create ROC objects for all models
category <- diabetes.data$diabetes == "Yes"
ROCobj.reduced <- roc(category, predReduced)
ROCobj.full <- roc(category, predFull)
ROCobj.forwards <- roc(category, predForwards)
ROCobj.NN <- roc(category, predNN)

# 3. Get AUC values from each ROC object
reducedAUC <- ROCobj.reduced$auc
fullAUC <- ROCobj.full$auc
forwardsAUC <- ROCobj.forwards$auc
NNAUC <- ROCobj.NN$auc

# 4. Plot all ROC curves on one graph for comparison
colors <- c("#8B4500", "#00008B", "#8B008B", "#055d03")

plot(ROCobj.reduced, col = colors[1], lwd = 2, main = "ROC Curves of Candidate Models (Full Dataset)")
lines(ROCobj.full, col = colors[2], lwd = 2, lty = 2)
lines(ROCobj.forwards, col = colors[3], lwd = 1)
lines(ROCobj.NN, col = colors[4], lwd = 1)

# Add legend
legend("bottomright", c("reduced", "full", "forwards", "NN"),
       col = colors, lwd = c(2, 2, 1, 1), lty = c(1, 2, 1, 1), bty = "n")

# AUC text annotations for clarity
text(0.4, 0.4, paste("AUC.reduced =", round(reducedAUC, 4)), col = colors[1], adj = 0)
text(0.4, 0.35, paste("AUC.full =", round(fullAUC, 4)), col = colors[2], adj = 0)
text(0.4, 0.3, paste("AUC.forwards =", round(forwardsAUC, 4)), col = colors[3], adj = 0)
text(0.4, 0.25, paste("AUC.NN =", round(NNAUC, 4)), col = colors[4], adj = 0)

```

# Conclusion

  * **Part 1 (Data Prep)**: The data was loaded, reviewed and prepped. No missing values were found. The `gender` variable was cleaned by removing the 'Other' category.
  * **Part 2 (Model Development)**: We developed four models:
      * **Logistic Regression**: We built a `reducedModel`, a `fullModel`, and a `forwards` selection model using `glm()` and `step()`.
      * **Perceptron (`neuralnet`)**: We created a separate, scaled and dummified dataset using `model.matrix`. We built a `neuralnet` model with one hidden node (`hidden = 1`) and a logistic activation function.
  * **Part 3 (Model Comparison)**: We evaluated all four models by calculating their ROC-AUC on the *entire dataset*. The final AUC scores were:
      * **Reduced GLM AUC**: `r round(reducedAUC, 4)`
      * **Full GLM AUC**: `r round(fullAUC, 4)`
      * **Forwards GLM AUC**: `r round(forwardsAUC, 4)`
      * **Perceptron (NN) AUC**: `r round(NNAUC, 4)`

Based on this analysis, the **Full Logistic Model** and the **Forwards-Selected Logistic Model** have almost identical and superior predictive results (AUC 0.962) compared to the simpler reduced model. This suggests that the additional variables in the full/forwards models (like `smoking_history` and `gender`) provide valuable predictive information.