library(tidyverse)
library(corrplot)
library(DataExplorer)
library(caret)
library(nnet)
library(ROSE)

df <- read.csv("diabetes.csv", sep = ",")

Problem and Data Selection

As a pharmacist, diabetes is one of the most common health concerns I encounter in my community and at my hospital. According to the Centers for Disease Control and Prevent (CDC), 38.4 million people have diabetes (11.6% of the US population) with 8.7 million of these cases being undiagnosed. This is clearly a growing problem as uncontrolled diabetes can lead to serious heart, kidney, and foot complications. However, early identification of individuals at risk of diabetes can lead to timely interventions with lifestyle modifications and medication.

Given the importance of early detection, one problem we can address is: how can we use clinical data to identify patients at high risk for diabetes? From a business standpoint, early identification of diabetes can reduce long term healthcare costs to the patients, hospitals, and insurance companies by improving patient outcome. Translating this into a data science problem, the goal becomes: can we build a predictive model using clinical features to determine whether a patient is likely to have diabetes?

To explore this, I selected the Pima Indians Diabetes Dataset from the National Institute of Diabetes and Digestive and Kidney Diseases. This dataset contains diagnostic data from 768 female patients aged 21 and older of Pima Indian heritage. It includes variables such as glucose level, BMI, insulin, age, blood pressure, and number of pregnancies. The outcome variable indicates whether or not a patient was diagnosed with diabetes, making this dataset well-suited for binary classification modeling.

Exploratory Data Analysis (EDA)

head(df)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

# Summary Statistics
summary(df)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

# Checking for missing values
colSums(is.na(df))

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

# No missing values but 0s may represent missing in some columns
colSums(df == 0)

##              Pregnancies                  Glucose            BloodPressure 
##                      111                        5                       35 
##            SkinThickness                  Insulin                      BMI 
##                      227                      374                       11 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                      500

There are several variables that have zero values but some of these aren’t physciologically possible. Glucose, BloodPressure, SkinThickness, Insulin, and BMI contain zeros that likely represent missing data. These variables may need to be imputed later on with the mean or median, especially SkinThickness and Insulin where the zero values account for larger than 20%R of the dataset.

# Correlation Matrix
cor_matrix <- cor(df)
corrplot(cor_matrix, method = "color",tl.col = "black", addCoef.col = "black", number.cex = 0.5)

Based on the correlation matrix, Glucose, BMI, Age, and DiabetesPedigreeFunction stand out as the most informative features for predicting diabetes, with correlation scores of 0.47, 0.31, 0.24, and 0.22, respectively. In contrast, BloodPressure, SkinThickness, and Insulin show the weakest correlation with the outcome variable. The strongest inter-feature correlation is between Age and Pregnancies (r = 0.54). Aside from this, there are no strong correlations between the features, so multicollinearity is not a major concern for modeling.

# Distribution of numerical variables
plot_histogram(df)

The Age variable is right skewed, most of the patients were between 20 to 40 years old. The distribution of the BloodPressure, BMI, Glucose, and SkinThickness variables looks roughly normal. The distribution of the variables DiabetesPedigreeFunction and Pregnancies are right skewed as well. Most patients have a DiabetesPedigreeFunction below 1 and fewer than 5 pregnancies.

# Boxplots (by Outcome)
df_long <- reshape2::melt(df, id.vars = "Outcome")
ggplot(df_long, aes(x = factor(Outcome), y = value)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free") +
  labs(x = "Diabetes Outcome", y = "Value")

The boxplot shows how each feature varies based on whether they had diabetes or not. Glucose and BMI are clearly higher in those with diabetes, suggesting these two variables are strong predictors. Pregnancies, DiabetesPedigreeFunction, and Age also seem to be higher in the diabetic population. The boxplots show that Insulin has a lot of outliers, which may need to be imputed later on.

# Proportion of each outcome
table(df$Outcome)

## 
##   0   1 
## 500 268

prop.table(table(df$Outcome))

## 
##         0         1 
## 0.6510417 0.3489583

This table shows that there is a moderate class imbalance in the dataset. 35% of patients in the dataset has diabetes, while a majority (65%) do not. The prediction models may favor the majority class more, indicating a need for oversampling or undersampling.

For the zero values in Glucose, BMI, and Age, we can impute these with the median. As for Insulin and SkinThickness, these variables have a percentage of zero values so imputing them with the median might skew the data. In this case, these two variable will be dropped.

# Imputing the zero values with the median
df_clean <- df
vars_to_impute <- c("Glucose", "BMI", "Age")
for (var in vars_to_impute) {
  median_val <- median(df_clean[[var]][df_clean[[var]] != 0], na.rm = TRUE)
  df_clean[[var]][df_clean[[var]] == 0] <- median_val
}

#Dropping SkinThickness and Insulin variables
df_clean <- df_clean %>%
  select(-SkinThickness, -Insulin)

# Convert Outcome to factor for classification
df_clean$Outcome <- as.factor(df_clean$Outcome)

summary(df_clean)

##   Pregnancies        Glucose       BloodPressure         BMI       
##  Min.   : 0.000   Min.   : 44.00   Min.   :  0.00   Min.   :18.20  
##  1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 62.00   1st Qu.:27.50  
##  Median : 3.000   Median :117.00   Median : 72.00   Median :32.30  
##  Mean   : 3.845   Mean   :121.66   Mean   : 69.11   Mean   :32.46  
##  3rd Qu.: 6.000   3rd Qu.:140.25   3rd Qu.: 80.00   3rd Qu.:36.60  
##  Max.   :17.000   Max.   :199.00   Max.   :122.00   Max.   :67.10  
##  DiabetesPedigreeFunction      Age        Outcome
##  Min.   :0.0780           Min.   :21.00   0:500  
##  1st Qu.:0.2437           1st Qu.:24.00   1:268  
##  Median :0.3725           Median :29.00          
##  Mean   :0.4719           Mean   :33.24          
##  3rd Qu.:0.6262           3rd Qu.:41.00          
##  Max.   :2.4200           Max.   :81.00

Model Selection

Based on this dataset, I will create two models using Logistic Regression and a Neural Network to predict diabetes status based on clinical features. Logistic regression is a widely used algorithm for binary classification tasks, making it well-suited for this dataset, where the outcome variable is binary (diabetic or not). It is also easy to interpret and works well with numerical data. In addition, neural networks are capable of capturing complex nonlinear relationships among variables. Given that the correlation matrix did not reveal any strong linear relationships between most features, a neural network may be better equipped to detect subtle interactions within the data. To evaluate amongst each model, I will look at the testing accuracy as well as the sensitivity and specificity as these metrics are important when dealing with medical diagonses.

# Train/Test Split 
set.seed(123)
trainIndex <- createDataPartition(df_clean$Outcome, p = 0.7, list = FALSE)
trainData <- df_clean[trainIndex, ]
testData <- df_clean[-trainIndex, ]

Logistic Regression

log_model <- glm(Outcome ~ ., data = trainData, family = "binomial")
log_probs <- predict(log_model, testData, type = "response")
log_preds <- ifelse(log_probs > 0.5, 1, 0)
log_conf <- confusionMatrix(as.factor(log_preds), testData$Outcome)
print("Logistic Regression Confusion Matrix:")

## [1] "Logistic Regression Confusion Matrix:"

print(log_conf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 128  32
##          1  22  48
##                                          
##                Accuracy : 0.7652         
##                  95% CI : (0.705, 0.8184)
##     No Information Rate : 0.6522         
##     P-Value [Acc > NIR] : 0.0001394      
##                                          
##                   Kappa : 0.467          
##                                          
##  Mcnemar's Test P-Value : 0.2206714      
##                                          
##             Sensitivity : 0.8533         
##             Specificity : 0.6000         
##          Pos Pred Value : 0.8000         
##          Neg Pred Value : 0.6857         
##              Prevalence : 0.6522         
##          Detection Rate : 0.5565         
##    Detection Prevalence : 0.6957         
##       Balanced Accuracy : 0.7267         
##                                          
##        'Positive' Class : 0              
##

This logistic regression model correctly predicted diabetes status for 76.5% of the test cases. This is better than the baseline (No Information Rate) at 65.2%, which is what one would get if they always predicted the majority class. The p-value of 0.0001394 indicates taht the model performs significantly better than randomly guessing based on the majority class. The sensitivity (True Positive Rate) was 85.33% and the specificity (True Negative Rate) was 60.0%. This means the model was more effective at identifying patients who actually have diabetes than those who do not. From a healthcare perspective, this is advantageous because it reduces the risk of false negatives, which often leads to delayed interventions and higher long-term health costs. However, the relatively lower specificity suggests a higher rate of false positives, which may lead to unnecessary testing. In this case, we can consider balancing the dataset, tuning the parameters, or using a more complex model like neural networks. Let’s try oversampling the dataset first to see if we can improve the accuracy of the model.

# Create a data frame with accuracy, sensitivity, and specificity
results <- data.frame(
  Model = c("Logistic Regression"),
  Accuracy = c("76.5%"),
  Sensitivity = c("85.3%"),
  Specificity = c("60.0%")
)

print(results)

##                 Model Accuracy Sensitivity Specificity
## 1 Logistic Regression    76.5%       85.3%       60.0%

# Oversample training data
balanced_train <- ROSE(Outcome ~ ., data = trainData, seed = 1)$data

# Check class distribution after balancing
table(balanced_train$Outcome)

## 
##   0   1 
## 286 252

prop.table(table(balanced_train$Outcome))

## 
##         0         1 
## 0.5315985 0.4684015

After oversampling the training data, the Outcome variable is closer to 50/50.

# Train logistic regression on balanced data
model_balanced_log <- glm(Outcome ~ ., data = balanced_train, family = binomial)

# Predict on test set
pred_probs <- predict(model_balanced_log, newdata = testData, type = "response")
pred_class <- ifelse(pred_probs > 0.5, 1, 0)

# Confusion matrix
confusionMatrix(as.factor(pred_class), as.factor(testData$Outcome))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 119  22
##          1  31  58
##                                           
##                Accuracy : 0.7696          
##                  95% CI : (0.7097, 0.8224)
##     No Information Rate : 0.6522          
##     P-Value [Acc > NIR] : 7.748e-05       
##                                           
##                   Kappa : 0.5051          
##                                           
##  Mcnemar's Test P-Value : 0.2718          
##                                           
##             Sensitivity : 0.7933          
##             Specificity : 0.7250          
##          Pos Pred Value : 0.8440          
##          Neg Pred Value : 0.6517          
##              Prevalence : 0.6522          
##          Detection Rate : 0.5174          
##    Detection Prevalence : 0.6130          
##       Balanced Accuracy : 0.7592          
##                                           
##        'Positive' Class : 0               
##

After balancing the dataset, the overall accuracy improved slightly to 77.0%. However, this change came with a trade-off: the sensitivity dropped to 79.3% while the specificity increased to 72.5%. The decrease in sensitivity means the model is now missing more true diabetic cases, a significant concern in a healthcare setting because failing to detect diabetes can lead to delayed treatment and higher long-term health costs.

results <- data.frame(
  Model = c("Logistic Regression", "Logistic Regression - oversamling"),
  Accuracy = c("76.5%", "77.0%"),
  Sensitivity = c("85.3%", "79.3%"),
  Specificity = c("60.0%", "72.5%")
)

print(results)

##                               Model Accuracy Sensitivity Specificity
## 1               Logistic Regression    76.5%       85.3%       60.0%
## 2 Logistic Regression - oversamling    77.0%       79.3%       72.5%

Given that the original dataset reflects the actual population distribution, where approximately 30% of individuals have diabetes, it provides a more realistic baseline for model evaluation and generalization. While balancing the data helped improve specificity, it did not significantly improved the accuracy and may have introduced sampling bias that doesn’t mirror real-world prevalence. Therefore, let’s return to the original imbalanced dataset and explore threshold tuning to see if we can improve model performance.

# Create a function to evaluate model at different thresholds
evaluate_thresholds <- function(probs, true_labels, thresholds) {
  results <- data.frame(
    Threshold = numeric(),
    Accuracy = numeric(),
    Sensitivity = numeric(),
    Specificity = numeric()
  )
  
  for (t in thresholds) {
    preds <- ifelse(probs > t, 1, 0)
    cm <- confusionMatrix(as.factor(preds), as.factor(true_labels), positive = "1")
    
    results <- rbind(results, data.frame(
      Threshold = t,
      Accuracy = cm$overall["Accuracy"],
      Sensitivity = cm$byClass["Sensitivity"],
      Specificity = cm$byClass["Specificity"]
    ))
  }
  
  return(results)
}

# Define thresholds to test
thresholds <- seq(0.3, 1, by = 0.1)

# Run evaluation
threshold_results <- evaluate_thresholds(pred_probs, testData$Outcome, thresholds)

## Warning in confusionMatrix.default(as.factor(preds), as.factor(true_labels), :
## Levels are not in the same order for reference and data. Refactoring data to
## match.

# View the results
print(threshold_results)

##           Threshold  Accuracy Sensitivity Specificity
## Accuracy        0.3 0.7347826      0.9125   0.6400000
## Accuracy1       0.4 0.7347826      0.7750   0.7133333
## Accuracy2       0.5 0.7695652      0.7250   0.7933333
## Accuracy3       0.6 0.7652174      0.5750   0.8666667
## Accuracy4       0.7 0.7739130      0.4875   0.9266667
## Accuracy5       0.8 0.7434783      0.3375   0.9600000
## Accuracy6       0.9 0.6956522      0.1500   0.9866667
## Accuracy7       1.0 0.6521739      0.0000   1.0000000

Iterating over several thresholds shows that a value of 0.7 yields a slightly higher accuracy than the original 0.5 threshold. However, this improvement comes at a steep cost: sensitivity drops sharply to 49%, which significantly increases the risk of missing true cases of diabetes. Therefore, it would be better to stick with the 0.5 threshold since the trade off for the small increase in accuracy is not worth the decrease in sensitivity.

results <- data.frame(
  Model = c("Logistic Regression", "Logistic Regression - oversamling","Logistic Regression - threshold"),
  Accuracy = c("76.5%", "77.0%", "77.4%"),
  Sensitivity = c("85.3%", "79.3%","48.8%"),
  Specificity = c("60.0%", "72.5%", "92.3%")
)

print(results)

##                               Model Accuracy Sensitivity Specificity
## 1               Logistic Regression    76.5%       85.3%       60.0%
## 2 Logistic Regression - oversamling    77.0%       79.3%       72.5%
## 3   Logistic Regression - threshold    77.4%       48.8%       92.3%

Neural Network

# Normalize data 
preProc <- preProcess(trainData[, -ncol(trainData)], method = c("center", "scale"))
train_scaled <- predict(preProc, trainData[, -ncol(trainData)])
test_scaled <- predict(preProc, testData[, -ncol(testData)])

# Add labels back
train_scaled$Outcome <- trainData$Outcome
test_scaled$Outcome <- testData$Outcome

# Train neural net with 1 hidden layer, 5 neurons
nn_model <- nnet(Outcome ~ ., data = train_scaled, size = 5, maxit = 500, decay = 0.01)

## # weights:  41
## initial  value 353.257049 
## iter  10 value 240.291371
## iter  20 value 229.145321
## iter  30 value 220.633011
## iter  40 value 215.492336
## iter  50 value 213.652463
## iter  60 value 212.836950
## iter  70 value 212.611377
## iter  80 value 212.598612
## iter  90 value 212.593230
## iter 100 value 212.589165
## iter 110 value 212.588930
## final  value 212.588882 
## converged

# Predict and evaluate
nn_preds <- predict(nn_model, test_scaled[, -ncol(test_scaled)], type = "class")
nn_conf <- confusionMatrix(as.factor(nn_preds), test_scaled$Outcome)
print("Neural Network Confusion Matrix:")

## [1] "Neural Network Confusion Matrix:"

print(nn_conf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 118  35
##          1  32  45
##                                           
##                Accuracy : 0.7087          
##                  95% CI : (0.6454, 0.7666)
##     No Information Rate : 0.6522          
##     P-Value [Acc > NIR] : 0.04036         
##                                           
##                   Kappa : 0.3522          
##                                           
##  Mcnemar's Test P-Value : 0.80697         
##                                           
##             Sensitivity : 0.7867          
##             Specificity : 0.5625          
##          Pos Pred Value : 0.7712          
##          Neg Pred Value : 0.5844          
##              Prevalence : 0.6522          
##          Detection Rate : 0.5130          
##    Detection Prevalence : 0.6652          
##       Balanced Accuracy : 0.6746          
##                                           
##        'Positive' Class : 0               
##

The neural network converged successfully after 110 iterations, with a final minimized loss value of 212.59. The accuracy was 70.9% while the sensitivity and specificity were 76.7% and 56.3%, respectively. Based on these results, the neural network underperformed compared to the logistic regression model. Let’s tune the model’s hidden layer to see if accuracy can be improved.

results <- data.frame(
  Model = c("Logistic Regression", "Logistic Regression - oversamling","Logistic Regression - threshold", "Neural Network"),
  Accuracy = c("76.5%", "77.0%", "77.4%", "70.9%"),
  Sensitivity = c("85.3%", "79.3%","48.8%", "76.7%"),
  Specificity = c("60.0%", "72.5%", "92.3%", "56.3%")
)

print(results)

##                               Model Accuracy Sensitivity Specificity
## 1               Logistic Regression    76.5%       85.3%       60.0%
## 2 Logistic Regression - oversamling    77.0%       79.3%       72.5%
## 3   Logistic Regression - threshold    77.4%       48.8%       92.3%
## 4                    Neural Network    70.9%       76.7%       56.3%

# Set up cross-validation
control <- trainControl(method = "cv", number = 5)

# Define tuning grid
tune_grid <- expand.grid(
  size = c(2, 5, 10),      # number of neurons
  decay = c(0, 0.01, 0.1)  # regularization
)

# Train neural network model
set.seed(123)
nn_tuned <- train(
  Outcome ~ .,
  data = trainData,
  method = "nnet",
  trControl = control,
  tuneGrid = tune_grid,
  maxit = 200,
  trace = FALSE
)

# View best model
print(nn_tuned)

## Neural Network 
## 
## 538 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 431, 430, 431, 430, 430 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa     
##    2    0.00   0.6543094  0.02536722
##    2    0.01   0.7267567  0.33735750
##    2    0.10   0.7843891  0.51141097
##    5    0.00   0.7024749  0.33060717
##    5    0.01   0.7398235  0.41233694
##    5    0.10   0.7454483  0.43019255
##   10    0.00   0.6859121  0.30218014
##   10    0.01   0.7323641  0.39192881
##   10    0.10   0.7454656  0.43394246
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 2 and decay = 0.1.

# Predict on test set
nn_preds <- predict(nn_tuned, newdata = testData)
confusionMatrix(nn_preds, as.factor(testData$Outcome))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 126  27
##          1  24  53
##                                          
##                Accuracy : 0.7783         
##                  95% CI : (0.719, 0.8302)
##     No Information Rate : 0.6522         
##     P-Value [Acc > NIR] : 2.232e-05      
##                                          
##                   Kappa : 0.5069         
##                                          
##  Mcnemar's Test P-Value : 0.7794         
##                                          
##             Sensitivity : 0.8400         
##             Specificity : 0.6625         
##          Pos Pred Value : 0.8235         
##          Neg Pred Value : 0.6883         
##              Prevalence : 0.6522         
##          Detection Rate : 0.5478         
##    Detection Prevalence : 0.6652         
##       Balanced Accuracy : 0.7512         
##                                          
##        'Positive' Class : 0              
##

Here we tuned the size (number of neurons in the hidden layer) and decay (weight decay) parameter and used cross validation to see which combination results in the highest average accuracy. Among the combinations tested, the model with the highest accuracy of 78.4% was the neural network with size = 2 and decay = 0.1. On the test data, the model had an accuracy of 77.8%, a sensitivity of 84% and a specificity of 66.3%. These results indicate that the tuned neural network outperforms the logistic regression model and handles the imbalanced dataset more effectively, making it a strong candidate for predicting diabetes in this population.

results <- data.frame(
  Model = c("Logistic Regression", "Logistic Regression (oversampled)","Logistic Regression (threshold = 0.7)", "Neural Network", "Neural Network (size = 2, decay = 0.1)"),
  Accuracy = c("76.5%", "77.0%", "77.4%", "70.9%", "78.4%"),
  Sensitivity = c("85.3%", "79.3%","48.8%", "76.7%", "84.0%"),
  Specificity = c("60.0%", "72.5%", "92.3%", "56.3%", "66.3%")
)

print(results)

##                                    Model Accuracy Sensitivity Specificity
## 1                    Logistic Regression    76.5%       85.3%       60.0%
## 2      Logistic Regression (oversampled)    77.0%       79.3%       72.5%
## 3  Logistic Regression (threshold = 0.7)    77.4%       48.8%       92.3%
## 4                         Neural Network    70.9%       76.7%       56.3%
## 5 Neural Network (size = 2, decay = 0.1)    78.4%       84.0%       66.3%

***Initially I wanted to add another hidden layer to this model but I had trouble installing the keras package. Instead I will run a grid search over a broader range of size and decay values.

# Define training control with 5-fold cross-validation
ctrl <- trainControl(method = "cv", number = 5)

# Define a broader grid of tuning parameters
grid <- expand.grid(
  size = c(1, 2, 3, 5, 7, 10),     
  decay = c(0, 0.001, 0.01, 0.05, 0.1, 0.2)  
)

# Train the model using grid search
nnet_tuned <- train(
  Outcome ~ ., 
  data = trainData, 
  method = "nnet",
  metric = "Accuracy",
  trControl = ctrl,
  tuneGrid = grid,
  maxit = 200,
  trace = FALSE
)

# Print the best model and results
print(nnet_tuned)

## Neural Network 
## 
## 538 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 430, 430, 431, 431, 430 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa     
##    1    0.000  0.6505711  0.00000000
##    1    0.001  0.7322776  0.31144872
##    1    0.010  0.6913465  0.17653192
##    1    0.050  0.6915196  0.23001771
##    1    0.100  0.6748702  0.09690887
##    1    0.200  0.7470924  0.38400670
##    2    0.000  0.6617861  0.05839915
##    2    0.001  0.7231222  0.31097805
##    2    0.010  0.7508480  0.42467040
##    2    0.050  0.7285566  0.33051058
##    2    0.100  0.7324161  0.38690915
##    2    0.200  0.6841468  0.19283469
##    3    0.000  0.6728280  0.19108128
##    3    0.001  0.7454136  0.38725255
##    3    0.010  0.7140187  0.27475176
##    3    0.050  0.6766182  0.19857330
##    3    0.100  0.7433887  0.41810316
##    3    0.200  0.7471443  0.41423518
##    5    0.000  0.7174282  0.29587886
##    5    0.001  0.7140014  0.31866876
##    5    0.010  0.7528730  0.44895524
##    5    0.050  0.7509692  0.44532217
##    5    0.100  0.7547075  0.44853987
##    5    0.200  0.7621495  0.45863480
##    7    0.000  0.7472136  0.39021991
##    7    0.001  0.7418484  0.40571508
##    7    0.010  0.7547075  0.46807599
##    7    0.050  0.7417445  0.42610597
##    7    0.100  0.7640187  0.45868568
##    7    0.200  0.7713915  0.47748194
##   10    0.000  0.7418138  0.44791057
##   10    0.001  0.7416926  0.40991124
##   10    0.010  0.7826757  0.50904880
##   10    0.050  0.7659571  0.47597755
##   10    0.100  0.7528037  0.44080201
##   10    0.200  0.7528037  0.43873404
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 10 and decay = 0.01.

plot(nnet_tuned)

# Make predictions on the test set
pred <- predict(nnet_tuned, newdata = testData)

# Evaluate performance
confusionMatrix(pred, as.factor(testData$Outcome))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 128  31
##          1  22  49
##                                           
##                Accuracy : 0.7696          
##                  95% CI : (0.7097, 0.8224)
##     No Information Rate : 0.6522          
##     P-Value [Acc > NIR] : 7.748e-05       
##                                           
##                   Kappa : 0.4784          
##                                           
##  Mcnemar's Test P-Value : 0.2718          
##                                           
##             Sensitivity : 0.8533          
##             Specificity : 0.6125          
##          Pos Pred Value : 0.8050          
##          Neg Pred Value : 0.6901          
##              Prevalence : 0.6522          
##          Detection Rate : 0.5565          
##    Detection Prevalence : 0.6913          
##       Balanced Accuracy : 0.7329          
##                                           
##        'Positive' Class : 0               
##

The plot of the cross-validated accuracy of different combinations of hidden units and weight decay shows that accuracy generally increases with more hidden units. However, while the cross-validation indicated an upward trend with additional hidden units, the configuration with size = 10 and decay = 0.01 achieved a test accuracy of 76.7%, a sensitivity of 85.3%, and a specificity of 61.3%, which is slightly lower in overall accuracy compared to the best-performing model (size = 2, decay = 0.1). This suggests that although increasing hidden units may improve sensitivity, a smaller network with stronger regularization appears to generalize better to unseen data.

results <- data.frame(
  Model = c("Logistic Regression", "Logistic Regression (oversampled)","Logistic Regression (threshold = 0.7)", "Neural Network", "Neural Network (size = 2, decay = 0.1)", "Neural Network (size = 10, decay = 0.01)"),
  Accuracy = c("76.5%", "77.0%", "77.4%", "70.9%", "78.4%", "76.7%"),
  Sensitivity = c("85.3%", "79.3%","48.8%", "76.7%", "84.0%", "85.3%"),
  Specificity = c("60.0%", "72.5%", "92.3%", "56.3%", "66.3%", "61.3%")
)

print(results)

##                                      Model Accuracy Sensitivity Specificity
## 1                      Logistic Regression    76.5%       85.3%       60.0%
## 2        Logistic Regression (oversampled)    77.0%       79.3%       72.5%
## 3    Logistic Regression (threshold = 0.7)    77.4%       48.8%       92.3%
## 4                           Neural Network    70.9%       76.7%       56.3%
## 5   Neural Network (size = 2, decay = 0.1)    78.4%       84.0%       66.3%
## 6 Neural Network (size = 10, decay = 0.01)    76.7%       85.3%       61.3%

Conclusion

Of all the models tested, the tuned neural network with size = 2 and decay = 0.1 achieved the highest accuracy (78.4%), with a strong balance of performance metrics—84.0% sensitivity and 66.3% specificity. In comparison, the baseline logistic regression model achieved 76.5% accuracy with 85.3% sensitivity and 60.0% specificity, while the oversampled logistic regression model improved specificity (72.5%) at the cost of sensitivity (79.3%). In contrast, setting the threshold to 0.7 in logistic regression boosted specificity (92.3%) but severely reduced sensitivity (48.8%), posing a high risk of missing true diabetic cases. These findings indicate that although some modifications can shift performance metrics, the tuned neural network offers the most balanced trade-off between overall accuracy and the critical measures of sensitivity and specificity for a clinical screening context.

From a business perspective, despite the tuned neural network’s slightly superior performance on paper, its increased complexity and higher resource demands may not justify its usage compared to a simpler model. Logistic regression models are easier to implement, require fewer computational resources, and are more interpretable, all of which can translate into lower operational costs and quicker decision-making in a real-world clinical setting. Given these factors, especially when scaling across large populations, the logistic regression model may be preferred due to its cost-effectiveness and robustness, even if it sacrifices a small margin of accuracy relative to the tuned neural network.

Assignment 4 - Final Project

Jian Quan Chen

2025-05-10