Introduction:

In this homework, you will apply logistic regression to a real-world dataset: the Pima Indians Diabetes Database. This dataset contains medical records from 768 women of Pima Indian heritage, aged 21 or older, and is used to predict the onset of diabetes (binary outcome: 0 = no diabetes, 1 = diabetes) based on physiological measurements.

The data is publicly available from the UCI Machine Learning Repository and can be imported directly.

Dataset URL: https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv

Columns (no header in the CSV, so we need to assign them manually):

  1. Pregnancies: Number of times pregnant
  2. Glucose: Plasma glucose concentration (2-hour test)
  3. BloodPressure: Diastolic blood pressure (mm Hg)
  4. SkinThickness: Triceps skin fold thickness (mm)
  5. Insulin: 2-hour serum insulin (mu U/ml)
  6. BMI: Body mass index (weight in kg/(height in m)^2)
  7. DiabetesPedigreeFunction: Diabetes pedigree function (a function scoring genetic risk)
  8. Age: Age in years
  9. Outcome: Class variable (0 = no diabetes, 1 = diabetes)

Task Overview: You will load the data, build a logistic regression model to predict diabetes onset using a subset of predictors (Glucose, BMI, Age), interpret the model, evaluate it with a confusion matrix and metrics, and analyze the ROC curve and AUC.

Cleaning the dataset Don’t change the following code

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

data <- read.csv(url, header = FALSE)

colnames(data) <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome")

data$Outcome <- as.factor(data$Outcome)

# Handle missing values (replace 0s with NA because 0 makes no sense here)
data$Glucose[data$Glucose == 0] <- NA
data$BloodPressure[data$BloodPressure == 0] <- NA
data$BMI[data$BMI == 0] <- NA


colSums(is.na(data))
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        5                       35 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                       11 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Question 1: Create and Interpret a Logistic Regression Model - Fit a logistic regression model to predict Outcome using Glucose, BMI, and Age.

# Citations/Disclaimer: This code and analysis follows what learned from course/class notes

# Here, I am fitting a logistic regression model to predict whether a person has diabetes (Outcome)
# I also chose Glucose, BMI, and Age as my predictor variables since they are important health indicators which are related to diabetes risk
model <- glm(Outcome ~ Glucose + BMI + Age, 
             data = data, 
             family = "binomial")

# I used the summary() function to examine the results of my model
# Due to this it allowed me to see the coefficients, which displays how each variable affects the log-odds of diabetes
# It also displays the p-values, which help me determine if the predictor is statistically significant
summary(model)
## 
## Call:
## glm(formula = Outcome ~ Glucose + BMI + Age, family = "binomial", 
##     data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.032377   0.711037 -12.703  < 2e-16 ***
## Glucose      0.035548   0.003481  10.212  < 2e-16 ***
## BMI          0.089753   0.014377   6.243  4.3e-10 ***
## Age          0.028699   0.007809   3.675 0.000238 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 974.75  on 751  degrees of freedom
## Residual deviance: 724.96  on 748  degrees of freedom
##   (16 observations deleted due to missingness)
## AIC: 732.96
## 
## Number of Fisher Scoring iterations: 4
# Here, I calculate a pseudo R-squared value, Since logistic regression does not have a true R-squared like linear regression
# This is why I used this formula to estimate how much of the variation in the outcome is explained well by my model
R2 <- 1 - (model$deviance/ model$null.deviance)

# At last I printed the R-squared value so I can interpret how well my model fits the data,
R2
## [1] 0.25626

What does the intercept represent (log-odds of diabetes when predictors are zero)?

The thing that the intercept shows the model’s starting point. This basically means the predicted log-odds of diabetes when Glucose, BMI, and Age are all zero, despite those values being unrealistic for a person.

For each predictor (Glucose, BMI, Age), does a one-unit increase raise or lower the odds of diabetes? Are they significant (p-value < 0.05)?

For Glucose a one unit increase in the glucose does increase the log-odds of diabetes by about 0.0355, this essentially means that as the glucose increase the odds of diabetes also increase.

For BMI a one unit increase does actually increase the log-odds of diabetes by a value of about 0.0898 which shows an increase in risk. As a result it is statistically significant and it does have an impact on the prediction.

For Age, a one unit increase in age does increase the log-odds of diabetes up by about 0.0287, this essentially means that as age increases the odds of diabetes also increases. Since the p-value is less than 0.05, it is considered to be statistically significant which plays an important role in predicting diabetes.

Question 2: Confusion Matrix and Important Metric

Calculate and report the metrics:

Accuracy: (TP + TN) / Total Sensitivity (Recall): TP / (TP + FN) Specificity: TN / (TN + FP) Precision: TP / (TP + FP)

Use the following starter code

# Citations/Disclaimer: This code and analysis follows what learned from course/class notes
# Keep only rows with no missing values in Glucose, BMI, or Age
data_subset <- data[complete.cases(data[, c("Glucose", "BMI", "Age")]), ]

#Create a numeric version of the outcome (0 = no diabetes, 1 = diabetes).This is required for calculating confusion matrices.
data_subset$Outcome_num <- ifelse(data_subset$Outcome == "1", 1, 0)


# Predicted probabilities

# I used the predict() function to estimate the probability that each person in my dataset has diabetes
prob <- predict(model, data_subset, type = "response")


# Predicted classes

# In this part I am trying to use the 0.5 threshold and if the probability is greater than 0.5 then I predicted diabetes as 1 otherwise in this situation there would be no diabetes
pred <- ifelse(prob > 0.5, 1, 0)

# Confusion matrix

# I create a confusion matrix in order to compare my model’s predictions with the actual outcomes. 
# This helps me see how many predictions were correct and where the model made mistakes like the false positives or the false negatives 

cm <- table(Predicted = pred, Actual = data_subset$Outcome_num)
cm
##          Actual
## Predicted   0   1
##         0 429 114
##         1  59 150
# Citations/Disclaimer: This code and analysis follows what learned from course/class notes

# Here, I extract the values from my confusion matrix so I can calculate performance metrics #
# TN (True Negatives): This would essentially correctly predicted no diabetes

# FN (False Negatives): For this predicted diabetes, but there was actually no diabetes

TN <- cm[1,1]  
FP <- cm[2,1]
FN <- cm[1,2]
TP <- cm[2,2]

#Metrics  

# Accuracy basically explains to me the percentage of correct predictions that were made by my model

# Sensitivity basically measures if my model accurately identifies the diabetes cases

# Specificity basically measures how accurately my model identifies non-diabetes cases 

# Precision basically explains to me about how out of all predicted diabetes cases how many were correct 

accuracy <- (TP + TN)/sum(cm)
sensitivity <- TP/(TP+FN)
specificity <- TN/(TN + FP)
precision <-  TP/ (TP+FP)

# Here I basically print all the metrics which are rounded to 3 decimal places for interpretation to become easier
cat("Accuracy:", round(accuracy, 3), "\nSensitivity:", round(sensitivity, 3), "\nSpecificity:", round(specificity, 3), "\nPrecision:", round(precision, 3))
## Accuracy: 0.77 
## Sensitivity: 0.568 
## Specificity: 0.879 
## Precision: 0.718

Interpret: How well does the model perform? Is it better at detecting diabetes (sensitivity) or non-diabetes (specificity)? Why might this matter for medical diagnosis?

I observed and found that my model has an accuracy of 0.77 which means I correctly predict about 77% of the outcomes. My model also has a higher specificity at about 0.879 and sensitivity at about 0.568, so the model I created is more effective at identifying people without diabetes compared to those who have diabetes.

This basically means that I am missing some actual diabetes cases that are false negatives.The reason why this might matter for medical diagnosis is since missing a real case can end up delaying treatment. This is why my goal is I would want to try to improve sensitivity.

Question 3: ROC Curve, AUC, and Interpretation

# Citations/Disclaimer: This code and analysis follows what learned from course/class notes

# Here, I load the pROC library, I used this to create the ROC curve and also to calculate the AUC which is the Area Under the Curve for my model.
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# I created an ROC object by using the actual outcomes and the predicted probabilities from my model. 
# This ends up helping me evaluate how well my model separates people with and without diabetes.

roc_obj <- roc(data_subset$Outcome_num, prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# I plotted the ROC curve to visually see the model’s performance.

# The curve also shows the trade-off that happens between sensitivity which has the true positive rate and specificity which has the false positive rate.

plot(roc_obj, col = "blue")

# I calculate the AUC, which summarizes the overall performance of the model. 

# Basically a higher AUC means that the model does a better job in figuring out between the two classes.

auc_value <- auc(roc_obj)

# I printed the AUC value so I would be able to interpret the strength of my model overall. 
auc_value
## Area under the curve: 0.828

What does AUC indicate (0.5 = random, 1.0 = perfect)?

The thing that the AUC measures is how well my model can distinguish between the people who have and not have diabetes. An AUC of 0.5 means displays how the model is random but 1.0 displays solid classification.

Based on my ROC I think my model has good AUC which shows how it does a solid job when it is seperating the two groups.

For diabetes diagnosis, prioritize sensitivity (catching cases) or specificity (avoiding false positives)? Suggest a threshold and explain.

The thing I would do for the diabetes diagnosis is have prioritization for sensitivity. It is more important to have proper identification of people who have diabetes.

I think Missing a real case like a false negative could also end up delaying treatment.I would choose a lower threshold at around 0.2 in order to try to increase the sensitivity. Even if it ends ip slightly reducing the specificity.