Introduction:

In this homework, you will apply logistic regression to a real-world dataset: the Pima Indians Diabetes Database. This dataset contains medical records from 768 women of Pima Indian heritage, aged 21 or older, and is used to predict the onset of diabetes (binary outcome: 0 = no diabetes, 1 = diabetes) based on physiological measurements.

The data is publicly available from the UCI Machine Learning Repository and can be imported directly.

Dataset URL: https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv

Columns (no header in the CSV, so we need to assign them manually):

  1. Pregnancies: Number of times pregnant
  2. Glucose: Plasma glucose concentration (2-hour test)
  3. BloodPressure: Diastolic blood pressure (mm Hg)
  4. SkinThickness: Triceps skin fold thickness (mm)
  5. Insulin: 2-hour serum insulin (mu U/ml)
  6. BMI: Body mass index (weight in kg/(height in m)^2)
  7. DiabetesPedigreeFunction: Diabetes pedigree function (a function scoring genetic risk)
  8. Age: Age in years
  9. Outcome: Class variable (0 = no diabetes, 1 = diabetes)

Task Overview: You will load the data, build a logistic regression model to predict diabetes onset using a subset of predictors (Glucose, BMI, Age), interpret the model, evaluate it with a confusion matrix and metrics, and analyze the ROC curve and AUC.

Cleaning the dataset Don’t change the following code

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

data <- read.csv(url, header = FALSE)

colnames(data) <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome")

data$Outcome <- as.factor(data$Outcome)

# Handle missing values (replace 0s with NA because 0 makes no sense here)
data$Glucose[data$Glucose == 0] <- NA
data$BloodPressure[data$BloodPressure == 0] <- NA
data$BMI[data$BMI == 0] <- NA


colSums(is.na(data))
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        5                       35 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                       11 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Question 1: Create and Interpret a Logistic Regression Model - Fit a logistic regression model to predict Outcome using Glucose, BMI, and Age.

# Here, I am fitting a logistic regression model to predict whether a person has diabetes (Outcome)
# I also chose Glucose, BMI, and Age as my predictor variables since they are important health indicators which are related to diabeties risk
model <- glm(Outcome ~ Glucose + BMI + Age, 
             data = data, 
             family = "binomial")

# I used the summary() function to examine the results of my model
# Due to this it allowed me to see the coefficients, which displays how each variable affects the log-odds of diabetes
# It also displays the p-values, which help me determine if the predictor is statistically significant
summary(model)
## 
## Call:
## glm(formula = Outcome ~ Glucose + BMI + Age, family = "binomial", 
##     data = data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.032377   0.711037 -12.703  < 2e-16 ***
## Glucose      0.035548   0.003481  10.212  < 2e-16 ***
## BMI          0.089753   0.014377   6.243  4.3e-10 ***
## Age          0.028699   0.007809   3.675 0.000238 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 974.75  on 751  degrees of freedom
## Residual deviance: 724.96  on 748  degrees of freedom
##   (16 observations deleted due to missingness)
## AIC: 732.96
## 
## Number of Fisher Scoring iterations: 4
# Here, I calculate a pseudo R-squared value, Since logistic regression does not have a true R-squared like linear regression
# This is why I used this formula to estimate how much of the variation in the outcome is explained well by my model
R2 <- 1 - (model$deviance / model$null.deviance)

# At last I printed the R-squared value so I can interpret how well my model fits the data,
R2
## [1] 0.25626

What does the intercept represent (log-odds of diabetes when predictors are zero)?

The thing that the intercept shows the model’s starting point. This basically means the predicted log-odds of diabetes when Glucose, BMI, and Age are all zero, despite those values being unrealistic for a person.

For each predictor (Glucose, BMI, Age), does a one-unit increase raise or lower the odds of diabetes? Are they significant (p-value < 0.05)?

For Glucose a one unit increase in the glucose does increase the log-odds of diabetes by about 0.0355, this essentially means that as the glucose increase the odds of diabetes also increase.

For BMI a one unit increase does actually increase the log-odds of diabetes by a value of about 0.0898 which shows an increase in risk. As a result it is statistically significant and it does have an impact on the prediction.

For Age, a one unit increase in age does increase the log-odds of diabetes up by about 0.0287, this essentially means that as age increases the odds of diabetes also increases. Since the p-value is less than 0.05, it is considered to be statistically significant which plays an important role in predicting diabetes.

Question 2: Confusion Matrix and Important Metric

Calculate and report the metrics:

Accuracy: (TP + TN) / Total Sensitivity (Recall): TP / (TP + FN) Specificity: TN / (TN + FP) Precision: TP / (TP + FP)

Use the following starter code

# Keep only rows with no missing values in Glucose, BMI, or Age
data_subset <- data[complete.cases(data[, c("Glucose", "BMI", "Age")]), ]

#Create a numeric version of the outcome (0 = no diabetes, 1 = diabetes).This is required for calculating confusion matrices.
data_subset$Outcome_num <- ifelse(data_subset$Outcome == "1", 1, 0)


# Predicted probabilities

# I used the predict() function to estimate the probability that each person in my dataset has diabetes
prob <- predict(model, data_subset, type = "response")


# Predicted classes

# This part I am trying to use the 0.5 threshold and if the probability is greater than 0.5 then I predicted diabities as 1 otherwise in this situation there would be no diabeties
pred <- ifelse(prob > 0.5, 1, 0)

# Confusion matrix

# I create a confusion matrix to compare my model’s predictions with the actual outcomes. This helps me see how many predictions were correct and where the model made mistakes like the false positives or the false negatives 
cm <- table(Predicted = pred, Actual = data_subset$Outcome_num)
cm
##          Actual
## Predicted   0   1
##         0 429 114
##         1  59 150
#Extract Values:
# Here, I extract the values from my confusion matrix so I can calculate performance metrics #
# TN (True Negatives): This would essentially correctly predicted no diabetes
# FN (False Negatives): For this predicted diabetes, but there was actually no diabetes
TN <- cm[1,1]  
FP <- cm[2,1]
FN <- cm[1,2]
TP <- cm[2,2]

#Metrics   
# Accuracy basically explains to me the percentage of correct predictions that was made by my model
# Sensitivity basically measures if my model accurately identifies the diabetes cases
# Specificity basically measures how accurately my model identifies non- diabetes cases 
# Precision basically explains to me about out of all predicted diabetes cases about how many were correct
accuracy <- (TP + TN)/sum(cm)
sensitivity <- TP/(TP+FN)
specificity <- TN/(TN + FP)
precision <-  TP/ (TP+FP)

# Here I basically print all the metrics which are rounded to 3 decimal places for interpretation to become easier
cat("Accuracy:", round(accuracy, 3), "\nSensitivity:", round(sensitivity, 3), "\nSpecificity:", round(specificity, 3), "\nPrecision:", round(precision, 3))
## Accuracy: 0.77 
## Sensitivity: 0.568 
## Specificity: 0.879 
## Precision: 0.718

Interpret: How well does the model perform? Is it better at detecting diabetes (sensitivity) or non-diabetes (specificity)? Why might this matter for medical diagnosis?

I observed and found that my model has an accuracy of 0.77 which means I correctly predict about 77% of the outcomes. My model also has a higher specificity at about 0.879 and sensitivity at about 0.568, so it is considered better at identifying people without diabetes than those who have diabetes.This basically means that I am missing some actual diabetes cases that are false negatives. The reason why this might matter for medical diagnosis since missing a real case can end up delaying treatment which is the reason why I would want to try to improve sensitivity.

Question 3: ROC Curve, AUC, and Interpretation

# Here, I load the pROC library, which I use to create the ROC curve and also calculated the AUC (Area Under the Curve) for my model.
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# I created an ROC object by using the actual outcomes and the predicted probabilities from my model. This ends up helping me evaluate how well my model separates people with and without diabetes.
roc_obj <- roc(data_subset$Outcome_num, prob)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# I plotted the ROC curve to visually see the model’s performance. The curve also shows the trade-off between sensitivity which is the true positive rate and specificity which is the false positive rate.
plot(roc_obj, col = "blue")

# I calculate the AUC, which summarizes the overall performance of the model. A higher AUC essential means that the model does a better job in distinguishing between the two classes
auc_value <- auc(roc_obj)

# Finally, I printed the AUC value so I can interpret the strength of my model overall as a whole. 
auc_value
## Area under the curve: 0.828

What does AUC indicate (0.5 = random, 1.0 = perfect)?

The thing that the AUC measures is how well my model can distinguish between the people that have or don’t have diabetes. An AUC of 0.5 means displays how the model is random but 1.0 shows solid classification. Based on my ROC I think my model has good AUC which shows how it does a solid job when separating the two groups.

For diabetes diagnosis, prioritize sensitivity (catching cases) or specificity (avoiding false positives)? Suggest a threshold and explain.

The thing I would do for the diabetes diagnosis is that I would prioritize sensitivity since it is more important to properly identify people who actually have diabetes. I think Missing a real case such as a false negative could also delay treatment. I would choose a lower threshold at around 0.2 in order to try to increase sensitivity, especially even if it slightly reduces specificity.