This demo covers some fundamental classification techniques: Logistic Regression, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA). We’ll also focus on a crucial aspect of model building: robust evaluation using cross-validation.

Learning Objectives:

  • Understand the theoretical underpinnings of Logistic Regression, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA).
  • Implement these models in R.
  • Apply k-fold cross-validation for robust model evaluation.
  • Compare the performance of different classification models.

What are Classification Models?

  • Goal: Predict a categorical outcome variable (e.g., spam/not spam, disease/no disease, customer churn/no churn).
  • Supervised Learning: We have labeled data (features and known outcomes).

Brief Theoretical Overview

1. Logistic Regression:

  • What’s the core idea? (Predicts the probability of belonging to a class using a logistic function).
  • What type of decision boundary does it create? (Linear).
  • Assumptions? (No strong distributional assumptions on predictors, but linearity in the log-odds is assumed).

2. Linear Discriminant Analysis (LDA):

  • What’s the core idea? (Assumes predictors within each class follow a multivariate Gaussian distribution with equal covariance matrices).
  • How does it classify? (Finds linear combinations of predictors that best separate the classes).
  • Decision boundary? (Linear).
  • When might it be preferred over logistic regression? (When classes are well-separated, or when the normality assumption holds).

3. Quadratic Discriminant Analysis (QDA):

  • How does it differ from LDA? (Assumes predictors within each class follow a multivariate Gaussian distribution but with unequal covariance matrices).
  • Decision boundary? (Quadratic).
  • When might it be preferred over LDA? (When the assumption of equal covariance matrices is violated, and the true decision boundary is non-linear).

Why Cross-Validation?

  • Why don’t we just split our data once into training and testing sets?” (Discuss limitations: sensitivity to split, less robust performance estimate).
  • Cross-validation: A technique to get a more reliable estimate of a model’s performance on unseen data by repeatedly splitting the data into training and validation sets.
  • k-Fold Cross-Validation: Explain the process – divide data into k folds, train on k-1 folds, validate on the remaining fold, repeat k times.
  • Benefits: More robust performance estimate, uses all data for both training and validation.

R Demo

Data Prep

We’ll use the PimaIndiansDiabetes2 dataset from the mlbench package.

# Install necessary packages (if not already installed)
# install.packages("mlbench")
# install.packages("caret") # For cross-validation and convenient model training
# install.packages("MASS")  # For LDA and QDA

# Load packages
library(mlbench)   # For PimaIndiansDiabetes2 dataset
library(caret)     # For cross-validation and model training utilities
Loading required package: ggplot2
Loading required package: lattice
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     
library(MASS)      # For lda() and qda() functions

# Load the dataset
data(PimaIndiansDiabetes2)
my_data <- PimaIndiansDiabetes2

# Explore the data
head(my_data)
summary(my_data)
    pregnant         glucose         pressure     
 Min.   : 0.000   Min.   : 44.0   Min.   : 24.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00  
 Median : 3.000   Median :117.0   Median : 72.00  
 Mean   : 3.845   Mean   :121.7   Mean   : 72.41  
 3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00  
                  NA's   :5       NA's   :35      
    triceps         insulin            mass      
 Min.   : 7.00   Min.   : 14.00   Min.   :18.20  
 1st Qu.:22.00   1st Qu.: 76.25   1st Qu.:27.50  
 Median :29.00   Median :125.00   Median :32.30  
 Mean   :29.15   Mean   :155.55   Mean   :32.46  
 3rd Qu.:36.00   3rd Qu.:190.00   3rd Qu.:36.60  
 Max.   :99.00   Max.   :846.00   Max.   :67.10  
 NA's   :227     NA's   :374      NA's   :11     
    pedigree           age        diabetes 
 Min.   :0.0780   Min.   :21.00   neg:500  
 1st Qu.:0.2437   1st Qu.:24.00   pos:268  
 Median :0.3725   Median :29.00            
 Mean   :0.4719   Mean   :33.24            
 3rd Qu.:0.6262   3rd Qu.:41.00            
 Max.   :2.4200   Max.   :81.00            
                                           
str(my_data)
'data.frame':   768 obs. of  9 variables:
 $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
 $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
 $ pressure: num  72 66 64 66 40 74 50 NA 70 96 ...
 $ triceps : num  35 29 NA 23 35 NA 32 NA 45 NA ...
 $ insulin : num  NA NA NA 94 168 NA 88 NA 543 NA ...
 $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
 $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
 $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
 $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
# Check for missing values
colSums(is.na(my_data))
pregnant  glucose pressure  triceps  insulin     mass pedigree 
       0        5       35      227      374       11        0 
     age diabetes 
       0        0 
# Handle missing values: For simplicity in this demo, we'll remove rows with NA.
# In a real-world scenario, imputation would be a better approach for missing data.
my_data <- na.omit(my_data)
colSums(is.na(my_data)) # Verify no more NAs
pregnant  glucose pressure  triceps  insulin     mass pedigree 
       0        0        0        0        0        0        0 
     age diabetes 
       0        0 
# Convert 'diabetes' to a factor with desired levels if not already
# The 'outcome' variable needs to be a factor for classification models in R.
# Let's ensure the levels are meaningful.
my_data$diabetes <- factor(my_data$diabetes, levels = c("neg", "pos"), labels = c("No", "Yes"))

# Ensure other variables are numeric where appropriate
# (PimaIndiansDiabetes2 is generally clean, but good practice to check)
  • What did we just do with na.omit()? What are the implications?
  • Why is it important for our outcome variable (diabetes) to be a factor?
  • What are some other strategies for handling missing data besides na.omit()? (Imputation, etc.)

Model Fitting and Evaluation with Cross-Validation

Now that our data is ready, let’s fit our three models and evaluate them using k-fold cross-validation. We’ll use the caret package, which simplifies the cross-validation process significantly.

Setting up Cross-Validations Control

# Set a random seed for reproducibility
set.seed(123)

# Define cross-validation control
# We'll use 10-fold cross-validation
# 'repeats = 3' means the entire 10-fold CV process is repeated 3 times.
# This gives an even more stable estimate of performance.
train_control <- trainControl(
  method = "repeatedcv",
  number = 10,       # k = 10 folds
  repeats = 3,       # Repeat the 10-fold CV 3 times
  classProbs = TRUE, # Needed for ROC curve later if desired
  summaryFunction = twoClassSummary # For classification, provides AUC, sensitivity, specificity
)
  • Why set.seed()?
  • What’s the difference between number and repeats in trainControl?
  • What other method options are available in trainControl? (e.g., boot, loocv)

Logistic Regression (glm)

# Fit Logistic Regression model using caret's train function
# The 'method = "glm"' specifies a generalized linear model (logistic regression in this case)
# The 'family = "binomial"' is implicitly handled by caret for factor outcomes.
# 'metric = "ROC"' tells caret to optimize/report based on Area Under the ROC Curve.
# This is a good metric for imbalanced classes.

cat("\n--- Fitting Logistic Regression Model ---\n")

--- Fitting Logistic Regression Model ---
model_glm <- train(
  diabetes ~ .,         # Predict 'diabetes' using all other variables
  data = my_data,       # Our pre-processed dataset
  method = "glm",       # Specify Logistic Regression
  family = "binomial",  # For binary classification
  trControl = train_control, # Our defined CV control
  metric = "ROC"        # Evaluate using ROC AUC
)

# Print the model results
print(model_glm)
Generalized Linear Model 

392 samples
  8 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 352, 353, 353, 353, 353, 353, ... 
Resampling results:

  ROC        Sens       Spec     
  0.8505187  0.8877968  0.5538462
# Access individual folds' performance (optional, for deeper insight)
# model_glm$resample

Linear Discriminant Analysis (LDA)

# Fit LDA model using caret's train function
# 'method = "lda"' for Linear Discriminant Analysis

cat("\n--- Fitting LDA Model ---\n")

--- Fitting LDA Model ---
model_lda <- train(
  diabetes ~ .,
  data = my_data,
  method = "lda",
  trControl = train_control,
  metric = "ROC"
)

# Print the model results
print(model_lda)
Linear Discriminant Analysis 

392 samples
  8 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 352, 353, 353, 352, 353, 353, ... 
Resampling results:

  ROC        Sens       Spec     
  0.8397984  0.8841406  0.5692308

Quadratic Discriminant (QDA)

# Fit QDA model using caret's train function
# 'method = "qda"' for Quadratic Discriminant Analysis

cat("\n--- Fitting QDA Model ---\n")

--- Fitting QDA Model ---
model_qda <- train(
  diabetes ~ .,
  data = my_data,
  method = "qda",
  trControl = train_control,
  metric = "ROC"
)

# Print the model results
print(model_qda)
Quadratic Discriminant Analysis 

392 samples
  8 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 353, 353, 353, 352, 353, 353, ... 
Resampling results:

  ROC        Sens       Spec     
  0.8265505  0.8643875  0.6153846

Comparing Models

# Compare the models using resamples
# This provides a statistical comparison of performance metrics across models.
resample_results <- resamples(list(
  Logistic = model_glm,
  LDA = model_lda,
  QDA = model_qda
))

# Summarize the comparison
summary(resample_results)

Call:
summary.resamples(object = resample_results)

Models: Logistic, LDA, QDA 
Number of resamples: 30 

ROC 
              Min.   1st Qu.    Median      Mean   3rd Qu.
Logistic 0.7485207 0.7995562 0.8431953 0.8505187 0.9075444
LDA      0.7236467 0.7914201 0.8284024 0.8397984 0.9016272
QDA      0.6210826 0.7877219 0.8317993 0.8265505 0.8823964
              Max. NA's
Logistic 0.9792899    0
LDA      0.9467456    0
QDA      0.9556213    0

Sens 
              Min.   1st Qu.    Median      Mean   3rd Qu.
Logistic 0.7307692 0.8475783 0.8888889 0.8877968 0.9230769
LDA      0.7407407 0.8461538 0.9038462 0.8841406 0.9230769
QDA      0.7037037 0.8461538 0.8490028 0.8643875 0.9145299
         Max. NA's
Logistic    1    0
LDA         1    0
QDA         1    0

Spec 
              Min.   1st Qu.    Median      Mean   3rd Qu.
Logistic 0.2307692 0.4615385 0.5384615 0.5538462 0.6153846
LDA      0.3076923 0.4615385 0.5384615 0.5692308 0.6730769
QDA      0.3846154 0.4807692 0.6153846 0.6153846 0.6923077
              Max. NA's
Logistic 0.8461538    0
LDA      0.8461538    0
QDA      0.9230769    0
# Visualize the comparison (e.g., box plots of ROC, Sensitivity, Specificity)
scales <- list(x = list(relation = "free"), y = list(relation = "free"))
bwplot(resample_results, scales = scales)


# You can also use a dotplot for a different visualization
dotplot(resample_results, scales = scales)

  • Looking at the output of print(model_glm), what do the ‘ROC’, ‘Sens’, and ‘Spec’ values represent? (Area Under the ROC Curve, Sensitivity/Recall, Specificity).
  • How do the ROC values compare across the three models? Which one performs best on average according to this metric?
  • What does the bwplot or dotplot tell us about the variability of the model’s performance across the folds?
  • Based on these results, which model would you recommend for this dataset and why? (Encourage discussion on trade-offs, interpretability, and assumptions).
  • If a model has a much higher ROC but lower Sens compared to another, what might that imply about its performance on different types of errors (false positives vs. false negatives)?

Predictions and Interpretation

caret handles much of the complexity, it’s important to understand how to make predictions with your trained models and interpret their coefficients.

Making Predictions

# Let's use the best performing model (or pick one for demonstration)
# For example, let's use the Logistic Regression model for prediction.
# We'll predict on the original training data for simplicity,
# but in a real scenario, you'd use a completely unseen test set.

# Predict class labels
predictions_glm <- predict(model_glm, newdata = my_data)
head(predictions_glm)
[1] No  Yes No  Yes Yes Yes
Levels: No Yes
# Predict class probabilities
probabilities_glm <- predict(model_glm, newdata = my_data, type = "prob")
head(probabilities_glm)

# Create a confusion matrix for the logistic regression model on training data
# This is NOT cross-validated performance, but good for understanding.
confusionMatrix(predictions_glm, my_data$diabetes)
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  233  56
       Yes  29  74
                                        
               Accuracy : 0.7832        
                 95% CI : (0.739, 0.823)
    No Information Rate : 0.6684        
    P-Value [Acc > NIR] : 3.836e-07     
                                        
                  Kappa : 0.4839        
                                        
 Mcnemar's Test P-Value : 0.004801      
                                        
            Sensitivity : 0.8893        
            Specificity : 0.5692        
         Pos Pred Value : 0.8062        
         Neg Pred Value : 0.7184        
             Prevalence : 0.6684        
         Detection Rate : 0.5944        
   Detection Prevalence : 0.7372        
      Balanced Accuracy : 0.7293        
                                        
       'Positive' Class : No            
                                        

Interpreting Coefficients (Logistic Regression)

# For Logistic Regression, we can inspect the coefficients
# Note: For LDA/QDA, interpretation is more about discriminant functions,
# not direct coefficient interpretation like GLM.

# Access the fitted GLM model object from the caret train object
glm_object <- model_glm$finalModel
summary(glm_object)

Call:
NULL

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.004e+01  1.218e+00  -8.246  < 2e-16 ***
pregnant     8.216e-02  5.543e-02   1.482  0.13825    
glucose      3.827e-02  5.768e-03   6.635 3.24e-11 ***
pressure    -1.420e-03  1.183e-02  -0.120  0.90446    
triceps      1.122e-02  1.708e-02   0.657  0.51128    
insulin     -8.253e-04  1.306e-03  -0.632  0.52757    
mass         7.054e-02  2.734e-02   2.580  0.00989 ** 
pedigree     1.141e+00  4.274e-01   2.669  0.00760 ** 
age          3.395e-02  1.838e-02   1.847  0.06474 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 498.10  on 391  degrees of freedom
Residual deviance: 344.02  on 383  degrees of freedom
AIC: 362.02

Number of Fisher Scoring iterations: 5
# Interpret coefficients (e.g., odds ratios)
# For a coefficient 'b', the odds ratio is exp(b)
# Example: For 'glucose', if the coefficient is 'x', then for every unit increase in glucose,
# the odds of having diabetes are multiplied by exp(x), assuming other predictors are constant.
exp(coef(glm_object))
 (Intercept)     pregnant      glucose     pressure 
4.358754e-05 1.085629e+00 1.039011e+00 9.985807e-01 
     triceps      insulin         mass     pedigree 
1.011285e+00 9.991750e-01 1.073085e+00 3.129611e+00 
         age 
1.034535e+00 
  • What’s the difference between type = "raw" (default predict output) and type = "prob"?
  • Why is interpreting coefficients more straightforward for Logistic Regression than for LDA/QDA?
  • What does an odds ratio less than 1 mean? Greater than 1?

Recap and Advanced Topics

  • Logistic Regression: Linear decision boundary, predicts probabilities.
  • LDA: Linear decision boundary, assumes equal covariance matrices.
  • QDA: Quadratic decision boundary, assumes unequal covariance matrices.
  • Cross-validation: Essential for reliable model performance estimation.

Further Exploration

  • Hyperparameter Tuning: For more complex models (like Random Forests, SVMs), caret can also tune hyperparameters automatically during cross-validation. Explore tuneGrid and tuneLength arguments in train().
  • Other Metrics: Investigate other evaluation metrics like Precision, Recall, F1-score, especially for imbalanced datasets.
  • Imbalanced Data: The PimaIndiansDiabetes2 dataset has some class imbalance. Research techniques for handling imbalanced data (e.g., SMOTE, undersampling/oversampling) and how they integrate with caret.
  • Feature Engineering/Selection: How would feature engineering or feature selection impact these models?
  • Visualization: Create ROC curves for each model to visualize their trade-offs between sensitivity and specificity.

Appendix

ROC Curve Plot (requires pROC package or manual calculation)

# install.packages("pROC")
library(pROC)

# Get predictions from one of the models (e.g., Logistic Regression)
# Note: This is on the training data, for illustration. For true ROC, use a held-out test set.
prob_glm <- predict(model_glm, newdata = my_data, type = "prob")

# Create ROC object
roc_glm <- roc(response = my_data$diabetes, predictor = prob_glm$Yes, levels = c("No", "Yes"))
Setting direction: controls < cases
# Plot ROC curve
plot(roc_glm, main = "ROC Curve for Logistic Regression", col = "blue", lwd = 2)

auc(roc_glm) # Area Under Curve
Area under the curve: 0.8624
# You could loop this for all models and plot on the same graph for comparison

Without caret

It’s certainly possible to perform cross-validation and fit these models without caret. While caret simplifies the process significantly, doing it manually provides a deeper understanding of the underlying mechanics.

Here’s the same demo without using the caret package. This version focuses more on manual loops for cross-validation and direct calls to glm, lda, and qda.

Manual K-Fold Cross-Validation Setup

# Set a random seed for reproducibility
set.seed(123)

# Define number of folds (k)
k_folds <- 10
num_repeats <- 3 # We'll repeat the entire CV process multiple times for more stable results

# Create empty lists to store results for each model and repeat
results_glm <- list()
results_lda <- list()
results_qda <- list()

# Store performance metrics (ROC, Sensitivity, Specificity) for each fold/repeat
metrics_glm <- data.frame(ROC = numeric(), Sensitivity = numeric(), Specificity = numeric())
metrics_lda <- data.frame(ROC = numeric(), Sensitivity = numeric(), Specificity = numeric())
metrics_qda <- data.frame(ROC = numeric(), Sensitivity = numeric(), Specificity = numeric())

# Get the number of observations
n_obs <- nrow(my_data)

# Outer loop for repeats
for (rep in 1:num_repeats) {
  cat(paste0("\n--- Cross-Validation Repeat ", rep, " ---\n"))

  # Randomly permute the data to ensure folds are diverse
  shuffled_data <- my_data[sample(n_obs), ]

  # Create folds
  # This creates a vector indicating which fold each row belongs to
  folds <- cut(seq(1, n_obs), breaks = k_folds, labels = FALSE)

  # Inner loop for folds
  for (i in 1:k_folds) {
    cat(paste0("  Fold ", i, "...\n"))

    # Define training and testing sets for this fold
    test_indices <- which(folds == i)
    train_data <- shuffled_data[-test_indices, ]
    test_data <- shuffled_data[test_indices, ]

    # --- Fit Logistic Regression ---
    tryCatch({ # Use tryCatch to handle potential errors (e.g., perfect separation)
      model_glm_fold <- glm(diabetes ~ ., data = train_data, family = "binomial")
      pred_prob_glm <- predict(model_glm_fold, newdata = test_data, type = "response")
      pred_class_glm <- factor(ifelse(pred_prob_glm > 0.5, "Yes", "No"), levels = c("No", "Yes"))

      # Ensure test_data$diabetes has consistent levels
      test_data$diabetes <- factor(test_data$diabetes, levels = c("No", "Yes"))

      # Calculate confusion matrix for GLM
      cm_glm <- table(Predicted = pred_class_glm, Actual = test_data$diabetes)
      # Calculate metrics manually or using pROC for AUC
      # Need to handle cases where classes might be missing in a fold
      if (ncol(cm_glm) == 2 && nrow(cm_glm) == 2) {
          TP_glm <- cm_glm[2,2]
          TN_glm <- cm_glm[1,1]
          FP_glm <- cm_glm[2,1]
          FN_glm <- cm_glm[1,2]

          sensitivity_glm <- TP_glm / (TP_glm + FN_glm)
          specificity_glm <- TN_glm / (TN_glm + FP_glm)

          # Calculate ROC AUC (requires at least two distinct predictions and two classes in test set)
          if (length(unique(test_data$diabetes)) > 1 && length(unique(pred_prob_glm)) > 1) {
              roc_obj_glm <- roc(response = test_data$diabetes, predictor = pred_prob_glm, levels = c("No", "Yes"), quiet = TRUE)
              auc_glm <- auc(roc_obj_glm)
          } else {
              auc_glm <- NA # Cannot compute AUC if only one class or constant predictions
          }
      } else {
          sensitivity_glm <- NA
          specificity_glm <- NA
          auc_glm <- NA
      }

      metrics_glm <- rbind(metrics_glm, data.frame(ROC = auc_glm, Sensitivity = sensitivity_glm, Specificity = specificity_glm))

    }, error = function(e) {
      cat("  GLM Error in fold ", i, ": ", e$message, "\n")
      metrics_glm <- rbind(metrics_glm, data.frame(ROC = NA, Sensitivity = NA, Specificity = NA))
    })

    # --- Fit LDA ---
    tryCatch({
      # Ensure there are at least two distinct classes in the training data for LDA
      if (length(unique(train_data$diabetes)) > 1) {
        model_lda_fold <- lda(diabetes ~ ., data = train_data)
        pred_lda <- predict(model_lda_fold, newdata = test_data)
        pred_prob_lda_yes <- pred_lda$posterior[, "Yes"] # Probability of 'Yes' class
        pred_class_lda <- pred_lda$class

        cm_lda <- table(Predicted = pred_class_lda, Actual = test_data$diabetes)
        if (ncol(cm_lda) == 2 && nrow(cm_lda) == 2) {
            TP_lda <- cm_lda[2,2]
            TN_lda <- cm_lda[1,1]
            FP_lda <- cm_lda[2,1]
            FN_lda <- cm_lda[1,2]

            sensitivity_lda <- TP_lda / (TP_lda + FN_lda)
            specificity_lda <- TN_lda / (TN_lda + FP_lda)

            if (length(unique(test_data$diabetes)) > 1 && length(unique(pred_prob_lda_yes)) > 1) {
                roc_obj_lda <- roc(response = test_data$diabetes, predictor = pred_prob_lda_yes, levels = c("No", "Yes"), quiet = TRUE)
                auc_lda <- auc(roc_obj_lda)
            } else {
                auc_lda <- NA
            }
        } else {
            sensitivity_lda <- NA
            specificity_lda <- NA
            auc_lda <- NA
        }
      } else {
          sensitivity_lda <- NA
          specificity_lda <- NA
          auc_lda <- NA
          cat("  LDA skipped in fold ", i, ": Only one class present in training data.\n")
      }
      metrics_lda <- rbind(metrics_lda, data.frame(ROC = auc_lda, Sensitivity = sensitivity_lda, Specificity = specificity_lda))

    }, error = function(e) {
      cat("  LDA Error in fold ", i, ": ", e$message, "\n")
      metrics_lda <- rbind(metrics_lda, data.frame(ROC = NA, Sensitivity = NA, Specificity = NA))
    })

    # --- Fit QDA ---
    tryCatch({
      # QDA often requires more data per class, can fail if not enough observations
      if (length(unique(train_data$diabetes)) > 1 &&
          sum(train_data$diabetes == "No") > ncol(train_data) &&
          sum(train_data$diabetes == "Yes") > ncol(train_data)) { # Rule of thumb for QDA
        model_qda_fold <- qda(diabetes ~ ., data = train_data)
        pred_qda <- predict(model_qda_fold, newdata = test_data)
        pred_prob_qda_yes <- pred_qda$posterior[, "Yes"]
        pred_class_qda <- pred_qda$class

        cm_qda <- table(Predicted = pred_class_qda, Actual = test_data$diabetes)
        if (ncol(cm_qda) == 2 && nrow(cm_qda) == 2) {
            TP_qda <- cm_qda[2,2]
            TN_qda <- cm_qda[1,1]
            FP_qda <- cm_qda[2,1]
            FN_qda <- cm_qda[1,2]

            sensitivity_qda <- TP_qda / (TP_qda + FN_qda)
            specificity_qda <- TN_qda / (TN_qda + FP_qda)

            if (length(unique(test_data$diabetes)) > 1 && length(unique(pred_prob_qda_yes)) > 1) {
                roc_obj_qda <- roc(response = test_data$diabetes, predictor = pred_prob_qda_yes, levels = c("No", "Yes"), quiet = TRUE)
                auc_qda <- auc(roc_obj_qda)
            } else {
                auc_qda <- NA
            }
        } else {
            sensitivity_qda <- NA
            specificity_qda <- NA
            auc_qda <- NA
        }
      } else {
          sensitivity_qda <- NA
          specificity_qda <- NA
          auc_qda <- NA
          cat("  QDA skipped in fold ", i, ": Not enough observations per class in training data.\n")
      }
      metrics_qda <- rbind(metrics_qda, data.frame(ROC = auc_qda, Sensitivity = sensitivity_qda, Specificity = specificity_qda))

    }, error = function(e) {
      cat("  QDA Error in fold ", i, ": ", e$message, "\n")
      metrics_qda <- rbind(metrics_qda, data.frame(ROC = NA, Sensitivity = NA, Specificity = NA))
    })
  } # End of inner fold loop
} # End of outer repeat loop

--- Cross-Validation Repeat 1 ---
  Fold 1...
  Fold 2...
  Fold 3...
  Fold 4...
  Fold 5...
  Fold 6...
  Fold 7...
  Fold 8...
  Fold 9...
  Fold 10...

--- Cross-Validation Repeat 2 ---
  Fold 1...
  Fold 2...
  Fold 3...
  Fold 4...
  Fold 5...
  Fold 6...
  Fold 7...
  Fold 8...
  Fold 9...
  Fold 10...

--- Cross-Validation Repeat 3 ---
  Fold 1...
  Fold 2...
  Fold 3...
  Fold 4...
  Fold 5...
  Fold 6...
  Fold 7...
  Fold 8...
  Fold 9...
  Fold 10...

Summarizing and Comparing Results

After running all the folds and repeats, we’ll have a collection of performance metrics. Let’s summarize them and compare our models.

# Summarize the cross-validation results
cat("\n--- Summary of Logistic Regression Performance ---\n")

--- Summary of Logistic Regression Performance ---
summary(metrics_glm)
      ROC          Sensitivity      Specificity    
 Min.   :0.6759   Min.   :0.1429   Min.   :0.7727  
 1st Qu.:0.7695   1st Qu.:0.5000   1st Qu.:0.8519  
 Median :0.8514   Median :0.5774   Median :0.8889  
 Mean   :0.8379   Mean   :0.5534   Mean   :0.8877  
 3rd Qu.:0.8992   3rd Qu.:0.6774   3rd Qu.:0.9223  
 Max.   :0.9423   Max.   :0.8000   Max.   :0.9655  
cat("Average ROC: ", mean(metrics_glm$ROC, na.rm = TRUE), "\n")
Average ROC:  0.8378616 
cat("Average Sensitivity: ", mean(metrics_glm$Sensitivity, na.rm = TRUE), "\n")
Average Sensitivity:  0.5534245 
cat("Average Specificity: ", mean(metrics_glm$Specificity, na.rm = TRUE), "\n")
Average Specificity:  0.8877038 
cat("\n--- Summary of LDA Performance ---\n")

--- Summary of LDA Performance ---
summary(metrics_lda)
      ROC          Sensitivity      Specificity    
 Min.   :0.6728   Min.   :0.1429   Min.   :0.7619  
 1st Qu.:0.7708   1st Qu.:0.4659   1st Qu.:0.8519  
 Median :0.8485   Median :0.5833   Median :0.8889  
 Mean   :0.8374   Mean   :0.5521   Mean   :0.8831  
 3rd Qu.:0.8942   3rd Qu.:0.6618   3rd Qu.:0.9200  
 Max.   :0.9444   Max.   :0.8000   Max.   :0.9643  
cat("Average ROC: ", mean(metrics_lda$ROC, na.rm = TRUE), "\n")
Average ROC:  0.8374091 
cat("Average Sensitivity: ", mean(metrics_lda$Sensitivity, na.rm = TRUE), "\n")
Average Sensitivity:  0.5520997 
cat("Average Specificity: ", mean(metrics_lda$Specificity, na.rm = TRUE), "\n")
Average Specificity:  0.8831438 
cat("\n--- Summary of QDA Performance ---\n")

--- Summary of QDA Performance ---
summary(metrics_qda)
      ROC          Sensitivity      Specificity    
 Min.   :0.6667   Min.   :0.1250   Min.   :0.7600  
 1st Qu.:0.7771   1st Qu.:0.5000   1st Qu.:0.8180  
 Median :0.8240   Median :0.6339   Median :0.8596  
 Mean   :0.8235   Mean   :0.5729   Mean   :0.8543  
 3rd Qu.:0.8819   3rd Qu.:0.7044   3rd Qu.:0.8878  
 Max.   :0.9517   Max.   :0.8333   Max.   :0.9655  
cat("Average ROC: ", mean(metrics_qda$ROC, na.rm = TRUE), "\n")
Average ROC:  0.8234612 
cat("Average Sensitivity: ", mean(metrics_qda$Sensitivity, na.rm = TRUE), "\n")
Average Sensitivity:  0.5729406 
cat("Average Specificity: ", mean(metrics_qda$Specificity, na.rm = TRUE), "\n")
Average Specificity:  0.8543147 
# Combine results for easier comparison and visualization
all_metrics <- rbind(
  data.frame(Model = "Logistic", metrics_glm),
  data.frame(Model = "LDA", metrics_lda),
  data.frame(Model = "QDA", metrics_qda)
)

# Remove rows with NA values (from failed folds) for plotting
all_metrics_clean <- na.omit(all_metrics)

# --- Visualization (requires ggplot2, not loaded by default here, but good for discussion) ---
# install.packages("ggplot2")
library(ggplot2)

# Boxplot of ROC values
ggplot(all_metrics_clean, aes(x = Model, y = ROC, fill = Model)) +
  geom_boxplot() +
  labs(title = "Model Performance Comparison (ROC)",
       y = "ROC AUC", x = "Model") +
  theme_minimal()


# Boxplot of Sensitivity values
ggplot(all_metrics_clean, aes(x = Model, y = Sensitivity, fill = Model)) +
  geom_boxplot() +
  labs(title = "Model Performance Comparison (Sensitivity)",
       y = "Sensitivity", x = "Model") +
  theme_minimal()


# Boxplot of Specificity values
ggplot(all_metrics_clean, aes(x = Model, y = Specificity, fill = Model)) +
  geom_boxplot() +
  labs(title = "Model Performance Comparison (Specificity)",
       y = "Specificity", x = "Model") +
  theme_minimal()

Here’s a summary of the average performance metrics for each model:

Model Average ROC Average Sensitivity Average Specificity
Logistic Regression 0.8379 0.5534 0.8877
LDA 0.8374 0.5521 0.8831
QDA 0.8235 0.5729 0.8543

Overall Performance (ROC AUC):

  • Logistic Regression (0.8379) and LDA (0.8374) have almost identical average ROC AUCs. The difference is negligible.
  • QDA (0.8235) has a slightly lower average ROC AUC compared to Logistic Regression and LDA.
  • Interpretation: Based purely on overall discriminative power across all possible thresholds, Logistic Regression and LDA are performing equally well and slightly better than QDA. An AUC above 0.8 is generally considered good, indicating that the models have strong discriminatory power.

Sensitivity (True Positive Rate):

  • QDA (0.5729) has the highest average Sensitivity.
  • Logistic Regression (0.5534) and LDA (0.5521) have very similar and slightly lower average Sensitivities compared to QDA.
  • Interpretation: QDA is slightly better at correctly identifying actual positive cases (e.g., people with diabetes). This means it has fewer false negatives.

Specificity (True Negative Rate):

  • Logistic Regression (0.8877) has the highest average Specificity.
  • LDA (0.8831) is very close to Logistic Regression.
  • QDA (0.8543) has a lower average Specificity compared to Logistic Regression and LDA.
  • Logistic Regression is best at correctly identifying actual negative cases (e.g., people without diabetes). This means it has fewer false positives.

Which one is the “best”?

The “best” model depends on the specific goals and costs associated with False Positives and False Negatives in your application.

  • If overall discriminative power (considering all thresholds) is paramount, and you don’t have a strong preference for minimizing one type of error over another:
    • Logistic Regression or LDA are the top contenders, as their average ROC AUCs are virtually identical and the highest. There’s not enough difference here to declare one definitively better than the other based solely on this metric.
  • If minimizing False Negatives (catching as many positive cases as possible) is more critical (i.e., high Sensitivity is preferred):
    • QDA might be considered slightly better due to its marginally higher average Sensitivity (0.5729). However, keep in mind that this comes at the cost of lower Specificity.
  • If minimizing False Positives (correctly identifying negative cases and avoiding false alarms) is more critical (i.e., high Specificity is preferred):
    • Logistic Regression (or very closely, LDA) is clearly the winner. It has the highest average Specificity, meaning it’s less likely to incorrectly classify a negative case as positive.

Based on these results:

  • For overall balanced performance (highest ROC AUC) and strongest control over False Positives (highest Specificity), Logistic Regression is marginally the “best,” closely followed by LDA.
  • If you prioritize slightly higher Sensitivity (catching more true positives), QDA offers a small advantage, but at the cost of lower Specificity and a slightly lower overall ROC.

Logistic Regression appears to be the most robust choice among these three models for this dataset. The differences between Logistic Regression and LDA are so small that they could be considered effectively equivalent. You’d likely choose between them based on ease of implementation, interpretability, or minor theoretical preferences.

How to interpret ROC, Sensitivity, and Specificity

ROC, Sensitivity, and Specificity are three metrics for evaluating the performance of classification models, especially in situations where you care about different types of errors. To understand these, first understand the four basic outcomes in a binary classification problem, often represented in a confusion matrix:

Actual Positive Actual Negative
Predicted Positive True Positive (TP) False Positive (FP)
Predicted Negative False Negative (FN) True Negative (TN)
  • True Positive (TP): The model correctly predicted a positive outcome (e.g., predicted “has diabetes,” and the person actually has diabetes).
  • False Positive (FP): The model incorrectly predicted a positive outcome (e.g., predicted “has diabetes,” but the person actually does not have diabetes). This is also known as a Type I error or a “false alarm.”
  • False Negative (FN): The model incorrectly predicted a negative outcome (e.g., predicted “does not have diabetes,” but the person actually has diabetes). This is also known as a Type II error or a “miss.”
  • True Negative (TN): The model correctly predicted a negative outcome (e.g., predicted “does not have diabetes,” and the person actually does not have diabetes).

Sensitivity (Recall, True Positive Rate - TPR)

Sensitivity measures the proportion of actual positive cases that were correctly identified by the model. It answers the question: “Out of all the people who truly have the condition, how many did our model correctly find?”

\(Sensitivity = \frac{TP}{TP+FN}\)

  • High Sensitivity (closer to 1.0): Means the model is very good at catching positive cases. It has a low rate of False Negatives (missing actual positives).
  • Low Sensitivity (closer to 0.0): Means the model misses many actual positive cases, leading to a high rate of False Negatives.

Sensitivity is crucial when the cost of a False Negative is high.

  • Medical Diagnosis (e.g., detecting a serious disease like cancer): You want very high sensitivity because missing a disease (false negative) can have severe consequences for the patient. You’d rather have some false alarms (false positives) that can be further investigated.
  • Security Systems: You want to detect all intruders (high sensitivity), even if it means some false alarms.
  • Fraud Detection: You want to catch as many fraudulent transactions as possible.

SNNOUT (Sensitive Negative = Rule OUT)

If a highly sensitive test comes back negative, you can confidently rule out the disease/condition.

Specificity (True Negative Rate - TNR)

Specificity measures the proportion of actual negative cases that were correctly identified by the model. It answers the question: “Out of all the people who truly do NOT have the condition, how many did our model correctly identify as not having it?”

\(Specificity = \frac{TN}{TN+FP}\)

  • High Specificity (closer to 1.0): Means the model is very good at correctly identifying negative cases. It has a low rate of False Positives (incorrectly classifying negatives as positives).
  • Low Specificity (closer to 0.0): Means the model incorrectly flags many negative cases as positive, leading to a high rate of False Positives.

Specificity is crucial when the cost of a False Positive is high.

  • Spam Detection: You want high specificity to avoid marking legitimate emails as spam (false positive), which would be very annoying to users. You might tolerate a few spam emails getting through (false negatives).
  • Legal/Criminal Justice: Incorrectly identifying an innocent person as guilty (false positive) has high societal costs.
  • Expensive or Invasive Follow-up Tests: If a positive test leads to a costly, painful, or risky follow-up procedure, you want high specificity to minimize unnecessary procedures.

SPPIN (Specific Positive = Rule IN)

If a highly specific test comes back positive, you can confidently rule in the disease/condition.

Reciever Operating Characteristic Curve (ROC)

The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It plots the True Positive Rate (Sensitivity) on the y-axis against the False Positive Rate (1 - Specificity) on the x-axis for different possible thresholds.

  • Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular classification threshold.

    • Top-left corner (0,1): This represents a perfect classifier: 100% Sensitivity (no false negatives) and 100% Specificity (no false positives). The closer your curve is to this corner, the better the model’s overall performance.
    • Diagonal line (from 0,0 to 1,1): This represents a random classifier (equivalent to flipping a coin). A model whose ROC curve falls along this diagonal is performing no better than chance.
    • Curve below the diagonal: Indicates a model that performs worse than random chance. If this happens, you can often just invert the model’s predictions to get a better-than-chance model!
  • The AUC is a single scalar value that summarizes the overall performance of the classifier across all possible thresholds. It ranges from 0 to 1.

    • AUC = 1: Perfect classification.
    • AUC = 0.5: Performance is no better than random guessing.
    • AUC < 0.5: Performance is worse than random guessing.
    • *A higher AUC indicates a better model. It can be interpreted as the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance.
# Install necessary packages (if not already installed)
# install.packages("mlbench")
# install.packages("MASS")
# install.packages("dplyr")
# install.packages("ggplot2") # For plotting
# install.packages("pROC")    # For ROC curves

# Load packages
library(mlbench)
library(MASS)
library(dplyr)
library(ggplot2)
library(pROC) # Load pROC for ROC curves and AUC

# Load the dataset
data(PimaIndiansDiabetes2)
my_data <- PimaIndiansDiabetes2

# Data Preparation (using dplyr)
my_data <- my_data %>%
  na.omit() %>%
  mutate(diabetes = factor(diabetes, levels = c("neg", "pos"), labels = c("No", "Yes")))

# --- 1. Fit each model on the *entire* dataset ---
# This is a final model, not part of the cross-validation loop.
# It's used for plotting the "overall" ROC curve for the full dataset.

cat("\n--- Fitting Final Models on Full Dataset for ROC Curve ---\n")

--- Fitting Final Models on Full Dataset for ROC Curve ---
# Logistic Regression
final_model_glm <- glm(diabetes ~ ., data = my_data, family = "binomial")

# LDA
final_model_lda <- lda(diabetes ~ ., data = my_data)

# QDA
# Add a check to ensure QDA can be fitted (sufficient data per class relative to predictors)
# This is similar to the check in the CV loop, but applied to the full dataset.
# The `ncol(my_data) - 1` accounts for all predictors
if (sum(my_data$diabetes == "No") >= (ncol(my_data) - 1) &&
    sum(my_data$diabetes == "Yes") >= (ncol(my_data) - 1)) {
  final_model_qda <- qda(diabetes ~ ., data = my_data)
} else {
  warning("QDA could not be fitted on the full dataset due to insufficient observations per class.")
  final_model_qda <- NULL # Set to NULL if it cannot be fitted
}


# --- 2. Get predicted probabilities for the 'Yes' class from each model ---

# Logistic Regression probabilities
# type = "response" gives probabilities directly for glm
prob_glm <- predict(final_model_glm, newdata = my_data, type = "response")

# LDA probabilities
# predict() for lda/qda returns a list; $posterior contains probabilities
prob_lda <- predict(final_model_lda, newdata = my_data)$posterior[, "Yes"]

# QDA probabilities (check if QDA model was successfully fitted)
prob_qda <- if (!is.null(final_model_qda)) {
  predict(final_model_qda, newdata = my_data)$posterior[, "Yes"]
} else {
  NULL # If QDA wasn't fitted, no probabilities
}

# --- 3. Create ROC objects for each model ---

# For Logistic Regression
roc_obj_glm <- roc(response = my_data$diabetes, predictor = prob_glm,
                   levels = c("No", "Yes")) # Ensure levels are in order: negative first
Setting direction: controls < cases
# For LDA
roc_obj_lda <- roc(response = my_data$diabetes, predictor = prob_lda,
                   levels = c("No", "Yes"))
Setting direction: controls < cases
# For QDA (only if the model was fitted)
if (!is.null(prob_qda)) {
  roc_obj_qda <- roc(response = my_data$diabetes, predictor = prob_qda,
                     levels = c("No", "Yes"))
} else {
  roc_obj_qda <- NULL
}
Setting direction: controls < cases
# --- 4. Plot all three ROC curves on the same graph ---

# Start with the first ROC curve (e.g., Logistic Regression)
plot(roc_obj_glm,
     col = "#3182bd", # Color for GLM
     lwd = 2,      # Line width
     main = "ROC Curves for Classification Models",
     sub = paste0("Logistic Regression AUC: ", round(auc(roc_obj_glm), 3),
                  "\nLDA AUC: ", round(auc(roc_obj_lda), 3),
                  if (!is.null(roc_obj_qda)) { paste0("\nQDA AUC: ", round(auc(roc_obj_qda), 3)) }
                  else { "" }
                  ),
     grid = TRUE # Add a grid for better readability
     )

# Add LDA curve to the existing plot
plot(roc_obj_lda,
     col = "#de2d26", # Color for LDA
     lwd = 2,
     add = TRUE # Add to the current plot
     )

# Add QDA curve to the existing plot (if it was fitted)
if (!is.null(roc_obj_qda)) {
  plot(roc_obj_qda,
       col = "#31a354", # Color for QDA
       lwd = 2,
       add = TRUE
       )
}

# Add a legend to distinguish the curves
legend("bottomright",
       legend = c(paste0("Logistic Regression (AUC = ", round(auc(roc_obj_glm), 3), ")"),
                  paste0("LDA (AUC = ", round(auc(roc_obj_lda), 3), ")"),
                  if (!is.null(roc_obj_qda)) { paste0("QDA (AUC = ", round(auc(roc_obj_qda), 3), ")") }
                  else { "QDA (Not Fitted)" }
                  ),
       col = c("#3182bd", "#de2d26", "#31a354"),
       lwd = 2,
       cex = 0.8 # Font size for legend
       )

# You can also add the diagonal line representing random chance
abline(a = 0, b = 1, lty = 2, col = "#bdbdbd") # Dashed gray line


# --- Optional: Print individual AUCs again for easy reference ---
cat("\n--- AUC Values for Final Models ---\n")

--- AUC Values for Final Models ---
cat("Logistic Regression AUC: ", auc(roc_obj_glm), "\n")
Logistic Regression AUC:  0.8623605 
cat("LDA AUC: ", auc(roc_obj_lda), "\n")
LDA AUC:  0.8605109 
if (!is.null(roc_obj_qda)) {
  cat("QDA AUC: ", auc(roc_obj_qda), "\n")
} else {
  cat("QDA AUC: Not applicable (model not fitted)\n")
}
QDA AUC:  0.8652965 

Why it’s important:

  • Unlike Accuracy, Sensitivity, and Specificity (which depend on a chosen threshold), the ROC curve and AUC provide a view of the model’s performance across all possible thresholds. This is especially useful for models that output probabilities (like logistic regression), as you can choose the optimal threshold later based on your specific needs.
  • AUC is generally a more reliable metric than accuracy when dealing with imbalanced datasets (where one class is much more prevalent than the other). A classifier can achieve high accuracy by simply predicting the majority class all the time, but its AUC would reveal its poor discriminatory power.
  • AUC is excellent for comparing different models because it summarizes their overall discriminative power in a single number. The model with a higher AUC is generally preferred.

Trade-offs between Sensitivity and Specificity (and how ROC shows it)

There’s often a trade-off between sensitivity and specificity.

  • If you set a very low threshold for classifying an outcome as “positive” (e.g., if any probability above 0.1 is considered positive), you’ll likely increase your sensitivity (catch more true positives) but also increase your false positive rate (decrease specificity).
  • Conversely, if you set a very high threshold (e.g., only probabilities above 0.9 are positive), you’ll increase your specificity (reduce false positives) but likely decrease your sensitivity (miss more true positives).
---
title: "Logistic Regression, LDA, and QDA with Cross-Validation"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

This demo covers some fundamental classification techniques: [Logistic Regression](https://stats.oarc.ucla.edu/r/dae/logit-regression/), [Linear Discriminant Analysis (LDA)](https://www.sthda.com/english/articles/36-classification-methods-essentials/146-discriminant-analysis-essentials-in-r/), and [Quadratic Discriminant Analysis (QDA)](https://www.sthda.com/english/articles/36-classification-methods-essentials/146-discriminant-analysis-essentials-in-r/). We'll also focus on a crucial aspect of model building: robust evaluation using [cross-validation](https://www.geeksforgeeks.org/r-language/cross-validation-in-r-programming/). 

## Learning Objectives:  
 - Understand the theoretical underpinnings of Logistic Regression, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA).  
 - Implement these models in `R`.  
 - Apply *k*-fold cross-validation for robust model evaluation.  
 - Compare the performance of different classification models.  

## What are Classification Models?

 - __Goal:__ Predict a categorical outcome variable (e.g., spam/not spam, disease/no disease, customer churn/no churn).  
 - __Supervised Learning:__ We have labeled data (features and known outcomes).  

## Brief Theoretical Overview 

__1. Logistic Regression:__   

 - What's the core idea? (Predicts the probability of belonging to a class using a logistic function).  
 - What type of decision boundary does it create? (Linear).  
 - Assumptions? (No strong distributional assumptions on predictors, but linearity in the log-odds is assumed).  

__2. Linear Discriminant Analysis (LDA):__

 - What's the core idea? (Assumes predictors within each class follow a multivariate Gaussian distribution with equal covariance matrices).  
 - How does it classify? (Finds linear combinations of predictors that best separate the classes).  
 - Decision boundary? (Linear).  
 - When might it be preferred over logistic regression? (When classes are well-separated, or when the normality assumption holds).  

__3. Quadratic Discriminant Analysis (QDA):__  

 - How does it differ from LDA? (Assumes predictors within each class follow a multivariate Gaussian distribution but with unequal covariance matrices).  
 - Decision boundary? (Quadratic).  
 - When might it be preferred over LDA? (When the assumption of equal covariance matrices is violated, and the true decision boundary is non-linear).  

## Why Cross-Validation?
 - Why don't we just split our data once into training and testing sets?" (Discuss limitations: sensitivity to split, less robust performance estimate).  
 - __Cross-validation:__ A technique to get a more reliable estimate of a model's performance on unseen data by repeatedly splitting the data into training and validation sets.  
 - __*k*-Fold Cross-Validation:__ Explain the process – divide data into *k* folds, train on *k*-1 folds, validate on the remaining fold, repeat *k* times.  
 - __Benefits:__ More robust performance estimate, uses all data for both training and validation.   

# R Demo

## Data Prep  
We'll use the [PimaIndiansDiabetes2](https://rdrr.io/cran/mlbench/man/PimaIndiansDiabetes.html) dataset from the `mlbench` package.

```{r}
# Install necessary packages (if not already installed)
# install.packages("mlbench")
# install.packages("caret") # For cross-validation and convenient model training
# install.packages("MASS")  # For LDA and QDA

# Load packages
library(mlbench)   # For PimaIndiansDiabetes2 dataset
library(caret)     # For cross-validation and model training utilities
library(MASS)      # For lda() and qda() functions

# Load the dataset
data(PimaIndiansDiabetes2)
my_data <- PimaIndiansDiabetes2

# Explore the data
head(my_data)
summary(my_data)
str(my_data)

# Check for missing values
colSums(is.na(my_data))

# Handle missing values: For simplicity in this demo, we'll remove rows with NA.
# In a real-world scenario, imputation would be a better approach for missing data.
my_data <- na.omit(my_data)
colSums(is.na(my_data)) # Verify no more NAs

# Convert 'diabetes' to a factor with desired levels if not already
# The 'outcome' variable needs to be a factor for classification models in R.
# Let's ensure the levels are meaningful.
my_data$diabetes <- factor(my_data$diabetes, levels = c("neg", "pos"), labels = c("No", "Yes"))

# Ensure other variables are numeric where appropriate
# (PimaIndiansDiabetes2 is generally clean, but good practice to check)
```

 - What did we just do with `na.omit()`? What are the implications?  
 - Why is it important for our outcome variable (`diabetes`) to be a factor?  
 - What are some other strategies for handling missing data besides `na.omit()`? (Imputation, etc.)  

## Model Fitting and Evaluation with Cross-Validation 
Now that our data is ready, let's fit our three models and evaluate them using k-fold cross-validation. We'll use the caret package, which simplifies the cross-validation process significantly.

### Setting up Cross-Validations Control  
```{r}
# Set a random seed for reproducibility
set.seed(123)

# Define cross-validation control
# We'll use 10-fold cross-validation
# 'repeats = 3' means the entire 10-fold CV process is repeated 3 times.
# This gives an even more stable estimate of performance.
train_control <- trainControl(
  method = "repeatedcv",
  number = 10,       # k = 10 folds
  repeats = 3,       # Repeat the 10-fold CV 3 times
  classProbs = TRUE, # Needed for ROC curve later if desired
  summaryFunction = twoClassSummary # For classification, provides AUC, sensitivity, specificity
)
```

 - Why `set.seed()`?  
 - What's the difference between `number` and `repeats` in `trainControl`?  
 - What other method options are available in `trainControl`? (e.g., `boot`, `loocv`)    
 
### Logistic Regression (glm)
```{r}
# Fit Logistic Regression model using caret's train function
# The 'method = "glm"' specifies a generalized linear model (logistic regression in this case)
# The 'family = "binomial"' is implicitly handled by caret for factor outcomes.
# 'metric = "ROC"' tells caret to optimize/report based on Area Under the ROC Curve.
# This is a good metric for imbalanced classes.

cat("\n--- Fitting Logistic Regression Model ---\n")
model_glm <- train(
  diabetes ~ .,         # Predict 'diabetes' using all other variables
  data = my_data,       # Our pre-processed dataset
  method = "glm",       # Specify Logistic Regression
  family = "binomial",  # For binary classification
  trControl = train_control, # Our defined CV control
  metric = "ROC"        # Evaluate using ROC AUC
)

# Print the model results
print(model_glm)

# Access individual folds' performance (optional, for deeper insight)
# model_glm$resample
```
### Linear Discriminant Analysis (LDA)
```{r}
# Fit LDA model using caret's train function
# 'method = "lda"' for Linear Discriminant Analysis

cat("\n--- Fitting LDA Model ---\n")
model_lda <- train(
  diabetes ~ .,
  data = my_data,
  method = "lda",
  trControl = train_control,
  metric = "ROC"
)

# Print the model results
print(model_lda)
```
### Quadratic Discriminant (QDA)
```{r}
# Fit QDA model using caret's train function
# 'method = "qda"' for Quadratic Discriminant Analysis

cat("\n--- Fitting QDA Model ---\n")
model_qda <- train(
  diabetes ~ .,
  data = my_data,
  method = "qda",
  trControl = train_control,
  metric = "ROC"
)

# Print the model results
print(model_qda)
```

### Comparing Models
```{r}
# Compare the models using resamples
# This provides a statistical comparison of performance metrics across models.
resample_results <- resamples(list(
  Logistic = model_glm,
  LDA = model_lda,
  QDA = model_qda
))

# Summarize the comparison
summary(resample_results)

# Visualize the comparison (e.g., box plots of ROC, Sensitivity, Specificity)
scales <- list(x = list(relation = "free"), y = list(relation = "free"))
bwplot(resample_results, scales = scales)

# You can also use a dotplot for a different visualization
dotplot(resample_results, scales = scales)
```

 - Looking at the output of `print(model_glm)`, what do the 'ROC', 'Sens', and 'Spec' values represent? (Area Under the ROC Curve, Sensitivity/Recall, Specificity).  
 - How do the ROC values compare across the three models? Which one performs best on average according to this metric?  
 - What does the `bwplot` or `dotplot` tell us about the variability of the model's performance across the folds?  
 - Based on these results, which model would you recommend for this dataset and why? (Encourage discussion on trade-offs, interpretability, and assumptions).  
 - If a model has a much higher ROC but lower Sens compared to another, what might that imply about its performance on different types of errors (false positives vs. false negatives)?  

## Predictions and Interpretation  
`caret` handles much of the complexity, it's important to understand how to make predictions with your trained models and interpret their coefficients.  

### Making Predictions
```{r}
# Let's use the best performing model (or pick one for demonstration)
# For example, let's use the Logistic Regression model for prediction.
# We'll predict on the original training data for simplicity,
# but in a real scenario, you'd use a completely unseen test set.

# Predict class labels
predictions_glm <- predict(model_glm, newdata = my_data)
head(predictions_glm)

# Predict class probabilities
probabilities_glm <- predict(model_glm, newdata = my_data, type = "prob")
head(probabilities_glm)

# Create a confusion matrix for the logistic regression model on training data
# This is NOT cross-validated performance, but good for understanding.
confusionMatrix(predictions_glm, my_data$diabetes)
```

### Interpreting Coefficients (Logistic Regression)
```{r}
# For Logistic Regression, we can inspect the coefficients
# Note: For LDA/QDA, interpretation is more about discriminant functions,
# not direct coefficient interpretation like GLM.

# Access the fitted GLM model object from the caret train object
glm_object <- model_glm$finalModel
summary(glm_object)

# Interpret coefficients (e.g., odds ratios)
# For a coefficient 'b', the odds ratio is exp(b)
# Example: For 'glucose', if the coefficient is 'x', then for every unit increase in glucose,
# the odds of having diabetes are multiplied by exp(x), assuming other predictors are constant.
exp(coef(glm_object))
```

 - What's the difference between `type = "raw"` (default predict output) and `type = "prob"`?   
 - Why is interpreting coefficients more straightforward for Logistic Regression than for LDA/QDA?  
 - What does an odds ratio less than 1 mean? Greater than 1?  
 
## Recap and Advanced Topics  
 - Logistic Regression: Linear decision boundary, predicts probabilities.  
 - LDA: Linear decision boundary, assumes equal covariance matrices.  
 - QDA: Quadratic decision boundary, assumes unequal covariance matrices.  
 - Cross-validation: Essential for reliable model performance estimation.  

### Further Exploration 
 - __Hyperparameter Tuning:__ For more complex models (like Random Forests, SVMs), `caret` can also tune hyperparameters automatically during cross-validation. Explore tuneGrid and tuneLength arguments in `train()`. 
 - __Other Metrics:__ Investigate other evaluation metrics like Precision, Recall, F1-score, especially for imbalanced datasets.  
 - __Imbalanced Data:__ The `PimaIndiansDiabetes2` dataset has some class imbalance. Research techniques for handling imbalanced data (e.g., SMOTE, undersampling/oversampling) and how they integrate with caret.  
 - __Feature Engineering/Selection:__ How would feature engineering or feature selection impact these models?  
 - __Visualization:__ Create ROC curves for each model to visualize their trade-offs between sensitivity and specificity.  
 
# Appendix 

## ROC Curve Plot (requires `pROC` package or manual calculation)

```{r}
# install.packages("pROC")
library(pROC)

# Get predictions from one of the models (e.g., Logistic Regression)
# Note: This is on the training data, for illustration. For true ROC, use a held-out test set.
prob_glm <- predict(model_glm, newdata = my_data, type = "prob")

# Create ROC object
roc_glm <- roc(response = my_data$diabetes, predictor = prob_glm$Yes, levels = c("No", "Yes"))

# Plot ROC curve
plot(roc_glm, main = "ROC Curve for Logistic Regression", col = "blue", lwd = 2)
auc(roc_glm) # Area Under Curve

# You could loop this for all models and plot on the same graph for comparison
```


## Without `caret`  
It's certainly possible to perform cross-validation and fit these models without `caret`. While `caret` simplifies the process significantly, doing it manually provides a deeper understanding of the underlying mechanics.

Here's the same demo without using the `caret` package. This version focuses more on manual loops for cross-validation and direct calls to `glm`, `lda`, and `qda`.

## Manual K-Fold Cross-Validation Setup
```{r}
# Set a random seed for reproducibility
set.seed(123)

# Define number of folds (k)
k_folds <- 10
num_repeats <- 3 # We'll repeat the entire CV process multiple times for more stable results

# Create empty lists to store results for each model and repeat
results_glm <- list()
results_lda <- list()
results_qda <- list()

# Store performance metrics (ROC, Sensitivity, Specificity) for each fold/repeat
metrics_glm <- data.frame(ROC = numeric(), Sensitivity = numeric(), Specificity = numeric())
metrics_lda <- data.frame(ROC = numeric(), Sensitivity = numeric(), Specificity = numeric())
metrics_qda <- data.frame(ROC = numeric(), Sensitivity = numeric(), Specificity = numeric())

# Get the number of observations
n_obs <- nrow(my_data)

# Outer loop for repeats
for (rep in 1:num_repeats) {
  cat(paste0("\n--- Cross-Validation Repeat ", rep, " ---\n"))

  # Randomly permute the data to ensure folds are diverse
  shuffled_data <- my_data[sample(n_obs), ]

  # Create folds
  # This creates a vector indicating which fold each row belongs to
  folds <- cut(seq(1, n_obs), breaks = k_folds, labels = FALSE)

  # Inner loop for folds
  for (i in 1:k_folds) {
    cat(paste0("  Fold ", i, "...\n"))

    # Define training and testing sets for this fold
    test_indices <- which(folds == i)
    train_data <- shuffled_data[-test_indices, ]
    test_data <- shuffled_data[test_indices, ]

    # --- Fit Logistic Regression ---
    tryCatch({ # Use tryCatch to handle potential errors (e.g., perfect separation)
      model_glm_fold <- glm(diabetes ~ ., data = train_data, family = "binomial")
      pred_prob_glm <- predict(model_glm_fold, newdata = test_data, type = "response")
      pred_class_glm <- factor(ifelse(pred_prob_glm > 0.5, "Yes", "No"), levels = c("No", "Yes"))

      # Ensure test_data$diabetes has consistent levels
      test_data$diabetes <- factor(test_data$diabetes, levels = c("No", "Yes"))

      # Calculate confusion matrix for GLM
      cm_glm <- table(Predicted = pred_class_glm, Actual = test_data$diabetes)
      # Calculate metrics manually or using pROC for AUC
      # Need to handle cases where classes might be missing in a fold
      if (ncol(cm_glm) == 2 && nrow(cm_glm) == 2) {
          TP_glm <- cm_glm[2,2]
          TN_glm <- cm_glm[1,1]
          FP_glm <- cm_glm[2,1]
          FN_glm <- cm_glm[1,2]

          sensitivity_glm <- TP_glm / (TP_glm + FN_glm)
          specificity_glm <- TN_glm / (TN_glm + FP_glm)

          # Calculate ROC AUC (requires at least two distinct predictions and two classes in test set)
          if (length(unique(test_data$diabetes)) > 1 && length(unique(pred_prob_glm)) > 1) {
              roc_obj_glm <- roc(response = test_data$diabetes, predictor = pred_prob_glm, levels = c("No", "Yes"), quiet = TRUE)
              auc_glm <- auc(roc_obj_glm)
          } else {
              auc_glm <- NA # Cannot compute AUC if only one class or constant predictions
          }
      } else {
          sensitivity_glm <- NA
          specificity_glm <- NA
          auc_glm <- NA
      }

      metrics_glm <- rbind(metrics_glm, data.frame(ROC = auc_glm, Sensitivity = sensitivity_glm, Specificity = specificity_glm))

    }, error = function(e) {
      cat("  GLM Error in fold ", i, ": ", e$message, "\n")
      metrics_glm <- rbind(metrics_glm, data.frame(ROC = NA, Sensitivity = NA, Specificity = NA))
    })


    # --- Fit LDA ---
    tryCatch({
      # Ensure there are at least two distinct classes in the training data for LDA
      if (length(unique(train_data$diabetes)) > 1) {
        model_lda_fold <- lda(diabetes ~ ., data = train_data)
        pred_lda <- predict(model_lda_fold, newdata = test_data)
        pred_prob_lda_yes <- pred_lda$posterior[, "Yes"] # Probability of 'Yes' class
        pred_class_lda <- pred_lda$class

        cm_lda <- table(Predicted = pred_class_lda, Actual = test_data$diabetes)
        if (ncol(cm_lda) == 2 && nrow(cm_lda) == 2) {
            TP_lda <- cm_lda[2,2]
            TN_lda <- cm_lda[1,1]
            FP_lda <- cm_lda[2,1]
            FN_lda <- cm_lda[1,2]

            sensitivity_lda <- TP_lda / (TP_lda + FN_lda)
            specificity_lda <- TN_lda / (TN_lda + FP_lda)

            if (length(unique(test_data$diabetes)) > 1 && length(unique(pred_prob_lda_yes)) > 1) {
                roc_obj_lda <- roc(response = test_data$diabetes, predictor = pred_prob_lda_yes, levels = c("No", "Yes"), quiet = TRUE)
                auc_lda <- auc(roc_obj_lda)
            } else {
                auc_lda <- NA
            }
        } else {
            sensitivity_lda <- NA
            specificity_lda <- NA
            auc_lda <- NA
        }
      } else {
          sensitivity_lda <- NA
          specificity_lda <- NA
          auc_lda <- NA
          cat("  LDA skipped in fold ", i, ": Only one class present in training data.\n")
      }
      metrics_lda <- rbind(metrics_lda, data.frame(ROC = auc_lda, Sensitivity = sensitivity_lda, Specificity = specificity_lda))

    }, error = function(e) {
      cat("  LDA Error in fold ", i, ": ", e$message, "\n")
      metrics_lda <- rbind(metrics_lda, data.frame(ROC = NA, Sensitivity = NA, Specificity = NA))
    })


    # --- Fit QDA ---
    tryCatch({
      # QDA often requires more data per class, can fail if not enough observations
      if (length(unique(train_data$diabetes)) > 1 &&
          sum(train_data$diabetes == "No") > ncol(train_data) &&
          sum(train_data$diabetes == "Yes") > ncol(train_data)) { # Rule of thumb for QDA
        model_qda_fold <- qda(diabetes ~ ., data = train_data)
        pred_qda <- predict(model_qda_fold, newdata = test_data)
        pred_prob_qda_yes <- pred_qda$posterior[, "Yes"]
        pred_class_qda <- pred_qda$class

        cm_qda <- table(Predicted = pred_class_qda, Actual = test_data$diabetes)
        if (ncol(cm_qda) == 2 && nrow(cm_qda) == 2) {
            TP_qda <- cm_qda[2,2]
            TN_qda <- cm_qda[1,1]
            FP_qda <- cm_qda[2,1]
            FN_qda <- cm_qda[1,2]

            sensitivity_qda <- TP_qda / (TP_qda + FN_qda)
            specificity_qda <- TN_qda / (TN_qda + FP_qda)

            if (length(unique(test_data$diabetes)) > 1 && length(unique(pred_prob_qda_yes)) > 1) {
                roc_obj_qda <- roc(response = test_data$diabetes, predictor = pred_prob_qda_yes, levels = c("No", "Yes"), quiet = TRUE)
                auc_qda <- auc(roc_obj_qda)
            } else {
                auc_qda <- NA
            }
        } else {
            sensitivity_qda <- NA
            specificity_qda <- NA
            auc_qda <- NA
        }
      } else {
          sensitivity_qda <- NA
          specificity_qda <- NA
          auc_qda <- NA
          cat("  QDA skipped in fold ", i, ": Not enough observations per class in training data.\n")
      }
      metrics_qda <- rbind(metrics_qda, data.frame(ROC = auc_qda, Sensitivity = sensitivity_qda, Specificity = specificity_qda))

    }, error = function(e) {
      cat("  QDA Error in fold ", i, ": ", e$message, "\n")
      metrics_qda <- rbind(metrics_qda, data.frame(ROC = NA, Sensitivity = NA, Specificity = NA))
    })
  } # End of inner fold loop
} # End of outer repeat loop
```

## Summarizing and Comparing Results  
After running all the folds and repeats, we'll have a collection of performance metrics. Let's summarize them and compare our models.

```{r}
# Summarize the cross-validation results
cat("\n--- Summary of Logistic Regression Performance ---\n")
summary(metrics_glm)
cat("Average ROC: ", mean(metrics_glm$ROC, na.rm = TRUE), "\n")
cat("Average Sensitivity: ", mean(metrics_glm$Sensitivity, na.rm = TRUE), "\n")
cat("Average Specificity: ", mean(metrics_glm$Specificity, na.rm = TRUE), "\n")

cat("\n--- Summary of LDA Performance ---\n")
summary(metrics_lda)
cat("Average ROC: ", mean(metrics_lda$ROC, na.rm = TRUE), "\n")
cat("Average Sensitivity: ", mean(metrics_lda$Sensitivity, na.rm = TRUE), "\n")
cat("Average Specificity: ", mean(metrics_lda$Specificity, na.rm = TRUE), "\n")

cat("\n--- Summary of QDA Performance ---\n")
summary(metrics_qda)
cat("Average ROC: ", mean(metrics_qda$ROC, na.rm = TRUE), "\n")
cat("Average Sensitivity: ", mean(metrics_qda$Sensitivity, na.rm = TRUE), "\n")
cat("Average Specificity: ", mean(metrics_qda$Specificity, na.rm = TRUE), "\n")

# Combine results for easier comparison and visualization
all_metrics <- rbind(
  data.frame(Model = "Logistic", metrics_glm),
  data.frame(Model = "LDA", metrics_lda),
  data.frame(Model = "QDA", metrics_qda)
)

# Remove rows with NA values (from failed folds) for plotting
all_metrics_clean <- na.omit(all_metrics)

# --- Visualization (requires ggplot2, not loaded by default here, but good for discussion) ---
# install.packages("ggplot2")
library(ggplot2)

# Boxplot of ROC values
ggplot(all_metrics_clean, aes(x = Model, y = ROC, fill = Model)) +
  geom_boxplot() +
  labs(title = "Model Performance Comparison (ROC)",
       y = "ROC AUC", x = "Model") +
  theme_minimal()

# Boxplot of Sensitivity values
ggplot(all_metrics_clean, aes(x = Model, y = Sensitivity, fill = Model)) +
  geom_boxplot() +
  labs(title = "Model Performance Comparison (Sensitivity)",
       y = "Sensitivity", x = "Model") +
  theme_minimal()

# Boxplot of Specificity values
ggplot(all_metrics_clean, aes(x = Model, y = Specificity, fill = Model)) +
  geom_boxplot() +
  labs(title = "Model Performance Comparison (Specificity)",
       y = "Specificity", x = "Model") +
  theme_minimal()
```
Here's a summary of the average performance metrics for each model:

| **Model**               | **Average ROC** | **Average Sensitivity** | **Average Specificity** |
|-------------------------|:-----------------:|:-------------------------:|:-------------------------:|
| **Logistic Regression** | **_0.8379_**    | 0.5534                  | **_0.8877_**            |
| **LDA**                 | 0.8374          | 0.5521                  | 0.8831                  |
| **QDA**                 | 0.8235          | **_0.5729_**            | 0.8543                  |

__Overall Performance (ROC AUC):__  

 - Logistic Regression (0.8379) and LDA (0.8374) have almost identical average ROC AUCs. The difference is negligible.
 - QDA (0.8235) has a slightly lower average ROC AUC compared to Logistic Regression and LDA.  
 - Interpretation: Based purely on overall discriminative power across all possible thresholds, Logistic Regression and LDA are performing equally well and slightly better than QDA. An AUC above 0.8 is generally considered good, indicating that the models have strong discriminatory power.  

__Sensitivity (True Positive Rate):__  

 - QDA (0.5729) has the highest average Sensitivity.  
 - Logistic Regression (0.5534) and LDA (0.5521) have very similar and slightly lower average Sensitivities compared to QDA.  
 - Interpretation: QDA is slightly better at correctly identifying actual positive cases (e.g., people with diabetes). This means it has fewer false negatives.  
 
__Specificity (True Negative Rate):__  

 - Logistic Regression (0.8877) has the highest average Specificity.
 - LDA (0.8831) is very close to Logistic Regression.
 - QDA (0.8543) has a lower average Specificity compared to Logistic Regression and LDA.
 - Logistic Regression is best at correctly identifying actual negative cases (e.g., people without diabetes). This means it has fewer false positives.

__Which one is the "best"?__  

The "best" model depends on the *specific goals and costs* associated with False Positives and False Negatives in your application.  

 - If overall discriminative power (considering all thresholds) is paramount, and you don't have a strong preference for minimizing one type of error over another:  
   - Logistic Regression or LDA are the top contenders, as their average ROC AUCs are virtually identical and the highest. There's not enough difference here to declare one definitively better than the other based solely on this metric.  
 - If minimizing False Negatives (catching as many positive cases as possible) is more critical (i.e., high Sensitivity is preferred):  
   - QDA might be considered slightly better due to its marginally higher average Sensitivity (0.5729). However, keep in mind that this comes at the cost of lower Specificity.  
 - If minimizing False Positives (correctly identifying negative cases and avoiding false alarms) is more critical (i.e., high Specificity is preferred):  
   - Logistic Regression (or very closely, LDA) is clearly the winner. It has the highest average Specificity, meaning it's less likely to incorrectly classify a negative case as positive.   

__Based on these results:__  

 - For overall balanced performance (highest ROC AUC) and strongest control over False Positives (highest Specificity), Logistic Regression is marginally the "best," closely followed by LDA.  
 - If you prioritize slightly higher Sensitivity (catching more true positives), QDA offers a small advantage, but at the cost of lower Specificity and a slightly lower overall ROC.  

Logistic Regression appears to be the most robust choice among these three models for this dataset. The differences between Logistic Regression and LDA are so small that they could be considered effectively equivalent. You'd likely choose between them based on ease of implementation, interpretability, or minor theoretical preferences.

## How to interpret ROC, Sensitivity, and Specificity

ROC, Sensitivity, and Specificity are three metrics for evaluating the performance of classification models, especially in situations where you care about different types of errors. To understand these, first understand the four basic outcomes in a binary classification problem, often represented in a confusion matrix:  

|                        | **Actual Positive** | **Actual Negative** |
|------------------------|:---------------------:|:---------------------:|
| **Predicted Positive** | True Positive (TP)  | False Positive (FP) |
| **Predicted Negative** | False Negative (FN) | True Negative (TN)  |  


 - __True Positive (TP):__ The model correctly predicted a positive outcome (e.g., predicted "has diabetes," and the person actually has diabetes).  
 - __False Positive (FP):__ The model incorrectly predicted a positive outcome (e.g., predicted "has diabetes," but the person actually does not have diabetes). This is also known as a Type I error or a "false alarm."  
 - __False Negative (FN):__ The model incorrectly predicted a negative outcome (e.g., predicted "does not have diabetes," but the person actually has diabetes). This is also known as a Type II error or a "miss."  
 - __True Negative (TN):__ The model correctly predicted a negative outcome (e.g., predicted "does not have diabetes," and the person actually does not have diabetes).  
 
### Sensitivity (Recall, True Positive Rate - TPR)  
Sensitivity measures the proportion of actual positive cases that were correctly identified by the model. It answers the question: "Out of all the people who truly have the condition, how many did our model correctly find?"  

$Sensitivity = \frac{TP}{TP+FN}$

 - __High Sensitivity (closer to 1.0):__ Means the model is very good at catching positive cases. It has a low rate of False Negatives (missing actual positives).  
 - __Low Sensitivity (closer to 0.0):__ Means the model misses many actual positive cases, leading to a high rate of False Negatives.  
 
__Sensitivity is crucial when the cost of a False Negative is high.__  

 - __Medical Diagnosis (e.g., detecting a serious disease like cancer):__ You want very high sensitivity because missing a disease (false negative) can have severe consequences for the patient. You'd rather have some false alarms (false positives) that can be further investigated.  
 - __Security Systems:__ You want to detect all intruders (high sensitivity), even if it means some false alarms.  
 - __Fraud Detection:__ You want to catch as many fraudulent transactions as possible.  

#### SNNOUT (Sensitive Negative = Rule OUT)  
If a highly sensitive test comes back negative, you can confidently rule out the disease/condition.

### Specificity (True Negative Rate - TNR)  
Specificity measures the proportion of actual negative cases that were correctly identified by the model. It answers the question: "Out of all the people who truly do NOT have the condition, how many did our model correctly identify as not having it?"  

$Specificity = \frac{TN}{TN+FP}$

 - __High Specificity (closer to 1.0):__ Means the model is very good at correctly identifying negative cases. It has a low rate of False Positives (incorrectly classifying negatives as positives).  
 - __Low Specificity (closer to 0.0):__ Means the model incorrectly flags many negative cases as positive, leading to a high rate of False Positives.  
 
__Specificity is crucial when the cost of a False Positive is high.__  

 - __Spam Detection:__ You want high specificity to avoid marking legitimate emails as spam (false positive), which would be very annoying to users. You might tolerate a few spam emails getting through (false negatives).  
 - __Legal/Criminal Justice:__ Incorrectly identifying an innocent person as guilty (false positive) has high societal costs.  
 - __Expensive or Invasive Follow-up Tests:__ If a positive test leads to a costly, painful, or risky follow-up procedure, you want high specificity to minimize unnecessary procedures.  

#### SPPIN (Specific Positive = Rule IN)  
If a highly specific test comes back positive, you can confidently rule in the disease/condition.

### Reciever Operating Characteristic Curve (ROC)  
The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It plots the *True Positive Rate (Sensitivity)* on the y-axis against the False Positive Rate *(1 - Specificity)* on the x-axis for different possible thresholds.

 - Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular classification threshold.  
 
   - *Top-left corner (0,1):* This represents a perfect classifier: 100% Sensitivity (no false negatives) and 100% Specificity (no false positives). The closer your curve is to this corner, the better the model's overall performance.  
   - *Diagonal line (from 0,0 to 1,1):* This represents a random classifier (equivalent to flipping a coin). A model whose ROC curve falls along this diagonal is performing no better than chance.  
   - *Curve below the diagonal:* Indicates a model that performs worse than random chance. If this happens, you can often just invert the model's predictions to get a better-than-chance model!  
   
 - The AUC is a single scalar value that summarizes the overall performance of the classifier across all possible thresholds. It ranges from 0 to 1.  
 
   - *AUC = 1:* Perfect classification.  
   - *AUC = 0.5:* Performance is no better than random guessing.  
   - *AUC < 0.5:* Performance is worse than random guessing.  
   - *A higher AUC indicates a better model. It can be interpreted as the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative instance.

```{r}
# Install necessary packages (if not already installed)
# install.packages("mlbench")
# install.packages("MASS")
# install.packages("dplyr")
# install.packages("ggplot2") # For plotting
# install.packages("pROC")    # For ROC curves

# Load packages
library(mlbench)
library(MASS)
library(dplyr)
library(ggplot2)
library(pROC) # Load pROC for ROC curves and AUC

# Load the dataset
data(PimaIndiansDiabetes2)
my_data <- PimaIndiansDiabetes2

# Data Preparation (using dplyr)
my_data <- my_data %>%
  na.omit() %>%
  mutate(diabetes = factor(diabetes, levels = c("neg", "pos"), labels = c("No", "Yes")))

# --- 1. Fit each model on the *entire* dataset ---
# This is a final model, not part of the cross-validation loop.
# It's used for plotting the "overall" ROC curve for the full dataset.

cat("\n--- Fitting Final Models on Full Dataset for ROC Curve ---\n")

# Logistic Regression
final_model_glm <- glm(diabetes ~ ., data = my_data, family = "binomial")

# LDA
final_model_lda <- lda(diabetes ~ ., data = my_data)

# QDA
# Add a check to ensure QDA can be fitted (sufficient data per class relative to predictors)
# This is similar to the check in the CV loop, but applied to the full dataset.
# The `ncol(my_data) - 1` accounts for all predictors
if (sum(my_data$diabetes == "No") >= (ncol(my_data) - 1) &&
    sum(my_data$diabetes == "Yes") >= (ncol(my_data) - 1)) {
  final_model_qda <- qda(diabetes ~ ., data = my_data)
} else {
  warning("QDA could not be fitted on the full dataset due to insufficient observations per class.")
  final_model_qda <- NULL # Set to NULL if it cannot be fitted
}


# --- 2. Get predicted probabilities for the 'Yes' class from each model ---

# Logistic Regression probabilities
# type = "response" gives probabilities directly for glm
prob_glm <- predict(final_model_glm, newdata = my_data, type = "response")

# LDA probabilities
# predict() for lda/qda returns a list; $posterior contains probabilities
prob_lda <- predict(final_model_lda, newdata = my_data)$posterior[, "Yes"]

# QDA probabilities (check if QDA model was successfully fitted)
prob_qda <- if (!is.null(final_model_qda)) {
  predict(final_model_qda, newdata = my_data)$posterior[, "Yes"]
} else {
  NULL # If QDA wasn't fitted, no probabilities
}

# --- 3. Create ROC objects for each model ---

# For Logistic Regression
roc_obj_glm <- roc(response = my_data$diabetes, predictor = prob_glm,
                   levels = c("No", "Yes")) # Ensure levels are in order: negative first

# For LDA
roc_obj_lda <- roc(response = my_data$diabetes, predictor = prob_lda,
                   levels = c("No", "Yes"))

# For QDA (only if the model was fitted)
if (!is.null(prob_qda)) {
  roc_obj_qda <- roc(response = my_data$diabetes, predictor = prob_qda,
                     levels = c("No", "Yes"))
} else {
  roc_obj_qda <- NULL
}


# --- 4. Plot all three ROC curves on the same graph ---

# Start with the first ROC curve (e.g., Logistic Regression)
plot(roc_obj_glm,
     col = "#3182bd", # Color for GLM
     lwd = 2,      # Line width
     main = "ROC Curves for Classification Models",
     sub = paste0("Logistic Regression AUC: ", round(auc(roc_obj_glm), 3),
                  "\nLDA AUC: ", round(auc(roc_obj_lda), 3),
                  if (!is.null(roc_obj_qda)) { paste0("\nQDA AUC: ", round(auc(roc_obj_qda), 3)) }
                  else { "" }
                  ),
     grid = TRUE # Add a grid for better readability
     )

# Add LDA curve to the existing plot
plot(roc_obj_lda,
     col = "#de2d26", # Color for LDA
     lwd = 2,
     add = TRUE # Add to the current plot
     )

# Add QDA curve to the existing plot (if it was fitted)
if (!is.null(roc_obj_qda)) {
  plot(roc_obj_qda,
       col = "#31a354", # Color for QDA
       lwd = 2,
       add = TRUE
       )
}

# Add a legend to distinguish the curves
legend("bottomright",
       legend = c(paste0("Logistic Regression (AUC = ", round(auc(roc_obj_glm), 3), ")"),
                  paste0("LDA (AUC = ", round(auc(roc_obj_lda), 3), ")"),
                  if (!is.null(roc_obj_qda)) { paste0("QDA (AUC = ", round(auc(roc_obj_qda), 3), ")") }
                  else { "QDA (Not Fitted)" }
                  ),
       col = c("#3182bd", "#de2d26", "#31a354"),
       lwd = 2,
       cex = 0.8 # Font size for legend
       )

# You can also add the diagonal line representing random chance
abline(a = 0, b = 1, lty = 2, col = "#bdbdbd") # Dashed gray line

# --- Optional: Print individual AUCs again for easy reference ---
cat("\n--- AUC Values for Final Models ---\n")
cat("Logistic Regression AUC: ", auc(roc_obj_glm), "\n")
cat("LDA AUC: ", auc(roc_obj_lda), "\n")
if (!is.null(roc_obj_qda)) {
  cat("QDA AUC: ", auc(roc_obj_qda), "\n")
} else {
  cat("QDA AUC: Not applicable (model not fitted)\n")
}
```

__Why it's important:__  

 - Unlike Accuracy, Sensitivity, and Specificity (which depend on a chosen threshold), the ROC curve and AUC provide a view of the model's performance across *all possible thresholds.* This is especially useful for models that output probabilities (like logistic regression), as you can choose the optimal threshold later based on your specific needs.  
 - AUC is generally a more reliable metric than accuracy when dealing with [imbalanced datasets](https://developers.google.com/machine-learning/crash-course/overfitting/imbalanced-datasets) (where one class is much more prevalent than the other). A classifier can achieve high accuracy by simply predicting the majority class all the time, but its AUC would reveal its poor discriminatory power.  
 - AUC is excellent for comparing different models because it summarizes their overall discriminative power in a single number. The model with a higher AUC is generally preferred.

#### Trade-offs between Sensitivity and Specificity (and how ROC shows it)  

There's often a trade-off between sensitivity and specificity.  

 - If you set a very low threshold for classifying an outcome as "positive" (e.g., if any probability above 0.1 is considered positive), you'll likely increase your sensitivity (catch more true positives) but also increase your false positive rate (decrease specificity).  
 - Conversely, if you set a very high threshold (e.g., only probabilities above 0.9 are positive), you'll increase your specificity (reduce false positives) but likely decrease your sensitivity (miss more true positives).  