Classification: Logistic Regression, LDA, and QDA

STA 6543: Predictive Modeling

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, fig.width = 8, fig.height = 5)

# Required libraries for today's graduate lab
library(mlbench)    # For the Pima Indians dataset
library(tidyverse)  # For data manipulation and ggplot2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(modelr)     # For elegant data splitting workflows
library(MASS)       # For lda() and qda() functions

Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select
library(gtsummary)  # For publication-grade statistical tables

Attaching package: 'gtsummary'

The following object is masked from 'package:MASS':

    select
library(pROC)       # For evaluating ROC curves
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'

The following objects are masked from 'package:stats':

    cov, smooth, var

1. Context & The Philosophy of Data Splitting

We are working with the PimaIndiansDiabetes dataset, which contains 768 observations on female patients of Pima Indian heritage. Our goal is to predict binary diabetes status (diabetes: pos/neg) using diagnostic signs such as glucose tolerance plasma concentration (glucose), body mass index (mass), age (age), and blood pressure (pressure).

Before fitting a single model, we must partition our data.

The Optimism Bias of Training Error

Training error is a fundamentally biased, overly optimistic estimator of the true generalization error. Because models minimize loss directly on the training architecture, they naturally adapt to its specific stochastic noise.

Out-of-sample validation via a Train/Test split ensures we evaluate models on a data generating process they have not yet seen, effectively testing their true predictive generalizes-ability.

We will use the modelr package, which generates memory-efficient pointer objects (resample objects) rather than immediately duplicating data frames in RAM—a critical habit when scaling to massive datasets.

# Load data
data(PimaIndiansDiabetes)

set.seed(42) # Crucial for reproducibility in stochastic partitioning

# Partitioning data: 75% Training, 25% Testing
partitions <- resample_partition(PimaIndiansDiabetes, c(train = 0.75, test = 0.25))

# Convert the pointers to explicit data frames for model fitting
train_df <- as.data.frame(partitions$train)
test_df  <- as.data.frame(partitions$test)

# Check the baseline distribution using gtsummary
train_df %>%
  tbl_summary(by = diabetes, include = c(glucose, mass, age, pressure)) %>%
  add_p() %>%
  modify_caption("**Table 1: Baseline Clinical Metrics Stratified by Diabetes Status (Training Set)**")
Table 1: Baseline Clinical Metrics Stratified by Diabetes Status (Training Set)
Characteristic neg
N = 3741
pos
N = 2011
p-value2
glucose 108 (94, 124) 141 (119, 167) <0.001
mass 30 (25, 35) 34 (30, 38) <0.001
age 27 (23, 37) 35 (29, 44) <0.001
pressure 70 (62, 78) 74 (68, 82) <0.001
1 Median (Q1, Q3)
2 Wilcoxon rank sum test

2. Theoretical Overview & Model Fitting

We are pitting three distinct mathematical frameworks against each other:

  1. Logistic Regression: A direct, discriminative approach. It models the conditional probability \(P(Y = 1 \mid X)\) directly using the log-odds (logit) link function. It makes no assumptions about the distribution of the predictors \(X\).

  2. Linear Discriminant Analysis (LDA): A generative approach. It assumes that \(X\) stems from a multivariate Gaussian distribution (\(X \mid Y = k \sim N(\mu_k, \Sigma)\)). Crucially, LDA assumes that all classes share an identical covariance matrix (\(\Sigma\)). This results in a linear decision boundary.

  3. Quadratic Discriminant Analysis (QDA): Also generative and multivariate Gaussian, but it relaxes the homoscedastic assumption. Each class gets its own unique covariance matrix (\(\Sigma_k\)). This yields a quadratic (curved) decision boundary.

The fundamental divide between these models lies in the underlying probabilistic frameworks, asymptotic properties, and inductive biases.

Generative vs. Discriminative

  • Logistic Regression is a Discriminative Classifier. It directly models the conditional probability \(P(Y=k \mid X)\) using the logistic/softmax function. It makes no assumptions about the distribution of the predictors \(X\), treating them as fixed or arbitrary. It optimizes parameters using maximum likelihood estimation (MLE) via iteratively reweighted least squares (IRLS).

  • LDA and QDA are Generative Classifiers. They model the joint distribution \(P(X, Y) = P(X \mid Y)P(Y)\) by modeling the class-conditional densities \(P(X \mid Y=k)\) and applying Bayes’ theorem to find \(P(Y=k \mid X)\). Both assume that the predictors follow a multivariate normal distribution:

\[X \mid Y=k \sim N(\mu_k, \Sigma_k)\]

Mathematical Comparison & Boundary Geography

The geometry of the decision boundaries is directly dictated by how the models handle the covariance structures of the predictors.

Metric / Feature Logistic Regression LDA QDA
Decision Boundary Linear Linear Quadratic (Conic sections)
Assumed Density None Multivariate Normal Multivariate Normal
Covariance Structure N/A Pooled: Assumes \(\Sigma_1 = \Sigma_2 = \dots = \Sigma_K\) Class-Specific: Each class \(k\) gets its own \(\Sigma_k\)
Parameter Complexity \(p + 1\) parameters \(Kp + p(p+1)/2\) parameters \(Kp + Kp(p+1)/2\) parameters

The Algebraic Connection of Logistic vs. LDA

If the conditional distribution assumptions of LDA hold true (multivariate normality with equal covariance), the log-posterior odds of LDA can be simplified to:

\[\log \left( \frac{P(Y=1 \mid X)}{P(Y=0 \mid X)} \right) = \alpha_0 + \beta^T X\]

This is the exact functional form of logistic regression. However, the parameter estimates \((\alpha_0, \beta)\) will differ because LDA optimizes the joint likelihood (using pooled empirical means and covariances), whereas logistic regression optimizes the conditional likelihood.

Asymptotic Efficiency and Robustness

When to prefer Logistic Regression

  • Non-Normal Predictors: Because logistic regression relies on fewer distributional assumptions, it is far more robust to heavy tails, extreme skewness, or qualitative/categorical features.

  • The Mismatched Case: If the true distribution \(P(X \mid Y)\) is drastically non-normal, LDA/QDA parameter estimates become inconsistent. Asymptotically, logistic regression will outperform LDA under model mismatch.

When to prefer LDA

  • Strict Normality & Small Samples: If the predictors are indeed multivariate normal, LDA is asymptotically efficient—it requires fewer data points than logistic regression to achieve the same error rate because it leverages the structural information contained in \(P(X)\).

  • Perfect Separation: In datasets where classes are perfectly linearly separable, logistic regression MLEs are unstable and drive coefficients toward infinity (the Hauck-Donner effect). LDA remains perfectly stable because it relies on mean vectors and pooled covariance matrices, which are always well-defined.

  • Multi-class Settings: While multinomial logistic regression handles \(K > 2\) classes well, LDA scales effortlessly and offers low-dimensional projection capabilities.

When to prefer QDA

  • Heteroscedasticity: When the variance-covariance structures differ markedly between classes (\(\Sigma_1 \neq \Sigma_2\)), drawing a linear boundary introduces massive bias. QDA allows the boundary to bend into parabolas, ellipses, or hyperbolas.

  • Sample Size Caveat: QDA requires estimating a separate covariance matrix for each class. Because parameter count scales quadratically with the number of predictors (\(O(Kp^2)\)), QDA suffers severely from the curse of dimensionality and will overfit drastically if the sample size \(N\) is small relative to \(p\). Under high dimensions, regularization techniques (like Friedman’s Regularized Discriminant Analysis) are typically required to compromise between LDA and QDA.

The Bias-Variance Tradeoff in Discriminant Analysis

If the assumption of equal covariance matrices is true, LDA is the Minimum Variance Unbiased Estimator (MVUE) among classifiers. QDA requires estimating a separate covariance matrix for each class. If you have \(p\) predictors and \(K\) classes, QDA must estimate \(K \times p(p+1)/2\) parameters.

For large \(p\), QDA can suffer from massive variance and overfit drastically. Choose LDA if sample size is small or variance structure is stable; choose QDA only if you have the sample size to support parameter estimation and clear evidence of heteroscedasticity.

Let’s fit all three models to our training data:

# 1. Logistic Regression
fit_logistic <- glm(diabetes ~ glucose + mass + age + pressure, data = train_df, family = binomial)

# 2. Linear Discriminant Analysis (MASS package)
fit_lda <- lda(diabetes ~ glucose + mass + age + pressure, data = train_df)

# 3. Quadratic Discriminant Analysis (MASS package)
fit_qda <- qda(diabetes ~ glucose + mass + age + pressure, data = train_df)

Reviewing Model Parameter Summaries

Let’s leverage gtsummary to inspect the structural coefficients of our Logistic Regression model.

Logistic Regression

tbl_regression(fit_logistic, exponentiate = TRUE) %>%
  modify_caption("**Table 2: Logistic Regression Odds Ratios (Training Data)**")
Table 2: Logistic Regression Odds Ratios (Training Data)
Characteristic OR 95% CI p-value
glucose 1.04 1.03, 1.05 <0.001
mass 1.08 1.05, 1.12 <0.001
age 1.03 1.02, 1.05 <0.001
pressure 0.99 0.98, 1.00 0.049
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

In epidemiologic and predictive modeling contexts, interpreting odds ratios (\(OR\)) requires evaluating three components: the point estimate, the direction of the effect relative to the baseline (\(OR = 1\)), and the statistical precision or significance (reflected by the 95% Confidence Interval and the \(p\)-value).

Because these variables are continuous physical metrics rather than binary indicators, each odds ratio represents the multiplicative change in the odds of the outcome (testing positive for diabetes) for a one-unit increase in that specific predictor, holding all other covariates constant.

1. Covariate-by-Covariate Breakdown
glucose (\(OR = 1.04\), \(95\%\text{ CI: } [1.03, 1.05]\), \(p < 0.001\))
  • Interpretation: Controlling for body mass index, age, and blood pressure, each 1-unit increase in plasma glucose concentration is associated with a 4% increase in the odds of testing positive for diabetes (\(\text{Odds Ratio} - 1 = 0.04\)).
  • Inference: This effect is highly statistically significant (\(p < 0.001\)). The narrow confidence interval \([1.03, 1.05]\) indicates a highly precise estimate, confirming that higher glucose levels are reliably tied to an upward shift in the conditional probability profile.
mass (\(OR = 1.08\), \(95\%\text{ CI: } [1.05, 1.12]\), \(p < 0.001\))
  • Interpretation: Adjusting for all other clinical indicators in the model, a 1-unit increase in body mass index (BMI) is associated with an 8% increase in the odds of testing positive for diabetes.
  • Inference: Out of all the continuous predictors evaluated, mass exhibits the largest magnitude of effect per single unit of change. The bounds of the 95% confidence interval are completely bounded well above 1.0, signifying a robust, highly significant positive association (\(p < 0.001\)).
age (\(OR = 1.03\), \(95\%\text{ CI: } [1.02, 1.05]\), \(p < 0.001\))
  • Interpretation: Holding glucose, mass, and blood pressure constant, each additional year of age is associated with a 3% increase in the odds of a positive diabetes diagnosis.

  • Inference: Though a 3% change per year appears modest superficially, it represents a compounding risk over time (e.g., a 10-year increase in age shifts the odds by a factor of \(1.03^{10} \approx 1.34\), or a 34% increase in odds). The effect is highly statistically significant.

pressure (\(OR = 0.99\), \(95\%\text{ CI: } [0.98, 1.00]\), \(p = 0.049\))
  • Interpretation: Controlling for glucose, mass, and age, a 1-unit increase in diastolic blood pressure is associated with a 1% decrease in the odds of testing positive for diabetes (\(\text{Odds Ratio} - 1 = -0.01\)).

  • Inference: This covariate is on the margin of statistical significance (\(p = 0.049\)). Notice that the upper bound of the 95% confidence interval touches \(1.00\) when rounded. Because the interval includes the null value (\(1.0\)), the protective effect of blood pressure in this specific multivariate configuration is highly unstable and weak.

Some Notes

  • Conditional Nature of Estimates: Emphasize that these are conditional odds ratios. For example, the 8% risk inflation per unit of mass is true specifically for individuals who share the exact same age, glucose, and pressure profiles.

  • The Scale Caveat: When comparing the magnitude of continuous odds ratios (e.g., mass at 1.08 vs. glucose at 1.04), the point estimates depend entirely on the measurement units. A 1-unit change in BMI (mass) represents a much larger physical shift relative to its sample distribution than a 1-unit change in plasma glucose concentration (glucose).

If you want to make these magnitudes directly comparable, you would need to standardize these predictors by their standard deviations (\(z\)-scores) before fitting the logistic regression model.

Discriminant Analysis

Analyzing the structural output of Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) isn’t common in simple classification accuracy tasks, but allows you to evaluate the underlying geometry and mechanics of how the models are slicing your feature space.

When you print an lda or qda object from the MASS package in R, the console gives you specific mathematical components. Here is a guide on how to extract, interpret, and discuss those outputs, utilizing the PimaIndiansDiabetes framework.

Interpreting the Text Output in R

When you execute print(fit_lda) or print(fit_qda), R returns a structured summary. Let’s look at what each section means.

print(fit_lda)
Call:
lda(diabetes ~ glucose + mass + age + pressure, data = train_df)

Prior probabilities of groups:
      neg       pos 
0.6504348 0.3495652 

Group means:
     glucose     mass      age pressure
neg 109.7219 30.15668 31.15775 67.96524
pos 142.1244 34.87512 36.84080 71.16418

Coefficients of linear discriminants:
                 LD1
glucose   0.02925412
mass      0.05774842
age       0.02618366
pressure -0.01007862
Prior Probabilities of Groups
  • What it is: The estimated prior probabilities (\(\pi_k\)) for each class. By default, R calculates these as the empirical proportions of the classes in your training data (e.g., neg: 0.651, pos: 0.349).

  • Look whether the priors reflect class imbalance. Because LDA and QDA incorporate these priors directly into Bayes’ theorem to calculate the posterior probability, a higher prior shifts the decision boundary to favor the majority class. This is the model’s “baseline intuition” before seeing any physical metrics.

Group Means
  • What it is: A matrix displaying the average value of each predictor vector (\(\mu_k\)) split by class.

  • Look for the univariate separation between groups. For instance, note: “The group mean for glucose is noticeably higher in the diabetic group (142.1) than the non-diabetic group (109.7), signaling that glucose will likely carry significant weight in the discriminant function.”

Coefficients of Linear Discriminants (LDA Only)
  • What it is: The scaling constants (\(v\)) that define the linear combination of predictors used to maximize between-class variance relative to within-class variance. The output gives you the formula for the discriminant score:

\(LD_1 = v_1X_1 + v_2X_2 + \dots + v_pX_p\)

  • Discuss these like standardized coordinates or regression weights (if your predictors are on similar scales). A large positive or negative coefficient indicates that the specific predictor heavily drives the assignment to a particular class.

  • Note: QDA does not have linear discriminant coefficients because its decision boundary is quadratic—it cannot be expressed as a simple linear combination of features.

print(fit_qda)
Call:
qda(diabetes ~ glucose + mass + age + pressure, data = train_df)

Prior probabilities of groups:
      neg       pos 
0.6504348 0.3495652 

Group means:
     glucose     mass      age pressure
neg 109.7219 30.15668 31.15775 67.96524
pos 142.1244 34.87512 36.84080 71.16418
Extracting and Discussing Hidden Matrices

To fully critique the models—especially when comparing LDA vs. QDA—graduate-level analysis requires inspecting the variance-covariance structures directly. R hides these inside the model objects, but you can compute or extract them to discuss the homoscedasticity assumption.

The LDA Pooled Covariance Matrix

LDA assumes that both groups share an identical covariance matrix (\(\Sigma\)). You can calculate this pooled matrix from your training data to inspect the correlation structures the model forces both classes to share:

# Calculate the pooled variance-covariance matrix used implicitly by LDA
with(train_df, {
  # Split predictors by group
  X_neg <- as.matrix(subset(train_df, diabetes == "neg", select = c(glucose, mass, age)))
  X_pos <- as.matrix(subset(train_df, diabetes == "pos", select = c(glucose, mass, age)))
  
  # Calculate group covariances
  cov_neg <- cov(X_neg)
  cov_pos <- cov(X_pos)
  
  # Pool them based on degrees of freedom
  n_neg <- nrow(X_neg)
  n_pos <- nrow(X_pos)
  pooled_cov <- ((n_neg - 1) * cov_neg + (n_pos - 1) * cov_pos) / (n_neg + n_pos - 2)
  print(pooled_cov)
})
          glucose     mass       age
glucose 762.18669 22.93965  53.00416
mass     22.93965 59.05123  -3.09117
age      53.00416 -3.09117 128.83176

This pooled covariance matrix (\(\Sigma\)) represents the foundational geometric framework that your Linear Discriminant Analysis (LDA) model uses to understand the relationships between your predictors across the entire dataset.

In LDA, the model assumes that both groups (pos and neg diabetes cohorts) share this exact same covariance structure. It uses these values to calculate Mahalanobis distances and to project the data into a lower-dimensional space that maximizes class separation.

The diagonal elements (\(762.19\), \(59.05\), and \(128.83\)) represent the variance (\(\sigma^2\)) for glucose, mass, and age respectively. Taking the square root of these values gives us their pooled standard deviations (\(\sigma\)):

  • glucose: \(\sigma = \sqrt{762.18669} \approx 27.61\) units
  • age: \(\sigma = \sqrt{128.83176} \approx 11.35\) years
  • mass: \(\sigma = \sqrt{59.05123} \approx 7.68\) units
Statistical Implications for the Model

The massive scale discrepancy between the variance of glucose (\(762.19\)) and mass (\(59.05\)) explains why looking at unstandardized coefficients in a model can be highly misleading. Because glucose spans a much larger numerical range, its unstandardized effect size or discriminant coefficient will appear artificially small compared to mass, even though its predictive power might be equivalent or greater. LDA accounts for these varied variances by standardizing the space dynamically via this matrix.

Analyzing the Off-Diagonal Elements: Covariances and Directions

The off-diagonal elements represent the covariances (\(\sigma_{jk}\)) between pairs of features. They tell us how these clinical metrics move together across the pooled sample.

glucose and age (\(\sigma_{13} = 53.00416\))

  • The Finding: There is a pronounced, positive covariance between a patient’s plasma glucose concentration and their age.

  • Model Impact: As age increases, glucose levels systematically shift upward. LDA uses this positive covariance to create a slanted decision boundary rather than an axis-aligned one. If a patient is older, the model “expects” a higher baseline glucose level, adjusting its classification threshold accordingly.

glucose and mass (\(\sigma_{12} = 22.93965\))

  • The Finding: There is a moderate positive covariance between glucose levels and body mass index.

  • Model Impact: Higher body mass shifts the expected distribution of glucose upward, indicating that these two features carry some redundant structural information. LDA’s mathematical formulation naturally handles this multicollinearity by penalizing overlapping variance.

mass and age (\(\sigma_{23} = -3.09117\))

  • The Finding: There is a small, negative covariance between body mass index and age in this sample.

  • Model Impact: This is a fascinating architectural nuance. Holding glucose constant, an increase in age is slightly associated with a decrease in body mass index within this pooled framework. Because this value is quite close to zero relative to the variances, it suggests that the linear relationship between age and mass is nearly orthogonal (uncorrelated) once pooled across classes.

How to Frame This in a Presentation or Paper

When writing up your methods or discussing this output in a lecture notebook, you can synthesize these findings to explain why LDA handles the data the way it does:

“The empirical pooled covariance matrix (\(\Sigma\)) demonstrates that the feature space is highly anisotropic, dominated by a large variance in plasma glucose (\(\sigma^2 = 762.19\)) relative to body mass index (\(\sigma^2 = 59.05\)). Furthermore, the positive covariances (\(\sigma_{\text{glucose, age}} = 53.00\); \(\sigma_{\text{glucose, mass}} = 22.94\)) establish that these clinical predictors are non-orthogonal. LDA successfully leverages these off-diagonal structures to construct an optimal linear decision boundary that accounts for feature correlation, effectively preventing redundant biometric signals from doubly influencing the posterior probability allocations.”

The Next Diagnostic Step:

To turn this into a rigorous critique of the model, your next step should be to calculate the class-specific covariance matrices (\(\Sigma_{\text{pos}}\) and \(\Sigma_{\text{neg}}\)) separately. If those two individual matrices diverge significantly from this pooled matrix (e.g., if the covariance between glucose and age is negative in one class but highly positive in another), it means the LDA model’s homoscedastic assumption is violated, providing a strong mathematical justification for switching to Quadratic Discriminant Analysis (QDA).

The QDA Class-Specific Covariance Matrices

QDA relaxes this assumption and calculates an individual covariance matrix (\(\Sigma_k\)) for each class. R stores the scaling of these matrices within the fit_qda object as singular values (scaling). You can look at the raw empirical matrices to evaluate if QDA is justified:

# Evaluate heteroscedasticity by comparing individual group matrices
cov_neg_actual <- cov(subset(train_df, diabetes == "neg", select = c(glucose, mass, age)))
cov_pos_actual <- cov(subset(train_df, diabetes == "pos", select = c(glucose, mass, age)))

print(cov_neg_actual)
          glucose      mass        age
glucose 661.20397 30.490872  62.531921
mass     30.49087 60.516725   2.281382
age      62.53192  2.281382 140.567540
print(cov_pos_actual)
           glucose       mass       age
glucose 950.519453   8.856609  35.23490
mass      8.856609  56.318078 -13.11098
age      35.234900 -13.110980 106.94453

This complete structural layout gives you everything required for a defense of why one discriminant model should be favored over the other.

By stacking the Group Means, the Pooled Covariance Matrix (\(\Sigma\)), and the Class-Specific Covariance Matrices (\(\Sigma_{neg}\) and \(\Sigma_{pos}\)) side by side, we can explicitly diagnose the geometric and statistical trade-offs between LDA and QDA.

Identifying the Univariate Signals

The class conditional means \(\mu_k\) are identical across both models and isolate where the center of mass sits for each population:

  • The Drivers: Plasma glucose shows massive univariate separation, jumping from 109.72 in the non-diabetic cohort to 142.12 in the diabetic cohort. Similarly, mass pushes from 30.16 to 34.88.

  • The Geometry: Because the mean vectors differ significantly across all four features, the center of the two multivariate normal distributions are distinctly separated in \(\mathbb{R}^4\).

Comparing Covariance Architectures: Testing the Homoscedastic Assumption

The core mathematical distinction between LDA and QDA hinges on whether you can justify pooling the variance structures. Let’s critique the empirical covariance matrices (\(3 \times 3\) subsets for glucose, mass, and age) to evaluate this choice.

Spatial Expansion (The Diagonal Variances)

Looking at the standalone variances (\(\sigma^2\)) on the main diagonals reveals a stark structural divergence between the two classes:

\[\text{Glucose Variance: } \sigma^2_{\text{neg}} = \mathbf{661.20} \quad \text{vs.} \quad \sigma^2_{\text{pos}} = \mathbf{950.52}\]

\[\text{Age Variance: } \sigma^2_{\text{neg}} = \mathbf{140.57} \quad \text{vs.} \quad \sigma^2_{\text{pos}} = \mathbf{106.94}\]

  • The Takeaway: The variance of glucose expands by nearly 44% within the positive cohort. Conversely, the variance of age shrinks significantly.

  • The LDA Flaw: LDA forces these values into a pooled compromise (\(\sigma^2_{\text{glucose}} = 762.19\)). This means LDA systematically overestimates the variance of glucose for healthy patients and underestimates it for diabetic patients.

Rotational Shifts (The Off-Diagonal Covariances)

The off-diagonal terms (\(\sigma_{jk}\)) define the orientation and rotation of the distribution ellipses in space. Comparing the classes reveals a complete breakdown of parallel structure:

  • glucose and mass: In the healthy cohort, they share a strong positive covariance of +30.49. In the diabetic cohort, this relationship collapses entirely to +8.86 (nearly orthogonal).

  • mass and age: In the healthy cohort, mass and age have a slight positive covariance (+2.28). In the diabetic cohort, they shift into a sharp negative covariance (-13.11).

Structural & Geometric Synthesis

Why LDA’s Assumptions Fail Visually

Because the orientation (covariances) and size (variances) shift drastically between neg and pos, the true underlying distributions are highly heteroscedastic.

Geometrically, the healthy population forms a tighter, distinct ellipse rotated along a strong glucose-mass axis. The diabetic population forms a massive, highly dispersed hyper-ellipsoid where mass and age are negatively aligned, and glucose variance is highly volatile.

When LDA forces a pooled covariance matrix (\(\Sigma\)), it assumes these two ellipsoids are identical in size and orientation, differing only in their center points. Visually, LDA forces a single, average “compromise” orientation onto both groups, resulting in a rigid, straight decision line.

Why QDA is Mathematically Justified

QDA relaxes this constraint and assigns each group its unique covariance matrix (\(\Sigma_k\)). Because \(\Sigma_{neg} \neq \Sigma_{pos}\), the quadratic terms in the log-posterior odds do not cancel out. This allows the decision boundary to bend into a curved, parabolic hyperplane that bends around the tighter healthy cohort and expands outward to accommodate the highly dispersed diabetic cohort.

The Parameter Penalty Caveat (The Bias-Variance Tradeoff)

While QDA is clearly better aligned with the true physics of this data, it comes at a steep cost in parameter complexity.

With \(p = 4\) predictors and \(K = 2\) classes:

  • LDA estimates a single pooled covariance matrix requiring only \(\frac{p(p+1)}{2} = \mathbf{10}\) covariance parameters.

  • QDA must estimate a separate covariance matrix for each class, requiring \(K \times \frac{p(p+1)}{2} = \mathbf{20}\) covariance parameters.

Conclusion Given that our training set is sufficiently large (\(N \approx 576\) observations in a 75% split), you have enough data points per class to stably estimate these 20 parameters without suffering from the curse of dimensionality. Therefore, because the class-specific matrices provide undeniable empirical evidence of heteroscedasticity, QDA holds the stronger theoretical justification for this dataset, even if its out-of-sample predictive returns are closely matched by a parsimonious Logistic Regression model.

Visualizing Covariance Ellipses

To ground this discussion visually, you can use the GGally package or standard ggplot2 to draw data density ellipses for each class. This lets you see if the data matches the geometric assumptions of the models.

library(ggplot2)

ggplot(train_df, aes(x = glucose, y = mass, color = diabetes)) +
  geom_point(alpha = 0.4) +
  # Draw normal data ellipses (representing the covariance shape)
  stat_ellipse(type = "norm", linetype = 2, size = 1) +
  labs(title = "Visualizing Class Covariance Structures",
       subtitle = "Parallel ellipses suggest LDA; differing orientations/sizes suggest QDA") +
  theme_minimal()

Visual Diagnostics:

  • If the ellipses have identical orientations and sizes: The homoscedastic assumption holds. LDA is perfectly sufficient, and QDA will likely overfit.

  • If the ellipses are rotated differently or one is significantly larger than the other: The covariance structures are distinct. LDA’s linear boundary will suffer from systemic bias, justifying QDA’s quadratic complexity.

3. Out-of-Sample Evaluation

Now, we evaluate performance on the held-out test set. We will extract the predicted probabilities for the “pos” (diabetic) class across all three architectures.

# Generate predictions on the test dataset
evaluation_results <- test_df %>%
  mutate(
    # Probabilities (Target class is "pos")
    prob_logistic = predict(fit_logistic, newdata = ., type = "response"),
    prob_lda      = predict(fit_lda, newdata = .)$posterior[, "pos"],
    prob_qda      = predict(fit_qda, newdata = .)$posterior[, "pos"],
    
    # Classifications (using a standard 0.5 decision threshold)
    class_logistic = factor(ifelse(prob_logistic > 0.5, "pos", "neg"), levels = c("neg", "pos")),
    class_lda      = predict(fit_lda, newdata = .)$class,
    class_qda      = predict(fit_qda, newdata = .)$class
  )

Comparing Predictive Behavior via gtsummary

Let’s construct a summary showing how often each model’s predictions align with actual outcome data.

evaluation_results %>%
  mutate(
    Correct_Logistic = ifelse(class_logistic == diabetes, "Correct", "Incorrect"),
    Correct_LDA      = ifelse(class_lda == diabetes, "Correct", "Incorrect"),
    Correct_QDA      = ifelse(class_qda == diabetes, "Correct", "Incorrect")
  ) %>%
  select(Correct_Logistic, Correct_LDA, Correct_QDA) %>%
  tbl_summary(
    label = list(
      Correct_Logistic ~ "Logistic Regression Success",
      Correct_LDA      ~ "LDA Success",
      Correct_QDA      ~ "QDA Success"
    )
  ) %>%
  modify_caption("**Table 3: Out-of-Sample Classification Accuracy Across Models**")
Table 3: Out-of-Sample Classification Accuracy Across Models
Characteristic N = 1931
Logistic Regression Success
    Correct 149 (77%)
    Incorrect 44 (23%)
LDA Success
    Correct 148 (77%)
    Incorrect 45 (23%)
QDA Success
    Correct 142 (74%)
    Incorrect 51 (26%)
1 n (%)

Beware Class Imbalance

Notice that all three models hover around a 75-80% success rate. While class imbalance is less extreme here than in default tracking (roughly ~35% of this population is positive for diabetes), raw accuracy still masks class-specific performance. Look beyond simple accuracy to Area Under the ROC Curve (AUC), Sensitivity (True Positive Rate), and Specificity (True Negative Rate).

4. Visualizing Performance with ggplot2

To truly evaluate these models, let’s look at how cleanly they separate the classes by plotting their predicted probabilities against the true diabetes state.

# Reshape data for clean faceting in ggplot2
density_data <- evaluation_results %>%
  select(diabetes, prob_logistic, prob_lda, prob_qda) %>%
  pivot_longer(
    cols = starts_with("prob_"),
    names_to = "Model",
    names_prefix = "prob_",
    values_to = "Predicted_Probability"
  ) %>%
  mutate(Model = toupper(Model))

# Create the density visualization
ggplot(density_data, aes(x = Predicted_Probability, fill = diabetes)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~Model, scales = "free_y") +
  scale_fill_manual(values = c("#67a9cf", "#ef8a62")) +
  labs(
    title = "Distribution of Predicted Probabilities by True Class",
    subtitle = "Evaluating how effectively classifiers push diabetes instances toward 1 and non-diabetes toward 0",
    x = "Predicted Probability of Diabetes",
    y = "Density",
    fill = "Actual Status"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom", strip.text = element_text(face = "bold", size = 11))

ROC Curve Comparison

Let’s construct Receiver Operating Characteristic (ROC) curves to evaluate performance across all potential decision thresholds simultaneously.

# Compute ROC curves using pROC
roc_log <- roc(evaluation_results$diabetes, evaluation_results$prob_logistic, quiet = TRUE)
roc_lda <- roc(evaluation_results$diabetes, evaluation_results$prob_lda, quiet = TRUE)
roc_qda <- roc(evaluation_results$diabetes, evaluation_results$prob_qda, quiet = TRUE)

# Bind coordinates for ggplot
roc_df <- bind_rows(
  coords(roc_log, "all", ret = c("threshold", "specificity", "sensitivity")) %>% mutate(Model = paste0("Logistic (AUC = ", round(auc(roc_log), 4), ")")),
  coords(roc_lda, "all", ret = c("threshold", "specificity", "sensitivity")) %>% mutate(Model = paste0("LDA (AUC = ", round(auc(roc_lda), 4), ")")),
  coords(roc_qda, "all", ret = c("threshold", "specificity", "sensitivity")) %>% mutate(Model = paste0("QDA (AUC = ", round(auc(roc_qda), 4), ")"))
)

# Plot ROC curves
ggplot(roc_df, aes(x = 1 - specificity, y = sensitivity, color = Model)) +
  geom_line(size = 1) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "#636363") +
  labs(
    title = "Out-of-Sample ROC Curves",
    x = "False Positive Rate (1 - Specificity)",
    y = "True Positive Rate (Sensitivity)"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal() +
  theme(
    legend.position = "inside", 
    legend.position.inside = c(0.70, 0.25), 
    legend.background = element_rect(fill = "white", color = "#bdbdbd")
  )

Conclusion: Which Model Fits Best?

Looking closely at both the classification summaries and the ROC visualization:

  • Logistic Regression and LDA perform almost identically (\(\text{AUC} \approx 0.80\)). This occurs because the true underlying log-odds relationship is highly linear with respect to balance, and the variance structures do not shift violently between classes.

  • QDA shows a marginal drop in performance or a comparable AUC but introduces a higher risk of variance because it estimates parameters for multiple covariance matrices.

Verdict: Given the principle of parsimony (Occam’s Razor), Logistic Regression or LDA is preferred here over QDA. Logistic regression wins a slight practical edge because we can easily interpret its coefficients as conditional odds ratios and the fact that LDA is misspecified.