knitr::opts_chunk$set(echo =TRUE, warning =FALSE, message =FALSE, fig.width =8, fig.height =5)# Required libraries for today's graduate lablibrary(mlbench) # For the Pima Indians datasetlibrary(tidyverse) # For data manipulation and ggplot2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.3 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(modelr) # For elegant data splitting workflowslibrary(MASS) # For lda() and qda() functions
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
library(gtsummary) # For publication-grade statistical tables
Attaching package: 'gtsummary'
The following object is masked from 'package:MASS':
select
library(pROC) # For evaluating ROC curves
Type 'citation("pROC")' for a citation.
Attaching package: 'pROC'
The following objects are masked from 'package:stats':
cov, smooth, var
1. Context & The Philosophy of Data Splitting
We are working with the PimaIndiansDiabetes dataset, which contains 768 observations on female patients of Pima Indian heritage. Our goal is to predict binary diabetes status (diabetes: pos/neg) using diagnostic signs such as glucose tolerance plasma concentration (glucose), body mass index (mass), age (age), and blood pressure (pressure).
Before fitting a single model, we must partition our data.
The Optimism Bias of Training Error
Training error is a fundamentally biased, overly optimistic estimator of the true generalization error. Because models minimize loss directly on the training architecture, they naturally adapt to its specific stochastic noise.
Out-of-sample validation via a Train/Test split ensures we evaluate models on a data generating process they have not yet seen, effectively testing their true predictive generalizes-ability.
We will use the modelr package, which generates memory-efficient pointer objects (resample objects) rather than immediately duplicating data frames in RAM—a critical habit when scaling to massive datasets.
# Load datadata(PimaIndiansDiabetes)set.seed(42) # Crucial for reproducibility in stochastic partitioning# Partitioning data: 75% Training, 25% Testingpartitions <-resample_partition(PimaIndiansDiabetes, c(train =0.75, test =0.25))# Convert the pointers to explicit data frames for model fittingtrain_df <-as.data.frame(partitions$train)test_df <-as.data.frame(partitions$test)# Check the baseline distribution using gtsummarytrain_df %>%tbl_summary(by = diabetes, include =c(glucose, mass, age, pressure)) %>%add_p() %>%modify_caption("**Table 1: Baseline Clinical Metrics Stratified by Diabetes Status (Training Set)**")
Table 1: Baseline Clinical Metrics Stratified by Diabetes Status (Training Set)
Characteristic
neg
N = 3741
pos
N = 2011
p-value2
glucose
108 (94, 124)
141 (119, 167)
<0.001
mass
30 (25, 35)
34 (30, 38)
<0.001
age
27 (23, 37)
35 (29, 44)
<0.001
pressure
70 (62, 78)
74 (68, 82)
<0.001
1 Median (Q1, Q3)
2 Wilcoxon rank sum test
2. Theoretical Overview & Model Fitting
We are pitting three distinct mathematical frameworks against each other:
Logistic Regression: A direct, discriminative approach. It models the conditional probability \(P(Y = 1 \mid X)\) directly using the log-odds (logit) link function. It makes no assumptions about the distribution of the predictors \(X\).
Linear Discriminant Analysis (LDA): A generative approach. It assumes that \(X\) stems from a multivariate Gaussian distribution (\(X \mid Y = k \sim N(\mu_k, \Sigma)\)). Crucially, LDA assumes that all classes share an identical covariance matrix (\(\Sigma\)). This results in a linear decision boundary.
Quadratic Discriminant Analysis (QDA): Also generative and multivariate Gaussian, but it relaxes the homoscedastic assumption. Each class gets its own unique covariance matrix (\(\Sigma_k\)). This yields a quadratic (curved) decision boundary.
The fundamental divide between these models lies in the underlying probabilistic frameworks, asymptotic properties, and inductive biases.
Generative vs. Discriminative
Logistic Regression is a Discriminative Classifier. It directly models the conditional probability \(P(Y=k \mid X)\) using the logistic/softmax function. It makes no assumptions about the distribution of the predictors \(X\), treating them as fixed or arbitrary. It optimizes parameters using maximum likelihood estimation (MLE) via iteratively reweighted least squares (IRLS).
LDA and QDA are Generative Classifiers. They model the joint distribution \(P(X, Y) = P(X \mid Y)P(Y)\) by modeling the class-conditional densities \(P(X \mid Y=k)\) and applying Bayes’ theorem to find \(P(Y=k \mid X)\). Both assume that the predictors follow a multivariate normal distribution:
\[X \mid Y=k \sim N(\mu_k, \Sigma_k)\]
Mathematical Comparison & Boundary Geography
The geometry of the decision boundaries is directly dictated by how the models handle the covariance structures of the predictors.
Class-Specific: Each class \(k\) gets its own \(\Sigma_k\)
Parameter Complexity
\(p + 1\) parameters
\(Kp + p(p+1)/2\) parameters
\(Kp + Kp(p+1)/2\) parameters
The Algebraic Connection of Logistic vs. LDA
If the conditional distribution assumptions of LDA hold true (multivariate normality with equal covariance), the log-posterior odds of LDA can be simplified to:
This is the exact functional form of logistic regression. However, the parameter estimates \((\alpha_0, \beta)\) will differ because LDA optimizes the joint likelihood (using pooled empirical means and covariances), whereas logistic regression optimizes the conditional likelihood.
Asymptotic Efficiency and Robustness
When to prefer Logistic Regression
Non-Normal Predictors: Because logistic regression relies on fewer distributional assumptions, it is far more robust to heavy tails, extreme skewness, or qualitative/categorical features.
The Mismatched Case: If the true distribution \(P(X \mid Y)\) is drastically non-normal, LDA/QDA parameter estimates become inconsistent. Asymptotically, logistic regression will outperform LDA under model mismatch.
When to prefer LDA
Strict Normality & Small Samples: If the predictors are indeed multivariate normal, LDA is asymptotically efficient—it requires fewer data points than logistic regression to achieve the same error rate because it leverages the structural information contained in \(P(X)\).
Perfect Separation: In datasets where classes are perfectly linearly separable, logistic regression MLEs are unstable and drive coefficients toward infinity (the Hauck-Donner effect). LDA remains perfectly stable because it relies on mean vectors and pooled covariance matrices, which are always well-defined.
Multi-class Settings: While multinomial logistic regression handles \(K > 2\) classes well, LDA scales effortlessly and offers low-dimensional projection capabilities.
When to prefer QDA
Heteroscedasticity: When the variance-covariance structures differ markedly between classes (\(\Sigma_1 \neq \Sigma_2\)), drawing a linear boundary introduces massive bias. QDA allows the boundary to bend into parabolas, ellipses, or hyperbolas.
Sample Size Caveat: QDA requires estimating a separate covariance matrix for each class. Because parameter count scales quadratically with the number of predictors (\(O(Kp^2)\)), QDA suffers severely from the curse of dimensionality and will overfit drastically if the sample size \(N\) is small relative to \(p\). Under high dimensions, regularization techniques (like Friedman’s Regularized Discriminant Analysis) are typically required to compromise between LDA and QDA.
The Bias-Variance Tradeoff in Discriminant Analysis
If the assumption of equal covariance matrices is true, LDA is the Minimum Variance Unbiased Estimator (MVUE) among classifiers. QDA requires estimating a separate covariance matrix for each class. If you have \(p\) predictors and \(K\) classes, QDA must estimate \(K \times p(p+1)/2\) parameters.
For large \(p\), QDA can suffer from massive variance and overfit drastically. Choose LDA if sample size is small or variance structure is stable; choose QDA only if you have the sample size to support parameter estimation and clear evidence of heteroscedasticity.
Let’s fit all three models to our training data:
# 1. Logistic Regressionfit_logistic <-glm(diabetes ~ glucose + mass + age + pressure, data = train_df, family = binomial)# 2. Linear Discriminant Analysis (MASS package)fit_lda <-lda(diabetes ~ glucose + mass + age + pressure, data = train_df)# 3. Quadratic Discriminant Analysis (MASS package)fit_qda <-qda(diabetes ~ glucose + mass + age + pressure, data = train_df)
Reviewing Model Parameter Summaries
Let’s leverage gtsummary to inspect the structural coefficients of our Logistic Regression model.
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
In epidemiologic and predictive modeling contexts, interpreting odds ratios (\(OR\)) requires evaluating three components: the point estimate, the direction of the effect relative to the baseline (\(OR = 1\)), and the statistical precision or significance (reflected by the 95% Confidence Interval and the \(p\)-value).
Because these variables are continuous physical metrics rather than binary indicators, each odds ratio represents the multiplicative change in the odds of the outcome (testing positive for diabetes) for a one-unit increase in that specific predictor, holding all other covariates constant.
Interpretation: Controlling for body mass index, age, and blood pressure, each 1-unit increase in plasma glucose concentration is associated with a 4% increase in the odds of testing positive for diabetes (\(\text{Odds Ratio} - 1 = 0.04\)).
Inference: This effect is highly statistically significant (\(p < 0.001\)). The narrow confidence interval \([1.03, 1.05]\) indicates a highly precise estimate, confirming that higher glucose levels are reliably tied to an upward shift in the conditional probability profile.
Interpretation: Adjusting for all other clinical indicators in the model, a 1-unit increase in body mass index (BMI) is associated with an 8% increase in the odds of testing positive for diabetes.
Inference: Out of all the continuous predictors evaluated, mass exhibits the largest magnitude of effect per single unit of change. The bounds of the 95% confidence interval are completely bounded well above 1.0, signifying a robust, highly significant positive association (\(p < 0.001\)).
Interpretation: Holding glucose, mass, and blood pressure constant, each additional year of age is associated with a 3% increase in the odds of a positive diabetes diagnosis.
Inference: Though a 3% change per year appears modest superficially, it represents a compounding risk over time (e.g., a 10-year increase in age shifts the odds by a factor of \(1.03^{10} \approx 1.34\), or a 34% increase in odds). The effect is highly statistically significant.
Interpretation: Controlling for glucose, mass, and age, a 1-unit increase in diastolic blood pressure is associated with a 1% decrease in the odds of testing positive for diabetes (\(\text{Odds Ratio} - 1 = -0.01\)).
Inference: This covariate is on the margin of statistical significance (\(p = 0.049\)). Notice that the upper bound of the 95% confidence interval touches \(1.00\) when rounded. Because the interval includes the null value (\(1.0\)), the protective effect of blood pressure in this specific multivariate configuration is highly unstable and weak.
Some Notes
Conditional Nature of Estimates: Emphasize that these are conditional odds ratios. For example, the 8% risk inflation per unit of mass is true specifically for individuals who share the exact same age, glucose, and pressure profiles.
The Scale Caveat: When comparing the magnitude of continuous odds ratios (e.g., mass at 1.08 vs. glucose at 1.04), the point estimates depend entirely on the measurement units. A 1-unit change in BMI (mass) represents a much larger physical shift relative to its sample distribution than a 1-unit change in plasma glucose concentration (glucose).
If you want to make these magnitudes directly comparable, you would need to standardize these predictors by their standard deviations (\(z\)-scores) before fitting the logistic regression model.
Discriminant Analysis
Analyzing the structural output of Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) isn’t common in simple classification accuracy tasks, but allows you to evaluate the underlying geometry and mechanics of how the models are slicing your feature space.
When you print an lda or qda object from the MASS package in R, the console gives you specific mathematical components. Here is a guide on how to extract, interpret, and discuss those outputs, utilizing the PimaIndiansDiabetes framework.
Interpreting the Text Output in R
When you execute print(fit_lda) or print(fit_qda), R returns a structured summary. Let’s look at what each section means.
print(fit_lda)
Call:
lda(diabetes ~ glucose + mass + age + pressure, data = train_df)
Prior probabilities of groups:
neg pos
0.6504348 0.3495652
Group means:
glucose mass age pressure
neg 109.7219 30.15668 31.15775 67.96524
pos 142.1244 34.87512 36.84080 71.16418
Coefficients of linear discriminants:
LD1
glucose 0.02925412
mass 0.05774842
age 0.02618366
pressure -0.01007862
Prior Probabilities of Groups
What it is: The estimated prior probabilities (\(\pi_k\)) for each class. By default, R calculates these as the empirical proportions of the classes in your training data (e.g., neg: 0.651, pos: 0.349).
Look whether the priors reflect class imbalance. Because LDA and QDA incorporate these priors directly into Bayes’ theorem to calculate the posterior probability, a higher prior shifts the decision boundary to favor the majority class. This is the model’s “baseline intuition” before seeing any physical metrics.
Group Means
What it is: A matrix displaying the average value of each predictor vector (\(\mu_k\)) split by class.
Look for the univariate separation between groups. For instance, note: “The group mean for glucose is noticeably higher in the diabetic group (142.1) than the non-diabetic group (109.7), signaling that glucose will likely carry significant weight in the discriminant function.”
Coefficients of Linear Discriminants (LDA Only)
What it is: The scaling constants (\(v\)) that define the linear combination of predictors used to maximize between-class variance relative to within-class variance. The output gives you the formula for the discriminant score:
\(LD_1 = v_1X_1 + v_2X_2 + \dots + v_pX_p\)
Discuss these like standardized coordinates or regression weights (if your predictors are on similar scales). A large positive or negative coefficient indicates that the specific predictor heavily drives the assignment to a particular class.
Note: QDA does not have linear discriminant coefficients because its decision boundary is quadratic—it cannot be expressed as a simple linear combination of features.
print(fit_qda)
Call:
qda(diabetes ~ glucose + mass + age + pressure, data = train_df)
Prior probabilities of groups:
neg pos
0.6504348 0.3495652
Group means:
glucose mass age pressure
neg 109.7219 30.15668 31.15775 67.96524
pos 142.1244 34.87512 36.84080 71.16418
Extracting and Discussing Hidden Matrices
To fully critique the models—especially when comparing LDA vs. QDA—graduate-level analysis requires inspecting the variance-covariance structures directly. R hides these inside the model objects, but you can compute or extract them to discuss the homoscedasticity assumption.
The LDA Pooled Covariance Matrix
LDA assumes that both groups share an identical covariance matrix (\(\Sigma\)). You can calculate this pooled matrix from your training data to inspect the correlation structures the model forces both classes to share:
# Calculate the pooled variance-covariance matrix used implicitly by LDAwith(train_df, {# Split predictors by group X_neg <-as.matrix(subset(train_df, diabetes =="neg", select =c(glucose, mass, age))) X_pos <-as.matrix(subset(train_df, diabetes =="pos", select =c(glucose, mass, age)))# Calculate group covariances cov_neg <-cov(X_neg) cov_pos <-cov(X_pos)# Pool them based on degrees of freedom n_neg <-nrow(X_neg) n_pos <-nrow(X_pos) pooled_cov <- ((n_neg -1) * cov_neg + (n_pos -1) * cov_pos) / (n_neg + n_pos -2)print(pooled_cov)})
glucose mass age
glucose 762.18669 22.93965 53.00416
mass 22.93965 59.05123 -3.09117
age 53.00416 -3.09117 128.83176
This pooled covariance matrix (\(\Sigma\)) represents the foundational geometric framework that your Linear Discriminant Analysis (LDA) model uses to understand the relationships between your predictors across the entire dataset.
In LDA, the model assumes that both groups (pos and neg diabetes cohorts) share this exact same covariance structure. It uses these values to calculate Mahalanobis distances and to project the data into a lower-dimensional space that maximizes class separation.
The diagonal elements (\(762.19\), \(59.05\), and \(128.83\)) represent the variance (\(\sigma^2\)) for glucose, mass, and age respectively. Taking the square root of these values gives us their pooled standard deviations (\(\sigma\)):
glucose:\(\sigma = \sqrt{762.18669} \approx 27.61\) units
age:\(\sigma = \sqrt{128.83176} \approx 11.35\) years
mass:\(\sigma = \sqrt{59.05123} \approx 7.68\) units
Statistical Implications for the Model
The massive scale discrepancy between the variance of glucose (\(762.19\)) and mass (\(59.05\)) explains why looking at unstandardized coefficients in a model can be highly misleading. Because glucose spans a much larger numerical range, its unstandardized effect size or discriminant coefficient will appear artificially small compared to mass, even though its predictive power might be equivalent or greater. LDA accounts for these varied variances by standardizing the space dynamically via this matrix.
Analyzing the Off-Diagonal Elements: Covariances and Directions
The off-diagonal elements represent the covariances (\(\sigma_{jk}\)) between pairs of features. They tell us how these clinical metrics move together across the pooled sample.
glucose and age (\(\sigma_{13} = 53.00416\))
The Finding: There is a pronounced, positive covariance between a patient’s plasma glucose concentration and their age.
Model Impact: As age increases, glucose levels systematically shift upward. LDA uses this positive covariance to create a slanted decision boundary rather than an axis-aligned one. If a patient is older, the model “expects” a higher baseline glucose level, adjusting its classification threshold accordingly.
glucose and mass (\(\sigma_{12} = 22.93965\))
The Finding: There is a moderate positive covariance between glucose levels and body mass index.
Model Impact: Higher body mass shifts the expected distribution of glucose upward, indicating that these two features carry some redundant structural information. LDA’s mathematical formulation naturally handles this multicollinearity by penalizing overlapping variance.
mass and age (\(\sigma_{23} = -3.09117\))
The Finding: There is a small, negative covariance between body mass index and age in this sample.
Model Impact: This is a fascinating architectural nuance. Holding glucose constant, an increase in age is slightly associated with a decrease in body mass index within this pooled framework. Because this value is quite close to zero relative to the variances, it suggests that the linear relationship between age and mass is nearly orthogonal (uncorrelated) once pooled across classes.
How to Frame This in a Presentation or Paper
When writing up your methods or discussing this output in a lecture notebook, you can synthesize these findings to explain why LDA handles the data the way it does:
“The empirical pooled covariance matrix (\(\Sigma\)) demonstrates that the feature space is highly anisotropic, dominated by a large variance in plasma glucose (\(\sigma^2 = 762.19\)) relative to body mass index (\(\sigma^2 = 59.05\)). Furthermore, the positive covariances (\(\sigma_{\text{glucose, age}} = 53.00\); \(\sigma_{\text{glucose, mass}} = 22.94\)) establish that these clinical predictors are non-orthogonal. LDA successfully leverages these off-diagonal structures to construct an optimal linear decision boundary that accounts for feature correlation, effectively preventing redundant biometric signals from doubly influencing the posterior probability allocations.”
The Next Diagnostic Step:
To turn this into a rigorous critique of the model, your next step should be to calculate the class-specific covariance matrices (\(\Sigma_{\text{pos}}\) and \(\Sigma_{\text{neg}}\)) separately. If those two individual matrices diverge significantly from this pooled matrix (e.g., if the covariance between glucose and age is negative in one class but highly positive in another), it means the LDA model’s homoscedastic assumption is violated, providing a strong mathematical justification for switching to Quadratic Discriminant Analysis (QDA).
The QDA Class-Specific Covariance Matrices
QDA relaxes this assumption and calculates an individual covariance matrix (\(\Sigma_k\)) for each class. R stores the scaling of these matrices within the fit_qda object as singular values (scaling). You can look at the raw empirical matrices to evaluate if QDA is justified:
glucose mass age
glucose 661.20397 30.490872 62.531921
mass 30.49087 60.516725 2.281382
age 62.53192 2.281382 140.567540
print(cov_pos_actual)
glucose mass age
glucose 950.519453 8.856609 35.23490
mass 8.856609 56.318078 -13.11098
age 35.234900 -13.110980 106.94453
This complete structural layout gives you everything required for a defense of why one discriminant model should be favored over the other.
By stacking the Group Means, the Pooled Covariance Matrix (\(\Sigma\)), and the Class-Specific Covariance Matrices (\(\Sigma_{neg}\) and \(\Sigma_{pos}\)) side by side, we can explicitly diagnose the geometric and statistical trade-offs between LDA and QDA.
Identifying the Univariate Signals
The class conditional means \(\mu_k\) are identical across both models and isolate where the center of mass sits for each population:
The Drivers: Plasma glucose shows massive univariate separation, jumping from 109.72 in the non-diabetic cohort to 142.12 in the diabetic cohort. Similarly, mass pushes from 30.16 to 34.88.
The Geometry: Because the mean vectors differ significantly across all four features, the center of the two multivariate normal distributions are distinctly separated in \(\mathbb{R}^4\).
Comparing Covariance Architectures: Testing the Homoscedastic Assumption
The core mathematical distinction between LDA and QDA hinges on whether you can justify pooling the variance structures. Let’s critique the empirical covariance matrices (\(3 \times 3\) subsets for glucose, mass, and age) to evaluate this choice.
Spatial Expansion (The Diagonal Variances)
Looking at the standalone variances (\(\sigma^2\)) on the main diagonals reveals a stark structural divergence between the two classes:
The Takeaway: The variance of glucose expands by nearly 44% within the positive cohort. Conversely, the variance of age shrinks significantly.
The LDA Flaw: LDA forces these values into a pooled compromise (\(\sigma^2_{\text{glucose}} = 762.19\)). This means LDA systematically overestimates the variance of glucose for healthy patients and underestimates it for diabetic patients.
Rotational Shifts (The Off-Diagonal Covariances)
The off-diagonal terms (\(\sigma_{jk}\)) define the orientation and rotation of the distribution ellipses in space. Comparing the classes reveals a complete breakdown of parallel structure:
glucose and mass: In the healthy cohort, they share a strong positive covariance of +30.49. In the diabetic cohort, this relationship collapses entirely to +8.86 (nearly orthogonal).
mass and age: In the healthy cohort, mass and age have a slight positive covariance (+2.28). In the diabetic cohort, they shift into a sharp negative covariance (-13.11).
Structural & Geometric Synthesis
Why LDA’s Assumptions Fail Visually
Because the orientation (covariances) and size (variances) shift drastically between neg and pos, the true underlying distributions are highly heteroscedastic.
Geometrically, the healthy population forms a tighter, distinct ellipse rotated along a strong glucose-mass axis. The diabetic population forms a massive, highly dispersed hyper-ellipsoid where mass and age are negatively aligned, and glucose variance is highly volatile.
When LDA forces a pooled covariance matrix (\(\Sigma\)), it assumes these two ellipsoids are identical in size and orientation, differing only in their center points. Visually, LDA forces a single, average “compromise” orientation onto both groups, resulting in a rigid, straight decision line.
Why QDA is Mathematically Justified
QDA relaxes this constraint and assigns each group its unique covariance matrix (\(\Sigma_k\)). Because \(\Sigma_{neg} \neq \Sigma_{pos}\), the quadratic terms in the log-posterior odds do not cancel out. This allows the decision boundary to bend into a curved, parabolic hyperplane that bends around the tighter healthy cohort and expands outward to accommodate the highly dispersed diabetic cohort.
The Parameter Penalty Caveat (The Bias-Variance Tradeoff)
While QDA is clearly better aligned with the true physics of this data, it comes at a steep cost in parameter complexity.
With \(p = 4\) predictors and \(K = 2\) classes:
LDA estimates a single pooled covariance matrix requiring only \(\frac{p(p+1)}{2} = \mathbf{10}\) covariance parameters.
QDA must estimate a separate covariance matrix for each class, requiring \(K \times \frac{p(p+1)}{2} = \mathbf{20}\) covariance parameters.
Conclusion Given that our training set is sufficiently large (\(N \approx 576\) observations in a 75% split), you have enough data points per class to stably estimate these 20 parameters without suffering from the curse of dimensionality. Therefore, because the class-specific matrices provide undeniable empirical evidence of heteroscedasticity, QDA holds the stronger theoretical justification for this dataset, even if its out-of-sample predictive returns are closely matched by a parsimonious Logistic Regression model.
Visualizing Covariance Ellipses
To ground this discussion visually, you can use the GGally package or standard ggplot2 to draw data density ellipses for each class. This lets you see if the data matches the geometric assumptions of the models.
library(ggplot2)ggplot(train_df, aes(x = glucose, y = mass, color = diabetes)) +geom_point(alpha =0.4) +# Draw normal data ellipses (representing the covariance shape)stat_ellipse(type ="norm", linetype =2, size =1) +labs(title ="Visualizing Class Covariance Structures",subtitle ="Parallel ellipses suggest LDA; differing orientations/sizes suggest QDA") +theme_minimal()
Visual Diagnostics:
If the ellipses have identical orientations and sizes: The homoscedastic assumption holds. LDA is perfectly sufficient, and QDA will likely overfit.
If the ellipses are rotated differently or one is significantly larger than the other: The covariance structures are distinct. LDA’s linear boundary will suffer from systemic bias, justifying QDA’s quadratic complexity.
3. Out-of-Sample Evaluation
Now, we evaluate performance on the held-out test set. We will extract the predicted probabilities for the “pos” (diabetic) class across all three architectures.
# Generate predictions on the test datasetevaluation_results <- test_df %>%mutate(# Probabilities (Target class is "pos")prob_logistic =predict(fit_logistic, newdata = ., type ="response"),prob_lda =predict(fit_lda, newdata = .)$posterior[, "pos"],prob_qda =predict(fit_qda, newdata = .)$posterior[, "pos"],# Classifications (using a standard 0.5 decision threshold)class_logistic =factor(ifelse(prob_logistic >0.5, "pos", "neg"), levels =c("neg", "pos")),class_lda =predict(fit_lda, newdata = .)$class,class_qda =predict(fit_qda, newdata = .)$class )
Comparing Predictive Behavior via gtsummary
Let’s construct a summary showing how often each model’s predictions align with actual outcome data.
Table 3: Out-of-Sample Classification Accuracy Across Models
Characteristic
N = 1931
Logistic Regression Success
Correct
149 (77%)
Incorrect
44 (23%)
LDA Success
Correct
148 (77%)
Incorrect
45 (23%)
QDA Success
Correct
142 (74%)
Incorrect
51 (26%)
1 n (%)
Beware Class Imbalance
Notice that all three models hover around a 75-80% success rate. While class imbalance is less extreme here than in default tracking (roughly ~35% of this population is positive for diabetes), raw accuracy still masks class-specific performance. Look beyond simple accuracy to Area Under the ROC Curve (AUC), Sensitivity (True Positive Rate), and Specificity (True Negative Rate).
4. Visualizing Performance with ggplot2
To truly evaluate these models, let’s look at how cleanly they separate the classes by plotting their predicted probabilities against the true diabetes state.
# Reshape data for clean faceting in ggplot2density_data <- evaluation_results %>%select(diabetes, prob_logistic, prob_lda, prob_qda) %>%pivot_longer(cols =starts_with("prob_"),names_to ="Model",names_prefix ="prob_",values_to ="Predicted_Probability" ) %>%mutate(Model =toupper(Model))# Create the density visualizationggplot(density_data, aes(x = Predicted_Probability, fill = diabetes)) +geom_density(alpha =0.5) +facet_wrap(~Model, scales ="free_y") +scale_fill_manual(values =c("#67a9cf", "#ef8a62")) +labs(title ="Distribution of Predicted Probabilities by True Class",subtitle ="Evaluating how effectively classifiers push diabetes instances toward 1 and non-diabetes toward 0",x ="Predicted Probability of Diabetes",y ="Density",fill ="Actual Status" ) +theme_minimal() +theme(legend.position ="bottom", strip.text =element_text(face ="bold", size =11))
ROC Curve Comparison
Let’s construct Receiver Operating Characteristic (ROC) curves to evaluate performance across all potential decision thresholds simultaneously.
Looking closely at both the classification summaries and the ROC visualization:
Logistic Regression and LDA perform almost identically (\(\text{AUC} \approx 0.80\)). This occurs because the true underlying log-odds relationship is highly linear with respect to balance, and the variance structures do not shift violently between classes.
QDA shows a marginal drop in performance or a comparable AUC but introduces a higher risk of variance because it estimates parameters for multiple covariance matrices.
Verdict: Given the principle of parsimony (Occam’s Razor), Logistic Regression or LDA is preferred here over QDA. Logistic regression wins a slight practical edge because we can easily interpret its coefficients as conditional odds ratios and the fact that LDA is misspecified.