Quadratic Discriminant Analysis

Author

Avery Holloman

Quadratic Discriminant Analysis for eCommerce Logistic Shipping

In my exploration of classification techniques for eCommerce logistic shipping, I found Quadratic Discriminant Analysis (QDA) particularly intriguing. While Linear Discriminant Analysis (LDA) assumes that observations within each class are drawn from a multivariate Gaussian distribution with a shared covariance matrix, QDA offers an alternative. Unlike LDA, QDA assumes that each class has its own unique covariance matrix, providing greater flexibility when classifying data with differing variance structures.

QDA Assumptions and Mechanics

QDA assumes that observations within each shipping category—such as “On-Time,” “Delayed,” or “At Risk”—are drawn from Gaussian distributions. Each category is characterized by its own mean vector and covariance matrix . Using Bayes’ theorem, QDA calculates the posterior probability for each class and assigns a shipment to the class that maximizes the discriminant function:

Here, the prior probability of each category plays an essential role in classification. Unlike Linear Discriminant Analysis (LDA), where the discriminant function is linear, the Quadratic Discriminant Analysis (QDA) function includes quadratic terms. This distinction arises because QDA accounts for class-specific covariance matrices. This flexibility makes QDA ideal for problems where shipping categories, such as “On-Time” and “Delayed,” exhibit varying relationships among features like shipment distance, package weight, and processing time.

Choosing Between QDA and LDA

Deciding whether to use QDA or LDA depends on the trade-off between bias and variance, as well as the dataset’s characteristics. LDA assumes a shared covariance matrix across all classes, which simplifies the model and reduces the number of parameters that need to be estimated. This makes LDA particularly effective for smaller training datasets since it minimizes variance. However, when the assumption of a shared covariance matrix does not hold, LDA may lead to high bias, resulting in less accurate classifications.

In contrast, QDA does not assume a shared covariance matrix and instead estimates a separate covariance matrix for each class. While this approach provides greater flexibility and captures more complex patterns in the data, it also requires estimating a much larger number of parameters. For instance, the number of parameters increases significantly with the number of predictors and classes. As a result, QDA tends to have higher variance and may perform poorly with limited data. QDA is most effective for larger datasets or when it is evident that each shipping category has a distinct covariance structure.

Application to Logistic Shipping

Let’s consider a scenario where I classify shipping outcomes, such as “On-Time” or “Delayed,” using predictors like shipping distance and package weight:

  • Scenario 1: If all shipping categories share a similar relationship between distance and weight, such as a strong positive correlation, the true decision boundary is linear. In this situation, LDA would perform well because its simplicity closely aligns with the underlying structure.

  • Scenario 2: If one category (e.g., “Delayed”) shows a positive correlation between distance and weight, while another category (e.g., “On-Time”) exhibits a negative correlation, the decision boundary becomes curved and nonlinear. Here, QDA would outperform LDA because it can account for these more complex, quadratic patterns in the data.

Illustration of Decision Boundaries

To better understand the differences, imagine the following scenarios:

  1. Linear Scenario: Suppose both “On-Time” and “Delayed” shipments have a consistent correlation between distance and weight. In this case, the Bayes decision boundary, which represents the theoretical optimal separation, is linear. LDA would closely approximate this boundary, while QDA might unnecessarily add complexity, leading to overfitting and poorer generalization.

  2. Quadratic Scenario: Now consider a case where “On-Time” shipments display a positive correlation between distance and weight, while “Delayed” shipments show a negative correlation. This results in a curved Bayes decision boundary. Here, QDA would excel because its quadratic functions can more accurately model this nonlinear structure, significantly outperforming LDA.

Summary

QDA provides a flexible and effective approach for classification problems in eCommerce logistic shipping, especially when categories exhibit distinct covariance structures. However, its utility depends on the dataset size and complexity. For smaller datasets, LDA’s simplicity and lower variance make it a better choice. On the other hand, for larger datasets with evident differences in covariance patterns, QDA offers superior performance by closely approximating complex decision boundaries, ensuring more accurate classifications.

# Load necessary libraries for data visualization and QDA analysis
library(MASS)       # For QDA implementation
library(ggplot2)    # For advanced visualization
library(gridExtra)  # For arranging plots

# Simulating data to match the visual theme
set.seed(42)  # Setting a seed so I can reproduce this exact dataset
n <- 300      # Total number of points
categories <- c("AET", "AMGN", "APA")  # These represent classes of logistic shipping categories
colors <- c("red", "green", "blue")

# Create synthetic data with clear separation for QDA
data <- data.frame(
  Spread = c(rnorm(n / 3, mean = 50, sd = 5), rnorm(n / 3, mean = 70, sd = 5), rnorm(n / 3, mean = 60, sd = 5)),
  ImpliedVol = c(rnorm(n / 3, mean = 40, sd = 3), rnorm(n / 3, mean = 30, sd = 3), rnorm(n / 3, mean = 50, sd = 3)),
  Category = factor(rep(categories, each = n / 3))
)

# I need to split the data into training and test sets
set.seed(100)  # Another seed to ensure my training/testing split is consistent
train_indices <- sample(1:n, size = 0.8 * n)  # Using 80% of the data for training
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]

# Applying QDA to the training data
qda_model <- qda(Category ~ Spread + ImpliedVol, data = train_data)

# Predicting on a grid to plot decision boundaries
x_grid <- seq(min(data$Spread) - 5, max(data$Spread) + 5, length.out = 200)
y_grid <- seq(min(data$ImpliedVol) - 5, max(data$ImpliedVol) + 5, length.out = 200)
grid <- expand.grid(Spread = x_grid, ImpliedVol = y_grid)

# Generating predictions for each grid point
grid$Prediction <- predict(qda_model, newdata = grid)$class

# Plotting the data and decision boundaries
# I decided to use ggplot2 here because it gives me better control over aesthetics
ggplot(data, aes(x = Spread, y = ImpliedVol, color = Category)) +
  geom_point(size = 2, alpha = 0.7) +  # Plotting actual data points
  geom_contour(data = grid, aes(z = as.numeric(Prediction)), color = "cyan", linetype = "solid") +
  scale_color_manual(values = colors) +  # Matching the colors of the points
  labs(
    title = "QDA Decision Boundary",
    x = "5-year Spread (basis points)",
    y = "3-month Implied Vol (%)"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")  # Moving the legend to match the reference image

# Interpretation of the results
# I noticed that QDA accurately captures the nonlinear separation between the categories.
# For instance, the cyan boundary indicates that the algorithm accounts for curved separation,
# which aligns with the differences in the covariance structures of the categories.
# The data points cluster tightly around their respective classes, suggesting
# the simulated data aligns well with the QDA assumptions of Gaussian distributions.