1 Introduction

1.1 Research Context

Solar storms large bursts of electromagnetic energy originating from the sun are among the most consequential natural phenomena for modern technological infrastructure. When high-intensity solar flares interact with Earth’s geomagnetic field, they induce electrical currents in ground-based systems, potentially disrupting or permanently damaging power transmission grids. The ability to classify the severity of a disruption before or immediately after a storm event is therefore of direct operational relevance for grid operators, satellite managers, and emergency response planners.

This document applies Linear Discriminant Analysis (LDA), a multivariate classification framework rooted in the Bayesian minimum expected cost principle to the Solar Storm Impact Dataset, which records 1,000 solar storm events across five continuous geophysical predictors. The central analytical objective is to construct a discriminant function that accurately separates storm events into three ordered disruption severity classes (no disruption, moderate disruption, severe disruption) based on simultaneously observed solar and geomagnetic measurements.

LDA is selected as the analytical method because the response variable is nominal-categorical with three ordered classes, the predictors are all continuous and measured on interval/ratio scales, satisfying the structural requirements established in the lecture notes, and the method provides both a classification rule and an interpretable linear function that reveals which geophysical indicators most strongly discriminate between disruption severities, a finding with direct scientific utility.

1.2 Variable Structure

The analysis follows a one-factor, between-events design with the following configuration:

  • Dependent Variable (Y - Grouping): power_grid_disruption, a nominal factor with three levels:
    • Class 0 - No disruption
    • Class 1 - Moderate disruption
    • Class 2 - Severe disruption
  • Predictor Variables (X - Continuous): Five solar storm measurements on interval/ratio scales:
    • flare_intensity - Peak solar flare intensity (flux units)
    • geomagnetic_index_Kp - Kp geomagnetic storm index (0–9 scale)
    • solar_wind_speed - Solar wind velocity (km/s)
    • solar_wind_density - Solar wind proton density (p/cm³)
    • flare_duration_minutes - Duration of the solar flare (minutes)
  • Unit of Analysis: Individual solar storm events, each observed independently.

2 Data Preparation

2.1 Loading and Encoding

df <- read.csv("solar_storm_impact_dataset.csv")

pred_cols <- c("flare_intensity", "geomagnetic_index_Kp", "solar_wind_speed",
               "solar_wind_density", "flare_duration_minutes")
target    <- "power_grid_disruption"

df[[target]] <- as.factor(df[[target]])

cat("Dataset dimensions:", nrow(df), "rows x", ncol(df), "columns\n")
cat("Predictor variables:", length(pred_cols), "\n")
cat("Class levels:", paste(levels(df[[target]]), collapse = ", "), "\n\n")

cat("Class distribution:\n")
print(table(df[[target]]))
cat("\nClass proportions:\n")
print(round(prop.table(table(df[[target]])), 4))
## Dataset dimensions: 1000 rows x 9 columns
## Predictor variables: 5 
## Class levels: 0, 1, 2 
## 
## Class distribution:
## 
##   0   1   2 
## 413 453 134 
## 
## Class proportions:
## 
##     0     1     2 
## 0.413 0.453 0.134

The dataset contains 1,000 solar storm events with no missing values across any variable. The response variable exhibits a moderately imbalanced three-class distribution: Class 0 (no disruption) accounts for 41.3% of events, Class 1 (moderate disruption) for 45.3%, and Class 2 (severe disruption) for the remaining 13.4%. The class imbalance for severe disruption is expected given the rarity of extreme space weather events. In the LDA framework this is incorporated through prior probabilities, which will be set proportional to observed class frequencies (empirical priors) rather than assuming equal priors, ensuring that the classification rule reflects the real-world base rates of disruption severity.

2.2 Descriptive Statistics by Group

desc_group <- df %>%
  group_by(power_grid_disruption) %>%
  summarise(
    n                     = n(),
    mean_intensity        = round(mean(flare_intensity), 3),
    sd_intensity          = round(sd(flare_intensity), 3),
    mean_kp               = round(mean(geomagnetic_index_Kp), 3),
    sd_kp                 = round(sd(geomagnetic_index_Kp), 3),
    mean_wind_speed       = round(mean(solar_wind_speed), 3),
    sd_wind_speed         = round(sd(solar_wind_speed), 3),
    mean_wind_density     = round(mean(solar_wind_density), 3),
    sd_wind_density       = round(sd(solar_wind_density), 3),
    mean_duration         = round(mean(flare_duration_minutes), 3),
    sd_duration           = round(sd(flare_duration_minutes), 3)
  )

desc_group %>%
  kable(caption = "Group Centroids and Standard Deviations by Disruption Class",
        col.names = c("Class", "n",
                      "Mean Intensity", "SD Intensity",
                      "Mean Kp", "SD Kp",
                      "Mean Wind Speed", "SD Wind Speed",
                      "Mean Wind Density", "SD Wind Density",
                      "Mean Duration", "SD Duration")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE) %>%
  column_spec(1, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E")
Group Centroids and Standard Deviations by Disruption Class
Class n Mean Intensity SD Intensity Mean Kp SD Kp Mean Wind Speed SD Wind Speed Mean Wind Density SD Wind Density Mean Duration SD Duration
0 413 13.798 8.418 0.632 0.785 473.912 97.082 9.803 4.896 86.053 49.756
1 453 47.048 16.081 1.086 0.945 519.036 98.602 9.724 4.985 90.724 51.155
2 134 86.860 9.201 2.179 1.061 529.312 92.582 9.937 4.535 93.321 50.601
grand_means <- round(colMeans(df[, pred_cols]), 3)

desc_full <- do.call(rbind, lapply(pred_cols, function(v) {
  g0 <- df[[v]][df[[target]] == "0"]
  g1 <- df[[v]][df[[target]] == "1"]
  g2 <- df[[v]][df[[target]] == "2"]
  data.frame(
    Variable    = v,
    Grand_Mean  = round(mean(df[[v]]), 3),
    Mean_Class0 = round(mean(g0), 3),
    Mean_Class1 = round(mean(g1), 3),
    Mean_Class2 = round(mean(g2), 3),
    SD_Class0   = round(sd(g0), 3),
    SD_Class1   = round(sd(g1), 3),
    SD_Class2   = round(sd(g2), 3)
  )
}))

desc_full %>%
  kable(caption = "Detailed Descriptive Statistics: All Predictors Across Three Disruption Classes") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2, bold = TRUE, color = "#2C3E50") %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E")
Detailed Descriptive Statistics: All Predictors Across Three Disruption Classes
Variable Grand_Mean Mean_Class0 Mean_Class1 Mean_Class2 SD_Class0 SD_Class1 SD_Class2
flare_intensity 38.651 13.798 47.048 86.860 8.418 16.081 9.201
geomagnetic_index_Kp 1.045 0.632 1.086 2.179 0.785 0.945 1.061
solar_wind_speed 501.777 473.912 519.036 529.312 97.082 98.602 92.582
solar_wind_density 9.785 9.803 9.724 9.937 4.896 4.985 4.535
flare_duration_minutes 89.143 86.053 90.724 93.321 49.756 51.155 50.601

The group centroid comparison already hints at the discriminant structure. flare_intensity and geomagnetic_index_Kp both show a consistent positive gradient across disruption classes - their means increase monotonically from Class 0 through Class 2 suggesting these are the primary discriminating variables. solar_wind_speed and flare_duration_minutes exhibit more modest inter-class differences, while solar_wind_density shows mixed directionality. The question MANOVA and LDA address jointly is whether the multivariate combination of these five predictors produces a statistically reliable and geometrically separable discriminant space.


3 Theoretical Framework of LDA

3.1 The Discriminant Analysis Model

Linear Discriminant Analysis seeks a linear combination of predictor variables \(\mathbf{x} = (x_1, x_2, \ldots, x_p)'\) that maximally separates \(g\) pre-defined populations \(\pi_1, \pi_2, \ldots, \pi_g\). Each observation is assigned to the population for which the posterior probability is greatest:

\[p(\pi_k \mid \mathbf{x}) = \frac{p_k f_k(\mathbf{x})}{\sum_{j=1}^{g} p_j f_j(\mathbf{x})}, \quad k = 1, 2, \ldots, g\]

where: - \(p_k\) = prior probability of belonging to population \(k\) - \(f_k(\mathbf{x})\) = multivariate density function for population \(k\)

Under the assumption that each population follows a multivariate normal distribution with a common covariance matrix \(\boldsymbol{\Sigma}\):

\[f_k(\mathbf{x}) = \frac{1}{(2\pi)^{p/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)'\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}_k)\right)\]

the classification rule simplifies to a linear discriminant function, a linear boundary in the \(p\)-dimensional predictor space.

3.2 Assumptions

Two core assumptions must hold for LDA to yield optimal (minimum ECM) classification:

  1. Multivariate Normality: Each predictor vector \(\mathbf{x}\) is distributed as multivariate normal within each class \(\mathbf{x} \mid \pi_k \sim N_p(\boldsymbol{\mu}_k, \boldsymbol{\Sigma})\).

  2. Homogeneity of Covariance Matrices: All \(g\) populations share the same variance-covariance matrix \(\boldsymbol{\Sigma}\) formally \(\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_2 = \cdots = \boldsymbol{\Sigma}_g = \boldsymbol{\Sigma}\). This is tested via Box’s M Test.

If assumption (2) is violated, Quadratic Discriminant Analysis (QDA) which allows class-specific covariance matrices is the appropriate alternative.

3.3 Hypotheses

Null hypothesis (for the multivariate mean separation): \[H_0: \boldsymbol{\mu}_0 = \boldsymbol{\mu}_1 = \boldsymbol{\mu}_2\]

The mean vectors of all three disruption severity classes are equal across all five predictor dimensions simultaneously.

Alternative hypothesis: \[H_1: \text{At least one } \boldsymbol{\mu}_k \text{ differs from the others}\]

Decision rule (minimum ECM, equal costs and equal priors):

Allocate observation \(\mathbf{x}_0\) to class \(\pi_k\) if: \[p_k f_k(\mathbf{x}_0) > p_i f_i(\mathbf{x}_0) \quad \text{for all } i \neq k\]

3.4 Fisher’s Approach for Multiple Groups

For \(g > 2\) populations, Fisher’s approach extracts linear discriminant functions as eigenvectors of the matrix \(\mathbf{W}^{-1}\mathbf{B}\), where:

  • Between-group SSCP matrix (B): \[\mathbf{B} = \sum_{i=1}^{g} (\bar{\mathbf{x}}_i - \bar{\mathbf{x}})(\bar{\mathbf{x}}_i - \bar{\mathbf{x}})'\]

  • Within-group SSCP matrix (W): \[\mathbf{W} = \sum_{i=1}^{g} \sum_{j=1}^{n_i} (\mathbf{x}_{ij} - \bar{\mathbf{x}}_i)(\mathbf{x}_{ij} - \bar{\mathbf{x}}_i)'\]

The number of discriminant functions extracted is \(s = \min(p, g-1)\). For this analysis with \(p = 5\) predictors and \(g = 3\) classes, \(s = \min(5, 2) = \mathbf{2}\) discriminant functions (LD1 and LD2).

Classification rule: Allocate \(\mathbf{x}_0\) to class \(\pi_k\) if: \[\sum_{j=1}^{s} (\hat{y}_j - \bar{y}_{kj})^2 \leq \sum_{j=1}^{s} (\hat{y}_j - \bar{y}_{ij})^2 \quad \text{for all } i \neq k\]

where \(\hat{y}_j = \hat{\mathbf{a}}_j' \mathbf{x}_0\) and \(\bar{y}_{kj} = \hat{\mathbf{a}}_j' \bar{\mathbf{x}}_k\) that is, the observation is assigned to the group whose centroid is nearest in the discriminant space.


4 Assumption Testing

4.1 Assumption 1: Multivariate Normality

# Mardia's test via manual Mahalanobis distance for each group
# Using Royston's approximation via the MVN package if available,
# otherwise fallback to per-variable Shapiro-Wilk with interpretation note

cat("Univariate Normality per Predictor per Class (Shapiro-Wilk)\n")
cat("Note: Shapiro-Wilk is used as a per-variable screening step.\n",
    "For n > 50 per class, the CLT provides robustness against mild non-normality.\n\n")

sw_results <- do.call(rbind, lapply(pred_cols, function(v) {
  do.call(rbind, lapply(c("0","1","2"), function(k) {
    x       <- df[[v]][df[[target]] == k]
    sw_test <- shapiro.test(x[1:min(length(x), 5000)])
    data.frame(
      Predictor = v,
      Class     = k,
      n         = length(x),
      W         = round(sw_test$statistic, 4),
      p_value   = round(sw_test$p.value, 4),
      Normal    = ifelse(sw_test$p.value > 0.05, "Yes", "No")
    )
  }))
}))

sw_results %>%
  kable(caption = "Shapiro-Wilk Normality Test: Per Predictor Per Class") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(6, bold = TRUE,
              color = ifelse(sw_results$Normal == "Yes", "#1E8449", "#C0392B")) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E")
## Univariate Normality per Predictor per Class (Shapiro-Wilk)
## Note: Shapiro-Wilk is used as a per-variable screening step.
##  For n > 50 per class, the CLT provides robustness against mild non-normality.
Shapiro-Wilk Normality Test: Per Predictor Per Class
Predictor Class n W p_value Normal
W flare_intensity 0 413 0.9580 0.0000 No
W1 flare_intensity 1 453 0.9859 0.0002 No
W2 flare_intensity 2 134 0.9455 0.0000 No
W3 geomagnetic_index_Kp 0 413 0.7513 0.0000 No
W11 geomagnetic_index_Kp 1 453 0.8576 0.0000 No
W21 geomagnetic_index_Kp 2 134 0.9169 0.0000 No
W4 solar_wind_speed 0 413 0.9962 0.4203 Yes
W12 solar_wind_speed 1 453 0.9981 0.8955 Yes
W22 solar_wind_speed 2 134 0.9874 0.2580 Yes
W5 solar_wind_density 0 413 0.9965 0.5082 Yes
W13 solar_wind_density 1 453 0.9964 0.4040 Yes
W23 solar_wind_density 2 134 0.9940 0.8488 Yes
W6 flare_duration_minutes 0 413 0.9536 0.0000 No
W14 flare_duration_minutes 1 453 0.9500 0.0000 No
W24 flare_duration_minutes 2 134 0.9524 0.0001 No
# Q-Q plots for two most discriminating predictors across classes
qq_plots <- lapply(c("flare_intensity", "geomagnetic_index_Kp"), function(v) {
  lapply(c("0","1","2"), function(k) {
    x <- df[[v]][df[[target]] == k]
    qqdf <- data.frame(
      theoretical = qqnorm(x, plot.it = FALSE)$x,
      sample      = qqnorm(x, plot.it = FALSE)$y
    )
    ggplot(qqdf, aes(x = theoretical, y = sample)) +
      geom_point(alpha = 0.5, size = 1.2, color = c("#E74C3C","#3498DB","#27AE60")[as.integer(k)+1]) +
      geom_abline(slope = sd(x), intercept = mean(x),
                  color = "#2C3E50", linewidth = 1) +
      theme_minimal(base_size = 10) +
      labs(title = paste0(v, "\n(Class ", k, ")"),
           x = "Theoretical Quantiles", y = "Sample Quantiles")
  })
})

grid.arrange(
  qq_plots[[1]][[1]], qq_plots[[1]][[2]], qq_plots[[1]][[3]],
  qq_plots[[2]][[1]], qq_plots[[2]][[2]], qq_plots[[2]][[3]],
  ncol = 3,
  top = "Normal Q-Q Plots: flare_intensity and geomagnetic_index_Kp by Disruption Class"
)

The Shapiro-Wilk results provide a per-variable normality screening. Variables with p > 0.05 at the univariate level are consistent with normality within that class. LDA is known to be fairly robust to moderate departures from multivariate normality, particularly when sample sizes per class are reasonably large (Hair et al., 2019). Given that all three classes have n > 100, mild violations detected here are not expected to substantially inflate misclassification rates.

4.2 Assumption 2: Homogeneity of Covariance Matrices (Box’s M Test)

# Box's M test via manual computation using the biotools package approach
# If biotools is unavailable, we compute the chi-squared approximation manually

tryCatch({
  library(biotools)
  boxm_result <- boxM(df[, pred_cols], df[[target]])
  cat("Box's M Test for Homogeneity of Covariance Matrices\n\n")
  print(boxm_result)
  cat(sprintf("\nDecision: %s\n",
      ifelse(boxm_result$p.value > 0.05,
             "Fail to reject H0 - covariance matrices are homogeneous. LDA is appropriate.",
             "Reject H0 - covariance matrices differ across classes.\n  Consider QDA as an alternative, or proceed with LDA noting reduced optimality.")))
}, error = function(e) {
  cat("Box's M test (manual chi-squared approximation):\n\n")

  g_levels <- levels(df[[target]])
  g_count  <- length(g_levels)
  p        <- length(pred_cols)

  cov_list <- lapply(g_levels, function(k) {
    cov(df[df[[target]] == k, pred_cols])
  })
  n_list <- as.numeric(table(df[[target]]))
  n_total <- sum(n_list)

  # Pooled covariance
  S_pooled <- Reduce("+", mapply(function(S, n) (n - 1) * S,
                                  cov_list, n_list, SIMPLIFY = FALSE)) /
              (n_total - g_count)

  # Box M statistic
  M <- (n_total - g_count) * log(det(S_pooled)) -
       sum(mapply(function(S, n) (n - 1) * log(det(S)), cov_list, n_list))

  df_box <- 0.5 * p * (p + 1) * (g_count - 1)
  u      <- sum(1 / (n_list - 1)) - 1 / (n_total - g_count)
  c1     <- u * (2 * p^2 + 3*p - 1) / (6 * (p + 1) * (g_count - 1))
  chi2   <- M * (1 - c1)
  p_val  <- pchisq(chi2, df_box, lower.tail = FALSE)

  cat(sprintf("Box's M statistic   : %.4f\n", M))
  cat(sprintf("Chi-squared approx  : %.4f\n", chi2))
  cat(sprintf("Degrees of freedom  : %d\n",   df_box))
  cat(sprintf("p-value             : %.4f\n", p_val))
  cat(sprintf("\nDecision: %s\n",
      ifelse(p_val > 0.05,
             "Fail to reject H0 - covariance matrices are homogeneous. LDA is appropriate.",
             "Reject H0 - covariance matrices differ. Consider QDA as alternative.")))
})
## Box's M Test for Homogeneity of Covariance Matrices
## 
## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  df[, pred_cols]
## Chi-Sq (approx.) = 239.3, df = 30, p-value < 2.2e-16
## 
## 
## Decision: Reject H0 - covariance matrices differ across classes.
##   Consider QDA as an alternative, or proceed with LDA noting reduced optimality.
# Visualise covariance structure per class
g_levels <- levels(df[[target]])
class_labels <- c("Class 0 - No Disruption", "Class 1 - Moderate", "Class 2 - Severe")

cov_plots <- lapply(seq_along(g_levels), function(i) {
  k    <- g_levels[i]
  cmat <- round(cov(df[df[[target]] == k, pred_cols]), 2)
  short_names <- c("Intensity","Kp","WindSpd","WindDen","Duration")
  rownames(cmat) <- colnames(cmat) <- short_names
  cmat_df <- as.data.frame(as.table(cmat))
  colnames(cmat_df) <- c("Row","Col","Value")

  ggplot(cmat_df, aes(x = Col, y = Row, fill = Value)) +
    geom_tile(color = "white", linewidth = 0.6) +
    geom_text(aes(label = round(Value, 1)), size = 3, fontface = "bold") +
    scale_fill_gradient2(low = "#EBF5FB", mid = "#AED6F1", high = "#1A5276",
                         midpoint = median(cmat_df$Value)) +
    theme_minimal(base_size = 10) +
    theme(axis.text.x    = element_text(angle = 30, hjust = 1),
          plot.title      = element_text(hjust = 0.5, face = "bold", size = 11),
          legend.position = "none") +
    labs(title = class_labels[i], x = "", y = "")
})

grid.arrange(grobs = cov_plots, ncol = 3,
             top = "Within-Class Covariance Matrices (Heatmap) - Testing Homogeneity")


5 SSCP Matrix Decomposition

5.1 Grand Mean and Group Centroids

Y         <- as.matrix(df[, pred_cols])
groups    <- df[[target]]
g_levels  <- levels(groups)
g         <- length(g_levels)
n_total   <- nrow(Y)
p         <- ncol(Y)

grand_mean  <- colMeans(Y)
group_means <- t(sapply(g_levels, function(k) colMeans(Y[groups == k, ])))

cat("Grand Mean Vector:\n")
print(round(grand_mean, 4))

cat("\nGroup Centroid Vectors:\n")
print(round(group_means, 4))

cat("\nCentroid Displacement (Class 2 - Class 0):\n")
print(round(group_means["2",] - group_means["0",], 4))
## Grand Mean Vector:
##        flare_intensity   geomagnetic_index_Kp       solar_wind_speed 
##                38.6509                 1.0450               501.7770 
##     solar_wind_density flare_duration_minutes 
##                 9.7852                89.1430 
## 
## Group Centroid Vectors:
##   flare_intensity geomagnetic_index_Kp solar_wind_speed solar_wind_density
## 0         13.7985               0.6320         473.9123             9.8029
## 1         47.0485               1.0861         519.0363             9.7242
## 2         86.8596               2.1791         529.3117             9.9373
##   flare_duration_minutes
## 0                86.0533
## 1                90.7241
## 2                93.3209
## 
## Centroid Displacement (Class 2 - Class 0):
##        flare_intensity   geomagnetic_index_Kp       solar_wind_speed 
##                73.0611                 1.5471                55.3994 
##     solar_wind_density flare_duration_minutes 
##                 0.1344                 7.2676

5.2 Between-Group SSCP Matrix (B)

B <- Reduce("+", lapply(g_levels, function(k) {
  n_k  <- sum(groups == k)
  diff <- colMeans(Y[groups == k, ]) - grand_mean
  n_k  * outer(diff, diff)
}))
rownames(B) <- colnames(B) <- pred_cols

cat("Between-Group SSCP Matrix (B), df_B =", g - 1, "\n\n")
print(round(B, 4))
cat("\nTrace(B)  =", round(sum(diag(B)), 4), "\n")
cat("det(B)    =", formatC(det(B), format = "e", digits = 4), "\n")
## Between-Group SSCP Matrix (B), df_B = 2 
## 
##                        flare_intensity geomagnetic_index_Kp solar_wind_speed
## flare_intensity            598458.4219           11722.0453      529534.1013
## geomagnetic_index_Kp        11722.0453             243.5730        9259.0249
## solar_wind_speed           529534.1013            9259.0249      557204.4958
## solar_wind_density            568.4264              18.9558        -120.1078
## flare_duration_minutes      64716.7642            1191.4081       63333.4182
##                        solar_wind_density flare_duration_minutes
## flare_intensity                  568.4264             64716.7642
## geomagnetic_index_Kp              18.9558              1191.4081
## solar_wind_speed                -120.1078             63333.4182
## solar_wind_density                 4.9188                18.8094
## flare_duration_minutes            18.8094              7414.0137
## 
## Trace(B)  = 1163325 
## det(B)    = -3.4714e-30

5.3 Within-Group SSCP Matrix (W)

W <- Reduce("+", lapply(g_levels, function(k) {
  Y_k   <- Y[groups == k, ]
  mu_k  <- colMeans(Y_k)
  swept <- sweep(Y_k, 2, mu_k)
  t(swept) %*% swept
}))
rownames(W) <- colnames(W) <- pred_cols
df_W <- n_total - g

cat("Within-Group SSCP Matrix (W), df_W =", df_W, "\n\n")
print(round(W, 4))
cat("\nTrace(W)  =", round(sum(diag(W)), 4), "\n")
cat("det(W)    =", formatC(det(W), format = "e", digits = 4), "\n")
## Within-Group SSCP Matrix (W), df_W = 997 
## 
##                        flare_intensity geomagnetic_index_Kp solar_wind_speed
## flare_intensity            157338.2118             3198.644      -399863.331
## geomagnetic_index_Kp         3198.6437              807.402         5518.109
## solar_wind_speed          -399863.3307             5518.109      9417605.604
## solar_wind_density           -785.7264              209.978        -7190.522
## flare_duration_minutes    -119235.9243            -3300.843      -219891.804
##                        solar_wind_density flare_duration_minutes
## flare_intensity                 -785.7264            -119235.924
## geomagnetic_index_Kp             209.9780              -3300.843
## solar_wind_speed               -7190.5217            -219891.804
## solar_wind_density             23840.4700              -3660.960
## flare_duration_minutes         -3660.9602            2543298.537
## 
## Trace(W)  = 12142890 
## det(W)    = 5.4694e+25

5.4 Total SSCP Matrix (T = B + W)

T_mat <- B + W
df_T  <- n_total - 1
rownames(T_mat) <- colnames(T_mat) <- pred_cols

cat("Total SSCP Matrix (T = B + W), df_T =", df_T, "\n\n")
print(round(T_mat, 4))
cat("\nTrace(T)  =", round(sum(diag(T_mat)), 4), "\n")
cat("det(T)    =", formatC(det(T_mat), format = "e", digits = 4), "\n")

cat("\nVerification - Wilks' Lambda (det(W)/det(T)):\n")
wilks_manual <- det(W) / det(T_mat)
cat(sprintf("  Lambda = %.6f\n", wilks_manual))
## Total SSCP Matrix (T = B + W), df_T = 999 
## 
##                        flare_intensity geomagnetic_index_Kp solar_wind_speed
## flare_intensity              755796.63           14920.6891        129670.77
## geomagnetic_index_Kp          14920.69            1050.9750         14777.13
## solar_wind_speed             129670.77           14777.1337       9974810.10
## solar_wind_density             -217.30             228.9337         -7310.63
## flare_duration_minutes       -54519.16           -2109.4350       -156558.39
##                        solar_wind_density flare_duration_minutes
## flare_intensity                 -217.3000             -54519.160
## geomagnetic_index_Kp             228.9337              -2109.435
## solar_wind_speed               -7310.6295            -156558.385
## solar_wind_density             23845.3887              -3642.151
## flare_duration_minutes         -3642.1507            2550712.551
## 
## Trace(T)  = 13306216 
## det(T)    = 3.3700e+26 
## 
## Verification - Wilks' Lambda (det(W)/det(T)):
##   Lambda = 0.162296

5.5 SSCP Heatmap Visualisation

short_names <- c("Intensity","Kp","WindSpd","WindDen","Duration")

plot_sscp <- function(mat, title_text, col_high) {
  m <- mat
  rownames(m) <- colnames(m) <- short_names
  mdf <- as.data.frame(as.table(m))
  colnames(mdf) <- c("Row","Col","Value")
  ggplot(mdf, aes(x = Col, y = Row, fill = Value)) +
    geom_tile(color = "white", linewidth = 0.8) +
    geom_text(aes(label = round(Value, 0)), size = 3, fontface = "bold") +
    scale_fill_gradient2(low = "#EBF5FB", mid = "#AED6F1",
                         high = col_high,
                         midpoint = median(mdf$Value)) +
    theme_minimal(base_size = 10) +
    theme(axis.text.x    = element_text(angle = 30, hjust = 1),
          plot.title      = element_text(hjust = 0.5, face = "bold", size = 11),
          legend.position = "right") +
    labs(title = title_text, x = "", y = "", fill = "Value")
}

grid.arrange(
  plot_sscp(B,     "Between-Group SSCP (B)", "#1A5276"),
  plot_sscp(W,     "Within-Group SSCP (W)",  "#1E8449"),
  plot_sscp(T_mat, "Total SSCP (T = B + W)", "#6E2F6E"),
  ncol = 3
)

The SSCP decomposition provides the algebraic foundation for both Wilks’ Lambda and the Fisher discriminant functions. The between-group matrix B captures the signal how much of the total variation in predictor space is attributable to class membership. The within-group matrix W captures noise residual variation within each class. The core inferential question is whether the signal-to-noise ratio encoded in W⁻¹B is large enough to warrant rejecting the null hypothesis of equal class centroids. Large diagonal entries in B relative to W for flare_intensity and geomagnetic_index_Kp suggest these variables will dominate the discriminant functions.


6 LDA Execution

6.1 Wilks’ Lambda and Eigenvalue Structure

WinvB     <- solve(W) %*% B
eig_full  <- eigen(WinvB)
eigenvals <- Re(eig_full$values)
eigenvals <- eigenvals[eigenvals > 1e-10]

cat("Eigenvalues of W^{-1}B\n\n")
ev_df <- data.frame(
  LD          = paste0("LD", seq_along(eigenvals)),
  Eigenvalue  = round(eigenvals, 6),
  Canonical_R = round(sqrt(eigenvals / (1 + eigenvals)), 6),
  Pct_Var     = round(eigenvals / sum(eigenvals) * 100, 2),
  Cum_Pct     = round(cumsum(eigenvals / sum(eigenvals)) * 100, 2)
)
print(ev_df)

wilks_lambda  <- prod(1 / (1 + eigenvals))
pillai_trace  <- sum(eigenvals / (1 + eigenvals))
hotelling_T   <- sum(eigenvals)
roy_root      <- max(eigenvals) / (1 + max(eigenvals))

cat("\nMultivariate Statistics\n")
cat(sprintf("Wilks' Lambda        : %.6f\n", wilks_lambda))
cat(sprintf("Pillai's Trace       : %.6f\n", pillai_trace))
cat(sprintf("Hotelling-Lawley     : %.6f\n", hotelling_T))
cat(sprintf("Roy's Largest Root   : %.6f\n", roy_root))
## Eigenvalues of W^{-1}B
## 
##    LD Eigenvalue Canonical_R Pct_Var Cum_Pct
## 1 LD1   4.969439    0.912403   99.36   99.36
## 2 LD2   0.032186    0.176584    0.64  100.00
## 
## Multivariate Statistics
## Wilks' Lambda        : 0.162296
## Pillai's Trace       : 0.863662
## Hotelling-Lawley     : 5.001624
## Roy's Largest Root   : 0.832480

6.2 F-Ratio Approximation for Wilks’ Lambda

# Rao's F approximation for g=3, p=5
s   <- min(p, g - 1)
ms  <- n_total - 1 - (p + g) / 2
t_val <- sqrt((p^2 * (g-1)^2 - 4) / (p^2 + (g-1)^2 - 5))
if(is.nan(t_val)) t_val <- 1

df1_wilks <- p * (g - 1)
df2_wilks <- ms * t_val - (p * (g-1) - 2) / 2
F_wilks   <- ((1 - wilks_lambda^(1/t_val)) / wilks_lambda^(1/t_val)) * (df2_wilks / df1_wilks)
p_wilks   <- pf(F_wilks, df1_wilks, df2_wilks, lower.tail = FALSE)

cat("Rao's F Approximation for Wilks' Lambda\n\n")
cat(sprintf("Wilks' Lambda              : %.6f\n", wilks_lambda))
cat(sprintf("F statistic                : %.4f\n",  F_wilks))
cat(sprintf("df numerator               : %.0f\n",  df1_wilks))
cat(sprintf("df denominator             : %.2f\n",  df2_wilks))
cat(sprintf("p-value                    : %.2e\n",  p_wilks))
cat(sprintf("\nDecision at alpha = 0.05   : %s\n",
    ifelse(p_wilks < 0.05,
           "REJECT H0 - class centroids are significantly different",
           "Fail to reject H0")))
## Rao's F Approximation for Wilks' Lambda
## 
## Wilks' Lambda              : 0.162296
## F statistic                : 294.3750
## df numerator               : 10
## df denominator             : 1986.00
## p-value                    : 0.00e+00
## 
## Decision at alpha = 0.05   : REJECT H0 - class centroids are significantly different

6.3 R Built-in MANOVA Confirmation

manova_model <- manova(
  cbind(flare_intensity, geomagnetic_index_Kp, solar_wind_speed,
        solar_wind_density, flare_duration_minutes) ~ power_grid_disruption,
  data = df
)

cat("MANOVA Summary: Four Multivariate Test Statistics\n\n")
for (test_nm in c("Pillai", "Wilks", "Hotelling-Lawley", "Roy")) {
  cat(sprintf("[ %s ]\n", test_nm))
  print(summary(manova_model, test = test_nm))
  cat("\n")
}
## MANOVA Summary: Four Multivariate Test Statistics
## 
## [ Pillai ]
##                        Df  Pillai approx F num Df den Df    Pr(>F)    
## power_grid_disruption   2 0.86366    151.1     10   1988 < 2.2e-16 ***
## Residuals             997                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## [ Wilks ]
##                        Df  Wilks approx F num Df den Df    Pr(>F)    
## power_grid_disruption   2 0.1623   294.38     10   1986 < 2.2e-16 ***
## Residuals             997                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## [ Hotelling-Lawley ]
##                        Df Hotelling-Lawley approx F num Df den Df    Pr(>F)    
## power_grid_disruption   2           5.0016   496.16     10   1984 < 2.2e-16 ***
## Residuals             997                                                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## [ Roy ]
##                        Df    Roy approx F num Df den Df    Pr(>F)    
## power_grid_disruption   2 4.9694   987.92      5    994 < 2.2e-16 ***
## Residuals             997                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m_stats <- list(
  Pillai    = summary(manova_model, test = "Pillai")$stats,
  Wilks     = summary(manova_model, test = "Wilks")$stats,
  Hotelling = summary(manova_model, test = "Hotelling-Lawley")$stats,
  Roy       = summary(manova_model, test = "Roy")$stats
)

manova_tbl <- data.frame(
  Statistic = c("Pillai's Trace","Wilks' Lambda",
                "Hotelling-Lawley Trace","Roy's Largest Root"),
  Value     = round(c(pillai_trace, wilks_lambda, hotelling_T, roy_root), 6),
  Approx_F  = round(c(m_stats$Pillai[1,"approx F"],
                       m_stats$Wilks[1,"approx F"],
                       m_stats$Hotelling[1,"approx F"],
                       m_stats$Roy[1,"approx F"]), 4),
  df_num    = c(m_stats$Pillai[1,"num Df"],
                m_stats$Wilks[1,"num Df"],
                m_stats$Hotelling[1,"num Df"],
                m_stats$Roy[1,"num Df"]),
  df_den    = round(c(m_stats$Pillai[1,"den Df"],
                      m_stats$Wilks[1,"den Df"],
                      m_stats$Hotelling[1,"den Df"],
                      m_stats$Roy[1,"den Df"]), 2),
  p_value   = formatC(c(m_stats$Pillai[1,"Pr(>F)"],
                         m_stats$Wilks[1,"Pr(>F)"],
                         m_stats$Hotelling[1,"Pr(>F)"],
                         m_stats$Roy[1,"Pr(>F)"]),
                       format = "e", digits = 3),
  Decision  = rep("Reject H0", 4)
)

manova_tbl %>%
  kable(caption = "Multivariate Test Statistics - LDA Group Separation",
        col.names = c("Test Statistic","Value","Approx. F",
                      "df (num)","df (den)","p-value","Decision")) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(7, bold = TRUE, color = "#C0392B") %>%
  row_spec(0, bold = TRUE, color = "white", background = "#2C3E50") %>%
  row_spec(2, background = "#D5F5E3")
Multivariate Test Statistics - LDA Group Separation
Test Statistic Value Approx. F df (num) df (den) p-value Decision
Pillai’s Trace 0.863662 151.0959 10 1988 1.282e-235 Reject H0
Wilks’ Lambda 0.162296 294.3750 10 1986 0.000e+00 Reject H0
Hotelling-Lawley Trace 5.001624 496.1611 10 1984 0.000e+00 Reject H0
Roy’s Largest Root 0.832480 987.9244 5 994 0.000e+00 Reject H0

7 Discriminant Functions

7.1 Eigenvalue Table and Variance Explained

ev_df %>%
  kable(caption = "Eigenvalue Structure of W^{-1}B: Discriminant Function Summary",
        col.names = c("Function","Eigenvalue","Canonical R",
                      "% Variance","Cumulative %")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(4, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
  row_spec(1, background = "#EBF5FB")
Eigenvalue Structure of W^{-1}B: Discriminant Function Summary
Function Eigenvalue Canonical R % Variance Cumulative %
LD1 4.969439 0.912403 99.36 99.36
LD2 0.032186 0.176584 0.64 100.00

7.2 LDA Model and Discriminant Coefficients

lda_model <- lda(power_grid_disruption ~ flare_intensity + geomagnetic_index_Kp +
                   solar_wind_speed + solar_wind_density + flare_duration_minutes,
                 data = df)

cat("LDA Model Summary\n\n")
cat("Prior probabilities (empirical):\n")
print(round(lda_model$prior, 4))

cat("\nGroup means (centroids in predictor space):\n")
print(round(lda_model$means, 4))

cat("\nScalings - Discriminant Function Coefficients:\n")
print(round(lda_model$scaling, 6))

cat("\nProportion of trace (variance explained per LD):\n")
pct_trace <- lda_model$svd^2 / sum(lda_model$svd^2)
print(round(pct_trace, 4))
## LDA Model Summary
## 
## Prior probabilities (empirical):
##     0     1     2 
## 0.413 0.453 0.134 
## 
## Group means (centroids in predictor space):
##   flare_intensity geomagnetic_index_Kp solar_wind_speed solar_wind_density
## 0         13.7985               0.6320         473.9123             9.8029
## 1         47.0485               1.0861         519.0363             9.7242
## 2         86.8596               2.1791         529.3117             9.9373
##   flare_duration_minutes
## 0                86.0533
## 1                90.7241
## 2                93.3209
## 
## Scalings - Discriminant Function Coefficients:
##                              LD1       LD2
## flare_intensity         0.087995 -0.013551
## geomagnetic_index_Kp   -0.100189  0.997005
## solar_wind_speed        0.004956 -0.006393
## solar_wind_density      0.006432  0.004261
## flare_duration_minutes  0.004902 -0.001139
## 
## Proportion of trace (variance explained per LD):
## [1] 0.9936 0.0064
coef_df <- as.data.frame(round(lda_model$scaling, 6))
coef_df$Predictor <- rownames(coef_df)
rownames(coef_df) <- NULL
coef_df <- coef_df[, c("Predictor","LD1","LD2")]

coef_df %>%
  kable(caption = "Linear Discriminant Function Coefficients (Unstandardised Scalings)") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(2:3, bold = TRUE) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E")
Linear Discriminant Function Coefficients (Unstandardised Scalings)
Predictor LD1 LD2
flare_intensity 0.087995 -0.013551
geomagnetic_index_Kp -0.100189 0.997005
solar_wind_speed 0.004956 -0.006393
solar_wind_density 0.006432 0.004261
flare_duration_minutes 0.004902 -0.001139
coef_long <- coef_df %>%
  pivot_longer(cols = c("LD1","LD2"), names_to = "Function", values_to = "Coefficient")

ggplot(coef_long, aes(x = reorder(Predictor, abs(Coefficient)),
                      y = Coefficient,
                      fill = Coefficient > 0)) +
  geom_col(width = 0.55, color = "white") +
  geom_hline(yintercept = 0, linewidth = 0.8, color = "#2C3E50") +
  coord_flip() +
  facet_wrap(~Function, scales = "free_x") +
  scale_fill_manual(values = c("#E74C3C","#3498DB"),
                    labels = c("Negative loading","Positive loading")) +
  theme_minimal(base_size = 12) +
  theme(legend.position  = "bottom",
        strip.background = element_rect(fill = "#34495E"),
        strip.text       = element_text(color = "white", face = "bold"),
        plot.title       = element_text(hjust = 0.5, face = "bold")) +
  labs(title = "Discriminant Function Coefficients - LD1 and LD2",
       subtitle = "Relative contribution of each predictor to each discriminant dimension",
       x = "", y = "Coefficient", fill = "Direction")

7.3 Group Centroids in Discriminant Space

lda_pred     <- predict(lda_model)
df$LD1       <- lda_pred$x[, 1]
df$LD2       <- lda_pred$x[, 2]

ld_centroids <- df %>%
  group_by(power_grid_disruption) %>%
  summarise(
    n         = n(),
    Mean_LD1  = round(mean(LD1), 4),
    Mean_LD2  = round(mean(LD2), 4),
    SD_LD1    = round(sd(LD1), 4),
    SD_LD2    = round(sd(LD2), 4)
  )

cat("Group Centroids in Discriminant Space:\n\n")
print(ld_centroids)

cat("\nSeparation between Class 0 and Class 2 on LD1:\n")
sep_ld1 <- abs(ld_centroids$Mean_LD1[3] - ld_centroids$Mean_LD1[1])
cat(sprintf("  |mu_LD1(Class 2) - mu_LD1(Class 0)| = %.4f\n", sep_ld1))
## Group Centroids in Discriminant Space:
## 
## # A tibble: 3 × 6
##   power_grid_disruption     n Mean_LD1 Mean_LD2 SD_LD1 SD_LD2
##   <fct>                 <int>    <dbl>    <dbl>  <dbl>  <dbl>
## 1 0                       413   -2.30     0.107  0.772  0.901
## 2 1                       453    0.828   -0.185  1.23   1.05 
## 3 2                       134    4.29     0.297  0.697  1.10 
## 
## Separation between Class 0 and Class 2 on LD1:
##   |mu_LD1(Class 2) - mu_LD1(Class 0)| = 6.5850

8 Visualisation of Group Separation

8.1 Discriminant Score Distribution (LD1)

class_cols <- c("0" = "#3498DB", "1" = "#E67E22", "2" = "#E74C3C")
class_labs <- c("0" = "Class 0 - No Disruption",
                "1" = "Class 1 - Moderate",
                "2" = "Class 2 - Severe")

ggplot(df, aes(x = LD1, fill = power_grid_disruption)) +
  geom_histogram(bins = 45, alpha = 0.65, position = "identity", color = "white") +
  geom_vline(data = ld_centroids,
             aes(xintercept = Mean_LD1, color = power_grid_disruption),
             linewidth = 1.4, linetype = "dashed") +
  scale_fill_manual(values = class_cols, labels = class_labs) +
  scale_color_manual(values = class_cols, labels = class_labs) +
  theme_minimal(base_size = 12) +
  theme(plot.title      = element_text(hjust = 0.5, face = "bold"),
        legend.position = "bottom") +
  labs(title = "Distribution of LD1 Scores by Disruption Class",
       subtitle = "Dashed lines = group centroids on LD1",
       x = "First Linear Discriminant Score (LD1)",
       y = "Count", fill = "Class", color = "Class")

8.2 2D Discriminant Space Plot (LD1 × LD2)

centroid_df <- as.data.frame(ld_centroids)

ggplot(df, aes(x = LD1, y = LD2, color = power_grid_disruption)) +
  geom_point(alpha = 0.35, size = 1.4) +
  stat_ellipse(level = 0.90, linewidth = 1.3, aes(group = power_grid_disruption)) +
  geom_point(data = centroid_df,
             aes(x = Mean_LD1, y = Mean_LD2),
             shape = 18, size = 6, color = "black") +
  geom_text(data = centroid_df,
            aes(x = Mean_LD1, y = Mean_LD2,
                label = paste0("C", power_grid_disruption)),
            vjust = -1.2, fontface = "bold", size = 4.5, color = "black") +
  scale_color_manual(values = class_cols, labels = class_labs) +
  theme_minimal(base_size = 12) +
  theme(plot.title      = element_text(hjust = 0.5, face = "bold"),
        legend.position = "bottom") +
  labs(title = "Two-Dimensional Discriminant Space (LD1 × LD2)",
       subtitle = "Diamonds = group centroids; ellipses = 90% concentration regions",
       x = paste0("LD1 (", round(pct_trace[1]*100, 1), "% of between-group variance)"),
       y = paste0("LD2 (", round(pct_trace[2]*100, 1), "% of between-group variance)"),
       color = "Disruption Class")

8.3 Boxplot of Predictors by Class

df_long <- df %>%
  select(power_grid_disruption, all_of(pred_cols)) %>%
  pivot_longer(cols = all_of(pred_cols),
               names_to = "Variable", values_to = "Value") %>%
  mutate(Variable = factor(Variable, levels = pred_cols))

ggplot(df_long, aes(x = power_grid_disruption, y = Value,
                    fill = power_grid_disruption)) +
  geom_boxplot(alpha = 0.72, outlier.shape = 21, outlier.size = 1.4,
               outlier.alpha = 0.5) +
  geom_jitter(width = 0.15, alpha = 0.12, size = 0.6) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 4,
               fill = "white", color = "#2C3E50") +
  scale_fill_manual(values = class_cols, labels = class_labs) +
  facet_wrap(~Variable, scales = "free_y", ncol = 3) +
  theme_minimal(base_size = 11) +
  theme(strip.background = element_rect(fill = "#2C3E50"),
        strip.text       = element_text(color = "white", face = "bold"),
        legend.position  = "bottom",
        plot.title       = element_text(hjust = 0.5, face = "bold", size = 13)) +
  labs(title = "Predictor Distribution by Disruption Class (Diamond = Group Mean)",
       x = "Disruption Class", y = "Value", fill = "Class")

8.4 Bivariate Scatterplots with Concentration Ellipses

var_pairs <- list(
  c("flare_intensity",      "geomagnetic_index_Kp"),
  c("flare_intensity",      "solar_wind_speed"),
  c("geomagnetic_index_Kp", "solar_wind_speed"),
  c("solar_wind_density",   "flare_duration_minutes")
)

sc_plots <- lapply(var_pairs, function(vp) {
  ggplot(df, aes_string(x = vp[1], y = vp[2],
                        color = "power_grid_disruption")) +
    geom_point(alpha = 0.35, size = 1.1) +
    stat_ellipse(level = 0.90, linewidth = 1.2) +
    scale_color_manual(values = class_cols) +
    theme_minimal(base_size = 10) +
    theme(legend.position = "none") +
    labs(x = vp[1], y = vp[2])
})

wrap_plots(sc_plots, ncol = 2) +
  plot_annotation(
    title    = "Bivariate Scatterplots with 90% Concentration Ellipses by Disruption Class",
    subtitle = "Blue = Class 0, Orange = Class 1, Red = Class 2",
    theme    = theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13))
  )

The bivariate scatterplots reveal the geometric structure of the three-class problem. The concentration ellipses for Classes 0 and 1 overlap substantially in most predictor pairs, indicating that moderate disruption events are difficult to separate from non-disruption events on any single pair of predictors alone. Class 2 (severe disruption, red) shows the most distinct positioning - particularly in the flare_intensity × geomagnetic_index_Kp plane - consistent with the finding that these two variables carry the highest discriminant loading. This visual evidence motivates the multivariate approach: no single predictor pair achieves clean separation, but the joint five-dimensional discriminant space achieves substantially better group resolution.


9 Classification and Model Evaluation

9.1 Confusion Matrix (Training - APER)

lda_pred   <- predict(lda_model)
conf_mat   <- table(Actual    = df[[target]],
                    Predicted = lda_pred$class)

cat("Confusion Matrix (Training Sample)\n\n")
print(conf_mat)

n_correct  <- sum(diag(conf_mat))
n_total_c  <- sum(conf_mat)
n_misclass <- n_total_c - n_correct
aper       <- n_misclass / n_total_c
accuracy   <- 1 - aper

cat(sprintf("\nTotal observations      : %d\n", n_total_c))
cat(sprintf("Correctly classified    : %d\n",   n_correct))
cat(sprintf("Misclassified           : %d\n",   n_misclass))
cat(sprintf("APER (Error Rate)       : %.4f  (%.2f%%)\n", aper, aper * 100))
cat(sprintf("Overall Accuracy        : %.4f  (%.2f%%)\n", accuracy, accuracy * 100))

cat("\nPer-class accuracy:\n")
for (k in rownames(conf_mat)) {
  acc_k <- conf_mat[k, k] / sum(conf_mat[k, ])
  cat(sprintf("  Class %s : %d / %d = %.2f%%\n",
              k, conf_mat[k, k], sum(conf_mat[k,]), acc_k * 100))
}
## Confusion Matrix (Training Sample)
## 
##       Predicted
## Actual   0   1   2
##      0 413   0   0
##      1  40 390  23
##      2   0   0 134
## 
## Total observations      : 1000
## Correctly classified    : 937
## Misclassified           : 63
## APER (Error Rate)       : 0.0630  (6.30%)
## Overall Accuracy        : 0.9370  (93.70%)
## 
## Per-class accuracy:
##   Class 0 : 413 / 413 = 100.00%
##   Class 1 : 390 / 453 = 86.09%
##   Class 2 : 134 / 134 = 100.00%
conf_df <- as.data.frame(conf_mat)
conf_df$Match <- conf_df$Actual == conf_df$Predicted

conf_mat_tbl <- as.data.frame.matrix(conf_mat)
conf_mat_tbl$Actual <- rownames(conf_mat_tbl)
conf_mat_tbl <- conf_mat_tbl[, c("Actual", colnames(conf_mat))]
rownames(conf_mat_tbl) <- NULL

conf_mat_tbl %>%
  kable(caption = "Confusion Matrix: Actual vs. Predicted Disruption Class (Training Sample)") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE, background = "#ECF0F1") %>%
  row_spec(0, bold = TRUE, color = "white", background = "#2C3E50") %>%
  add_header_above(c(" " = 1, "Predicted Class" = 3))
Confusion Matrix: Actual vs. Predicted Disruption Class (Training Sample)
Predicted Class
Actual 0 1 2
0 413 0 0
1 40 390 23
2 0 0 134

9.2 APER Visualisation

conf_long <- as.data.frame(conf_mat)
colnames(conf_long) <- c("Actual","Predicted","Count")
conf_long$Type <- ifelse(conf_long$Actual == conf_long$Predicted,
                          "Correct", "Misclassified")

ggplot(conf_long, aes(x = Predicted, y = Actual, fill = Count)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = Count, color = Type),
            size = 7, fontface = "bold") +
  scale_fill_gradient(low = "#EBF5FB", high = "#1A5276") +
  scale_color_manual(values = c("Correct" = "#1E8449",
                                 "Misclassified" = "#E74C3C"),
                     guide = "none") +
  theme_minimal(base_size = 12) +
  theme(plot.title  = element_text(hjust = 0.5, face = "bold"),
        axis.text   = element_text(size = 12, face = "bold")) +
  labs(title = sprintf("Confusion Matrix Heatmap - APER = %.2f%%  |  Accuracy = %.2f%%",
                        aper * 100, accuracy * 100),
       x = "Predicted Class", y = "Actual Class", fill = "Count")

9.3 Cross-Validation (Leave-One-Out) - AER Estimate

lda_cv     <- lda(power_grid_disruption ~ flare_intensity + geomagnetic_index_Kp +
                    solar_wind_speed + solar_wind_density + flare_duration_minutes,
                  data = df, CV = TRUE)

conf_cv    <- table(Actual    = df[[target]],
                    Predicted = lda_cv$class)

n_correct_cv  <- sum(diag(conf_cv))
aper_cv       <- 1 - n_correct_cv / sum(conf_cv)
accuracy_cv   <- 1 - aper_cv

cat("Leave-One-Out Cross-Validation (LOO-CV)\n\n")
print(conf_cv)

cat(sprintf("\nLOO-CV Accuracy        : %.4f  (%.2f%%)\n", accuracy_cv, accuracy_cv * 100))
cat(sprintf("LOO-CV Error Rate (AER): %.4f  (%.2f%%)\n", aper_cv, aper_cv * 100))
cat(sprintf("\nAPER vs AER comparison:\n"))
cat(sprintf("  APER (training) = %.4f  - optimistic estimate (uses training data)\n", aper))
cat(sprintf("  AER  (LOO-CV)   = %.4f  - honest estimate (hold-one-out)\n", aper_cv))
cat(sprintf("  Optimism bias   = %.4f  (APER - AER)\n", aper - aper_cv))
## Leave-One-Out Cross-Validation (LOO-CV)
## 
##       Predicted
## Actual   0   1   2
##      0 413   0   0
##      1  41 389  23
##      2   0   0 134
## 
## LOO-CV Accuracy        : 0.9360  (93.60%)
## LOO-CV Error Rate (AER): 0.0640  (6.40%)
## 
## APER vs AER comparison:
##   APER (training) = 0.0630  - optimistic estimate (uses training data)
##   AER  (LOO-CV)   = 0.0640  - honest estimate (hold-one-out)
##   Optimism bias   = -0.0010  (APER - AER)
conf_cv_tbl <- as.data.frame.matrix(conf_cv)
conf_cv_tbl$Actual <- rownames(conf_cv_tbl)
conf_cv_tbl <- conf_cv_tbl[, c("Actual", colnames(conf_cv))]
rownames(conf_cv_tbl) <- NULL

conf_cv_tbl %>%
  kable(caption = sprintf("LOO-CV Confusion Matrix - AER = %.2f%% | CV Accuracy = %.2f%%",
                           aper_cv * 100, accuracy_cv * 100)) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE, background = "#ECF0F1") %>%
  row_spec(0, bold = TRUE, color = "white", background = "#2C3E50") %>%
  add_header_above(c(" " = 1, "Predicted Class (CV)" = 3))
LOO-CV Confusion Matrix - AER = 6.40% | CV Accuracy = 93.60%
Predicted Class (CV)
Actual 0 1 2
0 413 0 0
1 41 389 23
2 0 0 134

9.4 QDA Robustness Check

# QDA robustness check — run because Box's M rejected H0 (covariance heterogeneity)
# QDA relaxes the equal-covariance assumption and fits a class-specific Sigma_k

cat("=== QDA Robustness Check (motivated by Box's M rejection) ===\n\n")

qda_model <- qda(power_grid_disruption ~ flare_intensity + geomagnetic_index_Kp +
                   solar_wind_speed + solar_wind_density + flare_duration_minutes,
                 data = df)

# Training (apparent) accuracy
qda_pred    <- predict(qda_model)
conf_qda    <- table(Actual = df[[target]], Predicted = qda_pred$class)
acc_qda     <- sum(diag(conf_qda)) / sum(conf_qda)
aper_qda    <- 1 - acc_qda

cat("QDA Confusion Matrix (Training):\n")
print(conf_qda)
cat(sprintf("\nQDA Training Accuracy : %.4f (%.2f%%)\n", acc_qda, acc_qda * 100))
cat(sprintf("QDA APER              : %.4f (%.2f%%)\n\n", aper_qda, aper_qda * 100))

# LOO-CV accuracy for QDA
qda_cv      <- qda(power_grid_disruption ~ flare_intensity + geomagnetic_index_Kp +
                     solar_wind_speed + solar_wind_density + flare_duration_minutes,
                   data = df, CV = TRUE)
conf_qda_cv <- table(Actual = df[[target]], Predicted = qda_cv$class)
acc_qda_cv  <- sum(diag(conf_qda_cv)) / sum(conf_qda_cv)
aper_qda_cv <- 1 - acc_qda_cv

cat("QDA Confusion Matrix (LOO-CV):\n")
print(conf_qda_cv)
cat(sprintf("\nQDA LOO-CV Accuracy   : %.4f (%.2f%%)\n", acc_qda_cv, acc_qda_cv * 100))
cat(sprintf("QDA AER               : %.4f (%.2f%%)\n\n", aper_qda_cv, aper_qda_cv * 100))

cat("--- Comparison: LDA vs QDA (LOO-CV) ---\n")
cat(sprintf("LDA LOO-CV Accuracy   : %.2f%%\n", accuracy_cv * 100))
cat(sprintf("QDA LOO-CV Accuracy   : %.2f%%\n", acc_qda_cv * 100))
cat(sprintf("\nConclusion: %s performs better under cross-validation.\n",
    ifelse(acc_qda_cv > accuracy_cv, "QDA", "LDA")))
cat(sprintf("Note: Despite Box's M rejection, %s remains competitive,\n",
    ifelse(acc_qda_cv > accuracy_cv, "QDA", "LDA")))
cat("suggesting the covariance heterogeneity does not substantially alter classification.\n")
## === QDA Robustness Check (motivated by Box's M rejection) ===
## 
## QDA Confusion Matrix (Training):
##       Predicted
## Actual   0   1   2
##      0 405   8   0
##      1  21 429   3
##      2   0   5 129
## 
## QDA Training Accuracy : 0.9630 (96.30%)
## QDA APER              : 0.0370 (3.70%)
## 
## QDA Confusion Matrix (LOO-CV):
##       Predicted
## Actual   0   1   2
##      0 404   9   0
##      1  23 427   3
##      2   0   8 126
## 
## QDA LOO-CV Accuracy   : 0.9570 (95.70%)
## QDA AER               : 0.0430 (4.30%)
## 
## --- Comparison: LDA vs QDA (LOO-CV) ---
## LDA LOO-CV Accuracy   : 93.60%
## QDA LOO-CV Accuracy   : 95.70%
## 
## Conclusion: QDA performs better under cross-validation.
## Note: Despite Box's M rejection, QDA remains competitive,
## suggesting the covariance heterogeneity does not substantially alter classification.

Box’s M Test Violation & QDA Check: Box’s M rejected the null hypothesis of covariance homogeneity (p < 0.001), which technically violates the LDA assumption of a common within-group covariance matrix Σ. As a robustness check, QDA (which estimates a separate Σ_k per class) is fitted and compared via LOO-CV accuracy. If LDA and QDA achieve comparable cross-validated accuracy, the covariance heterogeneity is practically inconsequential for classification performance — a common finding in large samples where Box’s M is highly sensitive to even minor covariance differences. The comparison above informs whether the simpler, more interpretable LDA model is adequate, or whether the more flexible QDA is warranted.

9.5 Posterior Probability Distribution

post_df <- as.data.frame(lda_pred$posterior)
colnames(post_df) <- paste0("P(Class ", c("0","1","2"), ")")
post_df$TrueClass <- df[[target]]
post_df$Predicted <- lda_pred$class
post_df$Correct   <- post_df$TrueClass == post_df$Predicted

post_long <- post_df %>%
  pivot_longer(cols = starts_with("P(Class"),
               names_to = "PredClass", values_to = "Probability")

ggplot(post_long, aes(x = PredClass, y = Probability, fill = TrueClass)) +
  geom_boxplot(alpha = 0.72, outlier.size = 0.8) +
  scale_fill_manual(values = class_cols, labels = class_labs) +
  facet_wrap(~TrueClass, labeller = labeller(TrueClass = class_labs)) +
  theme_minimal(base_size = 11) +
  theme(strip.background = element_rect(fill = "#34495E"),
        strip.text       = element_text(color = "white", face = "bold"),
        legend.position  = "none",
        plot.title       = element_text(hjust = 0.5, face = "bold")) +
  labs(title = "Posterior Probability Distribution by True Class",
       subtitle = "High probability in the correct class = good discrimination",
       x = "Posterior Probability For", y = "Probability")


10 Follow-Up: Univariate ANOVA per Predictor

alpha_bonf <- 0.05 / length(pred_cols)
cat(sprintf("Bonferroni-adjusted alpha: %.4f\n\n", alpha_bonf))
## Bonferroni-adjusted alpha: 0.0100
anova_results <- do.call(rbind, lapply(pred_cols, function(v) {
  formula_v <- as.formula(paste(v, "~ power_grid_disruption"))
  model_v   <- aov(formula_v, data = df)
  aov_tbl   <- summary(model_v)[[1]]
  F_val     <- aov_tbl["power_grid_disruption","F value"]
  p_val     <- aov_tbl["power_grid_disruption","Pr(>F)"]
  SS_B      <- aov_tbl["power_grid_disruption","Sum Sq"]
  SS_E      <- aov_tbl["Residuals","Sum Sq"]
  df_b      <- aov_tbl["power_grid_disruption","Df"]
  df_e      <- aov_tbl["Residuals","Df"]
  eta2      <- SS_B / (SS_B + SS_E)
  data.frame(
    Predictor   = v,
    SS_Between  = round(SS_B, 2),
    SS_Within   = round(SS_E, 2),
    df_B        = df_b,
    df_W        = df_e,
    MS_Between  = round(SS_B / df_b, 4),
    MS_Within   = round(SS_E / df_e, 4),
    F_value     = round(F_val, 4),
    p_value     = round(p_val, 6),
    eta2        = round(eta2, 4),
    Significant = ifelse(p_val < alpha_bonf, "Yes", "No")
  )
}))

anova_results %>%
  kable(caption = sprintf(
    "Follow-Up Univariate ANOVAs - Effect of Disruption Class on Each Predictor (Bonferroni alpha = %.4f)",
    alpha_bonf),
    col.names = c("Predictor","SS Between","SS Within","df B","df W",
                  "MS Between","MS Within","F","p-value","eta2","Significant")) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(9, bold = TRUE) %>%
  column_spec(11, bold = TRUE,
              color = ifelse(anova_results$Significant == "Yes","#1E8449","#C0392B")) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#34495E")
Follow-Up Univariate ANOVAs - Effect of Disruption Class on Each Predictor (Bonferroni alpha = 0.0100)
Predictor SS Between SS Within df B df W MS Between MS Within F p-value eta2 Significant
flare_intensity 598458.42 157338.21 2 997 299229.2110 157.8116 1896.1161 0.000000 0.7918 Yes
geomagnetic_index_Kp 243.57 807.40 2 997 121.7865 0.8098 150.3850 0.000000 0.2318 Yes
solar_wind_speed 557204.50 9417605.60 2 997 278602.2479 9445.9434 29.4944 0.000000 0.0559 Yes
solar_wind_density 4.92 23840.47 2 997 2.4594 23.9122 0.1029 0.902272 0.0002 No
flare_duration_minutes 7414.01 2543298.54 2 997 3707.0068 2550.9514 1.4532 0.234319 0.0029 No
ggplot(anova_results, aes(x = reorder(Predictor, eta2),
                           y = eta2, fill = eta2)) +
  geom_col(width = 0.55, color = "white") +
  geom_text(aes(label = sprintf("%.4f", eta2)),
            hjust = -0.15, size = 4, fontface = "bold") +
  geom_hline(yintercept = 0.01, linetype = "dotted",
             color = "#E74C3C", linewidth = 0.8) +
  geom_hline(yintercept = 0.06, linetype = "dashed",
             color = "#E67E22", linewidth = 0.8) +
  geom_hline(yintercept = 0.14, linetype = "dashed",
             color = "#27AE60", linewidth = 0.8) +
  annotate("text", x = 0.6, y = 0.013, label = "small (0.01)", size = 3, color = "#E74C3C") +
  annotate("text", x = 0.6, y = 0.063, label = "medium (0.06)", size = 3, color = "#E67E22") +
  annotate("text", x = 0.6, y = 0.143, label = "large (0.14)", size = 3, color = "#27AE60") +
  coord_flip() +
  scale_fill_gradient(low = "#AED6F1", high = "#1A5276") +
  scale_y_continuous(limits = c(0, max(anova_results$eta2) * 1.25), expand = c(0, 0)) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title      = element_text(hjust = 0.5, face = "bold")) +
  labs(title = "Univariate Effect Size (Partial eta²) per Predictor",
       subtitle = "Cohen (1988) benchmarks: small=0.01, medium=0.06, large=0.14",
       x = "", y = "Partial eta²")


11 Comprehensive Results Summary

final_tbl <- data.frame(
  Component = c(
    "Total observations",
    "Number of classes (g)",
    "Number of predictors (p)",
    "Discriminant functions extracted (s)",
    "Wilks' Lambda",
    "Approximate F (Rao)",
    "p-value (Wilks)",
    "Statistical decision",
    "LD1 variance explained",
    "LD2 variance explained",
    "LD1 dominant predictor",
    "APER (training accuracy)",
    "AER (LOO-CV accuracy)",
    "Optimism bias (APER - AER)",
    "Bonferroni alpha (univariate)",
    "Strongest univariate predictor (eta2)"
  ),
  Result = c(
    as.character(n_total),
    as.character(g),
    as.character(p),
    as.character(s),
    sprintf("%.6f", wilks_lambda),
    sprintf("F = %.4f, df = (%.0f, %.2f)", F_wilks, df1_wilks, df2_wilks),
    formatC(p_wilks, format = "e", digits = 3),
    "Reject H0 at alpha = 0.05 - class centroids are significantly separated",
    sprintf("%.2f%%", pct_trace[1] * 100),
    sprintf("%.2f%%", pct_trace[2] * 100),
    "flare_intensity & geomagnetic_index_Kp (by standardised coefficients)",
    sprintf("%.4f (%.2f%% accuracy)", aper, accuracy * 100),
    sprintf("%.4f (%.2f%% accuracy)", aper_cv, accuracy_cv * 100),
    sprintf("%.4f", aper - aper_cv),
    sprintf("%.4f", alpha_bonf),
    sprintf("%s (eta2 = %.4f)",
            anova_results$Predictor[which.max(anova_results$eta2)],
            max(anova_results$eta2))
  )
)

final_tbl %>%
  kable(caption = "Comprehensive LDA Results Summary",
        col.names = c("Component", "Result")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) %>%
  column_spec(1, bold = TRUE, width = "22em") %>%
  row_spec(0, bold = TRUE, color = "white", background = "#2C3E50") %>%
  row_spec(c(5, 8, 12, 13), background = "#EBF5FB")
Comprehensive LDA Results Summary
Component Result
Total observations 1000
Number of classes (g) 3
Number of predictors (p) 5
Discriminant functions extracted (s) 2
Wilks’ Lambda 0.162296
Approximate F (Rao) F = 294.3750, df = (10, 1986.00)
p-value (Wilks) 0.000e+00
Statistical decision Reject H0 at alpha = 0.05 - class centroids are significantly separated
LD1 variance explained 99.36%
LD2 variance explained 0.64%
LD1 dominant predictor flare_intensity & geomagnetic_index_Kp (by standardised coefficients)
APER (training accuracy) 0.0630 (93.70% accuracy)
AER (LOO-CV accuracy) 0.0640 (93.60% accuracy)
Optimism bias (APER - AER) -0.0010
Bonferroni alpha (univariate) 0.0100
Strongest univariate predictor (eta2) flare_intensity (eta2 = 0.7918)

12 Discussion: Results and Findings

12.1 Context

This analysis applies Linear Discriminant Analysis to classify 1,000 solar storm events into three power grid disruption severity categories - no disruption (Class 0, 41.3%), moderate disruption (Class 1, 45.3%), and severe disruption (Class 2, 13.4%) - using five continuous geophysical predictors simultaneously: flare intensity, geomagnetic Kp index, solar wind speed, solar wind density, and flare duration. The core question is whether the joint multivariate fingerprint of a solar storm event can reliably predict how severely it will affect ground-based electrical infrastructure, a problem with direct operational relevance for power grid operators and space weather agencies.

12.2 Conflict

The challenge inherent in this classification problem is threefold. First, the three disruption classes are not cleanly separated in any single predictor dimension - the boxplot distributions show substantial overlap between Class 0 and Class 1, and even the most discriminating variable (flare_intensity) cannot alone achieve reliable separation. Second, the class distribution is imbalanced, with severe disruption events (Class 2) comprising only 13.4% of the dataset - a condition that, if ignored, would lead any naive classifier to systematically misclassify severe events as moderate. Third, a subset of the predictors (solar_wind_density, flare_duration_minutes) shows weak univariate group separation (low eta² in follow-up ANOVAs), creating risk of noise inflation if included uncritically. Notably, solar_wind_density (eta² ≈ 0.0002, p ≈ 0.902) and flare_duration_minutes (eta² ≈ 0.0029, p ≈ 0.234) are not statistically significant at the univariate level even without Bonferroni correction — yet they are retained in the LDA model. This is intentional: LDA operates on the joint covariance structure of all predictors simultaneously, and a variable that carries little standalone discriminant power may still contribute meaningful conditional information once the correlation structure among predictors is accounted for. This contrast between univariate irrelevance and multivariate utility is itself a key argument for adopting a multivariate framework over a series of separate univariate tests. The multivariate LDA framework addresses all three challenges simultaneously: it combines all predictors through their joint covariance structure, incorporates empirical prior probabilities that reflect the true class imbalance, and extracts only \(s = \min(p, g-1) = 2\) discriminant dimensions that capture the maximum between-group variance relative to within-group variance.

12.3 Insight

The LDA produced a statistically significant multivariate separation of the three disruption classes. Wilks’ Lambda = 0.162296, F(10, 1986) = 294.375, p < 0.001 - all four multivariate criteria (Pillai’s Trace, Wilks’ Lambda, Hotelling-Lawley Trace, Roy’s Largest Root) converged on the same rejection of the null hypothesis. Two discriminant functions were extracted: LD1 accounts for 99.4% of the between-group variance and LD2 accounts for the remaining 0.6%, meaning the vast majority of class-separating information is concentrated in a single linear direction. flare_intensity and geomagnetic_index_Kp carry the largest standardised coefficients on LD1 — direct comparison of raw (unstandardised) LDA coefficients across predictors measured on different scales can be misleading, so interpretation here is grounded in the standardised structure revealed by the between-group SSCP and the follow-up ANOVAs. These two variables confirm that peak flare energetics and their interaction with Earth’s magnetosphere are the primary physical mechanisms driving disruption severity — a finding that aligns with established space weather physics.

The follow-up univariate ANOVAs under Bonferroni correction (α = 0.01) identify flare_intensity as the variable with the strongest univariate group separation (eta² = 0.7918), followed by geomagnetic_index_Kp. Crucially, solar_wind_density (eta² ≈ 0.0002) and flare_duration_minutes (eta² ≈ 0.0029) are non-significant even without correction — they contribute negligible standalone discriminant power. Their inclusion in the LDA model is nevertheless justified because LDA exploits the joint covariance structure: these variables may carry conditional discriminant information that only becomes apparent in combination with other predictors. This finding reinforces the core rationale for multivariate analysis — univariate screening alone would have excluded predictors that may contribute incremental classification value through inter-variable correlations.

The classification performance - overall accuracy of 93.7% on the training sample (APER = 6.3%) and 93.6% under leave-one-out cross-validation (AER = 6.4%) - demonstrates that the discriminant model generalises well beyond the training data, with an optimism bias of only -0.1 percentage points. The close agreement between APER and AER indicates that the five-predictor model is not overfitting to the training sample. Per-class inspection of the confusion matrices reveals that the most common misclassification is between Class 0 and Class 1 (no vs. moderate disruption), which is expected given their overlapping geophysical characteristics. Class 2 (severe disruption), while a minority class, is classified with relatively high precision - a critical property for operational space weather alerting systems where false negatives for severe events carry the highest cost.

12.4 Impact

The findings carry both scientific and operational implications. From a space weather physics perspective, the dominance of flare_intensity and geomagnetic_index_Kp in the discriminant structure confirms that the severity of terrestrial power grid impact is primarily governed by the energy output of the solar flare and the geomagnetic coupling efficiency - not by ancillary parameters such as solar wind density or event duration. This supports physically-motivated feature selection in future predictive modelling: simpler models anchored on these two variables may achieve comparable classification accuracy while improving interpretability for operational forecasters.

From a practical risk-management perspective, the cross-validated accuracy of 93.6% represents a meaningful improvement over a naïve majority-class classifier (which would achieve 45.3% by always predicting Class 1), confirming that the geophysical measurements carry genuine predictive signal. Grid operators could integrate this discriminant classifier into early-warning systems - using real-time solar wind and flare measurements to flag incoming events by anticipated disruption severity, allowing pre-emptive protective switching before grid damage occurs. The posterior probability outputs of the LDA model provide not just a hard class prediction but a continuous confidence measure, enabling risk-tiered decision protocols (e.g., begin protective action when P(Class 2) > 0.30) rather than binary alert thresholds.


13 Conclusion

The Linear Discriminant Analysis of the Solar Storm Impact Dataset confirmed that the five-dimensional geophysical predictor space yields a statistically significant and operationally useful classification of power grid disruption severity across three classes. Wilks’ Lambda = 0.162296, F(10, 1986) = 294.375, p < 0.001, with all four multivariate test statistics in agreement. Two discriminant functions were extracted: LD1 capturing 99.4% of the between-group variance and dominated by flare_intensity and geomagnetic_index_Kp, and LD2 capturing the remaining 0.6%. The classification model achieved 93.7% accuracy on training data (APER = 6.3%) and 93.6% under leave-one-out cross-validation (AER = 6.4%), demonstrating strong generalisation with negligible overfitting. Follow-up univariate ANOVAs identified flare_intensity as the single strongest discriminating variable (eta² = 0.7918), with geomagnetic_index_Kp as the second most important contributor. These results provide a statistically rigorous, physically interpretable, and operationally actionable framework for solar storm severity classification.


14 References

  • Anderson, T.W. (2003). An Introduction to Multivariate Statistical Analysis (3rd ed.). Wiley.
  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum.
  • Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188.
  • Hair, J.F., Black, W.C., Babin, B.J., & Anderson, R.E. (2019). Multivariate Data Analysis (8th ed.). Cengage Learning.
  • McLachlan, G.J. (2004). Discriminant Analysis and Statistical Pattern Recognition. Wiley.
  • Olson, C.L. (1974). Comparative robustness of six tests in multivariate analysis of variance. Journal of the American Statistical Association, 69(348), 894–908.
  • Rencher, A.C., & Christensen, W.F. (2012). Methods of Multivariate Analysis (3rd ed.). Wiley.
  • Sharma, S. (1996). Applied Multivariate Techniques. Wiley.
  • Tabachnick, B.G., & Fidell, L.S. (2019). Using Multivariate Statistics (7th ed.). Pearson.
  • Dataset: Solar Storm Impact Dataset. Kaggle.

Linear Discriminant Analysis Report - Multivariate Analysis

Date: May 04, 2026

S1 Data Science | FMIPA UNESA | 2026