Solar storms large bursts of electromagnetic energy originating from the sun are among the most consequential natural phenomena for modern technological infrastructure. When high-intensity solar flares interact with Earth’s geomagnetic field, they induce electrical currents in ground-based systems, potentially disrupting or permanently damaging power transmission grids. The ability to classify the severity of a disruption before or immediately after a storm event is therefore of direct operational relevance for grid operators, satellite managers, and emergency response planners.
This document applies Linear Discriminant Analysis (LDA), a multivariate classification framework rooted in the Bayesian minimum expected cost principle to the Solar Storm Impact Dataset, which records 1,000 solar storm events across five continuous geophysical predictors. The central analytical objective is to construct a discriminant function that accurately separates storm events into three ordered disruption severity classes (no disruption, moderate disruption, severe disruption) based on simultaneously observed solar and geomagnetic measurements.
LDA is selected as the analytical method because the response variable is nominal-categorical with three ordered classes, the predictors are all continuous and measured on interval/ratio scales, satisfying the structural requirements established in the lecture notes, and the method provides both a classification rule and an interpretable linear function that reveals which geophysical indicators most strongly discriminate between disruption severities, a finding with direct scientific utility.
The analysis follows a one-factor, between-events design with the following configuration:
power_grid_disruption, a nominal factor with three levels:
flare_intensity - Peak solar flare intensity (flux
units)geomagnetic_index_Kp - Kp geomagnetic storm index (0–9
scale)solar_wind_speed - Solar wind velocity (km/s)solar_wind_density - Solar wind proton density
(p/cm³)flare_duration_minutes - Duration of the solar flare
(minutes)df <- read.csv("solar_storm_impact_dataset.csv")
pred_cols <- c("flare_intensity", "geomagnetic_index_Kp", "solar_wind_speed",
"solar_wind_density", "flare_duration_minutes")
target <- "power_grid_disruption"
df[[target]] <- as.factor(df[[target]])
cat("Dataset dimensions:", nrow(df), "rows x", ncol(df), "columns\n")
cat("Predictor variables:", length(pred_cols), "\n")
cat("Class levels:", paste(levels(df[[target]]), collapse = ", "), "\n\n")
cat("Class distribution:\n")
print(table(df[[target]]))
cat("\nClass proportions:\n")
print(round(prop.table(table(df[[target]])), 4))## Dataset dimensions: 1000 rows x 9 columns
## Predictor variables: 5
## Class levels: 0, 1, 2
##
## Class distribution:
##
## 0 1 2
## 413 453 134
##
## Class proportions:
##
## 0 1 2
## 0.413 0.453 0.134
The dataset contains 1,000 solar storm events with no missing values across any variable. The response variable exhibits a moderately imbalanced three-class distribution: Class 0 (no disruption) accounts for 41.3% of events, Class 1 (moderate disruption) for 45.3%, and Class 2 (severe disruption) for the remaining 13.4%. The class imbalance for severe disruption is expected given the rarity of extreme space weather events. In the LDA framework this is incorporated through prior probabilities, which will be set proportional to observed class frequencies (empirical priors) rather than assuming equal priors, ensuring that the classification rule reflects the real-world base rates of disruption severity.
desc_group <- df %>%
group_by(power_grid_disruption) %>%
summarise(
n = n(),
mean_intensity = round(mean(flare_intensity), 3),
sd_intensity = round(sd(flare_intensity), 3),
mean_kp = round(mean(geomagnetic_index_Kp), 3),
sd_kp = round(sd(geomagnetic_index_Kp), 3),
mean_wind_speed = round(mean(solar_wind_speed), 3),
sd_wind_speed = round(sd(solar_wind_speed), 3),
mean_wind_density = round(mean(solar_wind_density), 3),
sd_wind_density = round(sd(solar_wind_density), 3),
mean_duration = round(mean(flare_duration_minutes), 3),
sd_duration = round(sd(flare_duration_minutes), 3)
)
desc_group %>%
kable(caption = "Group Centroids and Standard Deviations by Disruption Class",
col.names = c("Class", "n",
"Mean Intensity", "SD Intensity",
"Mean Kp", "SD Kp",
"Mean Wind Speed", "SD Wind Speed",
"Mean Wind Density", "SD Wind Density",
"Mean Duration", "SD Duration")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE) %>%
column_spec(1, bold = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E")| Class | n | Mean Intensity | SD Intensity | Mean Kp | SD Kp | Mean Wind Speed | SD Wind Speed | Mean Wind Density | SD Wind Density | Mean Duration | SD Duration |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 413 | 13.798 | 8.418 | 0.632 | 0.785 | 473.912 | 97.082 | 9.803 | 4.896 | 86.053 | 49.756 |
| 1 | 453 | 47.048 | 16.081 | 1.086 | 0.945 | 519.036 | 98.602 | 9.724 | 4.985 | 90.724 | 51.155 |
| 2 | 134 | 86.860 | 9.201 | 2.179 | 1.061 | 529.312 | 92.582 | 9.937 | 4.535 | 93.321 | 50.601 |
grand_means <- round(colMeans(df[, pred_cols]), 3)
desc_full <- do.call(rbind, lapply(pred_cols, function(v) {
g0 <- df[[v]][df[[target]] == "0"]
g1 <- df[[v]][df[[target]] == "1"]
g2 <- df[[v]][df[[target]] == "2"]
data.frame(
Variable = v,
Grand_Mean = round(mean(df[[v]]), 3),
Mean_Class0 = round(mean(g0), 3),
Mean_Class1 = round(mean(g1), 3),
Mean_Class2 = round(mean(g2), 3),
SD_Class0 = round(sd(g0), 3),
SD_Class1 = round(sd(g1), 3),
SD_Class2 = round(sd(g2), 3)
)
}))
desc_full %>%
kable(caption = "Detailed Descriptive Statistics: All Predictors Across Three Disruption Classes") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE) %>%
column_spec(1, bold = TRUE) %>%
column_spec(2, bold = TRUE, color = "#2C3E50") %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E")| Variable | Grand_Mean | Mean_Class0 | Mean_Class1 | Mean_Class2 | SD_Class0 | SD_Class1 | SD_Class2 |
|---|---|---|---|---|---|---|---|
| flare_intensity | 38.651 | 13.798 | 47.048 | 86.860 | 8.418 | 16.081 | 9.201 |
| geomagnetic_index_Kp | 1.045 | 0.632 | 1.086 | 2.179 | 0.785 | 0.945 | 1.061 |
| solar_wind_speed | 501.777 | 473.912 | 519.036 | 529.312 | 97.082 | 98.602 | 92.582 |
| solar_wind_density | 9.785 | 9.803 | 9.724 | 9.937 | 4.896 | 4.985 | 4.535 |
| flare_duration_minutes | 89.143 | 86.053 | 90.724 | 93.321 | 49.756 | 51.155 | 50.601 |
The group centroid comparison already hints at the discriminant
structure. flare_intensity and
geomagnetic_index_Kp both show a consistent positive
gradient across disruption classes - their means increase monotonically
from Class 0 through Class 2 suggesting these are the primary
discriminating variables. solar_wind_speed and
flare_duration_minutes exhibit more modest inter-class
differences, while solar_wind_density shows mixed
directionality. The question MANOVA and LDA address jointly is whether
the multivariate combination of these five predictors produces
a statistically reliable and geometrically separable discriminant
space.
Linear Discriminant Analysis seeks a linear combination of predictor variables \(\mathbf{x} = (x_1, x_2, \ldots, x_p)'\) that maximally separates \(g\) pre-defined populations \(\pi_1, \pi_2, \ldots, \pi_g\). Each observation is assigned to the population for which the posterior probability is greatest:
\[p(\pi_k \mid \mathbf{x}) = \frac{p_k f_k(\mathbf{x})}{\sum_{j=1}^{g} p_j f_j(\mathbf{x})}, \quad k = 1, 2, \ldots, g\]
where: - \(p_k\) = prior probability of belonging to population \(k\) - \(f_k(\mathbf{x})\) = multivariate density function for population \(k\)
Under the assumption that each population follows a multivariate normal distribution with a common covariance matrix \(\boldsymbol{\Sigma}\):
\[f_k(\mathbf{x}) = \frac{1}{(2\pi)^{p/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)'\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}_k)\right)\]
the classification rule simplifies to a linear discriminant function, a linear boundary in the \(p\)-dimensional predictor space.
Two core assumptions must hold for LDA to yield optimal (minimum ECM) classification:
Multivariate Normality: Each predictor vector \(\mathbf{x}\) is distributed as multivariate normal within each class \(\mathbf{x} \mid \pi_k \sim N_p(\boldsymbol{\mu}_k, \boldsymbol{\Sigma})\).
Homogeneity of Covariance Matrices: All \(g\) populations share the same variance-covariance matrix \(\boldsymbol{\Sigma}\) formally \(\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_2 = \cdots = \boldsymbol{\Sigma}_g = \boldsymbol{\Sigma}\). This is tested via Box’s M Test.
If assumption (2) is violated, Quadratic Discriminant Analysis (QDA) which allows class-specific covariance matrices is the appropriate alternative.
Null hypothesis (for the multivariate mean separation): \[H_0: \boldsymbol{\mu}_0 = \boldsymbol{\mu}_1 = \boldsymbol{\mu}_2\]
The mean vectors of all three disruption severity classes are equal across all five predictor dimensions simultaneously.
Alternative hypothesis: \[H_1: \text{At least one } \boldsymbol{\mu}_k \text{ differs from the others}\]
Decision rule (minimum ECM, equal costs and equal priors):
Allocate observation \(\mathbf{x}_0\) to class \(\pi_k\) if: \[p_k f_k(\mathbf{x}_0) > p_i f_i(\mathbf{x}_0) \quad \text{for all } i \neq k\]
For \(g > 2\) populations, Fisher’s approach extracts linear discriminant functions as eigenvectors of the matrix \(\mathbf{W}^{-1}\mathbf{B}\), where:
Between-group SSCP matrix (B): \[\mathbf{B} = \sum_{i=1}^{g} (\bar{\mathbf{x}}_i - \bar{\mathbf{x}})(\bar{\mathbf{x}}_i - \bar{\mathbf{x}})'\]
Within-group SSCP matrix (W): \[\mathbf{W} = \sum_{i=1}^{g} \sum_{j=1}^{n_i} (\mathbf{x}_{ij} - \bar{\mathbf{x}}_i)(\mathbf{x}_{ij} - \bar{\mathbf{x}}_i)'\]
The number of discriminant functions extracted is \(s = \min(p, g-1)\). For this analysis with \(p = 5\) predictors and \(g = 3\) classes, \(s = \min(5, 2) = \mathbf{2}\) discriminant functions (LD1 and LD2).
Classification rule: Allocate \(\mathbf{x}_0\) to class \(\pi_k\) if: \[\sum_{j=1}^{s} (\hat{y}_j - \bar{y}_{kj})^2 \leq \sum_{j=1}^{s} (\hat{y}_j - \bar{y}_{ij})^2 \quad \text{for all } i \neq k\]
where \(\hat{y}_j = \hat{\mathbf{a}}_j' \mathbf{x}_0\) and \(\bar{y}_{kj} = \hat{\mathbf{a}}_j' \bar{\mathbf{x}}_k\) that is, the observation is assigned to the group whose centroid is nearest in the discriminant space.
# Mardia's test via manual Mahalanobis distance for each group
# Using Royston's approximation via the MVN package if available,
# otherwise fallback to per-variable Shapiro-Wilk with interpretation note
cat("Univariate Normality per Predictor per Class (Shapiro-Wilk)\n")
cat("Note: Shapiro-Wilk is used as a per-variable screening step.\n",
"For n > 50 per class, the CLT provides robustness against mild non-normality.\n\n")
sw_results <- do.call(rbind, lapply(pred_cols, function(v) {
do.call(rbind, lapply(c("0","1","2"), function(k) {
x <- df[[v]][df[[target]] == k]
sw_test <- shapiro.test(x[1:min(length(x), 5000)])
data.frame(
Predictor = v,
Class = k,
n = length(x),
W = round(sw_test$statistic, 4),
p_value = round(sw_test$p.value, 4),
Normal = ifelse(sw_test$p.value > 0.05, "Yes", "No")
)
}))
}))
sw_results %>%
kable(caption = "Shapiro-Wilk Normality Test: Per Predictor Per Class") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
column_spec(1, bold = TRUE) %>%
column_spec(6, bold = TRUE,
color = ifelse(sw_results$Normal == "Yes", "#1E8449", "#C0392B")) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E")## Univariate Normality per Predictor per Class (Shapiro-Wilk)
## Note: Shapiro-Wilk is used as a per-variable screening step.
## For n > 50 per class, the CLT provides robustness against mild non-normality.
| Predictor | Class | n | W | p_value | Normal | |
|---|---|---|---|---|---|---|
| W | flare_intensity | 0 | 413 | 0.9580 | 0.0000 | No |
| W1 | flare_intensity | 1 | 453 | 0.9859 | 0.0002 | No |
| W2 | flare_intensity | 2 | 134 | 0.9455 | 0.0000 | No |
| W3 | geomagnetic_index_Kp | 0 | 413 | 0.7513 | 0.0000 | No |
| W11 | geomagnetic_index_Kp | 1 | 453 | 0.8576 | 0.0000 | No |
| W21 | geomagnetic_index_Kp | 2 | 134 | 0.9169 | 0.0000 | No |
| W4 | solar_wind_speed | 0 | 413 | 0.9962 | 0.4203 | Yes |
| W12 | solar_wind_speed | 1 | 453 | 0.9981 | 0.8955 | Yes |
| W22 | solar_wind_speed | 2 | 134 | 0.9874 | 0.2580 | Yes |
| W5 | solar_wind_density | 0 | 413 | 0.9965 | 0.5082 | Yes |
| W13 | solar_wind_density | 1 | 453 | 0.9964 | 0.4040 | Yes |
| W23 | solar_wind_density | 2 | 134 | 0.9940 | 0.8488 | Yes |
| W6 | flare_duration_minutes | 0 | 413 | 0.9536 | 0.0000 | No |
| W14 | flare_duration_minutes | 1 | 453 | 0.9500 | 0.0000 | No |
| W24 | flare_duration_minutes | 2 | 134 | 0.9524 | 0.0001 | No |
# Q-Q plots for two most discriminating predictors across classes
qq_plots <- lapply(c("flare_intensity", "geomagnetic_index_Kp"), function(v) {
lapply(c("0","1","2"), function(k) {
x <- df[[v]][df[[target]] == k]
qqdf <- data.frame(
theoretical = qqnorm(x, plot.it = FALSE)$x,
sample = qqnorm(x, plot.it = FALSE)$y
)
ggplot(qqdf, aes(x = theoretical, y = sample)) +
geom_point(alpha = 0.5, size = 1.2, color = c("#E74C3C","#3498DB","#27AE60")[as.integer(k)+1]) +
geom_abline(slope = sd(x), intercept = mean(x),
color = "#2C3E50", linewidth = 1) +
theme_minimal(base_size = 10) +
labs(title = paste0(v, "\n(Class ", k, ")"),
x = "Theoretical Quantiles", y = "Sample Quantiles")
})
})
grid.arrange(
qq_plots[[1]][[1]], qq_plots[[1]][[2]], qq_plots[[1]][[3]],
qq_plots[[2]][[1]], qq_plots[[2]][[2]], qq_plots[[2]][[3]],
ncol = 3,
top = "Normal Q-Q Plots: flare_intensity and geomagnetic_index_Kp by Disruption Class"
)The Shapiro-Wilk results provide a per-variable normality screening. Variables with p > 0.05 at the univariate level are consistent with normality within that class. LDA is known to be fairly robust to moderate departures from multivariate normality, particularly when sample sizes per class are reasonably large (Hair et al., 2019). Given that all three classes have n > 100, mild violations detected here are not expected to substantially inflate misclassification rates.
# Box's M test via manual computation using the biotools package approach
# If biotools is unavailable, we compute the chi-squared approximation manually
tryCatch({
library(biotools)
boxm_result <- boxM(df[, pred_cols], df[[target]])
cat("Box's M Test for Homogeneity of Covariance Matrices\n\n")
print(boxm_result)
cat(sprintf("\nDecision: %s\n",
ifelse(boxm_result$p.value > 0.05,
"Fail to reject H0 - covariance matrices are homogeneous. LDA is appropriate.",
"Reject H0 - covariance matrices differ across classes.\n Consider QDA as an alternative, or proceed with LDA noting reduced optimality.")))
}, error = function(e) {
cat("Box's M test (manual chi-squared approximation):\n\n")
g_levels <- levels(df[[target]])
g_count <- length(g_levels)
p <- length(pred_cols)
cov_list <- lapply(g_levels, function(k) {
cov(df[df[[target]] == k, pred_cols])
})
n_list <- as.numeric(table(df[[target]]))
n_total <- sum(n_list)
# Pooled covariance
S_pooled <- Reduce("+", mapply(function(S, n) (n - 1) * S,
cov_list, n_list, SIMPLIFY = FALSE)) /
(n_total - g_count)
# Box M statistic
M <- (n_total - g_count) * log(det(S_pooled)) -
sum(mapply(function(S, n) (n - 1) * log(det(S)), cov_list, n_list))
df_box <- 0.5 * p * (p + 1) * (g_count - 1)
u <- sum(1 / (n_list - 1)) - 1 / (n_total - g_count)
c1 <- u * (2 * p^2 + 3*p - 1) / (6 * (p + 1) * (g_count - 1))
chi2 <- M * (1 - c1)
p_val <- pchisq(chi2, df_box, lower.tail = FALSE)
cat(sprintf("Box's M statistic : %.4f\n", M))
cat(sprintf("Chi-squared approx : %.4f\n", chi2))
cat(sprintf("Degrees of freedom : %d\n", df_box))
cat(sprintf("p-value : %.4f\n", p_val))
cat(sprintf("\nDecision: %s\n",
ifelse(p_val > 0.05,
"Fail to reject H0 - covariance matrices are homogeneous. LDA is appropriate.",
"Reject H0 - covariance matrices differ. Consider QDA as alternative.")))
})## Box's M Test for Homogeneity of Covariance Matrices
##
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: df[, pred_cols]
## Chi-Sq (approx.) = 239.3, df = 30, p-value < 2.2e-16
##
##
## Decision: Reject H0 - covariance matrices differ across classes.
## Consider QDA as an alternative, or proceed with LDA noting reduced optimality.
# Visualise covariance structure per class
g_levels <- levels(df[[target]])
class_labels <- c("Class 0 - No Disruption", "Class 1 - Moderate", "Class 2 - Severe")
cov_plots <- lapply(seq_along(g_levels), function(i) {
k <- g_levels[i]
cmat <- round(cov(df[df[[target]] == k, pred_cols]), 2)
short_names <- c("Intensity","Kp","WindSpd","WindDen","Duration")
rownames(cmat) <- colnames(cmat) <- short_names
cmat_df <- as.data.frame(as.table(cmat))
colnames(cmat_df) <- c("Row","Col","Value")
ggplot(cmat_df, aes(x = Col, y = Row, fill = Value)) +
geom_tile(color = "white", linewidth = 0.6) +
geom_text(aes(label = round(Value, 1)), size = 3, fontface = "bold") +
scale_fill_gradient2(low = "#EBF5FB", mid = "#AED6F1", high = "#1A5276",
midpoint = median(cmat_df$Value)) +
theme_minimal(base_size = 10) +
theme(axis.text.x = element_text(angle = 30, hjust = 1),
plot.title = element_text(hjust = 0.5, face = "bold", size = 11),
legend.position = "none") +
labs(title = class_labels[i], x = "", y = "")
})
grid.arrange(grobs = cov_plots, ncol = 3,
top = "Within-Class Covariance Matrices (Heatmap) - Testing Homogeneity")Y <- as.matrix(df[, pred_cols])
groups <- df[[target]]
g_levels <- levels(groups)
g <- length(g_levels)
n_total <- nrow(Y)
p <- ncol(Y)
grand_mean <- colMeans(Y)
group_means <- t(sapply(g_levels, function(k) colMeans(Y[groups == k, ])))
cat("Grand Mean Vector:\n")
print(round(grand_mean, 4))
cat("\nGroup Centroid Vectors:\n")
print(round(group_means, 4))
cat("\nCentroid Displacement (Class 2 - Class 0):\n")
print(round(group_means["2",] - group_means["0",], 4))## Grand Mean Vector:
## flare_intensity geomagnetic_index_Kp solar_wind_speed
## 38.6509 1.0450 501.7770
## solar_wind_density flare_duration_minutes
## 9.7852 89.1430
##
## Group Centroid Vectors:
## flare_intensity geomagnetic_index_Kp solar_wind_speed solar_wind_density
## 0 13.7985 0.6320 473.9123 9.8029
## 1 47.0485 1.0861 519.0363 9.7242
## 2 86.8596 2.1791 529.3117 9.9373
## flare_duration_minutes
## 0 86.0533
## 1 90.7241
## 2 93.3209
##
## Centroid Displacement (Class 2 - Class 0):
## flare_intensity geomagnetic_index_Kp solar_wind_speed
## 73.0611 1.5471 55.3994
## solar_wind_density flare_duration_minutes
## 0.1344 7.2676
B <- Reduce("+", lapply(g_levels, function(k) {
n_k <- sum(groups == k)
diff <- colMeans(Y[groups == k, ]) - grand_mean
n_k * outer(diff, diff)
}))
rownames(B) <- colnames(B) <- pred_cols
cat("Between-Group SSCP Matrix (B), df_B =", g - 1, "\n\n")
print(round(B, 4))
cat("\nTrace(B) =", round(sum(diag(B)), 4), "\n")
cat("det(B) =", formatC(det(B), format = "e", digits = 4), "\n")## Between-Group SSCP Matrix (B), df_B = 2
##
## flare_intensity geomagnetic_index_Kp solar_wind_speed
## flare_intensity 598458.4219 11722.0453 529534.1013
## geomagnetic_index_Kp 11722.0453 243.5730 9259.0249
## solar_wind_speed 529534.1013 9259.0249 557204.4958
## solar_wind_density 568.4264 18.9558 -120.1078
## flare_duration_minutes 64716.7642 1191.4081 63333.4182
## solar_wind_density flare_duration_minutes
## flare_intensity 568.4264 64716.7642
## geomagnetic_index_Kp 18.9558 1191.4081
## solar_wind_speed -120.1078 63333.4182
## solar_wind_density 4.9188 18.8094
## flare_duration_minutes 18.8094 7414.0137
##
## Trace(B) = 1163325
## det(B) = -3.4714e-30
W <- Reduce("+", lapply(g_levels, function(k) {
Y_k <- Y[groups == k, ]
mu_k <- colMeans(Y_k)
swept <- sweep(Y_k, 2, mu_k)
t(swept) %*% swept
}))
rownames(W) <- colnames(W) <- pred_cols
df_W <- n_total - g
cat("Within-Group SSCP Matrix (W), df_W =", df_W, "\n\n")
print(round(W, 4))
cat("\nTrace(W) =", round(sum(diag(W)), 4), "\n")
cat("det(W) =", formatC(det(W), format = "e", digits = 4), "\n")## Within-Group SSCP Matrix (W), df_W = 997
##
## flare_intensity geomagnetic_index_Kp solar_wind_speed
## flare_intensity 157338.2118 3198.644 -399863.331
## geomagnetic_index_Kp 3198.6437 807.402 5518.109
## solar_wind_speed -399863.3307 5518.109 9417605.604
## solar_wind_density -785.7264 209.978 -7190.522
## flare_duration_minutes -119235.9243 -3300.843 -219891.804
## solar_wind_density flare_duration_minutes
## flare_intensity -785.7264 -119235.924
## geomagnetic_index_Kp 209.9780 -3300.843
## solar_wind_speed -7190.5217 -219891.804
## solar_wind_density 23840.4700 -3660.960
## flare_duration_minutes -3660.9602 2543298.537
##
## Trace(W) = 12142890
## det(W) = 5.4694e+25
T_mat <- B + W
df_T <- n_total - 1
rownames(T_mat) <- colnames(T_mat) <- pred_cols
cat("Total SSCP Matrix (T = B + W), df_T =", df_T, "\n\n")
print(round(T_mat, 4))
cat("\nTrace(T) =", round(sum(diag(T_mat)), 4), "\n")
cat("det(T) =", formatC(det(T_mat), format = "e", digits = 4), "\n")
cat("\nVerification - Wilks' Lambda (det(W)/det(T)):\n")
wilks_manual <- det(W) / det(T_mat)
cat(sprintf(" Lambda = %.6f\n", wilks_manual))## Total SSCP Matrix (T = B + W), df_T = 999
##
## flare_intensity geomagnetic_index_Kp solar_wind_speed
## flare_intensity 755796.63 14920.6891 129670.77
## geomagnetic_index_Kp 14920.69 1050.9750 14777.13
## solar_wind_speed 129670.77 14777.1337 9974810.10
## solar_wind_density -217.30 228.9337 -7310.63
## flare_duration_minutes -54519.16 -2109.4350 -156558.39
## solar_wind_density flare_duration_minutes
## flare_intensity -217.3000 -54519.160
## geomagnetic_index_Kp 228.9337 -2109.435
## solar_wind_speed -7310.6295 -156558.385
## solar_wind_density 23845.3887 -3642.151
## flare_duration_minutes -3642.1507 2550712.551
##
## Trace(T) = 13306216
## det(T) = 3.3700e+26
##
## Verification - Wilks' Lambda (det(W)/det(T)):
## Lambda = 0.162296
short_names <- c("Intensity","Kp","WindSpd","WindDen","Duration")
plot_sscp <- function(mat, title_text, col_high) {
m <- mat
rownames(m) <- colnames(m) <- short_names
mdf <- as.data.frame(as.table(m))
colnames(mdf) <- c("Row","Col","Value")
ggplot(mdf, aes(x = Col, y = Row, fill = Value)) +
geom_tile(color = "white", linewidth = 0.8) +
geom_text(aes(label = round(Value, 0)), size = 3, fontface = "bold") +
scale_fill_gradient2(low = "#EBF5FB", mid = "#AED6F1",
high = col_high,
midpoint = median(mdf$Value)) +
theme_minimal(base_size = 10) +
theme(axis.text.x = element_text(angle = 30, hjust = 1),
plot.title = element_text(hjust = 0.5, face = "bold", size = 11),
legend.position = "right") +
labs(title = title_text, x = "", y = "", fill = "Value")
}
grid.arrange(
plot_sscp(B, "Between-Group SSCP (B)", "#1A5276"),
plot_sscp(W, "Within-Group SSCP (W)", "#1E8449"),
plot_sscp(T_mat, "Total SSCP (T = B + W)", "#6E2F6E"),
ncol = 3
)The SSCP decomposition provides the algebraic foundation for both
Wilks’ Lambda and the Fisher discriminant functions. The between-group
matrix B captures the signal how much of the total
variation in predictor space is attributable to class membership. The
within-group matrix W captures noise residual variation
within each class. The core inferential question is whether the
signal-to-noise ratio encoded in W⁻¹B is large enough
to warrant rejecting the null hypothesis of equal class centroids. Large
diagonal entries in B relative to W
for flare_intensity and geomagnetic_index_Kp
suggest these variables will dominate the discriminant functions.
WinvB <- solve(W) %*% B
eig_full <- eigen(WinvB)
eigenvals <- Re(eig_full$values)
eigenvals <- eigenvals[eigenvals > 1e-10]
cat("Eigenvalues of W^{-1}B\n\n")
ev_df <- data.frame(
LD = paste0("LD", seq_along(eigenvals)),
Eigenvalue = round(eigenvals, 6),
Canonical_R = round(sqrt(eigenvals / (1 + eigenvals)), 6),
Pct_Var = round(eigenvals / sum(eigenvals) * 100, 2),
Cum_Pct = round(cumsum(eigenvals / sum(eigenvals)) * 100, 2)
)
print(ev_df)
wilks_lambda <- prod(1 / (1 + eigenvals))
pillai_trace <- sum(eigenvals / (1 + eigenvals))
hotelling_T <- sum(eigenvals)
roy_root <- max(eigenvals) / (1 + max(eigenvals))
cat("\nMultivariate Statistics\n")
cat(sprintf("Wilks' Lambda : %.6f\n", wilks_lambda))
cat(sprintf("Pillai's Trace : %.6f\n", pillai_trace))
cat(sprintf("Hotelling-Lawley : %.6f\n", hotelling_T))
cat(sprintf("Roy's Largest Root : %.6f\n", roy_root))## Eigenvalues of W^{-1}B
##
## LD Eigenvalue Canonical_R Pct_Var Cum_Pct
## 1 LD1 4.969439 0.912403 99.36 99.36
## 2 LD2 0.032186 0.176584 0.64 100.00
##
## Multivariate Statistics
## Wilks' Lambda : 0.162296
## Pillai's Trace : 0.863662
## Hotelling-Lawley : 5.001624
## Roy's Largest Root : 0.832480
# Rao's F approximation for g=3, p=5
s <- min(p, g - 1)
ms <- n_total - 1 - (p + g) / 2
t_val <- sqrt((p^2 * (g-1)^2 - 4) / (p^2 + (g-1)^2 - 5))
if(is.nan(t_val)) t_val <- 1
df1_wilks <- p * (g - 1)
df2_wilks <- ms * t_val - (p * (g-1) - 2) / 2
F_wilks <- ((1 - wilks_lambda^(1/t_val)) / wilks_lambda^(1/t_val)) * (df2_wilks / df1_wilks)
p_wilks <- pf(F_wilks, df1_wilks, df2_wilks, lower.tail = FALSE)
cat("Rao's F Approximation for Wilks' Lambda\n\n")
cat(sprintf("Wilks' Lambda : %.6f\n", wilks_lambda))
cat(sprintf("F statistic : %.4f\n", F_wilks))
cat(sprintf("df numerator : %.0f\n", df1_wilks))
cat(sprintf("df denominator : %.2f\n", df2_wilks))
cat(sprintf("p-value : %.2e\n", p_wilks))
cat(sprintf("\nDecision at alpha = 0.05 : %s\n",
ifelse(p_wilks < 0.05,
"REJECT H0 - class centroids are significantly different",
"Fail to reject H0")))## Rao's F Approximation for Wilks' Lambda
##
## Wilks' Lambda : 0.162296
## F statistic : 294.3750
## df numerator : 10
## df denominator : 1986.00
## p-value : 0.00e+00
##
## Decision at alpha = 0.05 : REJECT H0 - class centroids are significantly different
manova_model <- manova(
cbind(flare_intensity, geomagnetic_index_Kp, solar_wind_speed,
solar_wind_density, flare_duration_minutes) ~ power_grid_disruption,
data = df
)
cat("MANOVA Summary: Four Multivariate Test Statistics\n\n")
for (test_nm in c("Pillai", "Wilks", "Hotelling-Lawley", "Roy")) {
cat(sprintf("[ %s ]\n", test_nm))
print(summary(manova_model, test = test_nm))
cat("\n")
}## MANOVA Summary: Four Multivariate Test Statistics
##
## [ Pillai ]
## Df Pillai approx F num Df den Df Pr(>F)
## power_grid_disruption 2 0.86366 151.1 10 1988 < 2.2e-16 ***
## Residuals 997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## [ Wilks ]
## Df Wilks approx F num Df den Df Pr(>F)
## power_grid_disruption 2 0.1623 294.38 10 1986 < 2.2e-16 ***
## Residuals 997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## [ Hotelling-Lawley ]
## Df Hotelling-Lawley approx F num Df den Df Pr(>F)
## power_grid_disruption 2 5.0016 496.16 10 1984 < 2.2e-16 ***
## Residuals 997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## [ Roy ]
## Df Roy approx F num Df den Df Pr(>F)
## power_grid_disruption 2 4.9694 987.92 5 994 < 2.2e-16 ***
## Residuals 997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m_stats <- list(
Pillai = summary(manova_model, test = "Pillai")$stats,
Wilks = summary(manova_model, test = "Wilks")$stats,
Hotelling = summary(manova_model, test = "Hotelling-Lawley")$stats,
Roy = summary(manova_model, test = "Roy")$stats
)
manova_tbl <- data.frame(
Statistic = c("Pillai's Trace","Wilks' Lambda",
"Hotelling-Lawley Trace","Roy's Largest Root"),
Value = round(c(pillai_trace, wilks_lambda, hotelling_T, roy_root), 6),
Approx_F = round(c(m_stats$Pillai[1,"approx F"],
m_stats$Wilks[1,"approx F"],
m_stats$Hotelling[1,"approx F"],
m_stats$Roy[1,"approx F"]), 4),
df_num = c(m_stats$Pillai[1,"num Df"],
m_stats$Wilks[1,"num Df"],
m_stats$Hotelling[1,"num Df"],
m_stats$Roy[1,"num Df"]),
df_den = round(c(m_stats$Pillai[1,"den Df"],
m_stats$Wilks[1,"den Df"],
m_stats$Hotelling[1,"den Df"],
m_stats$Roy[1,"den Df"]), 2),
p_value = formatC(c(m_stats$Pillai[1,"Pr(>F)"],
m_stats$Wilks[1,"Pr(>F)"],
m_stats$Hotelling[1,"Pr(>F)"],
m_stats$Roy[1,"Pr(>F)"]),
format = "e", digits = 3),
Decision = rep("Reject H0", 4)
)
manova_tbl %>%
kable(caption = "Multivariate Test Statistics - LDA Group Separation",
col.names = c("Test Statistic","Value","Approx. F",
"df (num)","df (den)","p-value","Decision")) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
column_spec(1, bold = TRUE) %>%
column_spec(7, bold = TRUE, color = "#C0392B") %>%
row_spec(0, bold = TRUE, color = "white", background = "#2C3E50") %>%
row_spec(2, background = "#D5F5E3")| Test Statistic | Value | Approx. F | df (num) | df (den) | p-value | Decision |
|---|---|---|---|---|---|---|
| Pillai’s Trace | 0.863662 | 151.0959 | 10 | 1988 | 1.282e-235 | Reject H0 |
| Wilks’ Lambda | 0.162296 | 294.3750 | 10 | 1986 | 0.000e+00 | Reject H0 |
| Hotelling-Lawley Trace | 5.001624 | 496.1611 | 10 | 1984 | 0.000e+00 | Reject H0 |
| Roy’s Largest Root | 0.832480 | 987.9244 | 5 | 994 | 0.000e+00 | Reject H0 |
ev_df %>%
kable(caption = "Eigenvalue Structure of W^{-1}B: Discriminant Function Summary",
col.names = c("Function","Eigenvalue","Canonical R",
"% Variance","Cumulative %")) %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
column_spec(1, bold = TRUE) %>%
column_spec(4, bold = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E") %>%
row_spec(1, background = "#EBF5FB")| Function | Eigenvalue | Canonical R | % Variance | Cumulative % |
|---|---|---|---|---|
| LD1 | 4.969439 | 0.912403 | 99.36 | 99.36 |
| LD2 | 0.032186 | 0.176584 | 0.64 | 100.00 |
lda_model <- lda(power_grid_disruption ~ flare_intensity + geomagnetic_index_Kp +
solar_wind_speed + solar_wind_density + flare_duration_minutes,
data = df)
cat("LDA Model Summary\n\n")
cat("Prior probabilities (empirical):\n")
print(round(lda_model$prior, 4))
cat("\nGroup means (centroids in predictor space):\n")
print(round(lda_model$means, 4))
cat("\nScalings - Discriminant Function Coefficients:\n")
print(round(lda_model$scaling, 6))
cat("\nProportion of trace (variance explained per LD):\n")
pct_trace <- lda_model$svd^2 / sum(lda_model$svd^2)
print(round(pct_trace, 4))## LDA Model Summary
##
## Prior probabilities (empirical):
## 0 1 2
## 0.413 0.453 0.134
##
## Group means (centroids in predictor space):
## flare_intensity geomagnetic_index_Kp solar_wind_speed solar_wind_density
## 0 13.7985 0.6320 473.9123 9.8029
## 1 47.0485 1.0861 519.0363 9.7242
## 2 86.8596 2.1791 529.3117 9.9373
## flare_duration_minutes
## 0 86.0533
## 1 90.7241
## 2 93.3209
##
## Scalings - Discriminant Function Coefficients:
## LD1 LD2
## flare_intensity 0.087995 -0.013551
## geomagnetic_index_Kp -0.100189 0.997005
## solar_wind_speed 0.004956 -0.006393
## solar_wind_density 0.006432 0.004261
## flare_duration_minutes 0.004902 -0.001139
##
## Proportion of trace (variance explained per LD):
## [1] 0.9936 0.0064
coef_df <- as.data.frame(round(lda_model$scaling, 6))
coef_df$Predictor <- rownames(coef_df)
rownames(coef_df) <- NULL
coef_df <- coef_df[, c("Predictor","LD1","LD2")]
coef_df %>%
kable(caption = "Linear Discriminant Function Coefficients (Unstandardised Scalings)") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
column_spec(1, bold = TRUE) %>%
column_spec(2:3, bold = TRUE) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E")| Predictor | LD1 | LD2 |
|---|---|---|
| flare_intensity | 0.087995 | -0.013551 |
| geomagnetic_index_Kp | -0.100189 | 0.997005 |
| solar_wind_speed | 0.004956 | -0.006393 |
| solar_wind_density | 0.006432 | 0.004261 |
| flare_duration_minutes | 0.004902 | -0.001139 |
coef_long <- coef_df %>%
pivot_longer(cols = c("LD1","LD2"), names_to = "Function", values_to = "Coefficient")
ggplot(coef_long, aes(x = reorder(Predictor, abs(Coefficient)),
y = Coefficient,
fill = Coefficient > 0)) +
geom_col(width = 0.55, color = "white") +
geom_hline(yintercept = 0, linewidth = 0.8, color = "#2C3E50") +
coord_flip() +
facet_wrap(~Function, scales = "free_x") +
scale_fill_manual(values = c("#E74C3C","#3498DB"),
labels = c("Negative loading","Positive loading")) +
theme_minimal(base_size = 12) +
theme(legend.position = "bottom",
strip.background = element_rect(fill = "#34495E"),
strip.text = element_text(color = "white", face = "bold"),
plot.title = element_text(hjust = 0.5, face = "bold")) +
labs(title = "Discriminant Function Coefficients - LD1 and LD2",
subtitle = "Relative contribution of each predictor to each discriminant dimension",
x = "", y = "Coefficient", fill = "Direction")lda_pred <- predict(lda_model)
df$LD1 <- lda_pred$x[, 1]
df$LD2 <- lda_pred$x[, 2]
ld_centroids <- df %>%
group_by(power_grid_disruption) %>%
summarise(
n = n(),
Mean_LD1 = round(mean(LD1), 4),
Mean_LD2 = round(mean(LD2), 4),
SD_LD1 = round(sd(LD1), 4),
SD_LD2 = round(sd(LD2), 4)
)
cat("Group Centroids in Discriminant Space:\n\n")
print(ld_centroids)
cat("\nSeparation between Class 0 and Class 2 on LD1:\n")
sep_ld1 <- abs(ld_centroids$Mean_LD1[3] - ld_centroids$Mean_LD1[1])
cat(sprintf(" |mu_LD1(Class 2) - mu_LD1(Class 0)| = %.4f\n", sep_ld1))## Group Centroids in Discriminant Space:
##
## # A tibble: 3 × 6
## power_grid_disruption n Mean_LD1 Mean_LD2 SD_LD1 SD_LD2
## <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 413 -2.30 0.107 0.772 0.901
## 2 1 453 0.828 -0.185 1.23 1.05
## 3 2 134 4.29 0.297 0.697 1.10
##
## Separation between Class 0 and Class 2 on LD1:
## |mu_LD1(Class 2) - mu_LD1(Class 0)| = 6.5850
class_cols <- c("0" = "#3498DB", "1" = "#E67E22", "2" = "#E74C3C")
class_labs <- c("0" = "Class 0 - No Disruption",
"1" = "Class 1 - Moderate",
"2" = "Class 2 - Severe")
ggplot(df, aes(x = LD1, fill = power_grid_disruption)) +
geom_histogram(bins = 45, alpha = 0.65, position = "identity", color = "white") +
geom_vline(data = ld_centroids,
aes(xintercept = Mean_LD1, color = power_grid_disruption),
linewidth = 1.4, linetype = "dashed") +
scale_fill_manual(values = class_cols, labels = class_labs) +
scale_color_manual(values = class_cols, labels = class_labs) +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "bottom") +
labs(title = "Distribution of LD1 Scores by Disruption Class",
subtitle = "Dashed lines = group centroids on LD1",
x = "First Linear Discriminant Score (LD1)",
y = "Count", fill = "Class", color = "Class")centroid_df <- as.data.frame(ld_centroids)
ggplot(df, aes(x = LD1, y = LD2, color = power_grid_disruption)) +
geom_point(alpha = 0.35, size = 1.4) +
stat_ellipse(level = 0.90, linewidth = 1.3, aes(group = power_grid_disruption)) +
geom_point(data = centroid_df,
aes(x = Mean_LD1, y = Mean_LD2),
shape = 18, size = 6, color = "black") +
geom_text(data = centroid_df,
aes(x = Mean_LD1, y = Mean_LD2,
label = paste0("C", power_grid_disruption)),
vjust = -1.2, fontface = "bold", size = 4.5, color = "black") +
scale_color_manual(values = class_cols, labels = class_labs) +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "bottom") +
labs(title = "Two-Dimensional Discriminant Space (LD1 × LD2)",
subtitle = "Diamonds = group centroids; ellipses = 90% concentration regions",
x = paste0("LD1 (", round(pct_trace[1]*100, 1), "% of between-group variance)"),
y = paste0("LD2 (", round(pct_trace[2]*100, 1), "% of between-group variance)"),
color = "Disruption Class")df_long <- df %>%
select(power_grid_disruption, all_of(pred_cols)) %>%
pivot_longer(cols = all_of(pred_cols),
names_to = "Variable", values_to = "Value") %>%
mutate(Variable = factor(Variable, levels = pred_cols))
ggplot(df_long, aes(x = power_grid_disruption, y = Value,
fill = power_grid_disruption)) +
geom_boxplot(alpha = 0.72, outlier.shape = 21, outlier.size = 1.4,
outlier.alpha = 0.5) +
geom_jitter(width = 0.15, alpha = 0.12, size = 0.6) +
stat_summary(fun = mean, geom = "point", shape = 23, size = 4,
fill = "white", color = "#2C3E50") +
scale_fill_manual(values = class_cols, labels = class_labs) +
facet_wrap(~Variable, scales = "free_y", ncol = 3) +
theme_minimal(base_size = 11) +
theme(strip.background = element_rect(fill = "#2C3E50"),
strip.text = element_text(color = "white", face = "bold"),
legend.position = "bottom",
plot.title = element_text(hjust = 0.5, face = "bold", size = 13)) +
labs(title = "Predictor Distribution by Disruption Class (Diamond = Group Mean)",
x = "Disruption Class", y = "Value", fill = "Class")var_pairs <- list(
c("flare_intensity", "geomagnetic_index_Kp"),
c("flare_intensity", "solar_wind_speed"),
c("geomagnetic_index_Kp", "solar_wind_speed"),
c("solar_wind_density", "flare_duration_minutes")
)
sc_plots <- lapply(var_pairs, function(vp) {
ggplot(df, aes_string(x = vp[1], y = vp[2],
color = "power_grid_disruption")) +
geom_point(alpha = 0.35, size = 1.1) +
stat_ellipse(level = 0.90, linewidth = 1.2) +
scale_color_manual(values = class_cols) +
theme_minimal(base_size = 10) +
theme(legend.position = "none") +
labs(x = vp[1], y = vp[2])
})
wrap_plots(sc_plots, ncol = 2) +
plot_annotation(
title = "Bivariate Scatterplots with 90% Concentration Ellipses by Disruption Class",
subtitle = "Blue = Class 0, Orange = Class 1, Red = Class 2",
theme = theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 13))
)The bivariate scatterplots reveal the geometric structure of the
three-class problem. The concentration ellipses for Classes 0 and 1
overlap substantially in most predictor pairs, indicating that moderate
disruption events are difficult to separate from non-disruption events
on any single pair of predictors alone. Class 2 (severe disruption, red)
shows the most distinct positioning - particularly in the
flare_intensity × geomagnetic_index_Kp plane - consistent
with the finding that these two variables carry the highest discriminant
loading. This visual evidence motivates the multivariate approach: no
single predictor pair achieves clean separation, but the joint
five-dimensional discriminant space achieves substantially better group
resolution.
lda_pred <- predict(lda_model)
conf_mat <- table(Actual = df[[target]],
Predicted = lda_pred$class)
cat("Confusion Matrix (Training Sample)\n\n")
print(conf_mat)
n_correct <- sum(diag(conf_mat))
n_total_c <- sum(conf_mat)
n_misclass <- n_total_c - n_correct
aper <- n_misclass / n_total_c
accuracy <- 1 - aper
cat(sprintf("\nTotal observations : %d\n", n_total_c))
cat(sprintf("Correctly classified : %d\n", n_correct))
cat(sprintf("Misclassified : %d\n", n_misclass))
cat(sprintf("APER (Error Rate) : %.4f (%.2f%%)\n", aper, aper * 100))
cat(sprintf("Overall Accuracy : %.4f (%.2f%%)\n", accuracy, accuracy * 100))
cat("\nPer-class accuracy:\n")
for (k in rownames(conf_mat)) {
acc_k <- conf_mat[k, k] / sum(conf_mat[k, ])
cat(sprintf(" Class %s : %d / %d = %.2f%%\n",
k, conf_mat[k, k], sum(conf_mat[k,]), acc_k * 100))
}## Confusion Matrix (Training Sample)
##
## Predicted
## Actual 0 1 2
## 0 413 0 0
## 1 40 390 23
## 2 0 0 134
##
## Total observations : 1000
## Correctly classified : 937
## Misclassified : 63
## APER (Error Rate) : 0.0630 (6.30%)
## Overall Accuracy : 0.9370 (93.70%)
##
## Per-class accuracy:
## Class 0 : 413 / 413 = 100.00%
## Class 1 : 390 / 453 = 86.09%
## Class 2 : 134 / 134 = 100.00%
conf_df <- as.data.frame(conf_mat)
conf_df$Match <- conf_df$Actual == conf_df$Predicted
conf_mat_tbl <- as.data.frame.matrix(conf_mat)
conf_mat_tbl$Actual <- rownames(conf_mat_tbl)
conf_mat_tbl <- conf_mat_tbl[, c("Actual", colnames(conf_mat))]
rownames(conf_mat_tbl) <- NULL
conf_mat_tbl %>%
kable(caption = "Confusion Matrix: Actual vs. Predicted Disruption Class (Training Sample)") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
column_spec(1, bold = TRUE, background = "#ECF0F1") %>%
row_spec(0, bold = TRUE, color = "white", background = "#2C3E50") %>%
add_header_above(c(" " = 1, "Predicted Class" = 3))| Actual | 0 | 1 | 2 |
|---|---|---|---|
| 0 | 413 | 0 | 0 |
| 1 | 40 | 390 | 23 |
| 2 | 0 | 0 | 134 |
conf_long <- as.data.frame(conf_mat)
colnames(conf_long) <- c("Actual","Predicted","Count")
conf_long$Type <- ifelse(conf_long$Actual == conf_long$Predicted,
"Correct", "Misclassified")
ggplot(conf_long, aes(x = Predicted, y = Actual, fill = Count)) +
geom_tile(color = "white", linewidth = 1) +
geom_text(aes(label = Count, color = Type),
size = 7, fontface = "bold") +
scale_fill_gradient(low = "#EBF5FB", high = "#1A5276") +
scale_color_manual(values = c("Correct" = "#1E8449",
"Misclassified" = "#E74C3C"),
guide = "none") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text = element_text(size = 12, face = "bold")) +
labs(title = sprintf("Confusion Matrix Heatmap - APER = %.2f%% | Accuracy = %.2f%%",
aper * 100, accuracy * 100),
x = "Predicted Class", y = "Actual Class", fill = "Count")lda_cv <- lda(power_grid_disruption ~ flare_intensity + geomagnetic_index_Kp +
solar_wind_speed + solar_wind_density + flare_duration_minutes,
data = df, CV = TRUE)
conf_cv <- table(Actual = df[[target]],
Predicted = lda_cv$class)
n_correct_cv <- sum(diag(conf_cv))
aper_cv <- 1 - n_correct_cv / sum(conf_cv)
accuracy_cv <- 1 - aper_cv
cat("Leave-One-Out Cross-Validation (LOO-CV)\n\n")
print(conf_cv)
cat(sprintf("\nLOO-CV Accuracy : %.4f (%.2f%%)\n", accuracy_cv, accuracy_cv * 100))
cat(sprintf("LOO-CV Error Rate (AER): %.4f (%.2f%%)\n", aper_cv, aper_cv * 100))
cat(sprintf("\nAPER vs AER comparison:\n"))
cat(sprintf(" APER (training) = %.4f - optimistic estimate (uses training data)\n", aper))
cat(sprintf(" AER (LOO-CV) = %.4f - honest estimate (hold-one-out)\n", aper_cv))
cat(sprintf(" Optimism bias = %.4f (APER - AER)\n", aper - aper_cv))## Leave-One-Out Cross-Validation (LOO-CV)
##
## Predicted
## Actual 0 1 2
## 0 413 0 0
## 1 41 389 23
## 2 0 0 134
##
## LOO-CV Accuracy : 0.9360 (93.60%)
## LOO-CV Error Rate (AER): 0.0640 (6.40%)
##
## APER vs AER comparison:
## APER (training) = 0.0630 - optimistic estimate (uses training data)
## AER (LOO-CV) = 0.0640 - honest estimate (hold-one-out)
## Optimism bias = -0.0010 (APER - AER)
conf_cv_tbl <- as.data.frame.matrix(conf_cv)
conf_cv_tbl$Actual <- rownames(conf_cv_tbl)
conf_cv_tbl <- conf_cv_tbl[, c("Actual", colnames(conf_cv))]
rownames(conf_cv_tbl) <- NULL
conf_cv_tbl %>%
kable(caption = sprintf("LOO-CV Confusion Matrix - AER = %.2f%% | CV Accuracy = %.2f%%",
aper_cv * 100, accuracy_cv * 100)) %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
column_spec(1, bold = TRUE, background = "#ECF0F1") %>%
row_spec(0, bold = TRUE, color = "white", background = "#2C3E50") %>%
add_header_above(c(" " = 1, "Predicted Class (CV)" = 3))| Actual | 0 | 1 | 2 |
|---|---|---|---|
| 0 | 413 | 0 | 0 |
| 1 | 41 | 389 | 23 |
| 2 | 0 | 0 | 134 |
# QDA robustness check — run because Box's M rejected H0 (covariance heterogeneity)
# QDA relaxes the equal-covariance assumption and fits a class-specific Sigma_k
cat("=== QDA Robustness Check (motivated by Box's M rejection) ===\n\n")
qda_model <- qda(power_grid_disruption ~ flare_intensity + geomagnetic_index_Kp +
solar_wind_speed + solar_wind_density + flare_duration_minutes,
data = df)
# Training (apparent) accuracy
qda_pred <- predict(qda_model)
conf_qda <- table(Actual = df[[target]], Predicted = qda_pred$class)
acc_qda <- sum(diag(conf_qda)) / sum(conf_qda)
aper_qda <- 1 - acc_qda
cat("QDA Confusion Matrix (Training):\n")
print(conf_qda)
cat(sprintf("\nQDA Training Accuracy : %.4f (%.2f%%)\n", acc_qda, acc_qda * 100))
cat(sprintf("QDA APER : %.4f (%.2f%%)\n\n", aper_qda, aper_qda * 100))
# LOO-CV accuracy for QDA
qda_cv <- qda(power_grid_disruption ~ flare_intensity + geomagnetic_index_Kp +
solar_wind_speed + solar_wind_density + flare_duration_minutes,
data = df, CV = TRUE)
conf_qda_cv <- table(Actual = df[[target]], Predicted = qda_cv$class)
acc_qda_cv <- sum(diag(conf_qda_cv)) / sum(conf_qda_cv)
aper_qda_cv <- 1 - acc_qda_cv
cat("QDA Confusion Matrix (LOO-CV):\n")
print(conf_qda_cv)
cat(sprintf("\nQDA LOO-CV Accuracy : %.4f (%.2f%%)\n", acc_qda_cv, acc_qda_cv * 100))
cat(sprintf("QDA AER : %.4f (%.2f%%)\n\n", aper_qda_cv, aper_qda_cv * 100))
cat("--- Comparison: LDA vs QDA (LOO-CV) ---\n")
cat(sprintf("LDA LOO-CV Accuracy : %.2f%%\n", accuracy_cv * 100))
cat(sprintf("QDA LOO-CV Accuracy : %.2f%%\n", acc_qda_cv * 100))
cat(sprintf("\nConclusion: %s performs better under cross-validation.\n",
ifelse(acc_qda_cv > accuracy_cv, "QDA", "LDA")))
cat(sprintf("Note: Despite Box's M rejection, %s remains competitive,\n",
ifelse(acc_qda_cv > accuracy_cv, "QDA", "LDA")))
cat("suggesting the covariance heterogeneity does not substantially alter classification.\n")## === QDA Robustness Check (motivated by Box's M rejection) ===
##
## QDA Confusion Matrix (Training):
## Predicted
## Actual 0 1 2
## 0 405 8 0
## 1 21 429 3
## 2 0 5 129
##
## QDA Training Accuracy : 0.9630 (96.30%)
## QDA APER : 0.0370 (3.70%)
##
## QDA Confusion Matrix (LOO-CV):
## Predicted
## Actual 0 1 2
## 0 404 9 0
## 1 23 427 3
## 2 0 8 126
##
## QDA LOO-CV Accuracy : 0.9570 (95.70%)
## QDA AER : 0.0430 (4.30%)
##
## --- Comparison: LDA vs QDA (LOO-CV) ---
## LDA LOO-CV Accuracy : 93.60%
## QDA LOO-CV Accuracy : 95.70%
##
## Conclusion: QDA performs better under cross-validation.
## Note: Despite Box's M rejection, QDA remains competitive,
## suggesting the covariance heterogeneity does not substantially alter classification.
Box’s M Test Violation & QDA Check: Box’s M rejected the null hypothesis of covariance homogeneity (p < 0.001), which technically violates the LDA assumption of a common within-group covariance matrix Σ. As a robustness check, QDA (which estimates a separate Σ_k per class) is fitted and compared via LOO-CV accuracy. If LDA and QDA achieve comparable cross-validated accuracy, the covariance heterogeneity is practically inconsequential for classification performance — a common finding in large samples where Box’s M is highly sensitive to even minor covariance differences. The comparison above informs whether the simpler, more interpretable LDA model is adequate, or whether the more flexible QDA is warranted.
post_df <- as.data.frame(lda_pred$posterior)
colnames(post_df) <- paste0("P(Class ", c("0","1","2"), ")")
post_df$TrueClass <- df[[target]]
post_df$Predicted <- lda_pred$class
post_df$Correct <- post_df$TrueClass == post_df$Predicted
post_long <- post_df %>%
pivot_longer(cols = starts_with("P(Class"),
names_to = "PredClass", values_to = "Probability")
ggplot(post_long, aes(x = PredClass, y = Probability, fill = TrueClass)) +
geom_boxplot(alpha = 0.72, outlier.size = 0.8) +
scale_fill_manual(values = class_cols, labels = class_labs) +
facet_wrap(~TrueClass, labeller = labeller(TrueClass = class_labs)) +
theme_minimal(base_size = 11) +
theme(strip.background = element_rect(fill = "#34495E"),
strip.text = element_text(color = "white", face = "bold"),
legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold")) +
labs(title = "Posterior Probability Distribution by True Class",
subtitle = "High probability in the correct class = good discrimination",
x = "Posterior Probability For", y = "Probability")alpha_bonf <- 0.05 / length(pred_cols)
cat(sprintf("Bonferroni-adjusted alpha: %.4f\n\n", alpha_bonf))## Bonferroni-adjusted alpha: 0.0100
anova_results <- do.call(rbind, lapply(pred_cols, function(v) {
formula_v <- as.formula(paste(v, "~ power_grid_disruption"))
model_v <- aov(formula_v, data = df)
aov_tbl <- summary(model_v)[[1]]
F_val <- aov_tbl["power_grid_disruption","F value"]
p_val <- aov_tbl["power_grid_disruption","Pr(>F)"]
SS_B <- aov_tbl["power_grid_disruption","Sum Sq"]
SS_E <- aov_tbl["Residuals","Sum Sq"]
df_b <- aov_tbl["power_grid_disruption","Df"]
df_e <- aov_tbl["Residuals","Df"]
eta2 <- SS_B / (SS_B + SS_E)
data.frame(
Predictor = v,
SS_Between = round(SS_B, 2),
SS_Within = round(SS_E, 2),
df_B = df_b,
df_W = df_e,
MS_Between = round(SS_B / df_b, 4),
MS_Within = round(SS_E / df_e, 4),
F_value = round(F_val, 4),
p_value = round(p_val, 6),
eta2 = round(eta2, 4),
Significant = ifelse(p_val < alpha_bonf, "Yes", "No")
)
}))
anova_results %>%
kable(caption = sprintf(
"Follow-Up Univariate ANOVAs - Effect of Disruption Class on Each Predictor (Bonferroni alpha = %.4f)",
alpha_bonf),
col.names = c("Predictor","SS Between","SS Within","df B","df W",
"MS Between","MS Within","F","p-value","eta2","Significant")) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
column_spec(1, bold = TRUE) %>%
column_spec(9, bold = TRUE) %>%
column_spec(11, bold = TRUE,
color = ifelse(anova_results$Significant == "Yes","#1E8449","#C0392B")) %>%
row_spec(0, bold = TRUE, color = "white", background = "#34495E")| Predictor | SS Between | SS Within | df B | df W | MS Between | MS Within | F | p-value | eta2 | Significant |
|---|---|---|---|---|---|---|---|---|---|---|
| flare_intensity | 598458.42 | 157338.21 | 2 | 997 | 299229.2110 | 157.8116 | 1896.1161 | 0.000000 | 0.7918 | Yes |
| geomagnetic_index_Kp | 243.57 | 807.40 | 2 | 997 | 121.7865 | 0.8098 | 150.3850 | 0.000000 | 0.2318 | Yes |
| solar_wind_speed | 557204.50 | 9417605.60 | 2 | 997 | 278602.2479 | 9445.9434 | 29.4944 | 0.000000 | 0.0559 | Yes |
| solar_wind_density | 4.92 | 23840.47 | 2 | 997 | 2.4594 | 23.9122 | 0.1029 | 0.902272 | 0.0002 | No |
| flare_duration_minutes | 7414.01 | 2543298.54 | 2 | 997 | 3707.0068 | 2550.9514 | 1.4532 | 0.234319 | 0.0029 | No |
ggplot(anova_results, aes(x = reorder(Predictor, eta2),
y = eta2, fill = eta2)) +
geom_col(width = 0.55, color = "white") +
geom_text(aes(label = sprintf("%.4f", eta2)),
hjust = -0.15, size = 4, fontface = "bold") +
geom_hline(yintercept = 0.01, linetype = "dotted",
color = "#E74C3C", linewidth = 0.8) +
geom_hline(yintercept = 0.06, linetype = "dashed",
color = "#E67E22", linewidth = 0.8) +
geom_hline(yintercept = 0.14, linetype = "dashed",
color = "#27AE60", linewidth = 0.8) +
annotate("text", x = 0.6, y = 0.013, label = "small (0.01)", size = 3, color = "#E74C3C") +
annotate("text", x = 0.6, y = 0.063, label = "medium (0.06)", size = 3, color = "#E67E22") +
annotate("text", x = 0.6, y = 0.143, label = "large (0.14)", size = 3, color = "#27AE60") +
coord_flip() +
scale_fill_gradient(low = "#AED6F1", high = "#1A5276") +
scale_y_continuous(limits = c(0, max(anova_results$eta2) * 1.25), expand = c(0, 0)) +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold")) +
labs(title = "Univariate Effect Size (Partial eta²) per Predictor",
subtitle = "Cohen (1988) benchmarks: small=0.01, medium=0.06, large=0.14",
x = "", y = "Partial eta²")final_tbl <- data.frame(
Component = c(
"Total observations",
"Number of classes (g)",
"Number of predictors (p)",
"Discriminant functions extracted (s)",
"Wilks' Lambda",
"Approximate F (Rao)",
"p-value (Wilks)",
"Statistical decision",
"LD1 variance explained",
"LD2 variance explained",
"LD1 dominant predictor",
"APER (training accuracy)",
"AER (LOO-CV accuracy)",
"Optimism bias (APER - AER)",
"Bonferroni alpha (univariate)",
"Strongest univariate predictor (eta2)"
),
Result = c(
as.character(n_total),
as.character(g),
as.character(p),
as.character(s),
sprintf("%.6f", wilks_lambda),
sprintf("F = %.4f, df = (%.0f, %.2f)", F_wilks, df1_wilks, df2_wilks),
formatC(p_wilks, format = "e", digits = 3),
"Reject H0 at alpha = 0.05 - class centroids are significantly separated",
sprintf("%.2f%%", pct_trace[1] * 100),
sprintf("%.2f%%", pct_trace[2] * 100),
"flare_intensity & geomagnetic_index_Kp (by standardised coefficients)",
sprintf("%.4f (%.2f%% accuracy)", aper, accuracy * 100),
sprintf("%.4f (%.2f%% accuracy)", aper_cv, accuracy_cv * 100),
sprintf("%.4f", aper - aper_cv),
sprintf("%.4f", alpha_bonf),
sprintf("%s (eta2 = %.4f)",
anova_results$Predictor[which.max(anova_results$eta2)],
max(anova_results$eta2))
)
)
final_tbl %>%
kable(caption = "Comprehensive LDA Results Summary",
col.names = c("Component", "Result")) %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) %>%
column_spec(1, bold = TRUE, width = "22em") %>%
row_spec(0, bold = TRUE, color = "white", background = "#2C3E50") %>%
row_spec(c(5, 8, 12, 13), background = "#EBF5FB")| Component | Result |
|---|---|
| Total observations | 1000 |
| Number of classes (g) | 3 |
| Number of predictors (p) | 5 |
| Discriminant functions extracted (s) | 2 |
| Wilks’ Lambda | 0.162296 |
| Approximate F (Rao) | F = 294.3750, df = (10, 1986.00) |
| p-value (Wilks) | 0.000e+00 |
| Statistical decision | Reject H0 at alpha = 0.05 - class centroids are significantly separated |
| LD1 variance explained | 99.36% |
| LD2 variance explained | 0.64% |
| LD1 dominant predictor | flare_intensity & geomagnetic_index_Kp (by standardised coefficients) |
| APER (training accuracy) | 0.0630 (93.70% accuracy) |
| AER (LOO-CV accuracy) | 0.0640 (93.60% accuracy) |
| Optimism bias (APER - AER) | -0.0010 |
| Bonferroni alpha (univariate) | 0.0100 |
| Strongest univariate predictor (eta2) | flare_intensity (eta2 = 0.7918) |
This analysis applies Linear Discriminant Analysis to classify 1,000 solar storm events into three power grid disruption severity categories - no disruption (Class 0, 41.3%), moderate disruption (Class 1, 45.3%), and severe disruption (Class 2, 13.4%) - using five continuous geophysical predictors simultaneously: flare intensity, geomagnetic Kp index, solar wind speed, solar wind density, and flare duration. The core question is whether the joint multivariate fingerprint of a solar storm event can reliably predict how severely it will affect ground-based electrical infrastructure, a problem with direct operational relevance for power grid operators and space weather agencies.
The challenge inherent in this classification problem is threefold.
First, the three disruption classes are not cleanly separated in any
single predictor dimension - the boxplot distributions show substantial
overlap between Class 0 and Class 1, and even the most discriminating
variable (flare_intensity) cannot alone achieve reliable
separation. Second, the class distribution is imbalanced, with severe
disruption events (Class 2) comprising only 13.4% of the dataset - a
condition that, if ignored, would lead any naive classifier to
systematically misclassify severe events as moderate. Third, a subset of
the predictors (solar_wind_density,
flare_duration_minutes) shows weak univariate group
separation (low eta² in follow-up ANOVAs), creating risk of noise
inflation if included uncritically. Notably,
solar_wind_density (eta² ≈ 0.0002, p ≈ 0.902) and
flare_duration_minutes (eta² ≈ 0.0029, p ≈ 0.234) are not
statistically significant at the univariate level even without
Bonferroni correction — yet they are retained in the LDA model. This is
intentional: LDA operates on the joint covariance structure of
all predictors simultaneously, and a variable that carries little
standalone discriminant power may still contribute meaningful
conditional information once the correlation structure among predictors
is accounted for. This contrast between univariate irrelevance and
multivariate utility is itself a key argument for adopting a
multivariate framework over a series of separate univariate tests. The
multivariate LDA framework addresses all three challenges
simultaneously: it combines all predictors through their joint
covariance structure, incorporates empirical prior probabilities that
reflect the true class imbalance, and extracts only \(s = \min(p, g-1) = 2\) discriminant
dimensions that capture the maximum between-group variance relative to
within-group variance.
The LDA produced a statistically significant multivariate separation
of the three disruption classes. Wilks’ Lambda = 0.162296, F(10, 1986) =
294.375, p < 0.001 - all four multivariate criteria (Pillai’s Trace,
Wilks’ Lambda, Hotelling-Lawley Trace, Roy’s Largest Root) converged on
the same rejection of the null hypothesis. Two discriminant functions
were extracted: LD1 accounts for 99.4% of the between-group variance and
LD2 accounts for the remaining 0.6%, meaning the vast majority of
class-separating information is concentrated in a single linear
direction. flare_intensity and
geomagnetic_index_Kp carry the largest standardised
coefficients on LD1 — direct comparison of raw (unstandardised) LDA
coefficients across predictors measured on different scales can be
misleading, so interpretation here is grounded in the standardised
structure revealed by the between-group SSCP and the follow-up ANOVAs.
These two variables confirm that peak flare energetics and their
interaction with Earth’s magnetosphere are the primary physical
mechanisms driving disruption severity — a finding that aligns with
established space weather physics.
The follow-up univariate ANOVAs under Bonferroni correction (α =
0.01) identify flare_intensity as the variable with the strongest
univariate group separation (eta² = 0.7918), followed by
geomagnetic_index_Kp. Crucially,
solar_wind_density (eta² ≈ 0.0002) and
flare_duration_minutes (eta² ≈ 0.0029) are non-significant
even without correction — they contribute negligible standalone
discriminant power. Their inclusion in the LDA model is nevertheless
justified because LDA exploits the joint covariance structure:
these variables may carry conditional discriminant information that only
becomes apparent in combination with other predictors. This finding
reinforces the core rationale for multivariate analysis — univariate
screening alone would have excluded predictors that may contribute
incremental classification value through inter-variable
correlations.
The classification performance - overall accuracy of 93.7% on the training sample (APER = 6.3%) and 93.6% under leave-one-out cross-validation (AER = 6.4%) - demonstrates that the discriminant model generalises well beyond the training data, with an optimism bias of only -0.1 percentage points. The close agreement between APER and AER indicates that the five-predictor model is not overfitting to the training sample. Per-class inspection of the confusion matrices reveals that the most common misclassification is between Class 0 and Class 1 (no vs. moderate disruption), which is expected given their overlapping geophysical characteristics. Class 2 (severe disruption), while a minority class, is classified with relatively high precision - a critical property for operational space weather alerting systems where false negatives for severe events carry the highest cost.
The findings carry both scientific and operational implications. From
a space weather physics perspective, the dominance of
flare_intensity and geomagnetic_index_Kp in
the discriminant structure confirms that the severity of terrestrial
power grid impact is primarily governed by the energy output of the
solar flare and the geomagnetic coupling efficiency - not by ancillary
parameters such as solar wind density or event duration. This supports
physically-motivated feature selection in future predictive modelling:
simpler models anchored on these two variables may achieve comparable
classification accuracy while improving interpretability for operational
forecasters.
From a practical risk-management perspective, the cross-validated accuracy of 93.6% represents a meaningful improvement over a naïve majority-class classifier (which would achieve 45.3% by always predicting Class 1), confirming that the geophysical measurements carry genuine predictive signal. Grid operators could integrate this discriminant classifier into early-warning systems - using real-time solar wind and flare measurements to flag incoming events by anticipated disruption severity, allowing pre-emptive protective switching before grid damage occurs. The posterior probability outputs of the LDA model provide not just a hard class prediction but a continuous confidence measure, enabling risk-tiered decision protocols (e.g., begin protective action when P(Class 2) > 0.30) rather than binary alert thresholds.
The Linear Discriminant Analysis of the Solar Storm Impact Dataset
confirmed that the five-dimensional geophysical predictor space yields a
statistically significant and operationally useful classification of
power grid disruption severity across three classes. Wilks’ Lambda =
0.162296, F(10, 1986) = 294.375, p < 0.001, with all four
multivariate test statistics in agreement. Two discriminant functions
were extracted: LD1 capturing 99.4% of the between-group variance and
dominated by flare_intensity and
geomagnetic_index_Kp, and LD2 capturing the remaining 0.6%.
The classification model achieved 93.7% accuracy on training data (APER
= 6.3%) and 93.6% under leave-one-out cross-validation (AER = 6.4%),
demonstrating strong generalisation with negligible overfitting.
Follow-up univariate ANOVAs identified flare_intensity as the single
strongest discriminating variable (eta² = 0.7918), with
geomagnetic_index_Kp as the second most important
contributor. These results provide a statistically rigorous, physically
interpretable, and operationally actionable framework for solar storm
severity classification.
Date: May 04, 2026
S1 Data Science | FMIPA UNESA | 2026