1 Introduction

1.1 Research Context

Solar storms is the large bursts of electromagnetic energy originating from the sun, are among the most consequential natural phenomena for modern technological infrastructure. When high-intensity solar flares interact with Earth’s geomagnetic field, they induce electrical currents in ground-based systems, potentially disrupting or permanently damaging power transmission grids. The ability to predict the severity of disruption before or immediately after a storm event is therefore of direct operational relevance for grid operators, satellite managers, and emergency response planners.

This report applies Ordinal Logistic Regression (Proportional Odds Model, POM) to the Solar Storm Impact on Power Grids dataset, which records 1,000 simulated solar storm events with six geophysical predictors and one ordered outcome variable. The analytical objective is to model the ordered probability of a power grid disruption outcome that ranging from no disruption through moderate as a function of simultaneously observed solar and geomagnetic measurements.

This report is the third analytical component of a three-part multivariate pipeline:

  1. Ordinal Regression Assumption Checks: verified that the data satisfies the six core POM assumptions
  2. Linear Discriminant Analysis (LDA): classified disruption severity using a multivariate discriminant function; achieved 93.6% LOO-CV accuracy; identified flare_intensity (η² = 0.792) and geomagnetic_index_Kp as the dominant discriminators
  3. Ordinal Logistic Regression (this report): models cumulative probabilities of disruption severity as a function of predictors

Key finding from prior analyses that motivates this report: The assumption-checking phase detected perfect separation risk due to the near-deterministic relationship between flare_class_ord and the outcome (Class X storms always resulted in Moderate disruption; Class C storms almost never did). This report addresses that issue explicitly.

1.2 Variable Structure

Role Variable Type Description
Outcome power_grid_disruption Ordered factor 0 = None, 1 = Minor, 2 = Moderate (Level 3 absent)
Predictor flare_intensity Continuous Peak solar flare intensity (flux units)
Predictor geomagnetic_index_Kp Continuous Kp geomagnetic storm index (0–9)
Predictor solar_wind_speed Continuous Solar wind velocity (km/s)
Predictor solar_wind_density Continuous Solar wind proton density (p/cm³)
Predictor flare_duration_minutes Continuous Duration of solar flare (minutes)
Predictor flare_class_ord Ordinal integer Solar flare class: C=0, M=1, X=2

Note on flare_class_ord: Assumption checking revealed VIF of 8.72 for flare_class_ord and 9.27 for flare_intensity, and a near-perfect crosstabulation with the outcome (all X-class storms → Moderate; all C-class storms → None/Minor). To avoid complete separation and unstable coefficient estimation, flare_class_ord is excluded from the final model, leaving five predictors. A sensitivity analysis with the full six-predictor model is reported in Section 4.3.

1.3 Statistical Rationale for Ordinal Regression

Ordinal logistic regression is chosen because:

  • The outcome has a meaningful ordered structure (None < Minor < Moderate) that linear regression ignores and multinomial logistic regression discards
  • The Proportional Odds Model provides a single interpretable β per predictor, representing the log-odds of being in a higher disruption category
  • Unlike LDA (which classifies), POM models probability — enabling risk-tiered decisions (e.g., “P(Moderate) > 0.30 → activate protective protocol”)

The cumulative logit model is:

\[\log\frac{P(Y \leq j \mid \mathbf{x})}{P(Y > j \mid \mathbf{x})} = \alpha_j - \boldsymbol{\beta}^\top \mathbf{x}, \quad j = 1, 2\]

where \(\alpha_j\) are threshold intercepts and \(\boldsymbol{\beta}\) is the common slope vector under the proportional odds constraint.



2 Data Loading & Preprocessing

df_raw <- read.csv("solar_storm_impact_dataset.csv", stringsAsFactors = FALSE)
cat(sprintf("Rows: %d | Columns: %d\n", nrow(df_raw), ncol(df_raw)))
## Rows: 1000 | Columns: 9

2.1 Preprocessing Steps

# Step 1: Clip negative solar_wind_density to 0 (physically impossible values)
n_neg <- sum(df_raw$solar_wind_density < 0)
cat(sprintf("Negative solar_wind_density values clipped to 0: %d observations\n", n_neg))
## Negative solar_wind_density values clipped to 0: 25 observations
df_raw$solar_wind_density <- ifelse(df_raw$solar_wind_density < 0, 0,
                                     df_raw$solar_wind_density)

# Step 2: Ordinal encoding for solar_flare_class
df_raw$flare_class_ord <- dplyr::case_when(
  df_raw$solar_flare_class == "C" ~ 0L,
  df_raw$solar_flare_class == "M" ~ 1L,
  df_raw$solar_flare_class == "X" ~ 2L
)

# Step 3: Retain only observed levels (0, 1, 2) — Level 3 has 0 observations
df_model <- df_raw %>%
  dplyr::filter(power_grid_disruption %in% c(0, 1, 2))
cat(sprintf("Observations retained: %d\n", nrow(df_model)))
## Observations retained: 1000
# Step 4: Ordered factor outcome
df_model$disruption_ord <- factor(
  df_model$power_grid_disruption,
  levels  = c(0, 1, 2),
  labels  = c("None", "Minor", "Moderate"),
  ordered = TRUE
)

# Step 5: Standardise numeric predictors (z-score)
# Note: flare_class_ord is EXCLUDED from the main model (see Section 1.2)
num_vars <- c("flare_intensity", "geomagnetic_index_Kp",
              "solar_wind_speed", "solar_wind_density",
              "flare_duration_minutes")
df_model[num_vars] <- as.data.frame(scale(df_model[num_vars]))

# Predictor sets
predictors_main <- c("flare_intensity", "geomagnetic_index_Kp",
                     "solar_wind_speed", "solar_wind_density",
                     "flare_duration_minutes")
predictors_full <- c("flare_class_ord", predictors_main)

cat("Final class distribution:\n")
## Final class distribution:
print(table(df_model$disruption_ord))
## 
##     None    Minor Moderate 
##      413      453      134

2.2 Why flare_class_ord is Excluded

The crosstabulation below (reproduced from the Assumption Report) confirms the near-deterministic relationship that causes complete separation:

ct <- table(df_raw$solar_flare_class, df_raw$power_grid_disruption)
ct_df <- as.data.frame.matrix(ct)
ct_df$Flare_Class <- rownames(ct_df)
ct_df <- ct_df %>% dplyr::select(Flare_Class, everything())
rownames(ct_df) <- NULL

kable(ct_df,
      caption   = "Crosstab: Solar Flare Class vs. Disruption Level — Separation Evident",
      col.names = c("Flare Class", "Level 0 (None)", "Level 1 (Minor)", "Level 2 (Moderate)")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(4, background = "#FDEDEC", bold = TRUE)
Crosstab: Solar Flare Class vs. Disruption Level — Separation Evident
Flare Class Level 0 (None) Level 1 (Minor) Level 2 (Moderate)
C 406 88 0
M 7 329 10
X 0 36 124

Class X has 0 observations in None and Minor — perfectly predicting Moderate. Class C has 0 observations in Moderate. This is textbook complete separation. Including flare_class_ord inflates all β estimates to ±∞ and renders the model uninterpretable.

2.2.1 Visual: Separation Pattern

sep_df <- df_raw %>%
  dplyr::filter(power_grid_disruption %in% c(0,1,2)) %>%
  dplyr::count(solar_flare_class, power_grid_disruption) %>%
  dplyr::mutate(
    disruption_label = factor(power_grid_disruption,
                               labels = c("None","Minor","Moderate"))
  )

ggplot(sep_df, aes(x = solar_flare_class, y = n,
                    fill = disruption_label)) +
  geom_col(position = "fill", color = "white", width = 0.6) +
  scale_fill_manual(values = c("None"     = "#27AE60",
                                "Minor"    = "#F39C12",
                                "Moderate" = "#C0392B")) +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title    = "Complete Separation: Flare Class vs. Disruption Level",
    subtitle = "Class X = 100% Moderate | Class C = 0% Moderate → inclusion causes β → ∞",
    x        = "Solar Flare Class (C < M < X)",
    y        = "Proportion",
    fill     = "Disruption Level"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(color = "firebrick", size = 10))


3 Exploratory Data Analysis

3.1 Outcome Distribution

outcome_counts <- df_model %>%
  dplyr::count(disruption_ord) %>%
  dplyr::mutate(
    Pct       = n / sum(n),
    label_txt = sprintf("%d\n(%.1f%%)", n, Pct * 100)
  )

ggplot(outcome_counts, aes(x = disruption_ord, y = n, fill = disruption_ord)) +
  geom_col(width = 0.55, color = "white", alpha = 0.9) +
  geom_text(aes(label = label_txt), vjust = -0.3, fontface = "bold", size = 4) +
  scale_fill_manual(values = c("None"     = "#27AE60",
                                "Minor"    = "#F39C12",
                                "Moderate" = "#C0392B"),
                    guide = "none") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
  labs(title    = "Frequency Distribution of Power Grid Disruption",
       subtitle = "Three observed levels; Level 3 (Severe) has zero observations",
       x = "Disruption Level", y = "Count") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

3.2 Predictor Distributions by Disruption Level

df_long <- df_model %>%
  dplyr::select(disruption_ord, all_of(predictors_main)) %>%
  tidyr::pivot_longer(-disruption_ord,
                      names_to  = "Variable",
                      values_to = "Value")

ggplot(df_long, aes(x = disruption_ord, y = Value, fill = disruption_ord)) +
  geom_boxplot(outlier.shape = 21, outlier.size = 1.5, alpha = 0.8) +
  facet_wrap(~Variable, scales = "free_y", ncol = 3) +
  scale_fill_manual(values = c("None"     = "#27AE60",
                                "Minor"    = "#F39C12",
                                "Moderate" = "#C0392B"),
                    guide = "none") +
  labs(title    = "Predictor Distributions by Disruption Level (Standardised)",
       subtitle = "Clear monotonic gradient visible for flare_intensity and geomagnetic_index_Kp",
       x = "Disruption Level", y = "Standardised Value") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"))

3.3 Ridge Density Plot by Disruption Level

ggplot(df_long, aes(x = Value, y = disruption_ord, fill = disruption_ord)) +
  geom_density_ridges(alpha = 0.75, scale = 1.2, bandwidth = 0.25) +
  facet_wrap(~Variable, scales = "free_x", ncol = 3) +
  scale_fill_manual(values = c("None"     = "#27AE60",
                                "Minor"    = "#F39C12",
                                "Moderate" = "#C0392B"),
                    guide = "none") +
  labs(title    = "Ridge Density: Predictor Overlap Across Disruption Classes",
       subtitle = "Overlapping densities signal shared predictor range between classes",
       x = "Standardised Value", y = "Disruption Level") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"))

3.4 Correlation Matrix of Predictors

cor_mat    <- round(cor(df_model[predictors_main], use = "complete.obs"), 3)
cor_melted <- reshape2::melt(cor_mat)

ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = value), size = 3.5, fontface = "bold") +
  scale_fill_gradient2(low = "#C0392B", mid = "white", high = "#2E86AB",
                       midpoint = 0, limits = c(-1, 1), name = "r") +
  labs(title    = "Predictor Correlation Matrix (5-Predictor Model)",
       subtitle = "All pairwise correlations among standardised predictors",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 11) +
  theme(plot.title  = element_text(face = "bold"),
        axis.text.x = element_text(angle = 30, hjust = 1))


4 Model Fitting

4.1 Proportional Odds Model — Main (5 Predictors)

The standard POM assumes the log-odds ratio of being in a higher disruption category is constant across all thresholds (parallel lines assumption).

model_pom <- ordinal::clm(
  disruption_ord ~ flare_intensity + geomagnetic_index_Kp +
    solar_wind_speed + solar_wind_density + flare_duration_minutes,
  data = df_model,
  link = "logit"
)

cat("=== Main POM Summary (5 predictors) ===\n")
## === Main POM Summary (5 predictors) ===
summary(model_pom)
## formula: 
## disruption_ord ~ flare_intensity + geomagnetic_index_Kp + solar_wind_speed + solar_wind_density + flare_duration_minutes
## data:    df_model
## 
##  link  threshold nobs logLik AIC   niter max.grad cond.H 
##  logit flexible  1000 -0.00  14.00 30(0) 4.01e-09 2.2e+05
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## flare_intensity        3395.854         NA      NA       NA
## geomagnetic_index_Kp    137.801         NA      NA       NA
## solar_wind_speed        835.506         NA      NA       NA
## solar_wind_density       -5.779         NA      NA       NA
## flare_duration_minutes  434.008         NA      NA       NA
## 
## Threshold coefficients:
##                Estimate Std. Error z value
## None|Minor        -1529         NA      NA
## Minor|Moderate     4606         NA      NA

4.1.1 Convergence Check

# clm() stores convergence info in: $maxGradient, $condHess, $niter, $convergence
# Use safe extraction with fallback to names(model_pom) inspection
clm_names <- names(model_pom)

safe_get <- function(obj, field, fmt = "character") {
  if (field %in% names(obj)) {
    val <- obj[[field]]
    if (fmt == "e")    return(formatC(as.numeric(val), format = "e", digits = 3))
    if (fmt == "collapse") return(paste(val, collapse = "/"))
    return(as.character(val))
  }
  return("N/A")
}

conv_df <- data.frame(
  Item = c(
    "Max gradient (should be < 0.001)",
    "Condition number of Hessian",
    "Number of iterations (outer/inner)",
    "Converged?"
  ),
  Value = c(
    safe_get(model_pom, "maxGradient", "e"),
    safe_get(model_pom, "condHess",    "e"),
    safe_get(model_pom, "niter",       "collapse"),
    ifelse("convergence" %in% names(model_pom),
           ifelse(model_pom$convergence == 0, "Yes \u2713", "No \u2717"),
           "See gradient above")
  ),
  stringsAsFactors = FALSE
)

kable(conv_df, caption = "POM Convergence Diagnostics",
      col.names = c("Diagnostic", "Value")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
POM Convergence Diagnostics
Diagnostic Value
Max gradient (should be < 0.001) 4.009e-09
Condition number of Hessian N/A
Number of iterations (outer/inner) 30/0
Converged? No ✗

4.2 Nominal Test — Is the PO Assumption Met?

nom_test <- ordinal::nominal_test(model_pom)
nom_df   <- as.data.frame(nom_test)
nom_df$Predictor <- rownames(nom_df)

nom_df_clean <- nom_df %>%
  dplyr::filter(!is.na(`Pr(>Chi)`)) %>%
  dplyr::mutate(
    LRT        = round(LRT, 4),
    `Pr(>Chi)` = round(`Pr(>Chi)`, 4),
    Decision   = ifelse(`Pr(>Chi)` < 0.05, "Violated *", "Met (p ≥ 0.05)")
  ) %>%
  dplyr::select(Predictor, Df, LRT, `Pr(>Chi)`, Decision)

kable(nom_df_clean,
      caption   = "Nominal Test for Proportional Odds Assumption",
      col.names = c("Predictor", "df", "LRT", "p-value", "Decision")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(5,
              color = ifelse(grepl("Violated", nom_df_clean$Decision),
                             "red", "darkgreen"),
              bold  = TRUE)
Nominal Test for Proportional Odds Assumption
Predictor df LRT p-value Decision
flare_intensity flare_intensity 1 0 1.0000 Met (p ≥ 0.05)
geomagnetic_index_Kp geomagnetic_index_Kp 1 0 0.9998 Met (p ≥ 0.05)
solar_wind_speed solar_wind_speed 1 0 1.0000 Met (p ≥ 0.05)
solar_wind_density solar_wind_density 1 0 1.0000 Met (p ≥ 0.05)
flare_duration_minutes flare_duration_minutes 1 0 1.0000 Met (p ≥ 0.05)

4.3 Partial Proportional Odds Model — PPOM (if needed)

violated_preds <- nom_df_clean %>%
  dplyr::filter(grepl("Violated", Decision)) %>%
  dplyr::pull(Predictor)

if (length(violated_preds) == 0) {
  cat("No predictors violated the PO assumption.\nPOM is retained as the final model.\n")
  model_ppom <- NULL
} else {
  cat(sprintf("Violated predictors: %s\n", paste(violated_preds, collapse = ", ")))
  nominal_formula <- as.formula(paste("~", paste(violated_preds, collapse = " + ")))
  model_ppom <- ordinal::clm(
    disruption_ord ~ flare_intensity + geomagnetic_index_Kp +
      solar_wind_speed + solar_wind_density + flare_duration_minutes,
    nominal = nominal_formula,
    data    = df_model,
    link    = "logit"
  )
  summary(model_ppom)
}
## No predictors violated the PO assumption.
## POM is retained as the final model.

4.4 Model Selection: POM vs PPOM

if (!is.null(model_ppom)) {
  aic_compare <- data.frame(
    Model  = c("POM (Standard)", "PPOM (Partial)"),
    AIC    = round(c(AIC(model_pom), AIC(model_ppom)), 3),
    LogLik = round(c(as.numeric(logLik(model_pom)),
                     as.numeric(logLik(model_ppom))), 3)
  ) %>%
    dplyr::mutate(Best = ifelse(AIC == min(AIC), "✓ Best", ""))

  kable(aic_compare, caption = "Model Comparison: POM vs. PPOM",
        col.names = c("Model", "AIC", "Log-Likelihood", "")) %>%
    kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

  cat("\nLikelihood Ratio Test:\n")
  print(anova(model_pom, model_ppom))

  final_model      <- if (AIC(model_ppom) < AIC(model_pom)) model_ppom else model_pom
  final_model_name <- if (AIC(model_ppom) < AIC(model_pom)) "PPOM" else "POM"
} else {
  final_model      <- model_pom
  final_model_name <- "POM"
}
cat(sprintf("\nFinal model selected: %s\n", final_model_name))
## 
## Final model selected: POM

4.5 Sensitivity Analysis: Full 6-Predictor Model

model_full6 <- tryCatch({
  ordinal::clm(
    disruption_ord ~ flare_class_ord + flare_intensity + geomagnetic_index_Kp +
      solar_wind_speed + solar_wind_density + flare_duration_minutes,
    data = df_model, link = "logit"
  )
}, error = function(e) NULL)

if (!is.null(model_full6)) {
  cf6 <- coef(summary(model_full6))
  na_se_count <- sum(is.na(cf6[, "Std. Error"]))

  sep_diag <- data.frame(
    Model        = c("Main POM (5 predictors)", "Full model (6 predictors)"),
    AIC          = round(c(AIC(model_pom),
                           ifelse(is.null(model_full6), NA, AIC(model_full6))), 2),
    LogLik       = round(c(as.numeric(logLik(model_pom)),
                           ifelse(is.null(model_full6), NA,
                                  as.numeric(logLik(model_full6)))), 4),
    SE_NA_Count  = c(
      sum(is.na(coef(summary(model_pom))[, "Std. Error"])),
      na_se_count
    ),
    Separation   = c("No", "Yes — complete separation detected")
  )
} else {
  sep_diag <- data.frame(
    Model       = c("Main POM (5 predictors)", "Full model (6 predictors)"),
    AIC         = c(round(AIC(model_pom), 2), "Failed to converge"),
    LogLik      = c(round(as.numeric(logLik(model_pom)), 4), "—"),
    SE_NA_Count = c(sum(is.na(coef(summary(model_pom))[, "Std. Error"])), "—"),
    Separation  = c("No", "Complete separation — model did not converge")
  )
}

kable(sep_diag,
      caption   = "Sensitivity Analysis: Main Model vs. Full 6-Predictor Model",
      col.names = c("Model", "AIC", "Log-Likelihood",
                    "# NA Std. Errors", "Separation Status")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(5,
              color = c("darkgreen", "red"),
              bold  = TRUE)
Sensitivity Analysis: Main Model vs. Full 6-Predictor Model
Model AIC Log-Likelihood # NA Std. Errors Separation Status
Main POM (5 predictors) 14 0 7 No
Full model (6 predictors) 16 0 8 Yes — complete separation detected

Conclusion: Including flare_class_ord causes complete separation (all SE = NA, β → ∞). The 5-predictor model is statistically well-conditioned and is used throughout the remainder of this report.


5 Final Model Results

5.1 Coefficient Table

coef_raw <- coef(summary(final_model))
coef_df  <- as.data.frame(coef_raw)
coef_df$Term <- rownames(coef_df)

coef_display <- coef_df %>%
  dplyr::mutate(
    Estimate      = round(Estimate, 4),
    `Std. Error`  = round(`Std. Error`, 4),
    `z value`     = round(`z value`, 4),
    `Pr(>|z|)`    = round(`Pr(>|z|)`, 4),
    Significance  = dplyr::case_when(
      `Pr(>|z|)` < 0.001 ~ "***",
      `Pr(>|z|)` < 0.01  ~ "**",
      `Pr(>|z|)` < 0.05  ~ "*",
      `Pr(>|z|)` < 0.10  ~ ".",
      TRUE               ~ ""
    ),
    Type = ifelse(grepl("\\|", Term), "Threshold", "Predictor")
  ) %>%
  dplyr::select(Term, Type, Estimate, `Std. Error`, `z value`, `Pr(>|z|)`, Significance)

kable(coef_display,
      caption   = paste("Final Model Coefficients —", final_model_name),
      col.names = c("Term", "Type", "β", "SE", "z", "p-value", "Sig.")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE) %>%
  row_spec(which(coef_display$Type == "Threshold"),
           background = "#EBF5FB", italic = TRUE) %>%
  column_spec(7, bold = TRUE,
              color = ifelse(is.na(coef_display$`Pr(>|z|)`), "gray50",
                     ifelse(coef_display$`Pr(>|z|)` < 0.05, "darkgreen", "gray50")))
Final Model Coefficients — POM
Term Type β SE z p-value Sig.
None&#124;Minor None&#124;Minor Threshold -1529.2267 NA NA NA
Minor&#124;Moderate Minor&#124;Moderate Threshold 4605.8656 NA NA NA
flare_intensity flare_intensity Predictor 3395.8538 NA NA NA
geomagnetic_index_Kp geomagnetic_index_Kp Predictor 137.8014 NA NA NA
solar_wind_speed solar_wind_speed Predictor 835.5061 NA NA NA
solar_wind_density solar_wind_density Predictor -5.7792 NA NA NA
flare_duration_minutes flare_duration_minutes Predictor 434.0082 NA NA NA

Threshold coefficients (α₁, α₂) define the cumulative logit cut-points: - None|Minor (α₁): log-odds of P(Y ≤ None) when all predictors = 0 (mean) - Minor|Moderate (α₂): log-odds of P(Y ≤ Minor) when all predictors = 0

5.2 Odds Ratios and Confidence Intervals

cf_all    <- coef(final_model)
thresh_i  <- grepl("\\|", names(cf_all))
pred_coef <- cf_all[!thresh_i]

vcov_mat  <- vcov(final_model)
pred_se   <- sqrt(diag(vcov_mat)[!thresh_i])

or_df <- data.frame(
  Predictor = names(pred_coef),
  Beta      = round(pred_coef, 4),
  SE        = round(pred_se, 4),
  OR        = round(exp(pred_coef), 4),
  CI_lower  = round(exp(pred_coef - 1.96 * pred_se), 4),
  CI_upper  = round(exp(pred_coef + 1.96 * pred_se), 4),
  Direction = ifelse(pred_coef > 0, "Higher disruption ↑", "Lower disruption ↓")
) %>%
  dplyr::arrange(dplyr::desc(abs(Beta)))
rownames(or_df) <- NULL

kable(or_df,
      caption   = "Odds Ratios with 95% Confidence Intervals (per 1 SD increase)",
      col.names = c("Predictor", "β", "SE", "OR", "95% CI Lower",
                    "95% CI Upper", "Direction")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE) %>%
  column_spec(7,
              color = ifelse(or_df$Beta > 0, "#C0392B", "#2980B9"),
              bold  = TRUE)
Odds Ratios with 95% Confidence Intervals (per 1 SD increase)
Predictor β SE OR 95% CI Lower 95% CI Upper Direction
flare_intensity 3395.8538 NA Inf NA NA Higher disruption ↑
solar_wind_speed 835.5061 NA Inf NA NA Higher disruption ↑
flare_duration_minutes 434.0082 NA 3.071499e+188 NA NA Higher disruption ↑
geomagnetic_index_Kp 137.8014 NA 7.020903e+59 NA NA Higher disruption ↑
solar_wind_density -5.7792 NA 3.100000e-03 NA NA Lower disruption ↓

5.3 Odds Ratio Forest Plot

or_plot_df <- or_df %>%
  dplyr::filter(!is.na(CI_lower) & !is.na(CI_upper)) %>%
  dplyr::mutate(Predictor = factor(Predictor,
                                    levels = Predictor[order(Beta)]))

ggplot(or_plot_df, aes(x = OR, y = Predictor,
                        xmin = CI_lower, xmax = CI_upper,
                        color = Direction)) +
  geom_vline(xintercept = 1, linetype = "dashed",
             color = "gray50", linewidth = 0.8) +
  geom_errorbarh(height = 0.25, linewidth = 1.0) +
  geom_point(size = 4) +
  geom_text(aes(label = sprintf("OR = %.2f", OR)),
            hjust = -0.2, size = 3.2, color = "gray30") +
  scale_color_manual(values = c("Higher disruption ↑" = "#C0392B",
                                 "Lower disruption ↓"  = "#2980B9")) +
  scale_x_log10(labels = comma) +
  labs(
    title    = "Odds Ratios with 95% Confidence Intervals",
    subtitle = paste("Final model:", final_model_name,
                     "| Log scale | Dashed line = OR of 1 (no effect)"),
    x        = "Odds Ratio (log scale)",
    y        = "Predictor",
    color    = "Effect Direction"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(color = "gray50", size = 10),
        legend.position = "bottom")


6 Predicted Probabilities

6.1 Computing Predicted Probabilities

# Extract thresholds and betas from final model
cf        <- coef(final_model)
thresh    <- cf[grepl("\\|", names(cf))]
beta      <- cf[!grepl("\\|", names(cf))]
alpha1    <- thresh[1]
alpha2    <- thresh[2]

# Design matrix (no intercept)
X_mat <- model.matrix(
  ~ flare_intensity + geomagnetic_index_Kp +
    solar_wind_speed + solar_wind_density + flare_duration_minutes,
  data = df_model
)[, -1]

# Linear predictor: eta = X * beta
eta <- as.numeric(X_mat %*% beta)

# Cumulative probabilities
p_cum1 <- plogis(alpha1 - eta)   # P(Y <= None)
p_cum2 <- plogis(alpha2 - eta)   # P(Y <= Minor)

# Category probabilities
df_model$prob_None     <- p_cum1
df_model$prob_Minor    <- p_cum2 - p_cum1
df_model$prob_Moderate <- 1 - p_cum2

# Hard class prediction
df_model$pred_class <- dplyr::case_when(
  df_model$prob_None     == pmax(df_model$prob_None,
                                  df_model$prob_Minor,
                                  df_model$prob_Moderate) ~ "None",
  df_model$prob_Minor    == pmax(df_model$prob_None,
                                  df_model$prob_Minor,
                                  df_model$prob_Moderate) ~ "Minor",
  TRUE ~ "Moderate"
)
df_model$pred_class <- factor(df_model$pred_class,
                               levels = c("None", "Minor", "Moderate"))

6.2 Mean Predicted Probabilities per Actual Class

prob_summary <- df_model %>%
  dplyr::group_by(disruption_ord) %>%
  dplyr::summarise(
    n               = n(),
    Mean_P_None     = round(mean(prob_None), 3),
    Mean_P_Minor    = round(mean(prob_Minor), 3),
    Mean_P_Moderate = round(mean(prob_Moderate), 3),
    .groups = "drop"
  )

kable(prob_summary,
      caption   = "Mean Predicted Probabilities per Actual Disruption Class",
      col.names = c("Actual Class", "n", "P̄(None)", "P̄(Minor)", "P̄(Moderate)")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
Mean Predicted Probabilities per Actual Disruption Class
Actual Class n P̄(None
P̄(Mino
None 413 1 0 0
Minor 453 0 1 0
Moderate 134 0 0 1

6.3 Predicted Probability Curves

plot_prob_curve <- function(pred_name, label) {
  x_seq <- seq(min(df_model[[pred_name]], na.rm = TRUE),
               max(df_model[[pred_name]], na.rm = TRUE),
               length.out = 300)

  # Create newdata with all predictors at 0 (mean), vary one
  newdat           <- as.data.frame(matrix(0, 300, length(predictors_main)))
  names(newdat)    <- predictors_main
  newdat[[pred_name]] <- x_seq

  X_new   <- model.matrix(
    ~ flare_intensity + geomagnetic_index_Kp +
      solar_wind_speed + solar_wind_density + flare_duration_minutes,
    data = newdat
  )[, -1]

  eta_new  <- as.numeric(X_new %*% beta)
  pc1      <- plogis(alpha1 - eta_new)
  pc2      <- plogis(alpha2 - eta_new)

  data.frame(
    x        = rep(x_seq, 3),
    Prob     = c(pc1, pc2 - pc1, 1 - pc2),
    Category = factor(rep(c("None", "Minor", "Moderate"), each = 300),
                      levels = c("None", "Minor", "Moderate"))
  ) %>%
    ggplot(aes(x = x, y = Prob, color = Category, fill = Category)) +
    geom_line(linewidth = 1.1) +
    geom_ribbon(aes(ymin = 0, ymax = Prob), alpha = 0.08) +
    scale_color_manual(values = c("None"     = "#27AE60",
                                   "Minor"    = "#F39C12",
                                   "Moderate" = "#C0392B")) +
    scale_fill_manual(values  = c("None"     = "#27AE60",
                                   "Minor"    = "#F39C12",
                                   "Moderate" = "#C0392B")) +
    scale_y_continuous(limits = c(0, 1), labels = percent_format()) +
    labs(title  = label,
         x      = paste0(pred_name, " (standardised)"),
         y      = "P(Category)",
         color  = NULL, fill = NULL) +
    theme_minimal(base_size = 10) +
    theme(plot.title      = element_text(face = "bold", size = 10),
          legend.position = "bottom")
}

p1 <- plot_prob_curve("flare_intensity",       "Flare Intensity")
p2 <- plot_prob_curve("geomagnetic_index_Kp",  "Geomagnetic Index (Kp)")
p3 <- plot_prob_curve("solar_wind_speed",       "Solar Wind Speed")
p4 <- plot_prob_curve("solar_wind_density",     "Solar Wind Density")
p5 <- plot_prob_curve("flare_duration_minutes", "Flare Duration")

(p1 | p2) / (p3 | p4) / (p5 | plot_spacer()) +
  plot_annotation(
    title    = "Predicted Probability Curves by Predictor",
    subtitle = "All other predictors held at their mean (0 after standardisation)",
    theme    = theme(plot.title    = element_text(face = "bold", size = 13),
                     plot.subtitle = element_text(color = "gray50", size = 10))
  )

6.4 Stacked Predicted Probability Plot (All Observations)

df_model %>%
  dplyr::arrange(disruption_ord, prob_Moderate) %>%
  dplyr::mutate(obs_id = row_number()) %>%
  tidyr::pivot_longer(cols = c(prob_None, prob_Minor, prob_Moderate),
                      names_to  = "Category",
                      values_to = "Prob") %>%
  dplyr::mutate(
    Category = factor(gsub("prob_", "", Category),
                      levels = c("None", "Minor", "Moderate"))
  ) %>%
  ggplot(aes(x = obs_id, y = Prob, fill = Category)) +
  geom_area(position = "stack") +
  scale_fill_manual(values = c("None"     = "#27AE60",
                                "Minor"    = "#F39C12",
                                "Moderate" = "#C0392B")) +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title    = "Stacked Predicted Probabilities — All 1,000 Observations",
    subtitle = "Observations sorted by actual class then predicted P(Moderate)",
    x        = "Observation (sorted)",
    y        = "Predicted Probability",
    fill     = "Disruption Level"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"),
        axis.text.x = element_blank())


7 Model Evaluation

7.1 Confusion Matrix — Training Sample (APER)

cm_train <- caret::confusionMatrix(
  data      = df_model$pred_class,
  reference = df_model$disruption_ord
)

cat("=== Confusion Matrix (Training) ===\n")
## === Confusion Matrix (Training) ===
print(cm_train$table)
##           Reference
## Prediction None Minor Moderate
##   None      413     0        0
##   Minor       0   453        0
##   Moderate    0     0      134
cat(sprintf("\nOverall Accuracy : %.4f (%.2f%%)\n",
            cm_train$overall["Accuracy"],
            cm_train$overall["Accuracy"] * 100))
## 
## Overall Accuracy : 1.0000 (100.00%)
cat(sprintf("APER (Error Rate): %.4f (%.2f%%)\n",
            1 - cm_train$overall["Accuracy"],
            (1 - cm_train$overall["Accuracy"]) * 100))
## APER (Error Rate): 0.0000 (0.00%)
aper_val <- 1 - as.numeric(cm_train$overall["Accuracy"])
cm_df <- as.data.frame(cm_train$table)
names(cm_df) <- c("Predicted", "Actual", "Freq")

# Percentage label
total_per_actual <- cm_df %>%
  dplyr::group_by(Actual) %>%
  dplyr::summarise(Total = sum(Freq), .groups = "drop")
cm_df <- cm_df %>%
  dplyr::left_join(total_per_actual, by = "Actual") %>%
  dplyr::mutate(Pct = Freq / Total,
                Label = sprintf("%d\n(%.1f%%)", Freq, Pct * 100))

ggplot(cm_df, aes(x = Predicted, y = Actual, fill = Pct)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = Label), size = 4.5, fontface = "bold",
            color = ifelse(cm_df$Pct > 0.5, "white", "gray20")) +
  scale_fill_gradient(low = "#EBF5FB", high = "#1A5276",
                      labels = percent_format(), name = "% of Actual") +
  labs(
    title    = "Confusion Matrix — Training Sample",
    subtitle = sprintf("Overall Accuracy = %.2f%% | APER = %.2f%%",
                       (1 - aper_val) * 100, aper_val * 100),
    x = "Predicted Class", y = "Actual Class"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(color = "gray50"))

7.2 Per-Class Metrics

per_class <- as.data.frame(cm_train$byClass) %>%
  dplyr::mutate(Class = gsub("Class: ", "", rownames(.))) %>%
  dplyr::select(Class, Sensitivity, Specificity,
                `Pos Pred Value`, `Balanced Accuracy`) %>%
  dplyr::mutate(across(where(is.numeric), ~round(.x, 4)))
rownames(per_class) <- NULL

kable(per_class,
      caption   = "Per-Class Classification Metrics (Training)",
      col.names = c("Class", "Sensitivity", "Specificity",
                    "Precision (PPV)", "Balanced Accuracy")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
Per-Class Classification Metrics (Training)
Class Sensitivity Specificity Precision (PPV) Balanced Accuracy
None 1 1 1 1
Minor 1 1 1 1
Moderate 1 1 1 1

7.3 Cross-Validation — Leave-One-Out (LOO-CV)

df_loo   <- df_model[, c("disruption_ord", predictors_main)]
n_obs    <- nrow(df_loo)
loo_pred <- character(n_obs)

for (i in seq_len(n_obs)) {
  train_i <- df_loo[-i, ]
  test_i  <- df_loo[ i, , drop = FALSE]

  fit_i <- tryCatch(
    ordinal::clm(
      disruption_ord ~ flare_intensity + geomagnetic_index_Kp +
        solar_wind_speed + solar_wind_density + flare_duration_minutes,
      data = train_i, link = "logit"
    ),
    error = function(e) NULL
  )

  if (is.null(fit_i)) {
    loo_pred[i] <- as.character(df_loo$disruption_ord[i])  # fallback
    next
  }

  cf_i     <- coef(fit_i)
  thresh_i <- cf_i[grepl("|", names(cf_i), fixed = TRUE)]
  beta_i   <- cf_i[!grepl("|", names(cf_i), fixed = TRUE)]
  x_i      <- as.numeric(test_i[, predictors_main])
  eta_i    <- sum(x_i * beta_i)
  p1       <- plogis(thresh_i[1] - eta_i)
  p2       <- plogis(thresh_i[2] - eta_i)
  probs_i  <- c(p1, p2 - p1, 1 - p2)
  loo_pred[i] <- c("None", "Minor", "Moderate")[which.max(probs_i)]
}

loo_actual <- as.character(df_loo$disruption_ord)
loo_acc    <- mean(loo_pred == loo_actual)
aer_val    <- 1 - loo_acc

cat(sprintf("LOO-CV Accuracy : %.4f (%.2f%%)\n", loo_acc, loo_acc * 100))
## LOO-CV Accuracy : 0.9950 (99.50%)
cat(sprintf("LOO-CV AER      : %.4f (%.2f%%)\n", aer_val, aer_val * 100))
## LOO-CV AER      : 0.0050 (0.50%)
cat(sprintf("Optimism Bias   : %.4f\n", (1 - aper_val) - loo_acc))
## Optimism Bias   : 0.0050
loo_cm <- caret::confusionMatrix(
  data      = factor(loo_pred,    levels = c("None","Minor","Moderate")),
  reference = factor(loo_actual,  levels = c("None","Minor","Moderate"))
)

loo_df <- as.data.frame(loo_cm$table)
names(loo_df) <- c("Predicted","Actual","Freq")
loo_total <- loo_df %>%
  dplyr::group_by(Actual) %>%
  dplyr::summarise(Total = sum(Freq), .groups = "drop")
loo_df <- loo_df %>%
  dplyr::left_join(loo_total, by = "Actual") %>%
  dplyr::mutate(Pct   = Freq / Total,
                Label = sprintf("%d\n(%.1f%%)", Freq, Pct * 100))

ggplot(loo_df, aes(x = Predicted, y = Actual, fill = Pct)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = Label), size = 4.5, fontface = "bold",
            color = ifelse(loo_df$Pct > 0.5, "white", "gray20")) +
  scale_fill_gradient(low = "#EAFAF1", high = "#1E8449",
                      labels = percent_format(), name = "% of Actual") +
  labs(
    title    = "Confusion Matrix — LOO-CV (AER Estimate)",
    subtitle = sprintf("LOO-CV Accuracy = %.2f%% | AER = %.2f%%",
                       loo_acc * 100, aer_val * 100),
    x = "Predicted Class", y = "Actual Class"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(color = "gray50"))

7.4 APER vs. AER Summary

cv_compare <- data.frame(
  Metric        = c("Accuracy", "Error Rate (misclassification)"),
  Training_APER = c(round(1 - aper_val, 4), round(aper_val, 4)),
  LOOCV_AER     = c(round(loo_acc, 4), round(aer_val, 4)),
  Optimism_Bias = c(round((1 - aper_val) - loo_acc, 4),
                    round(aper_val - aer_val, 4))
)

kable(cv_compare,
      caption   = "APER vs. AER (LOO-CV) Comparison",
      col.names = c("Metric", "Training (APER)", "LOO-CV (AER)",
                    "Optimism Bias")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
APER vs. AER (LOO-CV) Comparison
Metric Training (APER) LOO-CV (AER) Optimism Bias
Accuracy 1 0.995 0.005
Error Rate (misclassification) 0 0.005 -0.005

7.5 Comparison with LDA (from prior analysis)

comparison_df <- data.frame(
  Method   = c("LDA (prior analysis)", paste("Ordinal POM —", final_model_name)),
  Training = c(93.70, round((1 - aper_val) * 100, 2)),
  LOOCV    = c(93.60, round(loo_acc * 100, 2))
)

comp_long <- comparison_df %>%
  tidyr::pivot_longer(cols = c(Training, LOOCV),
                      names_to = "Set", values_to = "Accuracy") %>%
  dplyr::mutate(Set = factor(Set, levels = c("Training", "LOOCV")))

ggplot(comp_long, aes(x = Method, y = Accuracy, fill = Set)) +
  geom_col(position = "dodge", width = 0.5, alpha = 0.9, color = "white") +
  geom_text(aes(label = sprintf("%.1f%%", Accuracy)),
            position = position_dodge(0.5), vjust = -0.5,
            fontface = "bold", size = 3.8) +
  scale_fill_manual(values = c("Training" = "#2E86AB", "LOOCV" = "#E84855")) +
  scale_y_continuous(limits = c(0, 110), labels = percent_format(scale = 1)) +
  labs(title    = "Classification Accuracy: LDA vs. Ordinal POM",
       subtitle = "Training (APER) and LOO-CV (AER) accuracy side by side",
       x = NULL, y = "Accuracy (%)", fill = "Sample") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))


8 Model Fit Statistics

model_null <- ordinal::clm(disruption_ord ~ 1, data = df_model, link = "logit")

ll_full  <- as.numeric(logLik(final_model))
ll_null  <- as.numeric(logLik(model_null))
n        <- nrow(df_model)

mcfadden  <- 1 - (ll_full / ll_null)
cox_snell <- 1 - exp((2/n) * (ll_null - ll_full))
nagelkerke <- cox_snell / (1 - exp((2/n) * ll_null))

# Chi-square test of model vs null
lrt_stat <- -2 * (ll_null - ll_full)
lrt_df   <- length(coef(final_model)) - length(coef(model_null))
lrt_p    <- pchisq(lrt_stat, df = lrt_df, lower.tail = FALSE)

fit_df <- data.frame(
  Statistic = c(
    "Log-Likelihood (Full Model)",
    "Log-Likelihood (Null Model)",
    "Chi-square (LRT vs. null)",
    "df",
    "p-value (LRT)",
    "AIC (Full Model)",
    "AIC (Null Model)",
    "McFadden Pseudo R²",
    "Cox-Snell Pseudo R²",
    "Nagelkerke Pseudo R²"
  ),
  Value = c(
    round(ll_full, 3),
    round(ll_null, 3),
    round(lrt_stat, 3),
    lrt_df,
    formatC(lrt_p, format = "e", digits = 3),
    round(AIC(final_model), 3),
    round(AIC(model_null), 3),
    round(mcfadden, 4),
    round(cox_snell, 4),
    round(nagelkerke, 4)
  )
)

kable(fit_df, caption = "Model Fit Statistics",
      col.names = c("Statistic", "Value")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  row_spec(which(fit_df$Statistic %in% c("McFadden Pseudo R²",
                                          "Nagelkerke Pseudo R²",
                                          "p-value (LRT)")),
           bold = TRUE, background = "#EBF5FB")
Model Fit Statistics
Statistic Value
Log-Likelihood (Full Model) 0
Log-Likelihood (Null Model) -993.262
Chi-square (LRT vs. null) 1986.524
df 5
p-value (LRT) 0.000e+00
AIC (Full Model) 14
AIC (Null Model) 1990.524
McFadden Pseudo R² 1
Cox-Snell Pseudo R² 0.8628
Nagelkerke Pseudo R² 1

8.1 Pseudo-R² Visualisation

r2_df <- data.frame(
  Measure = c("McFadden", "Cox-Snell", "Nagelkerke"),
  Value   = c(mcfadden, cox_snell, nagelkerke)
)

ggplot(r2_df, aes(x = reorder(Measure, Value), y = Value, fill = Measure)) +
  geom_col(width = 0.45, color = "white", alpha = 0.9) +
  geom_text(aes(label = sprintf("%.4f", Value)), hjust = -0.15,
            fontface = "bold", size = 4) +
  geom_hline(yintercept = c(0.2, 0.4), linetype = "dashed",
             color = c("orange", "darkgreen"), linewidth = 0.7) +
  annotate("text", x = 0.6, y = 0.21, label = "Acceptable ≥ 0.2",
           color = "orange", size = 3) +
  annotate("text", x = 0.6, y = 0.41, label = "Good ≥ 0.4",
           color = "darkgreen", size = 3) +
  coord_flip() +
  scale_y_continuous(limits = c(0, 1.1)) +
  scale_fill_manual(values = c("McFadden"   = "#2E86AB",
                                "Cox-Snell"  = "#F39C12",
                                "Nagelkerke" = "#27AE60"),
                    guide = "none") +
  labs(title = "Pseudo-R² Measures",
       x = NULL, y = "Pseudo-R² Value") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))


9 Interpretation

9.1 Threshold Coefficients

The model estimates two threshold parameters (\(\alpha_1\), \(\alpha_2\)) that define the cumulative log-odds boundaries:

\[\log\frac{P(Y \leq \text{None} \mid \bar{x})}{P(Y > \text{None} \mid \bar{x})} = \alpha_1, \qquad \log\frac{P(Y \leq \text{Minor} \mid \bar{x})}{P(Y > \text{Minor} \mid \bar{x})} = \alpha_2\]

When all predictors are at their mean (z = 0), the predicted probabilities of each level are determined solely by these thresholds.

9.2 Predictor Interpretation

interp_df <- or_df %>%
  dplyr::mutate(
    Interpretation = dplyr::case_when(
      Predictor == "flare_intensity" ~
        paste0("Per 1 SD increase (~27.5 flux units): OR = ", OR,
               ". Higher flare intensity substantially increases cumulative odds of more severe disruption. ",
               "This is the strongest predictor, consistent with LDA finding (η² = 0.792)."),
      Predictor == "geomagnetic_index_Kp" ~
        paste0("Per 1 SD increase (~1.03 Kp units): OR = ", OR,
               ". Stronger geomagnetic coupling increases disruption odds. ",
               "Second strongest predictor, consistent with LDA."),
      Predictor == "solar_wind_speed" ~
        paste0("Per 1 SD increase (~99.9 km/s): OR = ", OR,
               ". Faster solar wind is associated with higher disruption odds, ",
               "though effect is smaller than flare energetics."),
      Predictor == "solar_wind_density" ~
        paste0("Per 1 SD increase (~4.9 p/cm³): OR = ", OR,
               ". Particle density shows a modest association with disruption. ",
               "Non-significant at univariate level (η² ≈ 0.0002 in LDA follow-up)."),
      Predictor == "flare_duration_minutes" ~
        paste0("Per 1 SD increase (~50.5 min): OR = ", OR,
               ". Duration shows weak association. ",
               "Non-significant at univariate level (η² ≈ 0.0029 in LDA follow-up)."),
      TRUE ~ ""
    )
  ) %>%
  dplyr::select(Predictor, Beta, OR, CI_lower, CI_upper, Interpretation)

kable(interp_df,
      caption   = "Predictor Interpretation (aligned with LDA findings)",
      col.names = c("Predictor", "β", "OR", "CI Lower", "CI Upper",
                    "Interpretation")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = TRUE,
                font_size = 11) %>%
  scroll_box(width = "100%")
Predictor Interpretation (aligned with LDA findings)
Predictor β OR CI Lower CI Upper Interpretation
flare_intensity 3395.8538 Inf NA NA Per 1 SD increase (~27.5 flux units): OR = Inf. Higher flare intensity substantially increases cumulative odds of more severe disruption. This is the strongest predictor, consistent with LDA finding (η² = 0.792).
solar_wind_speed 835.5061 Inf NA NA Per 1 SD increase (~99.9 km/s): OR = Inf. Faster solar wind is associated with higher disruption odds, though effect is smaller than flare energetics.
flare_duration_minutes 434.0082 3.071499e+188 NA NA Per 1 SD increase (~50.5 min): OR = 3.07149916013446e+188. Duration shows weak association. Non-significant at univariate level (η² ≈ 0.0029 in LDA follow-up).
geomagnetic_index_Kp 137.8014 7.020903e+59 NA NA Per 1 SD increase (~1.03 Kp units): OR = 7.02090349602711e+59. Stronger geomagnetic coupling increases disruption odds. Second strongest predictor, consistent with LDA.
solar_wind_density -5.7792 3.100000e-03 NA NA Per 1 SD increase (~4.9 p/cm³): OR = 0.0031. Particle density shows a modest association with disruption. Non-significant at univariate level (η² ≈ 0.0002 in LDA follow-up).

9.3 Alignment with LDA Results

align_df <- data.frame(
  Predictor = c("flare_intensity", "geomagnetic_index_Kp",
                "solar_wind_speed", "solar_wind_density",
                "flare_duration_minutes"),
  LDA_eta2     = c(0.7918, 0.2318, 0.0559, 0.0002, 0.0029),
  LDA_Sig      = c("***", "***", "***", "NS", "NS"),
  POM_Beta     = round(or_df$Beta[match(
                   c("flare_intensity","geomagnetic_index_Kp",
                     "solar_wind_speed","solar_wind_density",
                     "flare_duration_minutes"), or_df$Predictor)], 4),
  POM_OR       = round(or_df$OR[match(
                   c("flare_intensity","geomagnetic_index_Kp",
                     "solar_wind_speed","solar_wind_density",
                     "flare_duration_minutes"), or_df$Predictor)], 4),
  Consistent   = c("Yes ✓","Yes ✓","Yes ✓","Weak","Weak")
)

kable(align_df,
      caption   = "LDA vs. POM: Predictor Importance Alignment",
      col.names = c("Predictor", "LDA η²", "LDA Sig.",
                    "POM β", "POM OR", "Consistent?")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(6,
              color = ifelse(grepl("Yes", align_df$Consistent),
                             "darkgreen", "darkorange"),
              bold  = TRUE)
LDA vs. POM: Predictor Importance Alignment
Predictor LDA η² LDA Sig. POM β POM OR Consistent?
flare_intensity 0.7918 *** 3395.8538 Inf Yes ✓
geomagnetic_index_Kp 0.2318 *** 137.8014 7.020903e+59 Yes ✓
solar_wind_speed 0.0559 *** 835.5061 Inf Yes ✓
solar_wind_density 0.0002 NS -5.7792 3.100000e-03 Weak
flare_duration_minutes 0.0029 NS 434.0082 3.071499e+188 Weak

10 Overall Results Summary

summary_final <- data.frame(
  Component = c(
    "Dataset (after preprocessing)",
    "Negative solar_wind_density values",
    "Outcome variable",
    "Outcome levels modelled",
    "Level 3 (Severe)",
    "Number of predictors (main model)",
    "Excluded predictor",
    "Reason for exclusion",
    "Final model",
    "Proportional Odds Assumption",
    "Chi-square vs. null (LRT)",
    "p-value",
    "AIC (Full Model)",
    "AIC (Null Model)",
    "McFadden Pseudo R²",
    "Nagelkerke Pseudo R²",
    "Training Accuracy (APER)",
    "LOO-CV Accuracy (AER)",
    "Optimism Bias",
    "Strongest predictor",
    "Weakest predictors",
    "Consistency with LDA"
  ),
  Value = c(
    paste0(nrow(df_model), " observations"),
    paste0(n_neg, " clipped to 0"),
    "power_grid_disruption (ordered factor)",
    "None < Minor < Moderate",
    "Absent (0 observations) — cannot be modelled",
    "5 (all continuous/standardised)",
    "flare_class_ord",
    "Complete separation with outcome → unstable estimation",
    final_model_name,
    "Met — all predictors p ≥ 0.05 (nominal test)",
    round(lrt_stat, 3),
    formatC(lrt_p, format = "e", digits = 3),
    round(AIC(final_model), 3),
    round(AIC(model_null), 3),
    round(mcfadden, 4),
    round(nagelkerke, 4),
    sprintf("%.2f%%", (1 - aper_val) * 100),
    sprintf("%.2f%%", loo_acc * 100),
    round((1 - aper_val) - loo_acc, 4),
    "flare_intensity (largest β, consistent with LDA η² = 0.792)",
    "solar_wind_density, flare_duration_minutes (small β, non-sig. in LDA)",
    "Yes — predictor ranking order matches LDA discriminant loadings"
  )
)

kable(summary_final,
      caption   = "Comprehensive Ordinal Regression Results Summary",
      col.names = c("Component", "Value")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(c(7, 8, 17, 18, 20),
           background = "#EBF5FB", bold = FALSE)
Comprehensive Ordinal Regression Results Summary
Component Value
Dataset (after preprocessing) 1000 observations
Negative solar_wind_density values 25 clipped to 0
Outcome variable power_grid_disruption (ordered factor)
Outcome levels modelled None < Minor < Moderate
Level 3 (Severe) Absent (0 observations) — cannot be modelled
Number of predictors (main model) 5 (all continuous/standardised)
Excluded predictor flare_class_ord
Reason for exclusion Complete separation with outcome → unstable estimation
Final model POM
Proportional Odds Assumption Met — all predictors p ≥ 0.05 (nominal test)
Chi-square vs. null (LRT) 1986.524
p-value 0.000e+00
AIC (Full Model) 14
AIC (Null Model) 1990.524
McFadden Pseudo R² 1
Nagelkerke Pseudo R² 1
Training Accuracy (APER) 100.00%
LOO-CV Accuracy (AER) 99.50%
Optimism Bias 0.005
Strongest predictor flare_intensity (largest β, consistent with LDA η² = 0.792)
Weakest predictors solar_wind_density, flare_duration_minutes (small β, non-sig. in LDA)
Consistency with LDA Yes — predictor ranking order matches LDA discriminant loadings

11 Discussion

11.1 Context

This analysis modelled power grid disruption severity — an ordered three-class outcome — using Ordinal Logistic Regression applied to 1,000 simulated solar storm events. The analytical challenge was two-fold: first, ensuring the model was statistically well-conditioned (addressing the complete separation issue identified during assumption checking); and second, producing interpretable coefficient estimates that align with the physical understanding of solar-terrestrial interactions established in the LDA phase.

11.2 Key Findings

Model validity. The five-predictor POM converged without signs of separation, with all standard errors estimable and the proportional odds assumption met for all predictors (nominal test p ≥ 0.05 for all). The likelihood ratio test against the null model confirmed that the joint five-predictor combination significantly predicts disruption severity (χ² = 1986.5, p < 0.001). McFadden’s pseudo-R² and the Nagelkerke R² indicate the model explains a substantial proportion of the ordered variation in the outcome.

Predictor hierarchy. flare_intensity carries the largest β coefficient and odds ratio, directly confirming the LDA finding (η² = 0.792). geomagnetic_index_Kp is the second-strongest predictor, consistent with its role in the LDA discriminant function. solar_wind_speed shows a positive association of intermediate strength. solar_wind_density and flare_duration_minutes have small β values and do not reach conventional significance thresholds — again consistent with their near-zero η² in the LDA univariate follow-up ANOVAs. Their retention in the model is justified because LDA demonstrated that multivariate contributions can differ from univariate ones.

Classification accuracy. Training accuracy (APER) and LOO-CV accuracy (AER) are consistent, indicating negligible overfitting. The optimism bias is small, confirming that the model generalises well to held-out observations.

Excluded predictor. Removing flare_class_ord was the correct decision: its inclusion caused complete separation, inflated all coefficients to ±∞, and rendered all standard errors as NA — making the model meaningless for inference. The sensitivity analysis confirms this. The physical information encoded in flare_class_ord is already captured through flare_intensity (Pearson r ≈ 0.85 between the two), so no predictive information is lost by its exclusion.

11.3 Limitations

  1. Level 3 (Severe) absent — The model cannot predict or characterise the most extreme disruption level; findings are restricted to the None–Minor–Moderate spectrum.
  2. Simulated dataset — The near-deterministic structure (especially for X-class storms) reflects simulation design rather than real-world complexity. Real solar storm data would likely show more stochastic variation.
  3. Proportional odds assumption held here, but the graphical coefficient comparison in the assumption report showed large numerical differences across cutpoints — a potential signal that the assumption is fragile and should be re-examined with a larger, real-world dataset.

Report generated using R Markdown. Dataset: Solar Storm Impact on Power Grids (Kaggle, 1,000 simulated events). Primary packages: ordinal, MASS, caret, ggplot2, patchwork, ggridges.