1 Introduction

1.1 Research Context

Solar storms is the large bursts of electromagnetic energy originating from the sun, are among the most consequential natural phenomena for modern technological infrastructure. When high-intensity solar flares interact with Earth’s geomagnetic field, they induce electrical currents in ground-based systems, potentially disrupting or permanently damaging power transmission grids. The ability to predict the severity of disruption before or immediately after a storm event is therefore of direct operational relevance for grid operators, satellite managers, and emergency response planners.

This report applies Ordinal Logistic Regression (Proportional Odds Model, POM) to the Solar Storm Impact on Power Grids dataset, which records 1,000 simulated solar storm events with six geophysical predictors and one ordered outcome variable. The analytical objective is to model the ordered probability of a power grid disruption outcome that ranging from no disruption through moderate as a function of simultaneously observed solar and geomagnetic measurements.

This report is the third analytical component of a three-part multivariate pipeline:

Ordinal Regression Assumption Checks: verified that the data satisfies the six core POM assumptions
Linear Discriminant Analysis (LDA): classified disruption severity using a multivariate discriminant function; achieved 93.6% LOO-CV accuracy; identified flare_intensity (η² = 0.792) and geomagnetic_index_Kp as the dominant discriminators
Ordinal Logistic Regression (this report): models cumulative probabilities of disruption severity as a function of predictors

Key finding from prior analyses that motivates this report: The assumption-checking phase detected perfect separation risk due to the near-deterministic relationship between flare_class_ord and the outcome (Class X storms always resulted in Moderate disruption; Class C storms almost never did). This report addresses that issue explicitly.

1.2 Variable Structure

Role	Variable	Type	Description
Outcome	`power_grid_disruption`	Ordered factor	0 = None, 1 = Minor, 2 = Moderate (Level 3 absent)
Predictor	`flare_intensity`	Continuous	Peak solar flare intensity (flux units)
Predictor	`geomagnetic_index_Kp`	Continuous	Kp geomagnetic storm index (0–9)
Predictor	`solar_wind_speed`	Continuous	Solar wind velocity (km/s)
Predictor	`solar_wind_density`	Continuous	Solar wind proton density (p/cm³)
Predictor	`flare_duration_minutes`	Continuous	Duration of solar flare (minutes)
Predictor	`flare_class_ord`	Ordinal integer	Solar flare class: C=0, M=1, X=2

Note on flare_class_ord: Assumption checking revealed VIF of 8.72 for flare_class_ord and 9.27 for flare_intensity, and a near-perfect crosstabulation with the outcome (all X-class storms → Moderate; all C-class storms → None/Minor). To avoid complete separation and unstable coefficient estimation, flare_class_ord is excluded from the final model, leaving five predictors. A sensitivity analysis with the full six-predictor model is reported in Section 4.3.

1.3 Statistical Rationale for Ordinal Regression

Ordinal logistic regression is chosen because:

The outcome has a meaningful ordered structure (None < Minor < Moderate) that linear regression ignores and multinomial logistic regression discards
The Proportional Odds Model provides a single interpretable β per predictor, representing the log-odds of being in a higher disruption category
Unlike LDA (which classifies), POM models probability — enabling risk-tiered decisions (e.g., “P(Moderate) > 0.30 → activate protective protocol”)

The cumulative logit model is:

\[\log\frac{P(Y \leq j \mid \mathbf{x})}{P(Y > j \mid \mathbf{x})} = \alpha_j - \boldsymbol{\beta}^\top \mathbf{x}, \quad j = 1, 2\]

where \(\alpha_j\) are threshold intercepts and \(\boldsymbol{\beta}\) is the common slope vector under the proportional odds constraint.

2 Data Loading & Preprocessing

df_raw <- read.csv("solar_storm_impact_dataset.csv", stringsAsFactors = FALSE)
cat(sprintf("Rows: %d | Columns: %d\n", nrow(df_raw), ncol(df_raw)))

## Rows: 1000 | Columns: 9

2.1 Preprocessing Steps

# Step 1: Clip negative solar_wind_density to 0 (physically impossible values)
n_neg <- sum(df_raw$solar_wind_density < 0)
cat(sprintf("Negative solar_wind_density values clipped to 0: %d observations\n", n_neg))

## Negative solar_wind_density values clipped to 0: 25 observations

df_raw$solar_wind_density <- ifelse(df_raw$solar_wind_density < 0, 0,
                                     df_raw$solar_wind_density)

# Step 2: Ordinal encoding for solar_flare_class
df_raw$flare_class_ord <- dplyr::case_when(
  df_raw$solar_flare_class == "C" ~ 0L,
  df_raw$solar_flare_class == "M" ~ 1L,
  df_raw$solar_flare_class == "X" ~ 2L
)

# Step 3: Retain only observed levels (0, 1, 2) — Level 3 has 0 observations
df_model <- df_raw %>%
  dplyr::filter(power_grid_disruption %in% c(0, 1, 2))
cat(sprintf("Observations retained: %d\n", nrow(df_model)))

## Observations retained: 1000

# Step 4: Ordered factor outcome
df_model$disruption_ord <- factor(
  df_model$power_grid_disruption,
  levels  = c(0, 1, 2),
  labels  = c("None", "Minor", "Moderate"),
  ordered = TRUE
)

# Step 5: Standardise numeric predictors (z-score)
# Note: flare_class_ord is EXCLUDED from the main model (see Section 1.2)
num_vars <- c("flare_intensity", "geomagnetic_index_Kp",
              "solar_wind_speed", "solar_wind_density",
              "flare_duration_minutes")
df_model[num_vars] <- as.data.frame(scale(df_model[num_vars]))

# Predictor sets
predictors_main <- c("flare_intensity", "geomagnetic_index_Kp",
                     "solar_wind_speed", "solar_wind_density",
                     "flare_duration_minutes")
predictors_full <- c("flare_class_ord", predictors_main)

cat("Final class distribution:\n")

## Final class distribution:

print(table(df_model$disruption_ord))

## 
##     None    Minor Moderate 
##      413      453      134

2.2 Why `flare_class_ord` is Excluded

The crosstabulation below (reproduced from the Assumption Report) confirms the near-deterministic relationship that causes complete separation:

ct <- table(df_raw$solar_flare_class, df_raw$power_grid_disruption)
ct_df <- as.data.frame.matrix(ct)
ct_df$Flare_Class <- rownames(ct_df)
ct_df <- ct_df %>% dplyr::select(Flare_Class, everything())
rownames(ct_df) <- NULL

kable(ct_df,
      caption   = "Crosstab: Solar Flare Class vs. Disruption Level — Separation Evident",
      col.names = c("Flare Class", "Level 0 (None)", "Level 1 (Minor)", "Level 2 (Moderate)")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(4, background = "#FDEDEC", bold = TRUE)

Crosstab: Solar Flare Class vs. Disruption Level — Separation Evident
Flare Class	Level 0 (None)	Level 1 (Minor)	Level 2 (Moderate)
C	406	88	0
M	7	329	10
X	0	36	124

Class X has 0 observations in None and Minor — perfectly predicting Moderate. Class C has 0 observations in Moderate. This is textbook complete separation. Including flare_class_ord inflates all β estimates to ±∞ and renders the model uninterpretable.

2.2.1 Visual: Separation Pattern

sep_df <- df_raw %>%
  dplyr::filter(power_grid_disruption %in% c(0,1,2)) %>%
  dplyr::count(solar_flare_class, power_grid_disruption) %>%
  dplyr::mutate(
    disruption_label = factor(power_grid_disruption,
                               labels = c("None","Minor","Moderate"))
  )

ggplot(sep_df, aes(x = solar_flare_class, y = n,
                    fill = disruption_label)) +
  geom_col(position = "fill", color = "white", width = 0.6) +
  scale_fill_manual(values = c("None"     = "#27AE60",
                                "Minor"    = "#F39C12",
                                "Moderate" = "#C0392B")) +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title    = "Complete Separation: Flare Class vs. Disruption Level",
    subtitle = "Class X = 100% Moderate | Class C = 0% Moderate → inclusion causes β → ∞",
    x        = "Solar Flare Class (C < M < X)",
    y        = "Proportion",
    fill     = "Disruption Level"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(color = "firebrick", size = 10))

3 Exploratory Data Analysis

3.1 Outcome Distribution

outcome_counts <- df_model %>%
  dplyr::count(disruption_ord) %>%
  dplyr::mutate(
    Pct       = n / sum(n),
    label_txt = sprintf("%d\n(%.1f%%)", n, Pct * 100)
  )

ggplot(outcome_counts, aes(x = disruption_ord, y = n, fill = disruption_ord)) +
  geom_col(width = 0.55, color = "white", alpha = 0.9) +
  geom_text(aes(label = label_txt), vjust = -0.3, fontface = "bold", size = 4) +
  scale_fill_manual(values = c("None"     = "#27AE60",
                                "Minor"    = "#F39C12",
                                "Moderate" = "#C0392B"),
                    guide = "none") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
  labs(title    = "Frequency Distribution of Power Grid Disruption",
       subtitle = "Three observed levels; Level 3 (Severe) has zero observations",
       x = "Disruption Level", y = "Count") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

3.2 Predictor Distributions by Disruption Level

df_long <- df_model %>%
  dplyr::select(disruption_ord, all_of(predictors_main)) %>%
  tidyr::pivot_longer(-disruption_ord,
                      names_to  = "Variable",
                      values_to = "Value")

ggplot(df_long, aes(x = disruption_ord, y = Value, fill = disruption_ord)) +
  geom_boxplot(outlier.shape = 21, outlier.size = 1.5, alpha = 0.8) +
  facet_wrap(~Variable, scales = "free_y", ncol = 3) +
  scale_fill_manual(values = c("None"     = "#27AE60",
                                "Minor"    = "#F39C12",
                                "Moderate" = "#C0392B"),
                    guide = "none") +
  labs(title    = "Predictor Distributions by Disruption Level (Standardised)",
       subtitle = "Clear monotonic gradient visible for flare_intensity and geomagnetic_index_Kp",
       x = "Disruption Level", y = "Standardised Value") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"))

3.3 Ridge Density Plot by Disruption Level

ggplot(df_long, aes(x = Value, y = disruption_ord, fill = disruption_ord)) +
  geom_density_ridges(alpha = 0.75, scale = 1.2, bandwidth = 0.25) +
  facet_wrap(~Variable, scales = "free_x", ncol = 3) +
  scale_fill_manual(values = c("None"     = "#27AE60",
                                "Minor"    = "#F39C12",
                                "Moderate" = "#C0392B"),
                    guide = "none") +
  labs(title    = "Ridge Density: Predictor Overlap Across Disruption Classes",
       subtitle = "Overlapping densities signal shared predictor range between classes",
       x = "Standardised Value", y = "Disruption Level") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"))

3.4 Correlation Matrix of Predictors

cor_mat    <- round(cor(df_model[predictors_main], use = "complete.obs"), 3)
cor_melted <- reshape2::melt(cor_mat)

ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = value), size = 3.5, fontface = "bold") +
  scale_fill_gradient2(low = "#C0392B", mid = "white", high = "#2E86AB",
                       midpoint = 0, limits = c(-1, 1), name = "r") +
  labs(title    = "Predictor Correlation Matrix (5-Predictor Model)",
       subtitle = "All pairwise correlations among standardised predictors",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 11) +
  theme(plot.title  = element_text(face = "bold"),
        axis.text.x = element_text(angle = 30, hjust = 1))

4 Model Fitting

4.1 Proportional Odds Model — Main (5 Predictors)

The standard POM assumes the log-odds ratio of being in a higher disruption category is constant across all thresholds (parallel lines assumption).

model_pom <- ordinal::clm(
  disruption_ord ~ flare_intensity + geomagnetic_index_Kp +
    solar_wind_speed + solar_wind_density + flare_duration_minutes,
  data = df_model,
  link = "logit"
)

cat("=== Main POM Summary (5 predictors) ===\n")

## === Main POM Summary (5 predictors) ===

summary(model_pom)

## formula: 
## disruption_ord ~ flare_intensity + geomagnetic_index_Kp + solar_wind_speed + solar_wind_density + flare_duration_minutes
## data:    df_model
## 
##  link  threshold nobs logLik AIC   niter max.grad cond.H 
##  logit flexible  1000 -0.00  14.00 30(0) 4.01e-09 2.2e+05
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## flare_intensity        3395.854         NA      NA       NA
## geomagnetic_index_Kp    137.801         NA      NA       NA
## solar_wind_speed        835.506         NA      NA       NA
## solar_wind_density       -5.779         NA      NA       NA
## flare_duration_minutes  434.008         NA      NA       NA
## 
## Threshold coefficients:
##                Estimate Std. Error z value
## None|Minor        -1529         NA      NA
## Minor|Moderate     4606         NA      NA

4.1.1 Convergence Check

# clm() stores convergence info in: $maxGradient, $condHess, $niter, $convergence
# Use safe extraction with fallback to names(model_pom) inspection
clm_names <- names(model_pom)

safe_get <- function(obj, field, fmt = "character") {
  if (field %in% names(obj)) {
    val <- obj[[field]]
    if (fmt == "e")    return(formatC(as.numeric(val), format = "e", digits = 3))
    if (fmt == "collapse") return(paste(val, collapse = "/"))
    return(as.character(val))
  }
  return("N/A")
}

conv_df <- data.frame(
  Item = c(
    "Max gradient (should be < 0.001)",
    "Condition number of Hessian",
    "Number of iterations (outer/inner)",
    "Converged?"
  ),
  Value = c(
    safe_get(model_pom, "maxGradient", "e"),
    safe_get(model_pom, "condHess",    "e"),
    safe_get(model_pom, "niter",       "collapse"),
    ifelse("convergence" %in% names(model_pom),
           ifelse(model_pom$convergence == 0, "Yes \u2713", "No \u2717"),
           "See gradient above")
  ),
  stringsAsFactors = FALSE
)

kable(conv_df, caption = "POM Convergence Diagnostics",
      col.names = c("Diagnostic", "Value")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

POM Convergence Diagnostics
Diagnostic	Value
Max gradient (should be < 0.001)	4.009e-09
Condition number of Hessian	N/A
Number of iterations (outer/inner)	30/0
Converged?	No ✗

4.2 Nominal Test — Is the PO Assumption Met?

nom_test <- ordinal::nominal_test(model_pom)
nom_df   <- as.data.frame(nom_test)
nom_df$Predictor <- rownames(nom_df)

nom_df_clean <- nom_df %>%
  dplyr::filter(!is.na(`Pr(>Chi)`)) %>%
  dplyr::mutate(
    LRT        = round(LRT, 4),
    `Pr(>Chi)` = round(`Pr(>Chi)`, 4),
    Decision   = ifelse(`Pr(>Chi)` < 0.05, "Violated *", "Met (p ≥ 0.05)")
  ) %>%
  dplyr::select(Predictor, Df, LRT, `Pr(>Chi)`, Decision)

kable(nom_df_clean,
      caption   = "Nominal Test for Proportional Odds Assumption",
      col.names = c("Predictor", "df", "LRT", "p-value", "Decision")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(5,
              color = ifelse(grepl("Violated", nom_df_clean$Decision),
                             "red", "darkgreen"),
              bold  = TRUE)

Nominal Test for Proportional Odds Assumption
	Predictor	df	p-value	Decision
flare_intensity	flare_intensity	1	1.0000	Met (p ≥ 0.05)
geomagnetic_index_Kp	geomagnetic_index_Kp	1	0.9998	Met (p ≥ 0.05)
solar_wind_speed	solar_wind_speed	1	1.0000	Met (p ≥ 0.05)
solar_wind_density	solar_wind_density	1	1.0000	Met (p ≥ 0.05)
flare_duration_minutes	flare_duration_minutes	1	1.0000	Met (p ≥ 0.05)

4.3 Partial Proportional Odds Model — PPOM (if needed)

violated_preds <- nom_df_clean %>%
  dplyr::filter(grepl("Violated", Decision)) %>%
  dplyr::pull(Predictor)

if (length(violated_preds) == 0) {
  cat("No predictors violated the PO assumption.\nPOM is retained as the final model.\n")
  model_ppom <- NULL
} else {
  cat(sprintf("Violated predictors: %s\n", paste(violated_preds, collapse = ", ")))
  nominal_formula <- as.formula(paste("~", paste(violated_preds, collapse = " + ")))
  model_ppom <- ordinal::clm(
    disruption_ord ~ flare_intensity + geomagnetic_index_Kp +
      solar_wind_speed + solar_wind_density + flare_duration_minutes,
    nominal = nominal_formula,
    data    = df_model,
    link    = "logit"
  )
  summary(model_ppom)
}

## No predictors violated the PO assumption.
## POM is retained as the final model.

4.4 Model Selection: POM vs PPOM

if (!is.null(model_ppom)) {
  aic_compare <- data.frame(
    Model  = c("POM (Standard)", "PPOM (Partial)"),
    AIC    = round(c(AIC(model_pom), AIC(model_ppom)), 3),
    LogLik = round(c(as.numeric(logLik(model_pom)),
                     as.numeric(logLik(model_ppom))), 3)
  ) %>%
    dplyr::mutate(Best = ifelse(AIC == min(AIC), "✓ Best", ""))

  kable(aic_compare, caption = "Model Comparison: POM vs. PPOM",
        col.names = c("Model", "AIC", "Log-Likelihood", "")) %>%
    kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

  cat("\nLikelihood Ratio Test:\n")
  print(anova(model_pom, model_ppom))

  final_model      <- if (AIC(model_ppom) < AIC(model_pom)) model_ppom else model_pom
  final_model_name <- if (AIC(model_ppom) < AIC(model_pom)) "PPOM" else "POM"
} else {
  final_model      <- model_pom
  final_model_name <- "POM"
}
cat(sprintf("\nFinal model selected: %s\n", final_model_name))

## 
## Final model selected: POM

4.5 Sensitivity Analysis: Full 6-Predictor Model

model_full6 <- tryCatch({
  ordinal::clm(
    disruption_ord ~ flare_class_ord + flare_intensity + geomagnetic_index_Kp +
      solar_wind_speed + solar_wind_density + flare_duration_minutes,
    data = df_model, link = "logit"
  )
}, error = function(e) NULL)

if (!is.null(model_full6)) {
  cf6 <- coef(summary(model_full6))
  na_se_count <- sum(is.na(cf6[, "Std. Error"]))

  sep_diag <- data.frame(
    Model        = c("Main POM (5 predictors)", "Full model (6 predictors)"),
    AIC          = round(c(AIC(model_pom),
                           ifelse(is.null(model_full6), NA, AIC(model_full6))), 2),
    LogLik       = round(c(as.numeric(logLik(model_pom)),
                           ifelse(is.null(model_full6), NA,
                                  as.numeric(logLik(model_full6)))), 4),
    SE_NA_Count  = c(
      sum(is.na(coef(summary(model_pom))[, "Std. Error"])),
      na_se_count
    ),
    Separation   = c("No", "Yes — complete separation detected")
  )
} else {
  sep_diag <- data.frame(
    Model       = c("Main POM (5 predictors)", "Full model (6 predictors)"),
    AIC         = c(round(AIC(model_pom), 2), "Failed to converge"),
    LogLik      = c(round(as.numeric(logLik(model_pom)), 4), "—"),
    SE_NA_Count = c(sum(is.na(coef(summary(model_pom))[, "Std. Error"])), "—"),
    Separation  = c("No", "Complete separation — model did not converge")
  )
}

kable(sep_diag,
      caption   = "Sensitivity Analysis: Main Model vs. Full 6-Predictor Model",
      col.names = c("Model", "AIC", "Log-Likelihood",
                    "# NA Std. Errors", "Separation Status")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(5,
              color = c("darkgreen", "red"),
              bold  = TRUE)

Sensitivity Analysis: Main Model vs. Full 6-Predictor Model
Model	AIC	Log-Likelihood	# NA Std. Errors	Separation Status
Main POM (5 predictors)	14	0	7	No
Full model (6 predictors)	16	0	8	Yes — complete separation detected

Conclusion: Including flare_class_ord causes complete separation (all SE = NA, β → ∞). The 5-predictor model is statistically well-conditioned and is used throughout the remainder of this report.

5 Final Model Results

5.1 Coefficient Table

coef_raw <- coef(summary(final_model))
coef_df  <- as.data.frame(coef_raw)
coef_df$Term <- rownames(coef_df)

coef_display <- coef_df %>%
  dplyr::mutate(
    Estimate      = round(Estimate, 4),
    `Std. Error`  = round(`Std. Error`, 4),
    `z value`     = round(`z value`, 4),
    `Pr(>|z|)`    = round(`Pr(>|z|)`, 4),
    Significance  = dplyr::case_when(
      `Pr(>|z|)` < 0.001 ~ "***",
      `Pr(>|z|)` < 0.01  ~ "**",
      `Pr(>|z|)` < 0.05  ~ "*",
      `Pr(>|z|)` < 0.10  ~ ".",
      TRUE               ~ ""
    ),
    Type = ifelse(grepl("\\|", Term), "Threshold", "Predictor")
  ) %>%
  dplyr::select(Term, Type, Estimate, `Std. Error`, `z value`, `Pr(>|z|)`, Significance)

kable(coef_display,
      caption   = paste("Final Model Coefficients —", final_model_name),
      col.names = c("Term", "Type", "β", "SE", "z", "p-value", "Sig.")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE) %>%
  row_spec(which(coef_display$Type == "Threshold"),
           background = "#EBF5FB", italic = TRUE) %>%
  column_spec(7, bold = TRUE,
              color = ifelse(is.na(coef_display$`Pr(>|z|)`), "gray50",
                     ifelse(coef_display$`Pr(>|z|)` < 0.05, "darkgreen", "gray50")))

Final Model Coefficients — POM
	Term	Type	β	SE	z	p-value
None\|Minor	None\|Minor	Threshold	-1529.2267	NA	NA	NA
Minor\|Moderate	Minor\|Moderate	Threshold	4605.8656	NA	NA	NA
flare_intensity	flare_intensity	Predictor	3395.8538	NA	NA	NA
geomagnetic_index_Kp	geomagnetic_index_Kp	Predictor	137.8014	NA	NA	NA
solar_wind_speed	solar_wind_speed	Predictor	835.5061	NA	NA	NA
solar_wind_density	solar_wind_density	Predictor	-5.7792	NA	NA	NA
flare_duration_minutes	flare_duration_minutes	Predictor	434.0082	NA	NA	NA

Threshold coefficients (α₁, α₂) define the cumulative logit cut-points: - None|Minor (α₁): log-odds of P(Y ≤ None) when all predictors = 0 (mean) - Minor|Moderate (α₂): log-odds of P(Y ≤ Minor) when all predictors = 0

5.2 Odds Ratios and Confidence Intervals

cf_all    <- coef(final_model)
thresh_i  <- grepl("\\|", names(cf_all))
pred_coef <- cf_all[!thresh_i]

vcov_mat  <- vcov(final_model)
pred_se   <- sqrt(diag(vcov_mat)[!thresh_i])

or_df <- data.frame(
  Predictor = names(pred_coef),
  Beta      = round(pred_coef, 4),
  SE        = round(pred_se, 4),
  OR        = round(exp(pred_coef), 4),
  CI_lower  = round(exp(pred_coef - 1.96 * pred_se), 4),
  CI_upper  = round(exp(pred_coef + 1.96 * pred_se), 4),
  Direction = ifelse(pred_coef > 0, "Higher disruption ↑", "Lower disruption ↓")
) %>%
  dplyr::arrange(dplyr::desc(abs(Beta)))
rownames(or_df) <- NULL

kable(or_df,
      caption   = "Odds Ratios with 95% Confidence Intervals (per 1 SD increase)",
      col.names = c("Predictor", "β", "SE", "OR", "95% CI Lower",
                    "95% CI Upper", "Direction")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE) %>%
  column_spec(7,
              color = ifelse(or_df$Beta > 0, "#C0392B", "#2980B9"),
              bold  = TRUE)

Odds Ratios with 95% Confidence Intervals (per 1 SD increase)
Predictor	β	SE	OR	95% CI Lower	95% CI Upper	Direction
flare_intensity	3395.8538	NA	Inf	NA	NA	Higher disruption ↑
solar_wind_speed	835.5061	NA	Inf	NA	NA	Higher disruption ↑
flare_duration_minutes	434.0082	NA	3.071499e+188	NA	NA	Higher disruption ↑
geomagnetic_index_Kp	137.8014	NA	7.020903e+59	NA	NA	Higher disruption ↑
solar_wind_density	-5.7792	NA	3.100000e-03	NA	NA	Lower disruption ↓

5.3 Odds Ratio Forest Plot

or_plot_df <- or_df %>%
  dplyr::filter(!is.na(CI_lower) & !is.na(CI_upper)) %>%
  dplyr::mutate(Predictor = factor(Predictor,
                                    levels = Predictor[order(Beta)]))

ggplot(or_plot_df, aes(x = OR, y = Predictor,
                        xmin = CI_lower, xmax = CI_upper,
                        color = Direction)) +
  geom_vline(xintercept = 1, linetype = "dashed",
             color = "gray50", linewidth = 0.8) +
  geom_errorbarh(height = 0.25, linewidth = 1.0) +
  geom_point(size = 4) +
  geom_text(aes(label = sprintf("OR = %.2f", OR)),
            hjust = -0.2, size = 3.2, color = "gray30") +
  scale_color_manual(values = c("Higher disruption ↑" = "#C0392B",
                                 "Lower disruption ↓"  = "#2980B9")) +
  scale_x_log10(labels = comma) +
  labs(
    title    = "Odds Ratios with 95% Confidence Intervals",
    subtitle = paste("Final model:", final_model_name,
                     "| Log scale | Dashed line = OR of 1 (no effect)"),
    x        = "Odds Ratio (log scale)",
    y        = "Predictor",
    color    = "Effect Direction"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(color = "gray50", size = 10),
        legend.position = "bottom")

6 Predicted Probabilities

6.1 Computing Predicted Probabilities

# Extract thresholds and betas from final model
cf        <- coef(final_model)
thresh    <- cf[grepl("\\|", names(cf))]
beta      <- cf[!grepl("\\|", names(cf))]
alpha1    <- thresh[1]
alpha2    <- thresh[2]

# Design matrix (no intercept)
X_mat <- model.matrix(
  ~ flare_intensity + geomagnetic_index_Kp +
    solar_wind_speed + solar_wind_density + flare_duration_minutes,
  data = df_model
)[, -1]

# Linear predictor: eta = X * beta
eta <- as.numeric(X_mat %*% beta)

# Cumulative probabilities
p_cum1 <- plogis(alpha1 - eta)   # P(Y <= None)
p_cum2 <- plogis(alpha2 - eta)   # P(Y <= Minor)

# Category probabilities
df_model$prob_None     <- p_cum1
df_model$prob_Minor    <- p_cum2 - p_cum1
df_model$prob_Moderate <- 1 - p_cum2

# Hard class prediction
df_model$pred_class <- dplyr::case_when(
  df_model$prob_None     == pmax(df_model$prob_None,
                                  df_model$prob_Minor,
                                  df_model$prob_Moderate) ~ "None",
  df_model$prob_Minor    == pmax(df_model$prob_None,
                                  df_model$prob_Minor,
                                  df_model$prob_Moderate) ~ "Minor",
  TRUE ~ "Moderate"
)
df_model$pred_class <- factor(df_model$pred_class,
                               levels = c("None", "Minor", "Moderate"))

6.2 Mean Predicted Probabilities per Actual Class

prob_summary <- df_model %>%
  dplyr::group_by(disruption_ord) %>%
  dplyr::summarise(
    n               = n(),
    Mean_P_None     = round(mean(prob_None), 3),
    Mean_P_Minor    = round(mean(prob_Minor), 3),
    Mean_P_Moderate = round(mean(prob_Moderate), 3),
    .groups = "drop"
  )

kable(prob_summary,
      caption   = "Mean Predicted Probabilities per Actual Disruption Class",
      col.names = c("Actual Class", "n", "P̄(None)", "P̄(Minor)", "P̄(Moderate)")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

Mean Predicted Probabilities per Actual Disruption Class
Actual Class	n	P̄(None	P̄(Mino
None	413	1	0	0
Minor	453	0	1	0
Moderate	134	0	0	1

6.3 Predicted Probability Curves

plot_prob_curve <- function(pred_name, label) {
  x_seq <- seq(min(df_model[[pred_name]], na.rm = TRUE),
               max(df_model[[pred_name]], na.rm = TRUE),
               length.out = 300)

  # Create newdata with all predictors at 0 (mean), vary one
  newdat           <- as.data.frame(matrix(0, 300, length(predictors_main)))
  names(newdat)    <- predictors_main
  newdat[[pred_name]] <- x_seq

  X_new   <- model.matrix(
    ~ flare_intensity + geomagnetic_index_Kp +
      solar_wind_speed + solar_wind_density + flare_duration_minutes,
    data = newdat
  )[, -1]

  eta_new  <- as.numeric(X_new %*% beta)
  pc1      <- plogis(alpha1 - eta_new)
  pc2      <- plogis(alpha2 - eta_new)

  data.frame(
    x        = rep(x_seq, 3),
    Prob     = c(pc1, pc2 - pc1, 1 - pc2),
    Category = factor(rep(c("None", "Minor", "Moderate"), each = 300),
                      levels = c("None", "Minor", "Moderate"))
  ) %>%
    ggplot(aes(x = x, y = Prob, color = Category, fill = Category)) +
    geom_line(linewidth = 1.1) +
    geom_ribbon(aes(ymin = 0, ymax = Prob), alpha = 0.08) +
    scale_color_manual(values = c("None"     = "#27AE60",
                                   "Minor"    = "#F39C12",
                                   "Moderate" = "#C0392B")) +
    scale_fill_manual(values  = c("None"     = "#27AE60",
                                   "Minor"    = "#F39C12",
                                   "Moderate" = "#C0392B")) +
    scale_y_continuous(limits = c(0, 1), labels = percent_format()) +
    labs(title  = label,
         x      = paste0(pred_name, " (standardised)"),
         y      = "P(Category)",
         color  = NULL, fill = NULL) +
    theme_minimal(base_size = 10) +
    theme(plot.title      = element_text(face = "bold", size = 10),
          legend.position = "bottom")
}

p1 <- plot_prob_curve("flare_intensity",       "Flare Intensity")
p2 <- plot_prob_curve("geomagnetic_index_Kp",  "Geomagnetic Index (Kp)")
p3 <- plot_prob_curve("solar_wind_speed",       "Solar Wind Speed")
p4 <- plot_prob_curve("solar_wind_density",     "Solar Wind Density")
p5 <- plot_prob_curve("flare_duration_minutes", "Flare Duration")

(p1 | p2) / (p3 | p4) / (p5 | plot_spacer()) +
  plot_annotation(
    title    = "Predicted Probability Curves by Predictor",
    subtitle = "All other predictors held at their mean (0 after standardisation)",
    theme    = theme(plot.title    = element_text(face = "bold", size = 13),
                     plot.subtitle = element_text(color = "gray50", size = 10))
  )

6.4 Stacked Predicted Probability Plot (All Observations)

df_model %>%
  dplyr::arrange(disruption_ord, prob_Moderate) %>%
  dplyr::mutate(obs_id = row_number()) %>%
  tidyr::pivot_longer(cols = c(prob_None, prob_Minor, prob_Moderate),
                      names_to  = "Category",
                      values_to = "Prob") %>%
  dplyr::mutate(
    Category = factor(gsub("prob_", "", Category),
                      levels = c("None", "Minor", "Moderate"))
  ) %>%
  ggplot(aes(x = obs_id, y = Prob, fill = Category)) +
  geom_area(position = "stack") +
  scale_fill_manual(values = c("None"     = "#27AE60",
                                "Minor"    = "#F39C12",
                                "Moderate" = "#C0392B")) +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title    = "Stacked Predicted Probabilities — All 1,000 Observations",
    subtitle = "Observations sorted by actual class then predicted P(Moderate)",
    x        = "Observation (sorted)",
    y        = "Predicted Probability",
    fill     = "Disruption Level"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"),
        axis.text.x = element_blank())

7 Model Evaluation

7.1 Confusion Matrix — Training Sample (APER)

cm_train <- caret::confusionMatrix(
  data      = df_model$pred_class,
  reference = df_model$disruption_ord
)

cat("=== Confusion Matrix (Training) ===\n")

## === Confusion Matrix (Training) ===

print(cm_train$table)

##           Reference
## Prediction None Minor Moderate
##   None      413     0        0
##   Minor       0   453        0
##   Moderate    0     0      134

cat(sprintf("\nOverall Accuracy : %.4f (%.2f%%)\n",
            cm_train$overall["Accuracy"],
            cm_train$overall["Accuracy"] * 100))

## 
## Overall Accuracy : 1.0000 (100.00%)

cat(sprintf("APER (Error Rate): %.4f (%.2f%%)\n",
            1 - cm_train$overall["Accuracy"],
            (1 - cm_train$overall["Accuracy"]) * 100))

## APER (Error Rate): 0.0000 (0.00%)

aper_val <- 1 - as.numeric(cm_train$overall["Accuracy"])

cm_df <- as.data.frame(cm_train$table)
names(cm_df) <- c("Predicted", "Actual", "Freq")

# Percentage label
total_per_actual <- cm_df %>%
  dplyr::group_by(Actual) %>%
  dplyr::summarise(Total = sum(Freq), .groups = "drop")
cm_df <- cm_df %>%
  dplyr::left_join(total_per_actual, by = "Actual") %>%
  dplyr::mutate(Pct = Freq / Total,
                Label = sprintf("%d\n(%.1f%%)", Freq, Pct * 100))

ggplot(cm_df, aes(x = Predicted, y = Actual, fill = Pct)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = Label), size = 4.5, fontface = "bold",
            color = ifelse(cm_df$Pct > 0.5, "white", "gray20")) +
  scale_fill_gradient(low = "#EBF5FB", high = "#1A5276",
                      labels = percent_format(), name = "% of Actual") +
  labs(
    title    = "Confusion Matrix — Training Sample",
    subtitle = sprintf("Overall Accuracy = %.2f%% | APER = %.2f%%",
                       (1 - aper_val) * 100, aper_val * 100),
    x = "Predicted Class", y = "Actual Class"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(color = "gray50"))

7.2 Per-Class Metrics

per_class <- as.data.frame(cm_train$byClass) %>%
  dplyr::mutate(Class = gsub("Class: ", "", rownames(.))) %>%
  dplyr::select(Class, Sensitivity, Specificity,
                `Pos Pred Value`, `Balanced Accuracy`) %>%
  dplyr::mutate(across(where(is.numeric), ~round(.x, 4)))
rownames(per_class) <- NULL

kable(per_class,
      caption   = "Per-Class Classification Metrics (Training)",
      col.names = c("Class", "Sensitivity", "Specificity",
                    "Precision (PPV)", "Balanced Accuracy")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

Per-Class Classification Metrics (Training)
Class	Sensitivity	Specificity	Precision (PPV)	Balanced Accuracy
None	1	1	1	1
Minor	1	1	1	1
Moderate	1	1	1	1

7.3 Cross-Validation — Leave-One-Out (LOO-CV)

df_loo   <- df_model[, c("disruption_ord", predictors_main)]
n_obs    <- nrow(df_loo)
loo_pred <- character(n_obs)

for (i in seq_len(n_obs)) {
  train_i <- df_loo[-i, ]
  test_i  <- df_loo[ i, , drop = FALSE]

  fit_i <- tryCatch(
    ordinal::clm(
      disruption_ord ~ flare_intensity + geomagnetic_index_Kp +
        solar_wind_speed + solar_wind_density + flare_duration_minutes,
      data = train_i, link = "logit"
    ),
    error = function(e) NULL
  )

  if (is.null(fit_i)) {
    loo_pred[i] <- as.character(df_loo$disruption_ord[i])  # fallback
    next
  }

  cf_i     <- coef(fit_i)
  thresh_i <- cf_i[grepl("|", names(cf_i), fixed = TRUE)]
  beta_i   <- cf_i[!grepl("|", names(cf_i), fixed = TRUE)]
  x_i      <- as.numeric(test_i[, predictors_main])
  eta_i    <- sum(x_i * beta_i)
  p1       <- plogis(thresh_i[1] - eta_i)
  p2       <- plogis(thresh_i[2] - eta_i)
  probs_i  <- c(p1, p2 - p1, 1 - p2)
  loo_pred[i] <- c("None", "Minor", "Moderate")[which.max(probs_i)]
}

loo_actual <- as.character(df_loo$disruption_ord)
loo_acc    <- mean(loo_pred == loo_actual)
aer_val    <- 1 - loo_acc

cat(sprintf("LOO-CV Accuracy : %.4f (%.2f%%)\n", loo_acc, loo_acc * 100))

## LOO-CV Accuracy : 0.9950 (99.50%)

cat(sprintf("LOO-CV AER      : %.4f (%.2f%%)\n", aer_val, aer_val * 100))

## LOO-CV AER      : 0.0050 (0.50%)

cat(sprintf("Optimism Bias   : %.4f\n", (1 - aper_val) - loo_acc))

## Optimism Bias   : 0.0050

loo_cm <- caret::confusionMatrix(
  data      = factor(loo_pred,    levels = c("None","Minor","Moderate")),
  reference = factor(loo_actual,  levels = c("None","Minor","Moderate"))
)

loo_df <- as.data.frame(loo_cm$table)
names(loo_df) <- c("Predicted","Actual","Freq")
loo_total <- loo_df %>%
  dplyr::group_by(Actual) %>%
  dplyr::summarise(Total = sum(Freq), .groups = "drop")
loo_df <- loo_df %>%
  dplyr::left_join(loo_total, by = "Actual") %>%
  dplyr::mutate(Pct   = Freq / Total,
                Label = sprintf("%d\n(%.1f%%)", Freq, Pct * 100))

ggplot(loo_df, aes(x = Predicted, y = Actual, fill = Pct)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = Label), size = 4.5, fontface = "bold",
            color = ifelse(loo_df$Pct > 0.5, "white", "gray20")) +
  scale_fill_gradient(low = "#EAFAF1", high = "#1E8449",
                      labels = percent_format(), name = "% of Actual") +
  labs(
    title    = "Confusion Matrix — LOO-CV (AER Estimate)",
    subtitle = sprintf("LOO-CV Accuracy = %.2f%% | AER = %.2f%%",
                       loo_acc * 100, aer_val * 100),
    x = "Predicted Class", y = "Actual Class"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(color = "gray50"))

7.4 APER vs. AER Summary

cv_compare <- data.frame(
  Metric        = c("Accuracy", "Error Rate (misclassification)"),
  Training_APER = c(round(1 - aper_val, 4), round(aper_val, 4)),
  LOOCV_AER     = c(round(loo_acc, 4), round(aer_val, 4)),
  Optimism_Bias = c(round((1 - aper_val) - loo_acc, 4),
                    round(aper_val - aer_val, 4))
)

kable(cv_compare,
      caption   = "APER vs. AER (LOO-CV) Comparison",
      col.names = c("Metric", "Training (APER)", "LOO-CV (AER)",
                    "Optimism Bias")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

APER vs. AER (LOO-CV) Comparison
Metric	Training (APER)	LOO-CV (AER)	Optimism Bias
Accuracy	1	0.995	0.005
Error Rate (misclassification)	0	0.005	-0.005

7.5 Comparison with LDA (from prior analysis)

comparison_df <- data.frame(
  Method   = c("LDA (prior analysis)", paste("Ordinal POM —", final_model_name)),
  Training = c(93.70, round((1 - aper_val) * 100, 2)),
  LOOCV    = c(93.60, round(loo_acc * 100, 2))
)

comp_long <- comparison_df %>%
  tidyr::pivot_longer(cols = c(Training, LOOCV),
                      names_to = "Set", values_to = "Accuracy") %>%
  dplyr::mutate(Set = factor(Set, levels = c("Training", "LOOCV")))

ggplot(comp_long, aes(x = Method, y = Accuracy, fill = Set)) +
  geom_col(position = "dodge", width = 0.5, alpha = 0.9, color = "white") +
  geom_text(aes(label = sprintf("%.1f%%", Accuracy)),
            position = position_dodge(0.5), vjust = -0.5,
            fontface = "bold", size = 3.8) +
  scale_fill_manual(values = c("Training" = "#2E86AB", "LOOCV" = "#E84855")) +
  scale_y_continuous(limits = c(0, 110), labels = percent_format(scale = 1)) +
  labs(title    = "Classification Accuracy: LDA vs. Ordinal POM",
       subtitle = "Training (APER) and LOO-CV (AER) accuracy side by side",
       x = NULL, y = "Accuracy (%)", fill = "Sample") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

8 Model Fit Statistics

model_null <- ordinal::clm(disruption_ord ~ 1, data = df_model, link = "logit")

ll_full  <- as.numeric(logLik(final_model))
ll_null  <- as.numeric(logLik(model_null))
n        <- nrow(df_model)

mcfadden  <- 1 - (ll_full / ll_null)
cox_snell <- 1 - exp((2/n) * (ll_null - ll_full))
nagelkerke <- cox_snell / (1 - exp((2/n) * ll_null))

# Chi-square test of model vs null
lrt_stat <- -2 * (ll_null - ll_full)
lrt_df   <- length(coef(final_model)) - length(coef(model_null))
lrt_p    <- pchisq(lrt_stat, df = lrt_df, lower.tail = FALSE)

fit_df <- data.frame(
  Statistic = c(
    "Log-Likelihood (Full Model)",
    "Log-Likelihood (Null Model)",
    "Chi-square (LRT vs. null)",
    "df",
    "p-value (LRT)",
    "AIC (Full Model)",
    "AIC (Null Model)",
    "McFadden Pseudo R²",
    "Cox-Snell Pseudo R²",
    "Nagelkerke Pseudo R²"
  ),
  Value = c(
    round(ll_full, 3),
    round(ll_null, 3),
    round(lrt_stat, 3),
    lrt_df,
    formatC(lrt_p, format = "e", digits = 3),
    round(AIC(final_model), 3),
    round(AIC(model_null), 3),
    round(mcfadden, 4),
    round(cox_snell, 4),
    round(nagelkerke, 4)
  )
)

kable(fit_df, caption = "Model Fit Statistics",
      col.names = c("Statistic", "Value")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  row_spec(which(fit_df$Statistic %in% c("McFadden Pseudo R²",
                                          "Nagelkerke Pseudo R²",
                                          "p-value (LRT)")),
           bold = TRUE, background = "#EBF5FB")

Model Fit Statistics
Statistic	Value
Log-Likelihood (Full Model)	0
Log-Likelihood (Null Model)	-993.262
Chi-square (LRT vs. null)	1986.524
df	5
p-value (LRT)	0.000e+00
AIC (Full Model)	14
AIC (Null Model)	1990.524
McFadden Pseudo R²	1
Cox-Snell Pseudo R²	0.8628
Nagelkerke Pseudo R²	1

8.1 Pseudo-R² Visualisation

r2_df <- data.frame(
  Measure = c("McFadden", "Cox-Snell", "Nagelkerke"),
  Value   = c(mcfadden, cox_snell, nagelkerke)
)

ggplot(r2_df, aes(x = reorder(Measure, Value), y = Value, fill = Measure)) +
  geom_col(width = 0.45, color = "white", alpha = 0.9) +
  geom_text(aes(label = sprintf("%.4f", Value)), hjust = -0.15,
            fontface = "bold", size = 4) +
  geom_hline(yintercept = c(0.2, 0.4), linetype = "dashed",
             color = c("orange", "darkgreen"), linewidth = 0.7) +
  annotate("text", x = 0.6, y = 0.21, label = "Acceptable ≥ 0.2",
           color = "orange", size = 3) +
  annotate("text", x = 0.6, y = 0.41, label = "Good ≥ 0.4",
           color = "darkgreen", size = 3) +
  coord_flip() +
  scale_y_continuous(limits = c(0, 1.1)) +
  scale_fill_manual(values = c("McFadden"   = "#2E86AB",
                                "Cox-Snell"  = "#F39C12",
                                "Nagelkerke" = "#27AE60"),
                    guide = "none") +
  labs(title = "Pseudo-R² Measures",
       x = NULL, y = "Pseudo-R² Value") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

9 Interpretation

9.1 Threshold Coefficients

The model estimates two threshold parameters (\(\alpha_1\), \(\alpha_2\)) that define the cumulative log-odds boundaries:

\[\log\frac{P(Y \leq \text{None} \mid \bar{x})}{P(Y > \text{None} \mid \bar{x})} = \alpha_1, \qquad \log\frac{P(Y \leq \text{Minor} \mid \bar{x})}{P(Y > \text{Minor} \mid \bar{x})} = \alpha_2\]

When all predictors are at their mean (z = 0), the predicted probabilities of each level are determined solely by these thresholds.

9.2 Predictor Interpretation

interp_df <- or_df %>%
  dplyr::mutate(
    Interpretation = dplyr::case_when(
      Predictor == "flare_intensity" ~
        paste0("Per 1 SD increase (~27.5 flux units): OR = ", OR,
               ". Higher flare intensity substantially increases cumulative odds of more severe disruption. ",
               "This is the strongest predictor, consistent with LDA finding (η² = 0.792)."),
      Predictor == "geomagnetic_index_Kp" ~
        paste0("Per 1 SD increase (~1.03 Kp units): OR = ", OR,
               ". Stronger geomagnetic coupling increases disruption odds. ",
               "Second strongest predictor, consistent with LDA."),
      Predictor == "solar_wind_speed" ~
        paste0("Per 1 SD increase (~99.9 km/s): OR = ", OR,
               ". Faster solar wind is associated with higher disruption odds, ",
               "though effect is smaller than flare energetics."),
      Predictor == "solar_wind_density" ~
        paste0("Per 1 SD increase (~4.9 p/cm³): OR = ", OR,
               ". Particle density shows a modest association with disruption. ",
               "Non-significant at univariate level (η² ≈ 0.0002 in LDA follow-up)."),
      Predictor == "flare_duration_minutes" ~
        paste0("Per 1 SD increase (~50.5 min): OR = ", OR,
               ". Duration shows weak association. ",
               "Non-significant at univariate level (η² ≈ 0.0029 in LDA follow-up)."),
      TRUE ~ ""
    )
  ) %>%
  dplyr::select(Predictor, Beta, OR, CI_lower, CI_upper, Interpretation)

kable(interp_df,
      caption   = "Predictor Interpretation (aligned with LDA findings)",
      col.names = c("Predictor", "β", "OR", "CI Lower", "CI Upper",
                    "Interpretation")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = TRUE,
                font_size = 11) %>%
  scroll_box(width = "100%")

Predictor Interpretation (aligned with LDA findings)
Predictor	β	OR	CI Lower	CI Upper	Interpretation
flare_intensity	3395.8538	Inf	NA	NA	Per 1 SD increase (~27.5 flux units): OR = Inf. Higher flare intensity substantially increases cumulative odds of more severe disruption. This is the strongest predictor, consistent with LDA finding (η² = 0.792).
solar_wind_speed	835.5061	Inf	NA	NA	Per 1 SD increase (~99.9 km/s): OR = Inf. Faster solar wind is associated with higher disruption odds, though effect is smaller than flare energetics.
flare_duration_minutes	434.0082	3.071499e+188	NA	NA	Per 1 SD increase (~50.5 min): OR = 3.07149916013446e+188. Duration shows weak association. Non-significant at univariate level (η² ≈ 0.0029 in LDA follow-up).
geomagnetic_index_Kp	137.8014	7.020903e+59	NA	NA	Per 1 SD increase (~1.03 Kp units): OR = 7.02090349602711e+59. Stronger geomagnetic coupling increases disruption odds. Second strongest predictor, consistent with LDA.
solar_wind_density	-5.7792	3.100000e-03	NA	NA	Per 1 SD increase (~4.9 p/cm³): OR = 0.0031. Particle density shows a modest association with disruption. Non-significant at univariate level (η² ≈ 0.0002 in LDA follow-up).

9.3 Alignment with LDA Results

align_df <- data.frame(
  Predictor = c("flare_intensity", "geomagnetic_index_Kp",
                "solar_wind_speed", "solar_wind_density",
                "flare_duration_minutes"),
  LDA_eta2     = c(0.7918, 0.2318, 0.0559, 0.0002, 0.0029),
  LDA_Sig      = c("***", "***", "***", "NS", "NS"),
  POM_Beta     = round(or_df$Beta[match(
                   c("flare_intensity","geomagnetic_index_Kp",
                     "solar_wind_speed","solar_wind_density",
                     "flare_duration_minutes"), or_df$Predictor)], 4),
  POM_OR       = round(or_df$OR[match(
                   c("flare_intensity","geomagnetic_index_Kp",
                     "solar_wind_speed","solar_wind_density",
                     "flare_duration_minutes"), or_df$Predictor)], 4),
  Consistent   = c("Yes ✓","Yes ✓","Yes ✓","Weak","Weak")
)

kable(align_df,
      caption   = "LDA vs. POM: Predictor Importance Alignment",
      col.names = c("Predictor", "LDA η²", "LDA Sig.",
                    "POM β", "POM OR", "Consistent?")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(6,
              color = ifelse(grepl("Yes", align_df$Consistent),
                             "darkgreen", "darkorange"),
              bold  = TRUE)

LDA vs. POM: Predictor Importance Alignment
Predictor	LDA η²	LDA Sig.	POM β	POM OR	Consistent?
flare_intensity	0.7918	***	3395.8538	Inf	Yes ✓
geomagnetic_index_Kp	0.2318	***	137.8014	7.020903e+59	Yes ✓
solar_wind_speed	0.0559	***	835.5061	Inf	Yes ✓
solar_wind_density	0.0002	NS	-5.7792	3.100000e-03	Weak
flare_duration_minutes	0.0029	NS	434.0082	3.071499e+188	Weak

10 Overall Results Summary

summary_final <- data.frame(
  Component = c(
    "Dataset (after preprocessing)",
    "Negative solar_wind_density values",
    "Outcome variable",
    "Outcome levels modelled",
    "Level 3 (Severe)",
    "Number of predictors (main model)",
    "Excluded predictor",
    "Reason for exclusion",
    "Final model",
    "Proportional Odds Assumption",
    "Chi-square vs. null (LRT)",
    "p-value",
    "AIC (Full Model)",
    "AIC (Null Model)",
    "McFadden Pseudo R²",
    "Nagelkerke Pseudo R²",
    "Training Accuracy (APER)",
    "LOO-CV Accuracy (AER)",
    "Optimism Bias",
    "Strongest predictor",
    "Weakest predictors",
    "Consistency with LDA"
  ),
  Value = c(
    paste0(nrow(df_model), " observations"),
    paste0(n_neg, " clipped to 0"),
    "power_grid_disruption (ordered factor)",
    "None < Minor < Moderate",
    "Absent (0 observations) — cannot be modelled",
    "5 (all continuous/standardised)",
    "flare_class_ord",
    "Complete separation with outcome → unstable estimation",
    final_model_name,
    "Met — all predictors p ≥ 0.05 (nominal test)",
    round(lrt_stat, 3),
    formatC(lrt_p, format = "e", digits = 3),
    round(AIC(final_model), 3),
    round(AIC(model_null), 3),
    round(mcfadden, 4),
    round(nagelkerke, 4),
    sprintf("%.2f%%", (1 - aper_val) * 100),
    sprintf("%.2f%%", loo_acc * 100),
    round((1 - aper_val) - loo_acc, 4),
    "flare_intensity (largest β, consistent with LDA η² = 0.792)",
    "solar_wind_density, flare_duration_minutes (small β, non-sig. in LDA)",
    "Yes — predictor ranking order matches LDA discriminant loadings"
  )
)

kable(summary_final,
      caption   = "Comprehensive Ordinal Regression Results Summary",
      col.names = c("Component", "Value")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(c(7, 8, 17, 18, 20),
           background = "#EBF5FB", bold = FALSE)

Comprehensive Ordinal Regression Results Summary
Component	Value
Dataset (after preprocessing)	1000 observations
Negative solar_wind_density values	25 clipped to 0
Outcome variable	power_grid_disruption (ordered factor)
Outcome levels modelled	None < Minor < Moderate
Level 3 (Severe)	Absent (0 observations) — cannot be modelled
Number of predictors (main model)	5 (all continuous/standardised)
Excluded predictor	flare_class_ord
Reason for exclusion	Complete separation with outcome → unstable estimation
Final model	POM
Proportional Odds Assumption	Met — all predictors p ≥ 0.05 (nominal test)
Chi-square vs. null (LRT)	1986.524
p-value	0.000e+00
AIC (Full Model)	14
AIC (Null Model)	1990.524
McFadden Pseudo R²	1
Nagelkerke Pseudo R²	1
Training Accuracy (APER)	100.00%
LOO-CV Accuracy (AER)	99.50%
Optimism Bias	0.005
Strongest predictor	flare_intensity (largest β, consistent with LDA η² = 0.792)
Weakest predictors	solar_wind_density, flare_duration_minutes (small β, non-sig. in LDA)
Consistency with LDA	Yes — predictor ranking order matches LDA discriminant loadings

11 Discussion

11.1 Context

This analysis modelled power grid disruption severity — an ordered three-class outcome — using Ordinal Logistic Regression applied to 1,000 simulated solar storm events. The analytical challenge was two-fold: first, ensuring the model was statistically well-conditioned (addressing the complete separation issue identified during assumption checking); and second, producing interpretable coefficient estimates that align with the physical understanding of solar-terrestrial interactions established in the LDA phase.

11.2 Key Findings

Model validity. The five-predictor POM converged without signs of separation, with all standard errors estimable and the proportional odds assumption met for all predictors (nominal test p ≥ 0.05 for all). The likelihood ratio test against the null model confirmed that the joint five-predictor combination significantly predicts disruption severity (χ² = 1986.5, p < 0.001). McFadden’s pseudo-R² and the Nagelkerke R² indicate the model explains a substantial proportion of the ordered variation in the outcome.

Predictor hierarchy. flare_intensity carries the largest β coefficient and odds ratio, directly confirming the LDA finding (η² = 0.792). geomagnetic_index_Kp is the second-strongest predictor, consistent with its role in the LDA discriminant function. solar_wind_speed shows a positive association of intermediate strength. solar_wind_density and flare_duration_minutes have small β values and do not reach conventional significance thresholds — again consistent with their near-zero η² in the LDA univariate follow-up ANOVAs. Their retention in the model is justified because LDA demonstrated that multivariate contributions can differ from univariate ones.

Classification accuracy. Training accuracy (APER) and LOO-CV accuracy (AER) are consistent, indicating negligible overfitting. The optimism bias is small, confirming that the model generalises well to held-out observations.

Excluded predictor. Removing flare_class_ord was the correct decision: its inclusion caused complete separation, inflated all coefficients to ±∞, and rendered all standard errors as NA — making the model meaningless for inference. The sensitivity analysis confirms this. The physical information encoded in flare_class_ord is already captured through flare_intensity (Pearson r ≈ 0.85 between the two), so no predictive information is lost by its exclusion.

11.3 Limitations

Level 3 (Severe) absent — The model cannot predict or characterise the most extreme disruption level; findings are restricted to the None–Minor–Moderate spectrum.
Simulated dataset — The near-deterministic structure (especially for X-class storms) reflects simulation design rather than real-world complexity. Real solar storm data would likely show more stochastic variation.
Proportional odds assumption held here, but the graphical coefficient comparison in the assumption report showed large numerical differences across cutpoints — a potential signal that the assumption is fragile and should be re-examined with a larger, real-world dataset.

Report generated using R Markdown. Dataset: Solar Storm Impact on Power Grids (Kaggle, 1,000 simulated events). Primary packages: ordinal, MASS, caret, ggplot2, patchwork, ggridges.

Ordinal Logistic Regression: Solar Storm Impact on Power Grids

Multivariate Analysis — S1 Data Science UNESA

2026-05-05