1. Introduction

This report presents a systematic assessment of the assumptions underlying Ordinal Regression (Proportional Odds Model) applied to the Solar Storm Impact on Power Grids dataset. Before fitting and interpreting an ordinal regression model, it is essential to verify that the data satisfies the statistical assumptions on which the model relies. Violating these assumptions can lead to biased estimates, inflated standard errors, and misleading conclusions.

The dataset contains 1,000 simulated solar storm events with six predictor variables (both numerical and categorical) and one ordinal outcome variable: power_grid_disruption (0 = None, 1 = Minor, 2 = Moderate, 3 = Severe). The analysis pipeline follows: data loading → preprocessing → exploratory data analysis → assumption checking → summary and recommendations.

The six assumptions tested in this report are:

  1. Ordinal dependent variable the outcome has a meaningful ordered structure
  2. Independent observations each event is an independent, non-repeated unit
  3. No perfect multicollinearity predictors are not perfectly linearly redundant (VIF)
  4. No extreme outliers no single observation dominates the parameter estimates
  5. Sufficient sample size per category adequate observations per outcome level
  6. Proportional Odds (Parallel Lines) the core assumption of the POM; the effect of each predictor is constant across all cumulative thresholds

2. Data Loading and Initial Inspection

df_raw <- read.csv("solar_storm_impact_dataset.csv", stringsAsFactors = FALSE)

cat(sprintf("Rows: %d | Columns: %d\n", nrow(df_raw), ncol(df_raw)))
## Rows: 1000 | Columns: 9
str(df_raw)
## 'data.frame':    1000 obs. of  9 variables:
##  $ event_id              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ event_date            : chr  "2018-10-01" "2020-09-28" "2020-03-15" "2008-01-11" ...
##  $ solar_flare_class     : chr  "X" "M" "C" "M" ...
##  $ flare_intensity       : num  96.2 31.1 23.9 62.7 37.7 ...
##  $ geomagnetic_index_Kp  : int  4 0 0 2 0 1 1 2 2 0 ...
##  $ solar_wind_speed      : num  558 723 626 498 505 ...
##  $ solar_wind_density    : num  12.62 6.26 7.81 5.94 18.03 ...
##  $ flare_duration_minutes: int  10 95 44 136 86 45 126 94 172 57 ...
##  $ power_grid_disruption : int  2 1 1 1 1 0 2 1 0 0 ...
missing_summary <- data.frame(
  Column   = names(df_raw),
  Missing  = sapply(df_raw, function(x) sum(is.na(x))),
  Pct_Miss = sapply(df_raw, function(x) round(mean(is.na(x)) * 100, 2))
)
rownames(missing_summary) <- NULL

kable(missing_summary,
      caption = "Missing Values per Variable",
      col.names = c("Variable", "Missing Count", "Missing (%)")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)
Missing Values per Variable
Variable Missing Count Missing (%)
event_id 0 0
event_date 0 0
solar_flare_class 0 0
flare_intensity 0 0
geomagnetic_index_Kp 0 0
solar_wind_speed 0 0
solar_wind_density 0 0
flare_duration_minutes 0 0
power_grid_disruption 0 0
kable(head(df_raw, 8), caption = "Preview: First 8 Rows of Dataset") %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")
Preview: First 8 Rows of Dataset
event_id event_date solar_flare_class flare_intensity geomagnetic_index_Kp solar_wind_speed solar_wind_density flare_duration_minutes power_grid_disruption
1 2018-10-01 X 96.25 4 558.43 12.62 10 2
2 2020-09-28 M 31.15 0 723.17 6.26 95 1
3 2020-03-15 C 23.93 0 626.27 7.81 44 1
4 2008-01-11 M 62.66 2 498.49 5.94 136 1
5 2022-03-30 M 37.73 0 505.02 18.03 86 1
6 2013-07-26 C 24.80 1 463.60 14.37 45 0
7 2004-03-08 M 69.01 1 690.06 5.38 126 2
8 2002-03-30 M 39.41 2 634.74 2.48 94 1

3. Data Preprocessing

3.1 Encoding and Type Conversion

The categorical predictor solar_flare_class (C, M, X) is encoded as an ordered integer because the flare classes carry a meaningful physical ordering: C < M < X in terms of energy release. One-hot encoding is deliberately avoided here to preserve this ordinal structure.

The outcome power_grid_disruption is converted to an ordered factor, which is required by ordinal regression functions in R (MASS::polr, ordinal::clm).

# Ordinal encoding for solar flare class (C=0, M=1, X=2)
df_raw$flare_class_ord <- dplyr::case_when(
  df_raw$solar_flare_class == "C" ~ 0L,
  df_raw$solar_flare_class == "M" ~ 1L,
  df_raw$solar_flare_class == "X" ~ 2L
)

# Convert outcome to ordered factor
df_raw$disruption_ord <- factor(
  df_raw$power_grid_disruption,
  levels = c(0, 1, 2, 3),
  labels = c("None", "Minor", "Moderate", "Severe"),
  ordered = TRUE
)

# Predictors to be used in the model
predictors <- c("flare_class_ord", "flare_intensity", "geomagnetic_index_Kp",
                "solar_wind_speed", "solar_wind_density", "flare_duration_minutes")

encoding_table <- data.frame(
  Original_Value = c("C", "M", "X"),
  Ordinal_Code   = c(0, 1, 2),
  Rationale      = c("Common flares, lowest energy",
                     "Moderate flares, medium energy",
                     "Extreme flares, highest energy")
)

kable(encoding_table,
      caption = "Ordinal Encoding of solar_flare_class",
      col.names = c("Original Value", "Ordinal Code", "Rationale")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
Ordinal Encoding of solar_flare_class
Original Value Ordinal Code Rationale
C 0 Common flares, lowest energy
M 1 Moderate flares, medium energy
X 2 Extreme flares, highest energy

3.2 Standardisation of Numeric Predictors

Numeric predictors are standardised (z-scored) to place coefficients on a comparable scale. This does not affect the assumption checks but is recommended practice before fitting the model.

num_vars <- c("flare_intensity", "geomagnetic_index_Kp",
              "solar_wind_speed", "solar_wind_density",
              "flare_duration_minutes")

df_model <- df_raw
df_model[num_vars] <- as.data.frame(scale(df_raw[num_vars]))

# Confirm scaling
scale_check <- data.frame(
  Variable = num_vars,
  Mean_raw = round(sapply(df_raw[num_vars], mean), 3),
  SD_raw   = round(sapply(df_raw[num_vars], sd), 3),
  Mean_scaled = round(sapply(df_model[num_vars], mean), 6),
  SD_scaled   = round(sapply(df_model[num_vars], sd), 3)
)
rownames(scale_check) <- NULL

kable(scale_check,
      caption = "Before and After Standardisation",
      col.names = c("Variable", "Mean (raw)", "SD (raw)",
                    "Mean (scaled)", "SD (scaled)")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)
Before and After Standardisation
Variable Mean (raw) SD (raw) Mean (scaled) SD (scaled)
flare_intensity 38.651 27.506 0 1
geomagnetic_index_Kp 1.045 1.026 0 1
solar_wind_speed 501.777 99.924 0 1
solar_wind_density 9.785 4.886 0 1
flare_duration_minutes 89.143 50.530 0 1

4. Exploratory Data Analysis

4.1 Descriptive Statistics

num_desc_vars <- c("flare_intensity", "geomagnetic_index_Kp",
                   "solar_wind_speed", "solar_wind_density",
                   "flare_duration_minutes")

desc_stats <- data.frame(
  Variable = num_desc_vars,
  N        = sapply(df_raw[num_desc_vars], function(x) sum(!is.na(x))),
  Mean     = sapply(df_raw[num_desc_vars], mean,             na.rm = TRUE),
  SD       = sapply(df_raw[num_desc_vars], sd,               na.rm = TRUE),
  Min      = sapply(df_raw[num_desc_vars], min,              na.rm = TRUE),
  Q1       = sapply(df_raw[num_desc_vars], quantile, 0.25,   na.rm = TRUE),
  Median   = sapply(df_raw[num_desc_vars], median,           na.rm = TRUE),
  Q3       = sapply(df_raw[num_desc_vars], quantile, 0.75,   na.rm = TRUE),
  Max      = sapply(df_raw[num_desc_vars], max,              na.rm = TRUE),
  Skewness = sapply(df_raw[num_desc_vars], moments::skewness, na.rm = TRUE),
  Kurtosis = sapply(df_raw[num_desc_vars], moments::kurtosis, na.rm = TRUE)
)
rownames(desc_stats) <- NULL

kable(desc_stats %>% dplyr::mutate(across(where(is.numeric), ~round(.x, 3))),
      caption = "Descriptive Statistics of Numeric Predictors") %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")
Descriptive Statistics of Numeric Predictors
Variable N Mean SD Min Q1 Median Q3 Max Skewness Kurtosis
flare_intensity 1000 38.651 27.506 1.02 15.778 31.885 58.550 99.93 0.559 2.207
geomagnetic_index_Kp 1000 1.045 1.026 0.00 0.000 1.000 2.000 5.00 0.790 3.193
solar_wind_speed 1000 501.777 99.924 168.77 437.105 498.865 567.955 788.38 -0.079 2.958
solar_wind_density 1000 9.785 4.886 -6.22 6.367 9.610 13.137 25.55 -0.008 2.948
flare_duration_minutes 1000 89.143 50.530 5.00 44.750 88.000 132.250 180.00 0.058 1.788

4.2 Distribution of the Outcome Variable

outcome_counts <- df_raw %>%
  dplyr::count(power_grid_disruption) %>%
  dplyr::mutate(
    Label     = c("0 - None", "1 - Minor", "2 - Moderate")[power_grid_disruption + 1],
    Pct       = round(n / sum(n) * 100, 1),
    Label_pct = paste0(n, "\n(", Pct, "%)")
  )

ggplot(outcome_counts, aes(x = factor(power_grid_disruption), y = n,
                            fill = factor(power_grid_disruption))) +
  geom_col(color = "white", alpha = 0.85, width = 0.6) +
  geom_text(aes(label = Label_pct), vjust = -0.4, size = 3.8, fontface = "bold") +
  scale_fill_brewer(palette = "Set2", guide = "none") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title    = "Frequency Distribution of power_grid_disruption",
       subtitle = "Note: Level 3 (Severe) has zero observations in this dataset",
       x = "Disruption Level",
       y = "Count") +
  theme_minimal(base_size = 12) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(color = "gray50", size = 10))
Distribution of Power Grid Disruption Levels

Distribution of Power Grid Disruption Levels

kable(outcome_counts %>% dplyr::select(power_grid_disruption, Label, n, Pct),
      caption = "Frequency Table: power_grid_disruption",
      col.names = c("Level", "Label", "Count", "Percent (%)")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
Frequency Table: power_grid_disruption
Level Label Count Percent (%)
0 0 - None 413 41.3
1 1 - Minor 453 45.3
2 2 - Moderate 134 13.4

4.3 Predictor Distributions by Outcome Level

df_long <- df_raw %>%
  dplyr::select(power_grid_disruption, all_of(num_desc_vars)) %>%
  tidyr::pivot_longer(-power_grid_disruption,
                      names_to  = "Variable",
                      values_to = "Value") %>%
  dplyr::mutate(Disruption = factor(power_grid_disruption))

ggplot(df_long, aes(x = Disruption, y = Value, fill = Disruption)) +
  geom_boxplot(outlier.shape = 21, outlier.size = 1.5, alpha = 0.75) +
  facet_wrap(~Variable, scales = "free_y", ncol = 3) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Numeric Predictor Distributions by Disruption Level",
       x = "Disruption Level", y = "Value", fill = "Disruption") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"))
Boxplots of Numeric Predictors by Disruption Level

Boxplots of Numeric Predictors by Disruption Level

4.4 Flare Class vs. Disruption Level (Crosstab)

ct <- table(df_raw$solar_flare_class, df_raw$power_grid_disruption)
ct_df <- as.data.frame.matrix(ct)
ct_df$Flare_Class <- rownames(ct_df)
ct_df <- ct_df %>% dplyr::select(Flare_Class, everything())
rownames(ct_df) <- NULL

kable(ct_df,
      caption = "Crosstab: Solar Flare Class vs. Power Grid Disruption Level",
      col.names = c("Flare Class", "Level 0 (None)", "Level 1 (Minor)", "Level 2 (Moderate)")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE)
Crosstab: Solar Flare Class vs. Power Grid Disruption Level
Flare Class Level 0 (None) Level 1 (Minor) Level 2 (Moderate)
C 406 88 0
M 7 329 10
X 0 36 124

4.5 Mean of Each Predictor per Disruption Level

mean_by_level <- df_raw %>%
  dplyr::group_by(power_grid_disruption) %>%
  dplyr::summarise(across(all_of(num_desc_vars), ~round(mean(.x, na.rm = TRUE), 3)),
            .groups = "drop")

kable(mean_by_level,
      caption = "Mean of Each Numeric Predictor per Disruption Level") %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")
Mean of Each Numeric Predictor per Disruption Level
power_grid_disruption flare_intensity geomagnetic_index_Kp solar_wind_speed solar_wind_density flare_duration_minutes
0 13.798 0.632 473.912 9.803 86.053
1 47.048 1.086 519.036 9.724 90.724
2 86.860 2.179 529.312 9.937 93.321

5. Assumption Checks

5.1 Assumption 1 Ordinal Dependent Variable

The dependent variable power_grid_disruption takes values 0 (None), 1 (Minor), 2 (Moderate), and 3 (Severe). These categories have a clear, theoretically grounded ordering: a higher value represents greater physical and economic damage to power infrastructure. The distances between categories are not assumed to be equal only the order is meaningful. This is precisely the structure that ordinal regression is designed to model, distinguishing it from both linear regression (which assumes equal intervals) and multinomial logistic regression (which ignores order entirely).

ord_check <- data.frame(
  Criterion       = c("Outcome variable type",
                      "Number of ordered levels",
                      "Levels observed in data",
                      "Meaningful order exists",
                      "Equal intervals assumed"),
  Status          = c("Ordinal factor",
                      "4 defined (0, 1, 2, 3)",
                      "3 observed (0, 1, 2)  Level 3 absent",
                      "Yes  None < Minor < Moderate < Severe",
                      "No  ordinal only"),
  Assessment      = c("Met",
                      "Met",
                      "Note: Level 3 has 0 observations",
                      "Met",
                      "Met")
)

kable(ord_check,
      caption = "Assumption 1: Ordinal Dependent Variable") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = TRUE) %>%
  column_spec(3,
              color = ifelse(ord_check$Assessment == "Met", "darkgreen", "darkorange"),
              bold  = TRUE)
Assumption 1: Ordinal Dependent Variable
Criterion Status Assessment
Outcome variable type Ordinal factor Met
Number of ordered levels 4 defined (0, 1, 2, 3) Met
Levels observed in data 3 observed (0, 1, 2) Level 3 absent Note: Level 3 has 0 observations
Meaningful order exists Yes None < Minor < Moderate < Severe Met
Equal intervals assumed No ordinal only Met

Conclusion: Assumption 1 is met. The outcome is genuinely ordinal and the ordering is physically meaningful. One important note is that Level 3 (Severe) contains zero observations, so the fitted model will effectively operate as a 3-class ordinal problem with two estimated threshold parameters.


5.2 Assumption 2 Independence of Observations

Ordinal regression assumes that each observation is independent of all others that is, the outcome for one solar storm event does not influence the outcome for any other event. This is primarily assessed by considering the study design rather than a formal statistical test.

# Inspect date range and check for potential temporal clustering
df_raw$event_date_parsed <- as.Date(df_raw$event_date)

date_range <- data.frame(
  Item  = c("Earliest event", "Latest event", "Total events", "Unique dates"),
  Value = c(
    as.character(min(df_raw$event_date_parsed, na.rm = TRUE)),
    as.character(max(df_raw$event_date_parsed, na.rm = TRUE)),
    as.character(nrow(df_raw)),
    as.character(length(unique(df_raw$event_date_parsed)))
  )
)

kable(date_range, caption = "Temporal Scope of the Dataset",
      col.names = c("Item", "Value")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
Temporal Scope of the Dataset
Item Value
Earliest event 2000-01-02
Latest event 2024-08-15
Total events 1000
Unique dates 949
df_raw$event_year <- as.integer(format(df_raw$event_date_parsed, "%Y"))

events_per_year <- df_raw %>%
  dplyr::count(event_year)

ggplot(events_per_year, aes(x = event_year, y = n)) +
  geom_col(fill = "#2E86AB", alpha = 0.8, color = "white") +
  geom_hline(yintercept = mean(events_per_year$n),
             linetype = "dashed", color = "#E84855", linewidth = 0.8) +
  annotate("text", x = min(events_per_year$event_year) + 0.5,
           y = mean(events_per_year$n) + 2,
           label = paste0("Mean = ", round(mean(events_per_year$n), 1)),
           color = "#E84855", size = 3.5) +
  labs(title    = "Number of Solar Storm Events per Year",
       subtitle = "No obvious systematic clustering or time-trend detected",
       x = "Year", y = "Count") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
Events Over Time  Checking for Temporal Clustering

Events Over Time Checking for Temporal Clustering

Conclusion: Assumption 2 is met by design. Each row represents a distinct solar storm event. Events are spread across multiple years with no obvious temporal clustering. There is no repeated-measures structure (e.g., the same storm measured twice). One caveat: storms occurring very close in time could reflect the same underlying active solar region this is a domain-level concern that cannot be resolved statistically, and should be noted as a limitation.


5.3 Assumption 3 Absence of Perfect Multicollinearity (VIF)

Multicollinearity refers to high linear correlation among predictors. Perfect multicollinearity makes the model inestimable; high multicollinearity inflates standard errors and makes individual coefficients unstable. We assess this using the Variance Inflation Factor (VIF): a VIF > 10 indicates serious multicollinearity, and VIF > 5 warrants attention.

Because MASS::polr fits a cumulative logit model, we approximate VIF by regressing each predictor on all other predictors using OLS (a standard approximation valid for assessing collinearity structure).

# Approximate VIF via OLS on predictors
vif_models <- list(
  flare_class_ord        = lm(flare_class_ord        ~ flare_intensity + geomagnetic_index_Kp +
                                 solar_wind_speed + solar_wind_density +
                                 flare_duration_minutes, data = df_raw),
  flare_intensity        = lm(flare_intensity        ~ flare_class_ord + geomagnetic_index_Kp +
                                 solar_wind_speed + solar_wind_density +
                                 flare_duration_minutes, data = df_raw),
  geomagnetic_index_Kp   = lm(geomagnetic_index_Kp   ~ flare_class_ord + flare_intensity +
                                 solar_wind_speed + solar_wind_density +
                                 flare_duration_minutes, data = df_raw),
  solar_wind_speed       = lm(solar_wind_speed       ~ flare_class_ord + flare_intensity +
                                 geomagnetic_index_Kp + solar_wind_density +
                                 flare_duration_minutes, data = df_raw),
  solar_wind_density     = lm(solar_wind_density     ~ flare_class_ord + flare_intensity +
                                 geomagnetic_index_Kp + solar_wind_speed +
                                 flare_duration_minutes, data = df_raw),
  flare_duration_minutes = lm(flare_duration_minutes ~ flare_class_ord + flare_intensity +
                                 geomagnetic_index_Kp + solar_wind_speed +
                                 solar_wind_density,    data = df_raw)
)

vif_values <- sapply(vif_models, function(m) {
  r2 <- summary(m)$r.squared
  1 / (1 - r2)
})

vif_df <- data.frame(
  Predictor      = names(vif_values),
  VIF            = round(vif_values, 4),
  Interpretation = ifelse(vif_values >= 10, "High  serious concern",
                   ifelse(vif_values >= 5,  "Moderate  monitor",
                                            "Acceptable"))
)
rownames(vif_df) <- NULL

kable(vif_df, caption = "Variance Inflation Factor (VIF) per Predictor") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(3,
              color = ifelse(vif_df$VIF >= 10, "red",
                     ifelse(vif_df$VIF >= 5,  "darkorange", "darkgreen")),
              bold  = TRUE)
Variance Inflation Factor (VIF) per Predictor
Predictor VIF Interpretation
flare_class_ord 8.7176 Moderate monitor
flare_intensity 9.2674 Moderate monitor
geomagnetic_index_Kp 1.4250 Acceptable
solar_wind_speed 1.0266 Acceptable
solar_wind_density 1.0042 Acceptable
flare_duration_minutes 1.0045 Acceptable
cor_vars <- c("flare_class_ord", "flare_intensity", "geomagnetic_index_Kp",
              "solar_wind_speed", "solar_wind_density", "flare_duration_minutes")

cor_mat <- round(cor(df_raw[cor_vars], use = "complete.obs"), 3)

cor_melted <- reshape2::melt(cor_mat)

ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = value), size = 3.5, fontface = "bold") +
  scale_fill_gradient2(low  = "#E84855", mid = "white", high = "#2E86AB",
                       midpoint = 0, limits = c(-1, 1),
                       name = "Pearson r") +
  labs(title    = "Predictor Correlation Matrix",
       subtitle = "Values close to ±1 signal potential multicollinearity",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 11) +
  theme(plot.title  = element_text(face = "bold"),
        axis.text.x = element_text(angle = 30, hjust = 1))
Correlation Matrix of All Predictors

Correlation Matrix of All Predictors

Conclusion: Assumption 3 is met if all VIF values are below 5. Pay particular attention to the pair flare_class_ord and flare_intensity, which may show elevated correlation given their conceptual overlap (flare class is a categorical proxy for flare energy, while intensity is a continuous measure of the same). If VIF for either exceeds 10, consider removing one from the model or creating a combined composite predictor.


5.4 Assumption 4 No Extreme Outliers

Extreme outliers in predictor variables can unduly influence the estimation of regression coefficients and threshold parameters. We assess outliers using z-scores (flagging observations beyond ±3 SD) and boxplots. We also specifically investigate the anomalous negative minimum of solar_wind_density.

z_scores <- as.data.frame(scale(df_raw[num_desc_vars]))
names(z_scores) <- paste0(num_desc_vars, "_z")

outlier_counts <- sapply(z_scores, function(x) sum(abs(x) > 3, na.rm = TRUE))

outlier_df <- data.frame(
  Variable        = num_desc_vars,
  N_Outliers_3SD  = as.integer(outlier_counts),
  Pct_Outliers    = round(outlier_counts / nrow(df_raw) * 100, 2),
  Assessment      = ifelse(outlier_counts == 0,     "None detected",
                   ifelse(outlier_counts <= 5,      "Few  monitor",
                                                    "Several  investigate"))
)
rownames(outlier_df) <- NULL

kable(outlier_df,
      caption = "Outlier Count per Predictor (|z| > 3 threshold)",
      col.names = c("Variable", "Outlier Count (|z|>3)", "% of Sample", "Assessment")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(4,
              color = ifelse(outlier_df$N_Outliers_3SD == 0, "darkgreen",
                     ifelse(outlier_df$N_Outliers_3SD <= 5,  "darkorange", "red")),
              bold  = TRUE)
Outlier Count per Predictor (|z| > 3 threshold)
Variable Outlier Count (&#124;z&#124;>3) % of Sample Assessment
flare_intensity 0 0.0 None detected
geomagnetic_index_Kp 3 0.3 Few monitor
solar_wind_speed 3 0.3 Few monitor
solar_wind_density 3 0.3 Few monitor
flare_duration_minutes 0 0.0 None detected
df_long_raw <- df_raw %>%
  dplyr::select(all_of(num_desc_vars)) %>%
  tidyr::pivot_longer(everything(), names_to = "Variable", values_to = "Value")

# Compute ±3 SD boundaries per variable (manual approach, no regex)
bounds_lo <- sapply(df_raw[num_desc_vars], function(x) mean(x) - 3*sd(x))
bounds_hi <- sapply(df_raw[num_desc_vars], function(x) mean(x) + 3*sd(x))
bounds <- data.frame(
  Variable = num_desc_vars,
  lo       = as.numeric(bounds_lo),
  hi       = as.numeric(bounds_hi)
)

ggplot(df_long_raw, aes(x = Value)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 fill = "#2E86AB", color = "white", alpha = 0.6) +
  geom_density(color = "#2E86AB", linewidth = 0.9) +
  geom_vline(data = bounds, aes(xintercept = lo),
             color = "#E84855", linetype = "dashed", linewidth = 0.8) +
  geom_vline(data = bounds, aes(xintercept = hi),
             color = "#E84855", linetype = "dashed", linewidth = 0.8) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +
  labs(title    = "Predictor Distributions with ±3 SD Outlier Boundaries",
       subtitle = "Red dashed lines indicate ±3 SD threshold",
       x = "Value", y = "Density") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"))
Density Plots of Numeric Predictors with Outlier Boundaries

Density Plots of Numeric Predictors with Outlier Boundaries

neg_density <- df_raw %>%
  dplyr::filter(solar_wind_density < 0) %>%
  dplyr::select(event_id, event_date, solar_wind_density, power_grid_disruption)

cat(sprintf("Number of observations with negative solar_wind_density: %d\n",
            nrow(neg_density)))
## Number of observations with negative solar_wind_density: 25
if (nrow(neg_density) > 0) {
  kable(neg_density,
        caption = "Observations with Negative solar_wind_density (Physically Anomalous)") %>%
    kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
}
Observations with Negative solar_wind_density (Physically Anomalous)
event_id event_date solar_wind_density power_grid_disruption
30 2021-10-08 -0.28 0
36 2005-09-19 -1.49 0
98 2016-10-14 -3.12 1
117 2015-02-05 -0.97 0
121 2018-07-25 -1.92 0
175 2004-06-17 -6.22 1
227 2015-04-03 -0.64 1
239 2008-05-24 -0.43 1
264 2001-08-24 -1.09 1
309 2008-05-24 -0.70 1
336 2020-03-28 -1.39 0
406 2019-07-06 -0.93 1
415 2019-04-21 -0.73 0
462 2024-06-09 -2.08 1
512 2024-02-29 -3.08 2
531 2001-05-13 -2.94 0
536 2010-12-12 -0.70 1
556 2014-11-07 -1.00 1
605 2021-09-24 -2.54 2
620 2001-08-16 -0.15 0
668 2019-02-16 -2.25 0
735 2022-08-04 -1.05 1
769 2004-11-19 -1.83 1
826 2002-08-15 -1.35 0
920 2019-07-03 -6.15 1

Conclusion: Assumption 4 is conditionally met. For most predictors, the number of extreme outliers (|z| > 3) should be minimal given the simulated nature of this dataset. The negative values in solar_wind_density are physically impossible (particle density cannot be negative) and represent either data entry errors or simulation artefacts. Before fitting the final model, these values should be clipped to zero or removed. Their influence on the model can be assessed by comparing results with and without those observations.


5.5 Assumption 5 Sufficient Sample Size per Category

Ordinal regression requires sufficient observations in each outcome category to reliably estimate the threshold parameters and regression coefficients. A commonly cited rule of thumb is at least 10–15 observations per predictor per outcome category.

n_predictors <- length(predictors)
category_counts <- df_raw %>%
  dplyr::count(power_grid_disruption) %>%
  dplyr::mutate(
    Label              = c("0 - None", "1 - Minor", "2 - Moderate", "3 - Severe")[power_grid_disruption + 1],
    Min_required       = n_predictors * 10,
    Sufficient_10x     = ifelse(n >= n_predictors * 10,  "Yes", "No"),
    Sufficient_15x     = ifelse(n >= n_predictors * 15,  "Yes", "No")
  ) %>%
  dplyr::select(power_grid_disruption, Label, n, Min_required, Sufficient_10x, Sufficient_15x)

kable(category_counts,
      caption = paste0("Sample Size per Outcome Category (", n_predictors,
                       " predictors × 10 or 15 = minimum threshold)"),
      col.names = c("Level", "Label", "n", "Min (10×p)", "≥10×p?", "≥15×p?")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(5, color = ifelse(category_counts$Sufficient_10x == "Yes",
                                "darkgreen", "red"), bold = TRUE) %>%
  column_spec(6, color = ifelse(category_counts$Sufficient_15x == "Yes",
                                "darkgreen", "red"), bold = TRUE)
Sample Size per Outcome Category (6 predictors × 10 or 15 = minimum threshold)
Level Label n Min (10×p) ≥10×p? ≥15×p?
0 0 - None 413 60 Yes Yes
1 1 - Minor 453 60 Yes Yes
2 2 - Moderate 134 60 Yes Yes
prop_df <- data.frame(
  Level = c("0-None", "1-Minor", "2-Moderate", "3-Severe"),
  Count = c(
    sum(df_raw$power_grid_disruption == 0),
    sum(df_raw$power_grid_disruption == 1),
    sum(df_raw$power_grid_disruption == 2),
    sum(df_raw$power_grid_disruption == 3)
  )
)

ggplot(prop_df, aes(x = Level, y = Count, fill = Level)) +
  geom_col(color = "white", alpha = 0.85, width = 0.6) +
  geom_hline(yintercept = n_predictors * 10,
             linetype = "dashed", color = "darkorange", linewidth = 0.9) +
  geom_hline(yintercept = n_predictors * 15,
             linetype = "dashed", color = "red", linewidth = 0.9) +
  annotate("text", x = 0.6, y = n_predictors * 10 + 8,
           label = paste0("Min 10×p = ", n_predictors * 10),
           color = "darkorange", size = 3.2, hjust = 0) +
  annotate("text", x = 0.6, y = n_predictors * 15 + 8,
           label = paste0("Min 15×p = ", n_predictors * 15),
           color = "red", size = 3.2, hjust = 0) +
  scale_fill_brewer(palette = "Set2", guide = "none") +
  labs(title    = "Class Frequencies vs. Minimum Sample Size Thresholds",
       subtitle = "Dashed lines = rule-of-thumb minimums for 6 predictors",
       x = "Disruption Level", y = "Count") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))
Class Proportions vs. Minimum Thresholds

Class Proportions vs. Minimum Thresholds

Conclusion: Assumption 5 is partially met. Levels 0 (None) and 1 (Minor) have more than adequate sample sizes. Level 2 (Moderate, n = 134) satisfies the 10× threshold for 6 predictors (minimum = 60) but is borderline for 15× (minimum = 90). Level 3 (Severe) has zero observations and cannot be modelled the model will be restricted to predicting levels 0, 1, and 2. This must be explicitly stated as a limitation of any findings.


5.6 Assumption 6 Proportional Odds (Parallel Lines)

This is the most critical and most commonly violated assumption of the Proportional Odds Model. It states that the effect of each predictor is identical across all cumulative logit thresholds i.e., the same β coefficient governs P(Y ≤ 0 | X) and P(Y ≤ 1 | X) simultaneously.

We test this assumption using the graphical parallel lines approach (fitting separate binary logistic regressions for each cutpoint and comparing coefficients), and by computing the Score Test for proportional odds available in ordinal::clm.

5.6.1 Fit the Proportional Odds Model

df_model$disruption_ord <- factor(
  df_model$power_grid_disruption,
  levels = c(0, 1, 2),
  ordered = TRUE
)

# Fit POM using ordinal::clm (more informative output than MASS::polr)
model_clm <- ordinal::clm(
  disruption_ord ~ flare_class_ord + flare_intensity + geomagnetic_index_Kp +
    solar_wind_speed + solar_wind_density + flare_duration_minutes,
  data = df_model,
  link = "logit"
)

cat("=== Proportional Odds Model Summary ===\n")
## === Proportional Odds Model Summary ===
summary(model_clm)
## formula: 
## disruption_ord ~ flare_class_ord + flare_intensity + geomagnetic_index_Kp + solar_wind_speed + solar_wind_density + flare_duration_minutes
## data:    df_model
## 
##  link  threshold nobs logLik AIC   niter max.grad cond.H 
##  logit flexible  1000 -0.00  16.00 29(0) 6.11e-09 1.4e+05
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## flare_class_ord          20.936         NA      NA       NA
## flare_intensity        2786.913         NA      NA       NA
## geomagnetic_index_Kp    115.876         NA      NA       NA
## solar_wind_speed        692.937         NA      NA       NA
## solar_wind_density        3.993         NA      NA       NA
## flare_duration_minutes  358.411         NA      NA       NA
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 0|1    -1248         NA      NA
## 1|2     3823         NA      NA

5.6.2 Score Test for Proportional Odds

The nominal_test() function from the ordinal package tests whether each predictor violates the proportional odds assumption. A significant result (p < 0.05) for a given predictor means that predictor’s effect is NOT consistent across thresholds, i.e., the parallel lines assumption is violated for that predictor.

nominal_result <- ordinal::nominal_test(model_clm)

nominal_df <- as.data.frame(nominal_result)
nominal_df$Predictor <- rownames(nominal_df)
# Rename actual columns returned by ordinal::nominal_test: Df, logLik, AIC, LRT, Pr(>Chi)
names(nominal_df)[names(nominal_df) == "Df"]        <- "df_val"
names(nominal_df)[names(nominal_df) == "LRT"]       <- "LRT_stat"
names(nominal_df)[names(nominal_df) == "Pr(>Chi)"]  <- "p_val"
nominal_df <- nominal_df %>%
  dplyr::select(Predictor, LRT_stat, df_val, p_val) %>%
  dplyr::mutate(
    LRT_stat = round(LRT_stat, 4),
    p_val    = round(p_val, 4),
    Decision = ifelse(is.na(p_val), "",
               ifelse(p_val < 0.001, "Violated ***",
               ifelse(p_val < 0.01,  "Violated **",
               ifelse(p_val < 0.05,  "Violated *",
                                     "Met (p >= 0.05)"))))
  )
rownames(nominal_df) <- NULL

kable(nominal_df,
      caption = "Score/Nominal Test for Proportional Odds Assumption per Predictor",
      col.names = c("Predictor", "LRT Statistic", "df", "p-value", "Decision")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(5,
              color = ifelse(grepl("Violated", nominal_df$Decision), "red",
                     ifelse(nominal_df$Decision == "", "gray", "darkgreen")),
              bold  = TRUE)
Score/Nominal Test for Proportional Odds Assumption per Predictor
Predictor LRT Statistic df p-value Decision
<none> NA NA NA
flare_class_ord 0 1 0.9998 Met (p >= 0.05)
flare_intensity 0 1 0.9996 Met (p >= 0.05)
geomagnetic_index_Kp 0 1 0.9998 Met (p >= 0.05)
solar_wind_speed 0 1 0.9996 Met (p >= 0.05)
solar_wind_density 0 1 1.0000 Met (p >= 0.05)
flare_duration_minutes 0 1 0.9996 Met (p >= 0.05)

5.6.3 Scale Test for Ordinal Predictors

scale_result <- ordinal::scale_test(model_clm)

scale_df <- as.data.frame(scale_result)
scale_df$Predictor <- rownames(scale_df)
# Rename actual columns returned by ordinal::scale_test: Df, logLik, AIC, LRT, Pr(>Chi)
names(scale_df)[names(scale_df) == "Df"]        <- "df_val"
names(scale_df)[names(scale_df) == "LRT"]       <- "LRT_stat"
names(scale_df)[names(scale_df) == "Pr(>Chi)"]  <- "p_val"
scale_df <- scale_df %>%
  dplyr::select(Predictor, LRT_stat, df_val, p_val) %>%
  dplyr::mutate(
    LRT_stat = round(LRT_stat, 4),
    p_val    = round(p_val, 4),
    Decision = ifelse(is.na(p_val), "",
               ifelse(p_val < 0.05,  "Violated *", "Met (p >= 0.05)"))
  )
rownames(scale_df) <- NULL

kable(scale_df,
      caption = "Scale Test for Proportional Odds Assumption",
      col.names = c("Predictor", "LRT Statistic", "df", "p-value", "Decision")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(5,
              color = ifelse(grepl("Violated", scale_df$Decision), "red",
                     ifelse(scale_df$Decision == "", "gray", "darkgreen")),
              bold  = TRUE)
Scale Test for Proportional Odds Assumption
Predictor LRT Statistic df p-value Decision
<none> NA NA NA
flare_class_ord 0 1 0.9993 Met (p >= 0.05)
flare_intensity 0 1 0.9993 Met (p >= 0.05)
geomagnetic_index_Kp 0 1 0.9993 Met (p >= 0.05)
solar_wind_speed 0 1 0.9993 Met (p >= 0.05)
solar_wind_density 0 1 0.9993 Met (p >= 0.05)
flare_duration_minutes 0 1 0.9993 Met (p >= 0.05)

5.6.4 Graphical Check Parallel Lines Visualisation

The graphical approach fits two separate binary logistic regressions one for each cutpoint and plots the predicted probabilities. If the lines are roughly parallel, the proportional odds assumption holds. If they diverge substantially, the assumption is violated for that predictor.

# Binary outcomes for each threshold
df_model$cut1 <- as.integer(df_model$power_grid_disruption >= 1)  # P(Y >= 1)
df_model$cut2 <- as.integer(df_model$power_grid_disruption >= 2)  # P(Y >= 2)

num_preds_scaled <- c("flare_intensity", "geomagnetic_index_Kp",
                      "solar_wind_speed", "solar_wind_density",
                      "flare_duration_minutes")

plot_list_parallel <- lapply(num_preds_scaled, function(pvar) {
  # Sequence across the range of the scaled predictor
  xseq <- seq(min(df_model[[pvar]], na.rm = TRUE),
              max(df_model[[pvar]], na.rm = TRUE),
              length.out = 100)

  # Fit binary logistic for each cutpoint (other predictors at mean = 0 after scaling)
  fml1 <- as.formula(paste("cut1 ~", pvar))
  fml2 <- as.formula(paste("cut2 ~", pvar))

  glm1 <- glm(fml1, data = df_model, family = binomial)
  glm2 <- glm(fml2, data = df_model, family = binomial)

  newdat <- data.frame(x = xseq)
  names(newdat) <- pvar

  pred1 <- predict(glm1, newdata = newdat, type = "response")
  pred2 <- predict(glm2, newdata = newdat, type = "response")

  plot_df <- data.frame(
    x       = rep(xseq, 2),
    prob    = c(pred1, pred2),
    Cutpoint = rep(c("P(Y \u2265 1) vs P(Y < 1)",
                     "P(Y \u2265 2) vs P(Y < 2)"), each = 100)
  )

  ggplot(plot_df, aes(x = x, y = prob, color = Cutpoint, linetype = Cutpoint)) +
    geom_line(linewidth = 1) +
    scale_color_manual(values = c("#2E86AB", "#E84855")) +
    labs(title    = pvar,
         x        = paste0(pvar, " (standardised)"),
         y        = "Predicted Probability",
         color    = "Cutpoint",
         linetype = "Cutpoint") +
    theme_minimal(base_size = 10) +
    theme(plot.title      = element_text(face = "bold", size = 10),
          legend.position = "bottom",
          legend.text     = element_text(size = 8))
})

gridExtra::grid.arrange(grobs = plot_list_parallel, ncol = 2)
Parallel Lines Check: Predicted Probabilities per Cutpoint

Parallel Lines Check: Predicted Probabilities per Cutpoint

5.6.5 Coefficient Comparison Across Cutpoints

A more direct assessment compares the regression coefficients from two separate binary logistic regressions (one per cutpoint). If the coefficients are approximately equal, the proportional odds assumption is tenable for that predictor.

all_preds_formula <- c("flare_class_ord", "flare_intensity", "geomagnetic_index_Kp",
                       "solar_wind_speed", "solar_wind_density",
                       "flare_duration_minutes")

fml_cut1 <- as.formula(paste("cut1 ~", paste(all_preds_formula, collapse = " + ")))
fml_cut2 <- as.formula(paste("cut2 ~", paste(all_preds_formula, collapse = " + ")))

glm_cut1 <- glm(fml_cut1, data = df_model, family = binomial)
glm_cut2 <- glm(fml_cut2, data = df_model, family = binomial)

coef_df <- data.frame(
  Predictor = all_preds_formula,
  Beta_Cut1 = round(coef(glm_cut1)[-1], 4),
  Beta_Cut2 = round(coef(glm_cut2)[-1], 4)
) %>%
  dplyr::mutate(
    Difference    = round(abs(Beta_Cut1 - Beta_Cut2), 4),
    Ratio         = round(ifelse(Beta_Cut1 != 0, Beta_Cut2 / Beta_Cut1, NA), 3),
    Assessment    = ifelse(Difference < 0.3, "Parallel (small diff)",
                   ifelse(Difference < 0.8, "Moderate diff  review",
                                            "Large diff  likely violated"))
  )
rownames(coef_df) <- NULL

kable(coef_df,
      caption = "Coefficient Comparison Across Two Cutpoints (Parallel Lines Check)",
      col.names = c("Predictor", "β (cutpoint 1)", "β (cutpoint 2)",
                    "|Difference|", "Ratio (β2/β1)", "Assessment")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(6,
              color = ifelse(coef_df$Difference >= 0.8,  "red",
                     ifelse(coef_df$Difference >= 0.3,  "darkorange",
                                                         "darkgreen")),
              bold  = TRUE)
Coefficient Comparison Across Two Cutpoints (Parallel Lines Check)
Predictor β (cutpoint 1) β (cutpoint 2) &#124;Difference&#124; Ratio (β2/β1) Assessment
flare_class_ord -1.6794 -44.8658 43.1864 26.715 Large diff likely violated
flare_intensity 1672.6531 501.1540 1171.4991 0.300 Large diff likely violated
geomagnetic_index_Kp 70.8777 24.7647 46.1130 0.349 Large diff likely violated
solar_wind_speed 415.7930 104.1878 311.6052 0.251 Large diff likely violated
solar_wind_density -2.2573 13.4627 15.7200 -5.964 Large diff likely violated
flare_duration_minutes 224.1975 46.5756 177.6219 0.208 Large diff likely violated

Conclusion for Assumption 6: The proportional odds assumption should be interpreted in light of all three pieces of evidence above: (a) the formal nominal test p-values, (b) visual parallelism of the predicted probability curves, and (c) the magnitude of coefficient differences across cutpoints. Predictors with a significant nominal test AND visually diverging curves AND large coefficient differences should be treated as violating the assumption. In that case, consider fitting a Partial Proportional Odds Model using ordinal::clm(..., nominal = ~violated_predictor) to allow that predictor to have separate coefficients per threshold while keeping the remaining predictors under the proportional odds constraint.


6. Overall Assumption Summary

summary_table <- data.frame(
  No. = 1:6,
  Assumption = c(
    "Ordinal dependent variable",
    "Independence of observations",
    "No perfect multicollinearity (VIF)",
    "No extreme outliers",
    "Sufficient sample size per category",
    "Proportional odds (parallel lines)"
  ),
  How_Assessed = c(
    "Domain/theoretical judgment",
    "Study design review + temporal plot",
    "VIF via OLS approximation; correlation matrix",
    "Z-score (|z| > 3) + density plots",
    "n per level vs. 10×p and 15×p thresholds",
    "Nominal test (ordinal::clm) + graphical + coefficient comparison"
  ),
  Typical_Result = c(
    "Met  ordered levels are theoretically meaningful",
    "Met by design  distinct events; note temporal caveat",
    "Check VIF table  monitor flare_class_ord vs flare_intensity",
    "Monitor negative solar_wind_density values",
    "Levels 0 & 1 adequate; Level 2 borderline; Level 3 absent",
    "See nominal test results and parallel lines plots"
  ),
  Action_If_Violated = c(
    "Not applicable",
    "Account for clustering or use GEE/mixed models",
    "Remove or combine correlated predictors; use ridge regression",
    "Clip/remove extreme values; sensitivity analysis",
    "Acknowledge as limitation; avoid predicting absent levels",
    "Use Partial Proportional Odds Model (clm with nominal argument)"
  )
)

kable(summary_table,
      caption = "Comprehensive Assumption Checklist for Ordinal Regression",
      col.names = c("#", "Assumption", "How Assessed",
                    "Expected Outcome / Notes", "Remediation if Violated")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")
Comprehensive Assumption Checklist for Ordinal Regression
# Assumption How Assessed Expected Outcome / Notes Remediation if Violated
1 Ordinal dependent variable Domain/theoretical judgment Met ordered levels are theoretically meaningful Not applicable
2 Independence of observations Study design review + temporal plot Met by design distinct events; note temporal caveat Account for clustering or use GEE/mixed models
3 No perfect multicollinearity (VIF) VIF via OLS approximation; correlation matrix Check VIF table monitor flare_class_ord vs flare_intensity Remove or combine correlated predictors; use ridge regression
4 No extreme outliers Z-score (&#124;z&#124; > 3) + density plots Monitor negative solar_wind_density values Clip/remove extreme values; sensitivity analysis
5 Sufficient sample size per category n per level vs. 10×p and 15×p thresholds Levels 0 & 1 adequate; Level 2 borderline; Level 3 absent Acknowledge as limitation; avoid predicting absent levels
6 Proportional odds (parallel lines) Nominal test (ordinal::clm) + graphical + coefficient comparison See nominal test results and parallel lines plots Use Partial Proportional Odds Model (clm with nominal argument)

7. Recommendations Before Fitting the Final Model

Based on the assumption checks above, the following steps are recommended before proceeding to interpretation of the ordinal regression coefficients:

  1. Handle negative solar_wind_density values clip to zero or remove those rows, then re-run VIF and outlier checks.

  2. Monitor multicollinearity between flare_class_ord and flare_intensity. If VIF > 5 for either, test the model with and without one of them, or use only flare_intensity (which is continuous and more information-rich).

  3. Acknowledge Level 3 absence in all interpretations the model estimates thresholds only for P(Y ≤ 0) and P(Y ≤ 1); the Level 3 threshold cannot be estimated.

  4. If the proportional odds assumption is violated for one or more predictors, fit a Partial Proportional Odds Model:

    model_ppom <- ordinal::clm(
      disruption_ord ~ flare_class_ord + flare_intensity + geomagnetic_index_Kp +
        solar_wind_speed + solar_wind_density + flare_duration_minutes,
      nominal = ~ violated_predictor_name,
      data = df_model
    )
  5. Use stratified train/test split (80/20, stratify = y) for predictive modelling to maintain proportional class representation, especially for the minority class (Level 2).


Document generated using R Markdown. All analyses conducted in R (version ≥ 4.1.0). Dataset: Solar Storm Impact on Power Grids (1,000 simulated events). Primary packages: ordinal, car, MASS, ggplot2, dplyr.