1. Introduction

This report presents a systematic assessment of the assumptions underlying Ordinal Regression (Proportional Odds Model) applied to the Solar Storm Impact on Power Grids dataset. Before fitting and interpreting an ordinal regression model, it is essential to verify that the data satisfies the statistical assumptions on which the model relies. Violating these assumptions can lead to biased estimates, inflated standard errors, and misleading conclusions.

The dataset contains 1,000 simulated solar storm events with six predictor variables (both numerical and categorical) and one ordinal outcome variable: power_grid_disruption (0 = None, 1 = Minor, 2 = Moderate, 3 = Severe). The analysis pipeline follows: data loading → preprocessing → exploratory data analysis → assumption checking → summary and recommendations.

The six assumptions tested in this report are:

Ordinal dependent variable the outcome has a meaningful ordered structure
Independent observations each event is an independent, non-repeated unit
No perfect multicollinearity predictors are not perfectly linearly redundant (VIF)
No extreme outliers no single observation dominates the parameter estimates
Sufficient sample size per category adequate observations per outcome level
Proportional Odds (Parallel Lines) the core assumption of the POM; the effect of each predictor is constant across all cumulative thresholds

2. Data Loading and Initial Inspection

df_raw <- read.csv("solar_storm_impact_dataset.csv", stringsAsFactors = FALSE)

cat(sprintf("Rows: %d | Columns: %d\n", nrow(df_raw), ncol(df_raw)))

## Rows: 1000 | Columns: 9

str(df_raw)

## 'data.frame':    1000 obs. of  9 variables:
##  $ event_id              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ event_date            : chr  "2018-10-01" "2020-09-28" "2020-03-15" "2008-01-11" ...
##  $ solar_flare_class     : chr  "X" "M" "C" "M" ...
##  $ flare_intensity       : num  96.2 31.1 23.9 62.7 37.7 ...
##  $ geomagnetic_index_Kp  : int  4 0 0 2 0 1 1 2 2 0 ...
##  $ solar_wind_speed      : num  558 723 626 498 505 ...
##  $ solar_wind_density    : num  12.62 6.26 7.81 5.94 18.03 ...
##  $ flare_duration_minutes: int  10 95 44 136 86 45 126 94 172 57 ...
##  $ power_grid_disruption : int  2 1 1 1 1 0 2 1 0 0 ...

missing_summary <- data.frame(
  Column   = names(df_raw),
  Missing  = sapply(df_raw, function(x) sum(is.na(x))),
  Pct_Miss = sapply(df_raw, function(x) round(mean(is.na(x)) * 100, 2))
)
rownames(missing_summary) <- NULL

kable(missing_summary,
      caption = "Missing Values per Variable",
      col.names = c("Variable", "Missing Count", "Missing (%)")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)

Missing Values per Variable
Variable	Missing Count	Missing (%)
event_id	0	0
event_date	0	0
solar_flare_class	0	0
flare_intensity	0	0
geomagnetic_index_Kp	0	0
solar_wind_speed	0	0
solar_wind_density	0	0
flare_duration_minutes	0	0
power_grid_disruption	0	0

kable(head(df_raw, 8), caption = "Preview: First 8 Rows of Dataset") %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")

Preview: First 8 Rows of Dataset
event_id	event_date	solar_flare_class	flare_intensity	geomagnetic_index_Kp	solar_wind_speed	solar_wind_density	flare_duration_minutes	power_grid_disruption
1	2018-10-01	X	96.25	4	558.43	12.62	10	2
2	2020-09-28	M	31.15	0	723.17	6.26	95	1
3	2020-03-15	C	23.93	0	626.27	7.81	44	1
4	2008-01-11	M	62.66	2	498.49	5.94	136	1
5	2022-03-30	M	37.73	0	505.02	18.03	86	1
6	2013-07-26	C	24.80	1	463.60	14.37	45	0
7	2004-03-08	M	69.01	1	690.06	5.38	126	2
8	2002-03-30	M	39.41	2	634.74	2.48	94	1

3. Data Preprocessing

3.1 Encoding and Type Conversion

The categorical predictor solar_flare_class (C, M, X) is encoded as an ordered integer because the flare classes carry a meaningful physical ordering: C < M < X in terms of energy release. One-hot encoding is deliberately avoided here to preserve this ordinal structure.

The outcome power_grid_disruption is converted to an ordered factor, which is required by ordinal regression functions in R (MASS::polr, ordinal::clm).

# Ordinal encoding for solar flare class (C=0, M=1, X=2)
df_raw$flare_class_ord <- dplyr::case_when(
  df_raw$solar_flare_class == "C" ~ 0L,
  df_raw$solar_flare_class == "M" ~ 1L,
  df_raw$solar_flare_class == "X" ~ 2L
)

# Convert outcome to ordered factor
df_raw$disruption_ord <- factor(
  df_raw$power_grid_disruption,
  levels = c(0, 1, 2, 3),
  labels = c("None", "Minor", "Moderate", "Severe"),
  ordered = TRUE
)

# Predictors to be used in the model
predictors <- c("flare_class_ord", "flare_intensity", "geomagnetic_index_Kp",
                "solar_wind_speed", "solar_wind_density", "flare_duration_minutes")

encoding_table <- data.frame(
  Original_Value = c("C", "M", "X"),
  Ordinal_Code   = c(0, 1, 2),
  Rationale      = c("Common flares, lowest energy",
                     "Moderate flares, medium energy",
                     "Extreme flares, highest energy")
)

kable(encoding_table,
      caption = "Ordinal Encoding of solar_flare_class",
      col.names = c("Original Value", "Ordinal Code", "Rationale")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

Ordinal Encoding of solar_flare_class
Original Value	Ordinal Code	Rationale
C	0	Common flares, lowest energy
M	1	Moderate flares, medium energy
X	2	Extreme flares, highest energy

3.2 Standardisation of Numeric Predictors

Numeric predictors are standardised (z-scored) to place coefficients on a comparable scale. This does not affect the assumption checks but is recommended practice before fitting the model.

num_vars <- c("flare_intensity", "geomagnetic_index_Kp",
              "solar_wind_speed", "solar_wind_density",
              "flare_duration_minutes")

df_model <- df_raw
df_model[num_vars] <- as.data.frame(scale(df_raw[num_vars]))

# Confirm scaling
scale_check <- data.frame(
  Variable = num_vars,
  Mean_raw = round(sapply(df_raw[num_vars], mean), 3),
  SD_raw   = round(sapply(df_raw[num_vars], sd), 3),
  Mean_scaled = round(sapply(df_model[num_vars], mean), 6),
  SD_scaled   = round(sapply(df_model[num_vars], sd), 3)
)
rownames(scale_check) <- NULL

kable(scale_check,
      caption = "Before and After Standardisation",
      col.names = c("Variable", "Mean (raw)", "SD (raw)",
                    "Mean (scaled)", "SD (scaled)")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)

Before and After Standardisation
Variable	Mean (raw)	SD (raw)	SD (scaled)
flare_intensity	38.651	27.506	1
geomagnetic_index_Kp	1.045	1.026	1
solar_wind_speed	501.777	99.924	1
solar_wind_density	9.785	4.886	1
flare_duration_minutes	89.143	50.530	1

4. Exploratory Data Analysis

4.1 Descriptive Statistics

num_desc_vars <- c("flare_intensity", "geomagnetic_index_Kp",
                   "solar_wind_speed", "solar_wind_density",
                   "flare_duration_minutes")

desc_stats <- data.frame(
  Variable = num_desc_vars,
  N        = sapply(df_raw[num_desc_vars], function(x) sum(!is.na(x))),
  Mean     = sapply(df_raw[num_desc_vars], mean,             na.rm = TRUE),
  SD       = sapply(df_raw[num_desc_vars], sd,               na.rm = TRUE),
  Min      = sapply(df_raw[num_desc_vars], min,              na.rm = TRUE),
  Q1       = sapply(df_raw[num_desc_vars], quantile, 0.25,   na.rm = TRUE),
  Median   = sapply(df_raw[num_desc_vars], median,           na.rm = TRUE),
  Q3       = sapply(df_raw[num_desc_vars], quantile, 0.75,   na.rm = TRUE),
  Max      = sapply(df_raw[num_desc_vars], max,              na.rm = TRUE),
  Skewness = sapply(df_raw[num_desc_vars], moments::skewness, na.rm = TRUE),
  Kurtosis = sapply(df_raw[num_desc_vars], moments::kurtosis, na.rm = TRUE)
)
rownames(desc_stats) <- NULL

kable(desc_stats %>% dplyr::mutate(across(where(is.numeric), ~round(.x, 3))),
      caption = "Descriptive Statistics of Numeric Predictors") %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")

Descriptive Statistics of Numeric Predictors
Variable	N	Mean	SD	Min	Q1	Median	Q3	Max	Skewness	Kurtosis
flare_intensity	1000	38.651	27.506	1.02	15.778	31.885	58.550	99.93	0.559	2.207
geomagnetic_index_Kp	1000	1.045	1.026	0.00	0.000	1.000	2.000	5.00	0.790	3.193
solar_wind_speed	1000	501.777	99.924	168.77	437.105	498.865	567.955	788.38	-0.079	2.958
solar_wind_density	1000	9.785	4.886	-6.22	6.367	9.610	13.137	25.55	-0.008	2.948
flare_duration_minutes	1000	89.143	50.530	5.00	44.750	88.000	132.250	180.00	0.058	1.788

4.2 Distribution of the Outcome Variable

outcome_counts <- df_raw %>%
  dplyr::count(power_grid_disruption) %>%
  dplyr::mutate(
    Label     = c("0 - None", "1 - Minor", "2 - Moderate")[power_grid_disruption + 1],
    Pct       = round(n / sum(n) * 100, 1),
    Label_pct = paste0(n, "\n(", Pct, "%)")
  )

ggplot(outcome_counts, aes(x = factor(power_grid_disruption), y = n,
                            fill = factor(power_grid_disruption))) +
  geom_col(color = "white", alpha = 0.85, width = 0.6) +
  geom_text(aes(label = Label_pct), vjust = -0.4, size = 3.8, fontface = "bold") +
  scale_fill_brewer(palette = "Set2", guide = "none") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title    = "Frequency Distribution of power_grid_disruption",
       subtitle = "Note: Level 3 (Severe) has zero observations in this dataset",
       x = "Disruption Level",
       y = "Count") +
  theme_minimal(base_size = 12) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(color = "gray50", size = 10))

Distribution of Power Grid Disruption Levels

kable(outcome_counts %>% dplyr::select(power_grid_disruption, Label, n, Pct),
      caption = "Frequency Table: power_grid_disruption",
      col.names = c("Level", "Label", "Count", "Percent (%)")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

Frequency Table: power_grid_disruption
Level	Label	Count	Percent (%)
0	0 - None	413	41.3
1	1 - Minor	453	45.3
2	2 - Moderate	134	13.4

4.3 Predictor Distributions by Outcome Level

df_long <- df_raw %>%
  dplyr::select(power_grid_disruption, all_of(num_desc_vars)) %>%
  tidyr::pivot_longer(-power_grid_disruption,
                      names_to  = "Variable",
                      values_to = "Value") %>%
  dplyr::mutate(Disruption = factor(power_grid_disruption))

ggplot(df_long, aes(x = Disruption, y = Value, fill = Disruption)) +
  geom_boxplot(outlier.shape = 21, outlier.size = 1.5, alpha = 0.75) +
  facet_wrap(~Variable, scales = "free_y", ncol = 3) +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Numeric Predictor Distributions by Disruption Level",
       x = "Disruption Level", y = "Value", fill = "Disruption") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"))

Boxplots of Numeric Predictors by Disruption Level

4.4 Flare Class vs. Disruption Level (Crosstab)

ct <- table(df_raw$solar_flare_class, df_raw$power_grid_disruption)
ct_df <- as.data.frame.matrix(ct)
ct_df$Flare_Class <- rownames(ct_df)
ct_df <- ct_df %>% dplyr::select(Flare_Class, everything())
rownames(ct_df) <- NULL

kable(ct_df,
      caption = "Crosstab: Solar Flare Class vs. Power Grid Disruption Level",
      col.names = c("Flare Class", "Level 0 (None)", "Level 1 (Minor)", "Level 2 (Moderate)")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE)

Crosstab: Solar Flare Class vs. Power Grid Disruption Level
Flare Class	Level 0 (None)	Level 1 (Minor)	Level 2 (Moderate)
C	406	88	0
M	7	329	10
X	0	36	124

4.5 Mean of Each Predictor per Disruption Level

mean_by_level <- df_raw %>%
  dplyr::group_by(power_grid_disruption) %>%
  dplyr::summarise(across(all_of(num_desc_vars), ~round(mean(.x, na.rm = TRUE), 3)),
            .groups = "drop")

kable(mean_by_level,
      caption = "Mean of Each Numeric Predictor per Disruption Level") %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")

Mean of Each Numeric Predictor per Disruption Level
power_grid_disruption	flare_intensity	geomagnetic_index_Kp	solar_wind_speed	solar_wind_density	flare_duration_minutes
0	13.798	0.632	473.912	9.803	86.053
1	47.048	1.086	519.036	9.724	90.724
2	86.860	2.179	529.312	9.937	93.321

5. Assumption Checks

5.1 Assumption 1 Ordinal Dependent Variable

The dependent variable power_grid_disruption takes values 0 (None), 1 (Minor), 2 (Moderate), and 3 (Severe). These categories have a clear, theoretically grounded ordering: a higher value represents greater physical and economic damage to power infrastructure. The distances between categories are not assumed to be equal only the order is meaningful. This is precisely the structure that ordinal regression is designed to model, distinguishing it from both linear regression (which assumes equal intervals) and multinomial logistic regression (which ignores order entirely).

ord_check <- data.frame(
  Criterion       = c("Outcome variable type",
                      "Number of ordered levels",
                      "Levels observed in data",
                      "Meaningful order exists",
                      "Equal intervals assumed"),
  Status          = c("Ordinal factor",
                      "4 defined (0, 1, 2, 3)",
                      "3 observed (0, 1, 2)  Level 3 absent",
                      "Yes  None < Minor < Moderate < Severe",
                      "No  ordinal only"),
  Assessment      = c("Met",
                      "Met",
                      "Note: Level 3 has 0 observations",
                      "Met",
                      "Met")
)

kable(ord_check,
      caption = "Assumption 1: Ordinal Dependent Variable") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = TRUE) %>%
  column_spec(3,
              color = ifelse(ord_check$Assessment == "Met", "darkgreen", "darkorange"),
              bold  = TRUE)

Assumption 1: Ordinal Dependent Variable
Criterion	Status	Assessment
Outcome variable type	Ordinal factor	Met
Number of ordered levels	4 defined (0, 1, 2, 3)	Met
Levels observed in data	3 observed (0, 1, 2) Level 3 absent	Note: Level 3 has 0 observations
Meaningful order exists	Yes None < Minor < Moderate < Severe	Met
Equal intervals assumed	No ordinal only	Met

Conclusion: Assumption 1 is met. The outcome is genuinely ordinal and the ordering is physically meaningful. One important note is that Level 3 (Severe) contains zero observations, so the fitted model will effectively operate as a 3-class ordinal problem with two estimated threshold parameters.

5.2 Assumption 2 Independence of Observations

Ordinal regression assumes that each observation is independent of all others that is, the outcome for one solar storm event does not influence the outcome for any other event. This is primarily assessed by considering the study design rather than a formal statistical test.

# Inspect date range and check for potential temporal clustering
df_raw$event_date_parsed <- as.Date(df_raw$event_date)

date_range <- data.frame(
  Item  = c("Earliest event", "Latest event", "Total events", "Unique dates"),
  Value = c(
    as.character(min(df_raw$event_date_parsed, na.rm = TRUE)),
    as.character(max(df_raw$event_date_parsed, na.rm = TRUE)),
    as.character(nrow(df_raw)),
    as.character(length(unique(df_raw$event_date_parsed)))
  )
)

kable(date_range, caption = "Temporal Scope of the Dataset",
      col.names = c("Item", "Value")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE)

Temporal Scope of the Dataset
Item	Value
Earliest event	2000-01-02
Latest event	2024-08-15
Total events	1000
Unique dates	949

df_raw$event_year <- as.integer(format(df_raw$event_date_parsed, "%Y"))

events_per_year <- df_raw %>%
  dplyr::count(event_year)

ggplot(events_per_year, aes(x = event_year, y = n)) +
  geom_col(fill = "#2E86AB", alpha = 0.8, color = "white") +
  geom_hline(yintercept = mean(events_per_year$n),
             linetype = "dashed", color = "#E84855", linewidth = 0.8) +
  annotate("text", x = min(events_per_year$event_year) + 0.5,
           y = mean(events_per_year$n) + 2,
           label = paste0("Mean = ", round(mean(events_per_year$n), 1)),
           color = "#E84855", size = 3.5) +
  labs(title    = "Number of Solar Storm Events per Year",
       subtitle = "No obvious systematic clustering or time-trend detected",
       x = "Year", y = "Count") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

Events Over Time Checking for Temporal Clustering

Conclusion: Assumption 2 is met by design. Each row represents a distinct solar storm event. Events are spread across multiple years with no obvious temporal clustering. There is no repeated-measures structure (e.g., the same storm measured twice). One caveat: storms occurring very close in time could reflect the same underlying active solar region this is a domain-level concern that cannot be resolved statistically, and should be noted as a limitation.

5.3 Assumption 3 Absence of Perfect Multicollinearity (VIF)

Multicollinearity refers to high linear correlation among predictors. Perfect multicollinearity makes the model inestimable; high multicollinearity inflates standard errors and makes individual coefficients unstable. We assess this using the Variance Inflation Factor (VIF): a VIF > 10 indicates serious multicollinearity, and VIF > 5 warrants attention.

Because MASS::polr fits a cumulative logit model, we approximate VIF by regressing each predictor on all other predictors using OLS (a standard approximation valid for assessing collinearity structure).

# Approximate VIF via OLS on predictors
vif_models <- list(
  flare_class_ord        = lm(flare_class_ord        ~ flare_intensity + geomagnetic_index_Kp +
                                 solar_wind_speed + solar_wind_density +
                                 flare_duration_minutes, data = df_raw),
  flare_intensity        = lm(flare_intensity        ~ flare_class_ord + geomagnetic_index_Kp +
                                 solar_wind_speed + solar_wind_density +
                                 flare_duration_minutes, data = df_raw),
  geomagnetic_index_Kp   = lm(geomagnetic_index_Kp   ~ flare_class_ord + flare_intensity +
                                 solar_wind_speed + solar_wind_density +
                                 flare_duration_minutes, data = df_raw),
  solar_wind_speed       = lm(solar_wind_speed       ~ flare_class_ord + flare_intensity +
                                 geomagnetic_index_Kp + solar_wind_density +
                                 flare_duration_minutes, data = df_raw),
  solar_wind_density     = lm(solar_wind_density     ~ flare_class_ord + flare_intensity +
                                 geomagnetic_index_Kp + solar_wind_speed +
                                 flare_duration_minutes, data = df_raw),
  flare_duration_minutes = lm(flare_duration_minutes ~ flare_class_ord + flare_intensity +
                                 geomagnetic_index_Kp + solar_wind_speed +
                                 solar_wind_density,    data = df_raw)
)

vif_values <- sapply(vif_models, function(m) {
  r2 <- summary(m)$r.squared
  1 / (1 - r2)
})

vif_df <- data.frame(
  Predictor      = names(vif_values),
  VIF            = round(vif_values, 4),
  Interpretation = ifelse(vif_values >= 10, "High  serious concern",
                   ifelse(vif_values >= 5,  "Moderate  monitor",
                                            "Acceptable"))
)
rownames(vif_df) <- NULL

kable(vif_df, caption = "Variance Inflation Factor (VIF) per Predictor") %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(3,
              color = ifelse(vif_df$VIF >= 10, "red",
                     ifelse(vif_df$VIF >= 5,  "darkorange", "darkgreen")),
              bold  = TRUE)

Variance Inflation Factor (VIF) per Predictor
Predictor	VIF	Interpretation
flare_class_ord	8.7176	Moderate monitor
flare_intensity	9.2674	Moderate monitor
geomagnetic_index_Kp	1.4250	Acceptable
solar_wind_speed	1.0266	Acceptable
solar_wind_density	1.0042	Acceptable
flare_duration_minutes	1.0045	Acceptable

cor_vars <- c("flare_class_ord", "flare_intensity", "geomagnetic_index_Kp",
              "solar_wind_speed", "solar_wind_density", "flare_duration_minutes")

cor_mat <- round(cor(df_raw[cor_vars], use = "complete.obs"), 3)

cor_melted <- reshape2::melt(cor_mat)

ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white", linewidth = 0.5) +
  geom_text(aes(label = value), size = 3.5, fontface = "bold") +
  scale_fill_gradient2(low  = "#E84855", mid = "white", high = "#2E86AB",
                       midpoint = 0, limits = c(-1, 1),
                       name = "Pearson r") +
  labs(title    = "Predictor Correlation Matrix",
       subtitle = "Values close to ±1 signal potential multicollinearity",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 11) +
  theme(plot.title  = element_text(face = "bold"),
        axis.text.x = element_text(angle = 30, hjust = 1))

Correlation Matrix of All Predictors

Conclusion: Assumption 3 is met if all VIF values are below 5. Pay particular attention to the pair flare_class_ord and flare_intensity, which may show elevated correlation given their conceptual overlap (flare class is a categorical proxy for flare energy, while intensity is a continuous measure of the same). If VIF for either exceeds 10, consider removing one from the model or creating a combined composite predictor.

5.4 Assumption 4 No Extreme Outliers

Extreme outliers in predictor variables can unduly influence the estimation of regression coefficients and threshold parameters. We assess outliers using z-scores (flagging observations beyond ±3 SD) and boxplots. We also specifically investigate the anomalous negative minimum of solar_wind_density.

z_scores <- as.data.frame(scale(df_raw[num_desc_vars]))
names(z_scores) <- paste0(num_desc_vars, "_z")

outlier_counts <- sapply(z_scores, function(x) sum(abs(x) > 3, na.rm = TRUE))

outlier_df <- data.frame(
  Variable        = num_desc_vars,
  N_Outliers_3SD  = as.integer(outlier_counts),
  Pct_Outliers    = round(outlier_counts / nrow(df_raw) * 100, 2),
  Assessment      = ifelse(outlier_counts == 0,     "None detected",
                   ifelse(outlier_counts <= 5,      "Few  monitor",
                                                    "Several  investigate"))
)
rownames(outlier_df) <- NULL

kable(outlier_df,
      caption = "Outlier Count per Predictor (|z| > 3 threshold)",
      col.names = c("Variable", "Outlier Count (|z|>3)", "% of Sample", "Assessment")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(4,
              color = ifelse(outlier_df$N_Outliers_3SD == 0, "darkgreen",
                     ifelse(outlier_df$N_Outliers_3SD <= 5,  "darkorange", "red")),
              bold  = TRUE)

Outlier Count per Predictor (|z| > 3 threshold)
Variable	Outlier Count (\|z\|>3)	% of Sample	Assessment
flare_intensity	0	0.0	None detected
geomagnetic_index_Kp	3	0.3	Few monitor
solar_wind_speed	3	0.3	Few monitor
solar_wind_density	3	0.3	Few monitor
flare_duration_minutes	0	0.0	None detected

df_long_raw <- df_raw %>%
  dplyr::select(all_of(num_desc_vars)) %>%
  tidyr::pivot_longer(everything(), names_to = "Variable", values_to = "Value")

# Compute ±3 SD boundaries per variable (manual approach, no regex)
bounds_lo <- sapply(df_raw[num_desc_vars], function(x) mean(x) - 3*sd(x))
bounds_hi <- sapply(df_raw[num_desc_vars], function(x) mean(x) + 3*sd(x))
bounds <- data.frame(
  Variable = num_desc_vars,
  lo       = as.numeric(bounds_lo),
  hi       = as.numeric(bounds_hi)
)

ggplot(df_long_raw, aes(x = Value)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30,
                 fill = "#2E86AB", color = "white", alpha = 0.6) +
  geom_density(color = "#2E86AB", linewidth = 0.9) +
  geom_vline(data = bounds, aes(xintercept = lo),
             color = "#E84855", linetype = "dashed", linewidth = 0.8) +
  geom_vline(data = bounds, aes(xintercept = hi),
             color = "#E84855", linetype = "dashed", linewidth = 0.8) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +
  labs(title    = "Predictor Distributions with ±3 SD Outlier Boundaries",
       subtitle = "Red dashed lines indicate ±3 SD threshold",
       x = "Value", y = "Density") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"))

Density Plots of Numeric Predictors with Outlier Boundaries

neg_density <- df_raw %>%
  dplyr::filter(solar_wind_density < 0) %>%
  dplyr::select(event_id, event_date, solar_wind_density, power_grid_disruption)

cat(sprintf("Number of observations with negative solar_wind_density: %d\n",
            nrow(neg_density)))

## Number of observations with negative solar_wind_density: 25

if (nrow(neg_density) > 0) {
  kable(neg_density,
        caption = "Observations with Negative solar_wind_density (Physically Anomalous)") %>%
    kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
}

Observations with Negative solar_wind_density (Physically Anomalous)
event_id	event_date	solar_wind_density	power_grid_disruption
30	2021-10-08	-0.28	0
36	2005-09-19	-1.49	0
98	2016-10-14	-3.12	1
117	2015-02-05	-0.97	0
121	2018-07-25	-1.92	0
175	2004-06-17	-6.22	1
227	2015-04-03	-0.64	1
239	2008-05-24	-0.43	1
264	2001-08-24	-1.09	1
309	2008-05-24	-0.70	1
336	2020-03-28	-1.39	0
406	2019-07-06	-0.93	1
415	2019-04-21	-0.73	0
462	2024-06-09	-2.08	1
512	2024-02-29	-3.08	2
531	2001-05-13	-2.94	0
536	2010-12-12	-0.70	1
556	2014-11-07	-1.00	1
605	2021-09-24	-2.54	2
620	2001-08-16	-0.15	0
668	2019-02-16	-2.25	0
735	2022-08-04	-1.05	1
769	2004-11-19	-1.83	1
826	2002-08-15	-1.35	0
920	2019-07-03	-6.15	1

Conclusion: Assumption 4 is conditionally met. For most predictors, the number of extreme outliers (|z| > 3) should be minimal given the simulated nature of this dataset. The negative values in solar_wind_density are physically impossible (particle density cannot be negative) and represent either data entry errors or simulation artefacts. Before fitting the final model, these values should be clipped to zero or removed. Their influence on the model can be assessed by comparing results with and without those observations.

5.5 Assumption 5 Sufficient Sample Size per Category

Ordinal regression requires sufficient observations in each outcome category to reliably estimate the threshold parameters and regression coefficients. A commonly cited rule of thumb is at least 10–15 observations per predictor per outcome category.

n_predictors <- length(predictors)
category_counts <- df_raw %>%
  dplyr::count(power_grid_disruption) %>%
  dplyr::mutate(
    Label              = c("0 - None", "1 - Minor", "2 - Moderate", "3 - Severe")[power_grid_disruption + 1],
    Min_required       = n_predictors * 10,
    Sufficient_10x     = ifelse(n >= n_predictors * 10,  "Yes", "No"),
    Sufficient_15x     = ifelse(n >= n_predictors * 15,  "Yes", "No")
  ) %>%
  dplyr::select(power_grid_disruption, Label, n, Min_required, Sufficient_10x, Sufficient_15x)

kable(category_counts,
      caption = paste0("Sample Size per Outcome Category (", n_predictors,
                       " predictors × 10 or 15 = minimum threshold)"),
      col.names = c("Level", "Label", "n", "Min (10×p)", "≥10×p?", "≥15×p?")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(5, color = ifelse(category_counts$Sufficient_10x == "Yes",
                                "darkgreen", "red"), bold = TRUE) %>%
  column_spec(6, color = ifelse(category_counts$Sufficient_15x == "Yes",
                                "darkgreen", "red"), bold = TRUE)

Sample Size per Outcome Category (6 predictors × 10 or 15 = minimum threshold)
Level	Label	n	Min (10×p)	≥10×p?	≥15×p?
0	0 - None	413	60	Yes	Yes
1	1 - Minor	453	60	Yes	Yes
2	2 - Moderate	134	60	Yes	Yes

prop_df <- data.frame(
  Level = c("0-None", "1-Minor", "2-Moderate", "3-Severe"),
  Count = c(
    sum(df_raw$power_grid_disruption == 0),
    sum(df_raw$power_grid_disruption == 1),
    sum(df_raw$power_grid_disruption == 2),
    sum(df_raw$power_grid_disruption == 3)
  )
)

ggplot(prop_df, aes(x = Level, y = Count, fill = Level)) +
  geom_col(color = "white", alpha = 0.85, width = 0.6) +
  geom_hline(yintercept = n_predictors * 10,
             linetype = "dashed", color = "darkorange", linewidth = 0.9) +
  geom_hline(yintercept = n_predictors * 15,
             linetype = "dashed", color = "red", linewidth = 0.9) +
  annotate("text", x = 0.6, y = n_predictors * 10 + 8,
           label = paste0("Min 10×p = ", n_predictors * 10),
           color = "darkorange", size = 3.2, hjust = 0) +
  annotate("text", x = 0.6, y = n_predictors * 15 + 8,
           label = paste0("Min 15×p = ", n_predictors * 15),
           color = "red", size = 3.2, hjust = 0) +
  scale_fill_brewer(palette = "Set2", guide = "none") +
  labs(title    = "Class Frequencies vs. Minimum Sample Size Thresholds",
       subtitle = "Dashed lines = rule-of-thumb minimums for 6 predictors",
       x = "Disruption Level", y = "Count") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

Class Proportions vs. Minimum Thresholds

Conclusion: Assumption 5 is partially met. Levels 0 (None) and 1 (Minor) have more than adequate sample sizes. Level 2 (Moderate, n = 134) satisfies the 10× threshold for 6 predictors (minimum = 60) but is borderline for 15× (minimum = 90). Level 3 (Severe) has zero observations and cannot be modelled the model will be restricted to predicting levels 0, 1, and 2. This must be explicitly stated as a limitation of any findings.

5.6 Assumption 6 Proportional Odds (Parallel Lines)

This is the most critical and most commonly violated assumption of the Proportional Odds Model. It states that the effect of each predictor is identical across all cumulative logit thresholds i.e., the same β coefficient governs P(Y ≤ 0 | X) and P(Y ≤ 1 | X) simultaneously.

We test this assumption using the graphical parallel lines approach (fitting separate binary logistic regressions for each cutpoint and comparing coefficients), and by computing the Score Test for proportional odds available in ordinal::clm.

5.6.1 Fit the Proportional Odds Model

df_model$disruption_ord <- factor(
  df_model$power_grid_disruption,
  levels = c(0, 1, 2),
  ordered = TRUE
)

# Fit POM using ordinal::clm (more informative output than MASS::polr)
model_clm <- ordinal::clm(
  disruption_ord ~ flare_class_ord + flare_intensity + geomagnetic_index_Kp +
    solar_wind_speed + solar_wind_density + flare_duration_minutes,
  data = df_model,
  link = "logit"
)

cat("=== Proportional Odds Model Summary ===\n")

## === Proportional Odds Model Summary ===

summary(model_clm)

## formula: 
## disruption_ord ~ flare_class_ord + flare_intensity + geomagnetic_index_Kp + solar_wind_speed + solar_wind_density + flare_duration_minutes
## data:    df_model
## 
##  link  threshold nobs logLik AIC   niter max.grad cond.H 
##  logit flexible  1000 -0.00  16.00 29(0) 6.11e-09 1.4e+05
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## flare_class_ord          20.936         NA      NA       NA
## flare_intensity        2786.913         NA      NA       NA
## geomagnetic_index_Kp    115.876         NA      NA       NA
## solar_wind_speed        692.937         NA      NA       NA
## solar_wind_density        3.993         NA      NA       NA
## flare_duration_minutes  358.411         NA      NA       NA
## 
## Threshold coefficients:
##     Estimate Std. Error z value
## 0|1    -1248         NA      NA
## 1|2     3823         NA      NA

5.6.2 Score Test for Proportional Odds

The nominal_test() function from the ordinal package tests whether each predictor violates the proportional odds assumption. A significant result (p < 0.05) for a given predictor means that predictor’s effect is NOT consistent across thresholds, i.e., the parallel lines assumption is violated for that predictor.

nominal_result <- ordinal::nominal_test(model_clm)

nominal_df <- as.data.frame(nominal_result)
nominal_df$Predictor <- rownames(nominal_df)
# Rename actual columns returned by ordinal::nominal_test: Df, logLik, AIC, LRT, Pr(>Chi)
names(nominal_df)[names(nominal_df) == "Df"]        <- "df_val"
names(nominal_df)[names(nominal_df) == "LRT"]       <- "LRT_stat"
names(nominal_df)[names(nominal_df) == "Pr(>Chi)"]  <- "p_val"
nominal_df <- nominal_df %>%
  dplyr::select(Predictor, LRT_stat, df_val, p_val) %>%
  dplyr::mutate(
    LRT_stat = round(LRT_stat, 4),
    p_val    = round(p_val, 4),
    Decision = ifelse(is.na(p_val), "",
               ifelse(p_val < 0.001, "Violated ***",
               ifelse(p_val < 0.01,  "Violated **",
               ifelse(p_val < 0.05,  "Violated *",
                                     "Met (p >= 0.05)"))))
  )
rownames(nominal_df) <- NULL

kable(nominal_df,
      caption = "Score/Nominal Test for Proportional Odds Assumption per Predictor",
      col.names = c("Predictor", "LRT Statistic", "df", "p-value", "Decision")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(5,
              color = ifelse(grepl("Violated", nominal_df$Decision), "red",
                     ifelse(nominal_df$Decision == "", "gray", "darkgreen")),
              bold  = TRUE)

Score/Nominal Test for Proportional Odds Assumption per Predictor
Predictor	LRT Statistic	df	p-value	Decision
<none>	NA	NA	NA
flare_class_ord	0	1	0.9998	Met (p >= 0.05)
flare_intensity	0	1	0.9996	Met (p >= 0.05)
geomagnetic_index_Kp	0	1	0.9998	Met (p >= 0.05)
solar_wind_speed	0	1	0.9996	Met (p >= 0.05)
solar_wind_density	0	1	1.0000	Met (p >= 0.05)
flare_duration_minutes	0	1	0.9996	Met (p >= 0.05)

5.6.3 Scale Test for Ordinal Predictors

scale_result <- ordinal::scale_test(model_clm)

scale_df <- as.data.frame(scale_result)
scale_df$Predictor <- rownames(scale_df)
# Rename actual columns returned by ordinal::scale_test: Df, logLik, AIC, LRT, Pr(>Chi)
names(scale_df)[names(scale_df) == "Df"]        <- "df_val"
names(scale_df)[names(scale_df) == "LRT"]       <- "LRT_stat"
names(scale_df)[names(scale_df) == "Pr(>Chi)"]  <- "p_val"
scale_df <- scale_df %>%
  dplyr::select(Predictor, LRT_stat, df_val, p_val) %>%
  dplyr::mutate(
    LRT_stat = round(LRT_stat, 4),
    p_val    = round(p_val, 4),
    Decision = ifelse(is.na(p_val), "",
               ifelse(p_val < 0.05,  "Violated *", "Met (p >= 0.05)"))
  )
rownames(scale_df) <- NULL

kable(scale_df,
      caption = "Scale Test for Proportional Odds Assumption",
      col.names = c("Predictor", "LRT Statistic", "df", "p-value", "Decision")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(5,
              color = ifelse(grepl("Violated", scale_df$Decision), "red",
                     ifelse(scale_df$Decision == "", "gray", "darkgreen")),
              bold  = TRUE)

Scale Test for Proportional Odds Assumption
Predictor	LRT Statistic	df	p-value	Decision
<none>	NA	NA	NA
flare_class_ord	0	1	0.9993	Met (p >= 0.05)
flare_intensity	0	1	0.9993	Met (p >= 0.05)
geomagnetic_index_Kp	0	1	0.9993	Met (p >= 0.05)
solar_wind_speed	0	1	0.9993	Met (p >= 0.05)
solar_wind_density	0	1	0.9993	Met (p >= 0.05)
flare_duration_minutes	0	1	0.9993	Met (p >= 0.05)

5.6.4 Graphical Check Parallel Lines Visualisation

The graphical approach fits two separate binary logistic regressions one for each cutpoint and plots the predicted probabilities. If the lines are roughly parallel, the proportional odds assumption holds. If they diverge substantially, the assumption is violated for that predictor.

# Binary outcomes for each threshold
df_model$cut1 <- as.integer(df_model$power_grid_disruption >= 1)  # P(Y >= 1)
df_model$cut2 <- as.integer(df_model$power_grid_disruption >= 2)  # P(Y >= 2)

num_preds_scaled <- c("flare_intensity", "geomagnetic_index_Kp",
                      "solar_wind_speed", "solar_wind_density",
                      "flare_duration_minutes")

plot_list_parallel <- lapply(num_preds_scaled, function(pvar) {
  # Sequence across the range of the scaled predictor
  xseq <- seq(min(df_model[[pvar]], na.rm = TRUE),
              max(df_model[[pvar]], na.rm = TRUE),
              length.out = 100)

  # Fit binary logistic for each cutpoint (other predictors at mean = 0 after scaling)
  fml1 <- as.formula(paste("cut1 ~", pvar))
  fml2 <- as.formula(paste("cut2 ~", pvar))

  glm1 <- glm(fml1, data = df_model, family = binomial)
  glm2 <- glm(fml2, data = df_model, family = binomial)

  newdat <- data.frame(x = xseq)
  names(newdat) <- pvar

  pred1 <- predict(glm1, newdata = newdat, type = "response")
  pred2 <- predict(glm2, newdata = newdat, type = "response")

  plot_df <- data.frame(
    x       = rep(xseq, 2),
    prob    = c(pred1, pred2),
    Cutpoint = rep(c("P(Y \u2265 1) vs P(Y < 1)",
                     "P(Y \u2265 2) vs P(Y < 2)"), each = 100)
  )

  ggplot(plot_df, aes(x = x, y = prob, color = Cutpoint, linetype = Cutpoint)) +
    geom_line(linewidth = 1) +
    scale_color_manual(values = c("#2E86AB", "#E84855")) +
    labs(title    = pvar,
         x        = paste0(pvar, " (standardised)"),
         y        = "Predicted Probability",
         color    = "Cutpoint",
         linetype = "Cutpoint") +
    theme_minimal(base_size = 10) +
    theme(plot.title      = element_text(face = "bold", size = 10),
          legend.position = "bottom",
          legend.text     = element_text(size = 8))
})

gridExtra::grid.arrange(grobs = plot_list_parallel, ncol = 2)

Parallel Lines Check: Predicted Probabilities per Cutpoint

5.6.5 Coefficient Comparison Across Cutpoints

A more direct assessment compares the regression coefficients from two separate binary logistic regressions (one per cutpoint). If the coefficients are approximately equal, the proportional odds assumption is tenable for that predictor.

all_preds_formula <- c("flare_class_ord", "flare_intensity", "geomagnetic_index_Kp",
                       "solar_wind_speed", "solar_wind_density",
                       "flare_duration_minutes")

fml_cut1 <- as.formula(paste("cut1 ~", paste(all_preds_formula, collapse = " + ")))
fml_cut2 <- as.formula(paste("cut2 ~", paste(all_preds_formula, collapse = " + ")))

glm_cut1 <- glm(fml_cut1, data = df_model, family = binomial)
glm_cut2 <- glm(fml_cut2, data = df_model, family = binomial)

coef_df <- data.frame(
  Predictor = all_preds_formula,
  Beta_Cut1 = round(coef(glm_cut1)[-1], 4),
  Beta_Cut2 = round(coef(glm_cut2)[-1], 4)
) %>%
  dplyr::mutate(
    Difference    = round(abs(Beta_Cut1 - Beta_Cut2), 4),
    Ratio         = round(ifelse(Beta_Cut1 != 0, Beta_Cut2 / Beta_Cut1, NA), 3),
    Assessment    = ifelse(Difference < 0.3, "Parallel (small diff)",
                   ifelse(Difference < 0.8, "Moderate diff  review",
                                            "Large diff  likely violated"))
  )
rownames(coef_df) <- NULL

kable(coef_df,
      caption = "Coefficient Comparison Across Two Cutpoints (Parallel Lines Check)",
      col.names = c("Predictor", "β (cutpoint 1)", "β (cutpoint 2)",
                    "|Difference|", "Ratio (β2/β1)", "Assessment")) %>%
  kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
  column_spec(6,
              color = ifelse(coef_df$Difference >= 0.8,  "red",
                     ifelse(coef_df$Difference >= 0.3,  "darkorange",
                                                         "darkgreen")),
              bold  = TRUE)

Coefficient Comparison Across Two Cutpoints (Parallel Lines Check)
Predictor	β (cutpoint 1)	β (cutpoint 2)	\|Difference\|	Ratio (β2/β1)	Assessment
flare_class_ord	-1.6794	-44.8658	43.1864	26.715	Large diff likely violated
flare_intensity	1672.6531	501.1540	1171.4991	0.300	Large diff likely violated
geomagnetic_index_Kp	70.8777	24.7647	46.1130	0.349	Large diff likely violated
solar_wind_speed	415.7930	104.1878	311.6052	0.251	Large diff likely violated
solar_wind_density	-2.2573	13.4627	15.7200	-5.964	Large diff likely violated
flare_duration_minutes	224.1975	46.5756	177.6219	0.208	Large diff likely violated

Conclusion for Assumption 6: The proportional odds assumption should be interpreted in light of all three pieces of evidence above: (a) the formal nominal test p-values, (b) visual parallelism of the predicted probability curves, and (c) the magnitude of coefficient differences across cutpoints. Predictors with a significant nominal test AND visually diverging curves AND large coefficient differences should be treated as violating the assumption. In that case, consider fitting a Partial Proportional Odds Model using ordinal::clm(..., nominal = ~violated_predictor) to allow that predictor to have separate coefficients per threshold while keeping the remaining predictors under the proportional odds constraint.

6. Overall Assumption Summary

summary_table <- data.frame(
  No. = 1:6,
  Assumption = c(
    "Ordinal dependent variable",
    "Independence of observations",
    "No perfect multicollinearity (VIF)",
    "No extreme outliers",
    "Sufficient sample size per category",
    "Proportional odds (parallel lines)"
  ),
  How_Assessed = c(
    "Domain/theoretical judgment",
    "Study design review + temporal plot",
    "VIF via OLS approximation; correlation matrix",
    "Z-score (|z| > 3) + density plots",
    "n per level vs. 10×p and 15×p thresholds",
    "Nominal test (ordinal::clm) + graphical + coefficient comparison"
  ),
  Typical_Result = c(
    "Met  ordered levels are theoretically meaningful",
    "Met by design  distinct events; note temporal caveat",
    "Check VIF table  monitor flare_class_ord vs flare_intensity",
    "Monitor negative solar_wind_density values",
    "Levels 0 & 1 adequate; Level 2 borderline; Level 3 absent",
    "See nominal test results and parallel lines plots"
  ),
  Action_If_Violated = c(
    "Not applicable",
    "Account for clustering or use GEE/mixed models",
    "Remove or combine correlated predictors; use ridge regression",
    "Clip/remove extreme values; sensitivity analysis",
    "Acknowledge as limitation; avoid predicting absent levels",
    "Use Partial Proportional Odds Model (clm with nominal argument)"
  )
)

kable(summary_table,
      caption = "Comprehensive Assumption Checklist for Ordinal Regression",
      col.names = c("#", "Assumption", "How Assessed",
                    "Expected Outcome / Notes", "Remediation if Violated")) %>%
  kable_styling(bootstrap_options = c("hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  scroll_box(width = "100%")

Comprehensive Assumption Checklist for Ordinal Regression
#	Assumption	How Assessed	Expected Outcome / Notes	Remediation if Violated
1	Ordinal dependent variable	Domain/theoretical judgment	Met ordered levels are theoretically meaningful	Not applicable
2	Independence of observations	Study design review + temporal plot	Met by design distinct events; note temporal caveat	Account for clustering or use GEE/mixed models
3	No perfect multicollinearity (VIF)	VIF via OLS approximation; correlation matrix	Check VIF table monitor flare_class_ord vs flare_intensity	Remove or combine correlated predictors; use ridge regression
4	No extreme outliers	Z-score (\|z\| > 3) + density plots	Monitor negative solar_wind_density values	Clip/remove extreme values; sensitivity analysis
5	Sufficient sample size per category	n per level vs. 10×p and 15×p thresholds	Levels 0 & 1 adequate; Level 2 borderline; Level 3 absent	Acknowledge as limitation; avoid predicting absent levels
6	Proportional odds (parallel lines)	Nominal test (ordinal::clm) + graphical + coefficient comparison	See nominal test results and parallel lines plots	Use Partial Proportional Odds Model (clm with nominal argument)

7. Recommendations Before Fitting the Final Model

Based on the assumption checks above, the following steps are recommended before proceeding to interpretation of the ordinal regression coefficients:

Handle negative solar_wind_density values clip to zero or remove those rows, then re-run VIF and outlier checks.
Monitor multicollinearity between flare_class_ord and flare_intensity. If VIF > 5 for either, test the model with and without one of them, or use only flare_intensity (which is continuous and more information-rich).
Acknowledge Level 3 absence in all interpretations the model estimates thresholds only for P(Y ≤ 0) and P(Y ≤ 1); the Level 3 threshold cannot be estimated.

If the proportional odds assumption is violated for one or more predictors, fit a Partial Proportional Odds Model:

model_ppom <- ordinal::clm(
  disruption_ord ~ flare_class_ord + flare_intensity + geomagnetic_index_Kp +
    solar_wind_speed + solar_wind_density + flare_duration_minutes,
  nominal = ~ violated_predictor_name,
  data = df_model
)

Use stratified train/test split (80/20, stratify = y) for predictive modelling to maintain proportional class representation, especially for the minority class (Level 2).

Document generated using R Markdown. All analyses conducted in R (version ≥ 4.1.0). Dataset: Solar Storm Impact on Power Grids (1,000 simulated events). Primary packages: ordinal, car, MASS, ggplot2, dplyr.

Ordinal Regression Assumption Checks: Solar Storm Impact on Power Grids

Nazril Ravi Pratama and Dimas Rafi Izzulhaq and Naura Kanaya Putri Masruri

2026-04-25