This report presents a systematic assessment of the assumptions underlying Ordinal Regression (Proportional Odds Model) applied to the Solar Storm Impact on Power Grids dataset. Before fitting and interpreting an ordinal regression model, it is essential to verify that the data satisfies the statistical assumptions on which the model relies. Violating these assumptions can lead to biased estimates, inflated standard errors, and misleading conclusions.
The dataset contains 1,000 simulated solar storm events with six
predictor variables (both numerical and categorical) and one ordinal
outcome variable: power_grid_disruption (0 = None, 1 =
Minor, 2 = Moderate, 3 = Severe). The analysis pipeline follows: data
loading → preprocessing → exploratory data analysis → assumption
checking → summary and recommendations.
The six assumptions tested in this report are:
df_raw <- read.csv("solar_storm_impact_dataset.csv", stringsAsFactors = FALSE)
cat(sprintf("Rows: %d | Columns: %d\n", nrow(df_raw), ncol(df_raw)))## Rows: 1000 | Columns: 9
## 'data.frame': 1000 obs. of 9 variables:
## $ event_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ event_date : chr "2018-10-01" "2020-09-28" "2020-03-15" "2008-01-11" ...
## $ solar_flare_class : chr "X" "M" "C" "M" ...
## $ flare_intensity : num 96.2 31.1 23.9 62.7 37.7 ...
## $ geomagnetic_index_Kp : int 4 0 0 2 0 1 1 2 2 0 ...
## $ solar_wind_speed : num 558 723 626 498 505 ...
## $ solar_wind_density : num 12.62 6.26 7.81 5.94 18.03 ...
## $ flare_duration_minutes: int 10 95 44 136 86 45 126 94 172 57 ...
## $ power_grid_disruption : int 2 1 1 1 1 0 2 1 0 0 ...
missing_summary <- data.frame(
Column = names(df_raw),
Missing = sapply(df_raw, function(x) sum(is.na(x))),
Pct_Miss = sapply(df_raw, function(x) round(mean(is.na(x)) * 100, 2))
)
rownames(missing_summary) <- NULL
kable(missing_summary,
caption = "Missing Values per Variable",
col.names = c("Variable", "Missing Count", "Missing (%)")) %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)| Variable | Missing Count | Missing (%) |
|---|---|---|
| event_id | 0 | 0 |
| event_date | 0 | 0 |
| solar_flare_class | 0 | 0 |
| flare_intensity | 0 | 0 |
| geomagnetic_index_Kp | 0 | 0 |
| solar_wind_speed | 0 | 0 |
| solar_wind_density | 0 | 0 |
| flare_duration_minutes | 0 | 0 |
| power_grid_disruption | 0 | 0 |
kable(head(df_raw, 8), caption = "Preview: First 8 Rows of Dataset") %>%
kable_styling(bootstrap_options = c("hover", "condensed"),
full_width = TRUE, font_size = 11) %>%
scroll_box(width = "100%")| event_id | event_date | solar_flare_class | flare_intensity | geomagnetic_index_Kp | solar_wind_speed | solar_wind_density | flare_duration_minutes | power_grid_disruption |
|---|---|---|---|---|---|---|---|---|
| 1 | 2018-10-01 | X | 96.25 | 4 | 558.43 | 12.62 | 10 | 2 |
| 2 | 2020-09-28 | M | 31.15 | 0 | 723.17 | 6.26 | 95 | 1 |
| 3 | 2020-03-15 | C | 23.93 | 0 | 626.27 | 7.81 | 44 | 1 |
| 4 | 2008-01-11 | M | 62.66 | 2 | 498.49 | 5.94 | 136 | 1 |
| 5 | 2022-03-30 | M | 37.73 | 0 | 505.02 | 18.03 | 86 | 1 |
| 6 | 2013-07-26 | C | 24.80 | 1 | 463.60 | 14.37 | 45 | 0 |
| 7 | 2004-03-08 | M | 69.01 | 1 | 690.06 | 5.38 | 126 | 2 |
| 8 | 2002-03-30 | M | 39.41 | 2 | 634.74 | 2.48 | 94 | 1 |
The categorical predictor solar_flare_class (C, M, X) is
encoded as an ordered integer because the flare classes carry a
meaningful physical ordering: C < M < X in terms of energy
release. One-hot encoding is deliberately avoided here to preserve this
ordinal structure.
The outcome power_grid_disruption is converted to an
ordered factor, which is required by ordinal regression functions in R
(MASS::polr, ordinal::clm).
# Ordinal encoding for solar flare class (C=0, M=1, X=2)
df_raw$flare_class_ord <- dplyr::case_when(
df_raw$solar_flare_class == "C" ~ 0L,
df_raw$solar_flare_class == "M" ~ 1L,
df_raw$solar_flare_class == "X" ~ 2L
)
# Convert outcome to ordered factor
df_raw$disruption_ord <- factor(
df_raw$power_grid_disruption,
levels = c(0, 1, 2, 3),
labels = c("None", "Minor", "Moderate", "Severe"),
ordered = TRUE
)
# Predictors to be used in the model
predictors <- c("flare_class_ord", "flare_intensity", "geomagnetic_index_Kp",
"solar_wind_speed", "solar_wind_density", "flare_duration_minutes")
encoding_table <- data.frame(
Original_Value = c("C", "M", "X"),
Ordinal_Code = c(0, 1, 2),
Rationale = c("Common flares, lowest energy",
"Moderate flares, medium energy",
"Extreme flares, highest energy")
)
kable(encoding_table,
caption = "Ordinal Encoding of solar_flare_class",
col.names = c("Original Value", "Ordinal Code", "Rationale")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Original Value | Ordinal Code | Rationale |
|---|---|---|
| C | 0 | Common flares, lowest energy |
| M | 1 | Moderate flares, medium energy |
| X | 2 | Extreme flares, highest energy |
Numeric predictors are standardised (z-scored) to place coefficients on a comparable scale. This does not affect the assumption checks but is recommended practice before fitting the model.
num_vars <- c("flare_intensity", "geomagnetic_index_Kp",
"solar_wind_speed", "solar_wind_density",
"flare_duration_minutes")
df_model <- df_raw
df_model[num_vars] <- as.data.frame(scale(df_raw[num_vars]))
# Confirm scaling
scale_check <- data.frame(
Variable = num_vars,
Mean_raw = round(sapply(df_raw[num_vars], mean), 3),
SD_raw = round(sapply(df_raw[num_vars], sd), 3),
Mean_scaled = round(sapply(df_model[num_vars], mean), 6),
SD_scaled = round(sapply(df_model[num_vars], sd), 3)
)
rownames(scale_check) <- NULL
kable(scale_check,
caption = "Before and After Standardisation",
col.names = c("Variable", "Mean (raw)", "SD (raw)",
"Mean (scaled)", "SD (scaled)")) %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = FALSE)| Variable | Mean (raw) | SD (raw) | Mean (scaled) | SD (scaled) |
|---|---|---|---|---|
| flare_intensity | 38.651 | 27.506 | 0 | 1 |
| geomagnetic_index_Kp | 1.045 | 1.026 | 0 | 1 |
| solar_wind_speed | 501.777 | 99.924 | 0 | 1 |
| solar_wind_density | 9.785 | 4.886 | 0 | 1 |
| flare_duration_minutes | 89.143 | 50.530 | 0 | 1 |
num_desc_vars <- c("flare_intensity", "geomagnetic_index_Kp",
"solar_wind_speed", "solar_wind_density",
"flare_duration_minutes")
desc_stats <- data.frame(
Variable = num_desc_vars,
N = sapply(df_raw[num_desc_vars], function(x) sum(!is.na(x))),
Mean = sapply(df_raw[num_desc_vars], mean, na.rm = TRUE),
SD = sapply(df_raw[num_desc_vars], sd, na.rm = TRUE),
Min = sapply(df_raw[num_desc_vars], min, na.rm = TRUE),
Q1 = sapply(df_raw[num_desc_vars], quantile, 0.25, na.rm = TRUE),
Median = sapply(df_raw[num_desc_vars], median, na.rm = TRUE),
Q3 = sapply(df_raw[num_desc_vars], quantile, 0.75, na.rm = TRUE),
Max = sapply(df_raw[num_desc_vars], max, na.rm = TRUE),
Skewness = sapply(df_raw[num_desc_vars], moments::skewness, na.rm = TRUE),
Kurtosis = sapply(df_raw[num_desc_vars], moments::kurtosis, na.rm = TRUE)
)
rownames(desc_stats) <- NULL
kable(desc_stats %>% dplyr::mutate(across(where(is.numeric), ~round(.x, 3))),
caption = "Descriptive Statistics of Numeric Predictors") %>%
kable_styling(bootstrap_options = c("hover", "condensed"),
full_width = TRUE, font_size = 11) %>%
scroll_box(width = "100%")| Variable | N | Mean | SD | Min | Q1 | Median | Q3 | Max | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|---|---|---|
| flare_intensity | 1000 | 38.651 | 27.506 | 1.02 | 15.778 | 31.885 | 58.550 | 99.93 | 0.559 | 2.207 |
| geomagnetic_index_Kp | 1000 | 1.045 | 1.026 | 0.00 | 0.000 | 1.000 | 2.000 | 5.00 | 0.790 | 3.193 |
| solar_wind_speed | 1000 | 501.777 | 99.924 | 168.77 | 437.105 | 498.865 | 567.955 | 788.38 | -0.079 | 2.958 |
| solar_wind_density | 1000 | 9.785 | 4.886 | -6.22 | 6.367 | 9.610 | 13.137 | 25.55 | -0.008 | 2.948 |
| flare_duration_minutes | 1000 | 89.143 | 50.530 | 5.00 | 44.750 | 88.000 | 132.250 | 180.00 | 0.058 | 1.788 |
outcome_counts <- df_raw %>%
dplyr::count(power_grid_disruption) %>%
dplyr::mutate(
Label = c("0 - None", "1 - Minor", "2 - Moderate")[power_grid_disruption + 1],
Pct = round(n / sum(n) * 100, 1),
Label_pct = paste0(n, "\n(", Pct, "%)")
)
ggplot(outcome_counts, aes(x = factor(power_grid_disruption), y = n,
fill = factor(power_grid_disruption))) +
geom_col(color = "white", alpha = 0.85, width = 0.6) +
geom_text(aes(label = Label_pct), vjust = -0.4, size = 3.8, fontface = "bold") +
scale_fill_brewer(palette = "Set2", guide = "none") +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(title = "Frequency Distribution of power_grid_disruption",
subtitle = "Note: Level 3 (Severe) has zero observations in this dataset",
x = "Disruption Level",
y = "Count") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold"),
plot.subtitle = element_text(color = "gray50", size = 10))Distribution of Power Grid Disruption Levels
kable(outcome_counts %>% dplyr::select(power_grid_disruption, Label, n, Pct),
caption = "Frequency Table: power_grid_disruption",
col.names = c("Level", "Label", "Count", "Percent (%)")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Level | Label | Count | Percent (%) |
|---|---|---|---|
| 0 | 0 - None | 413 | 41.3 |
| 1 | 1 - Minor | 453 | 45.3 |
| 2 | 2 - Moderate | 134 | 13.4 |
df_long <- df_raw %>%
dplyr::select(power_grid_disruption, all_of(num_desc_vars)) %>%
tidyr::pivot_longer(-power_grid_disruption,
names_to = "Variable",
values_to = "Value") %>%
dplyr::mutate(Disruption = factor(power_grid_disruption))
ggplot(df_long, aes(x = Disruption, y = Value, fill = Disruption)) +
geom_boxplot(outlier.shape = 21, outlier.size = 1.5, alpha = 0.75) +
facet_wrap(~Variable, scales = "free_y", ncol = 3) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Numeric Predictor Distributions by Disruption Level",
x = "Disruption Level", y = "Value", fill = "Disruption") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"))Boxplots of Numeric Predictors by Disruption Level
ct <- table(df_raw$solar_flare_class, df_raw$power_grid_disruption)
ct_df <- as.data.frame.matrix(ct)
ct_df$Flare_Class <- rownames(ct_df)
ct_df <- ct_df %>% dplyr::select(Flare_Class, everything())
rownames(ct_df) <- NULL
kable(ct_df,
caption = "Crosstab: Solar Flare Class vs. Power Grid Disruption Level",
col.names = c("Flare Class", "Level 0 (None)", "Level 1 (Minor)", "Level 2 (Moderate)")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
column_spec(1, bold = TRUE)| Flare Class | Level 0 (None) | Level 1 (Minor) | Level 2 (Moderate) |
|---|---|---|---|
| C | 406 | 88 | 0 |
| M | 7 | 329 | 10 |
| X | 0 | 36 | 124 |
mean_by_level <- df_raw %>%
dplyr::group_by(power_grid_disruption) %>%
dplyr::summarise(across(all_of(num_desc_vars), ~round(mean(.x, na.rm = TRUE), 3)),
.groups = "drop")
kable(mean_by_level,
caption = "Mean of Each Numeric Predictor per Disruption Level") %>%
kable_styling(bootstrap_options = c("hover", "condensed"),
full_width = TRUE, font_size = 11) %>%
scroll_box(width = "100%")| power_grid_disruption | flare_intensity | geomagnetic_index_Kp | solar_wind_speed | solar_wind_density | flare_duration_minutes |
|---|---|---|---|---|---|
| 0 | 13.798 | 0.632 | 473.912 | 9.803 | 86.053 |
| 1 | 47.048 | 1.086 | 519.036 | 9.724 | 90.724 |
| 2 | 86.860 | 2.179 | 529.312 | 9.937 | 93.321 |
The dependent variable power_grid_disruption takes
values 0 (None), 1 (Minor), 2 (Moderate), and 3 (Severe). These
categories have a clear, theoretically grounded ordering: a higher value
represents greater physical and economic damage to power infrastructure.
The distances between categories are not assumed to be equal only the
order is meaningful. This is precisely the structure that ordinal
regression is designed to model, distinguishing it from both linear
regression (which assumes equal intervals) and multinomial logistic
regression (which ignores order entirely).
ord_check <- data.frame(
Criterion = c("Outcome variable type",
"Number of ordered levels",
"Levels observed in data",
"Meaningful order exists",
"Equal intervals assumed"),
Status = c("Ordinal factor",
"4 defined (0, 1, 2, 3)",
"3 observed (0, 1, 2) Level 3 absent",
"Yes None < Minor < Moderate < Severe",
"No ordinal only"),
Assessment = c("Met",
"Met",
"Note: Level 3 has 0 observations",
"Met",
"Met")
)
kable(ord_check,
caption = "Assumption 1: Ordinal Dependent Variable") %>%
kable_styling(bootstrap_options = c("hover"), full_width = TRUE) %>%
column_spec(3,
color = ifelse(ord_check$Assessment == "Met", "darkgreen", "darkorange"),
bold = TRUE)| Criterion | Status | Assessment |
|---|---|---|
| Outcome variable type | Ordinal factor | Met |
| Number of ordered levels | 4 defined (0, 1, 2, 3) | Met |
| Levels observed in data | 3 observed (0, 1, 2) Level 3 absent | Note: Level 3 has 0 observations |
| Meaningful order exists | Yes None < Minor < Moderate < Severe | Met |
| Equal intervals assumed | No ordinal only | Met |
Conclusion: Assumption 1 is met. The outcome is genuinely ordinal and the ordering is physically meaningful. One important note is that Level 3 (Severe) contains zero observations, so the fitted model will effectively operate as a 3-class ordinal problem with two estimated threshold parameters.
Ordinal regression assumes that each observation is independent of all others that is, the outcome for one solar storm event does not influence the outcome for any other event. This is primarily assessed by considering the study design rather than a formal statistical test.
# Inspect date range and check for potential temporal clustering
df_raw$event_date_parsed <- as.Date(df_raw$event_date)
date_range <- data.frame(
Item = c("Earliest event", "Latest event", "Total events", "Unique dates"),
Value = c(
as.character(min(df_raw$event_date_parsed, na.rm = TRUE)),
as.character(max(df_raw$event_date_parsed, na.rm = TRUE)),
as.character(nrow(df_raw)),
as.character(length(unique(df_raw$event_date_parsed)))
)
)
kable(date_range, caption = "Temporal Scope of the Dataset",
col.names = c("Item", "Value")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)| Item | Value |
|---|---|
| Earliest event | 2000-01-02 |
| Latest event | 2024-08-15 |
| Total events | 1000 |
| Unique dates | 949 |
df_raw$event_year <- as.integer(format(df_raw$event_date_parsed, "%Y"))
events_per_year <- df_raw %>%
dplyr::count(event_year)
ggplot(events_per_year, aes(x = event_year, y = n)) +
geom_col(fill = "#2E86AB", alpha = 0.8, color = "white") +
geom_hline(yintercept = mean(events_per_year$n),
linetype = "dashed", color = "#E84855", linewidth = 0.8) +
annotate("text", x = min(events_per_year$event_year) + 0.5,
y = mean(events_per_year$n) + 2,
label = paste0("Mean = ", round(mean(events_per_year$n), 1)),
color = "#E84855", size = 3.5) +
labs(title = "Number of Solar Storm Events per Year",
subtitle = "No obvious systematic clustering or time-trend detected",
x = "Year", y = "Count") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))Events Over Time Checking for Temporal Clustering
Conclusion: Assumption 2 is met by design. Each row represents a distinct solar storm event. Events are spread across multiple years with no obvious temporal clustering. There is no repeated-measures structure (e.g., the same storm measured twice). One caveat: storms occurring very close in time could reflect the same underlying active solar region this is a domain-level concern that cannot be resolved statistically, and should be noted as a limitation.
Multicollinearity refers to high linear correlation among predictors. Perfect multicollinearity makes the model inestimable; high multicollinearity inflates standard errors and makes individual coefficients unstable. We assess this using the Variance Inflation Factor (VIF): a VIF > 10 indicates serious multicollinearity, and VIF > 5 warrants attention.
Because MASS::polr fits a cumulative logit model, we
approximate VIF by regressing each predictor on all other predictors
using OLS (a standard approximation valid for assessing collinearity
structure).
# Approximate VIF via OLS on predictors
vif_models <- list(
flare_class_ord = lm(flare_class_ord ~ flare_intensity + geomagnetic_index_Kp +
solar_wind_speed + solar_wind_density +
flare_duration_minutes, data = df_raw),
flare_intensity = lm(flare_intensity ~ flare_class_ord + geomagnetic_index_Kp +
solar_wind_speed + solar_wind_density +
flare_duration_minutes, data = df_raw),
geomagnetic_index_Kp = lm(geomagnetic_index_Kp ~ flare_class_ord + flare_intensity +
solar_wind_speed + solar_wind_density +
flare_duration_minutes, data = df_raw),
solar_wind_speed = lm(solar_wind_speed ~ flare_class_ord + flare_intensity +
geomagnetic_index_Kp + solar_wind_density +
flare_duration_minutes, data = df_raw),
solar_wind_density = lm(solar_wind_density ~ flare_class_ord + flare_intensity +
geomagnetic_index_Kp + solar_wind_speed +
flare_duration_minutes, data = df_raw),
flare_duration_minutes = lm(flare_duration_minutes ~ flare_class_ord + flare_intensity +
geomagnetic_index_Kp + solar_wind_speed +
solar_wind_density, data = df_raw)
)
vif_values <- sapply(vif_models, function(m) {
r2 <- summary(m)$r.squared
1 / (1 - r2)
})
vif_df <- data.frame(
Predictor = names(vif_values),
VIF = round(vif_values, 4),
Interpretation = ifelse(vif_values >= 10, "High serious concern",
ifelse(vif_values >= 5, "Moderate monitor",
"Acceptable"))
)
rownames(vif_df) <- NULL
kable(vif_df, caption = "Variance Inflation Factor (VIF) per Predictor") %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
column_spec(3,
color = ifelse(vif_df$VIF >= 10, "red",
ifelse(vif_df$VIF >= 5, "darkorange", "darkgreen")),
bold = TRUE)| Predictor | VIF | Interpretation |
|---|---|---|
| flare_class_ord | 8.7176 | Moderate monitor |
| flare_intensity | 9.2674 | Moderate monitor |
| geomagnetic_index_Kp | 1.4250 | Acceptable |
| solar_wind_speed | 1.0266 | Acceptable |
| solar_wind_density | 1.0042 | Acceptable |
| flare_duration_minutes | 1.0045 | Acceptable |
cor_vars <- c("flare_class_ord", "flare_intensity", "geomagnetic_index_Kp",
"solar_wind_speed", "solar_wind_density", "flare_duration_minutes")
cor_mat <- round(cor(df_raw[cor_vars], use = "complete.obs"), 3)
cor_melted <- reshape2::melt(cor_mat)
ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white", linewidth = 0.5) +
geom_text(aes(label = value), size = 3.5, fontface = "bold") +
scale_fill_gradient2(low = "#E84855", mid = "white", high = "#2E86AB",
midpoint = 0, limits = c(-1, 1),
name = "Pearson r") +
labs(title = "Predictor Correlation Matrix",
subtitle = "Values close to ±1 signal potential multicollinearity",
x = NULL, y = NULL) +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 30, hjust = 1))Correlation Matrix of All Predictors
Conclusion: Assumption 3 is met if
all VIF values are below 5. Pay particular attention to the pair
flare_class_ord and flare_intensity, which may
show elevated correlation given their conceptual overlap (flare class is
a categorical proxy for flare energy, while intensity is a continuous
measure of the same). If VIF for either exceeds 10, consider removing
one from the model or creating a combined composite predictor.
Extreme outliers in predictor variables can unduly influence the
estimation of regression coefficients and threshold parameters. We
assess outliers using z-scores (flagging observations beyond ±3 SD) and
boxplots. We also specifically investigate the anomalous negative
minimum of solar_wind_density.
z_scores <- as.data.frame(scale(df_raw[num_desc_vars]))
names(z_scores) <- paste0(num_desc_vars, "_z")
outlier_counts <- sapply(z_scores, function(x) sum(abs(x) > 3, na.rm = TRUE))
outlier_df <- data.frame(
Variable = num_desc_vars,
N_Outliers_3SD = as.integer(outlier_counts),
Pct_Outliers = round(outlier_counts / nrow(df_raw) * 100, 2),
Assessment = ifelse(outlier_counts == 0, "None detected",
ifelse(outlier_counts <= 5, "Few monitor",
"Several investigate"))
)
rownames(outlier_df) <- NULL
kable(outlier_df,
caption = "Outlier Count per Predictor (|z| > 3 threshold)",
col.names = c("Variable", "Outlier Count (|z|>3)", "% of Sample", "Assessment")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
column_spec(4,
color = ifelse(outlier_df$N_Outliers_3SD == 0, "darkgreen",
ifelse(outlier_df$N_Outliers_3SD <= 5, "darkorange", "red")),
bold = TRUE)| Variable | Outlier Count (|z|>3) | % of Sample | Assessment |
|---|---|---|---|
| flare_intensity | 0 | 0.0 | None detected |
| geomagnetic_index_Kp | 3 | 0.3 | Few monitor |
| solar_wind_speed | 3 | 0.3 | Few monitor |
| solar_wind_density | 3 | 0.3 | Few monitor |
| flare_duration_minutes | 0 | 0.0 | None detected |
df_long_raw <- df_raw %>%
dplyr::select(all_of(num_desc_vars)) %>%
tidyr::pivot_longer(everything(), names_to = "Variable", values_to = "Value")
# Compute ±3 SD boundaries per variable (manual approach, no regex)
bounds_lo <- sapply(df_raw[num_desc_vars], function(x) mean(x) - 3*sd(x))
bounds_hi <- sapply(df_raw[num_desc_vars], function(x) mean(x) + 3*sd(x))
bounds <- data.frame(
Variable = num_desc_vars,
lo = as.numeric(bounds_lo),
hi = as.numeric(bounds_hi)
)
ggplot(df_long_raw, aes(x = Value)) +
geom_histogram(aes(y = after_stat(density)), bins = 30,
fill = "#2E86AB", color = "white", alpha = 0.6) +
geom_density(color = "#2E86AB", linewidth = 0.9) +
geom_vline(data = bounds, aes(xintercept = lo),
color = "#E84855", linetype = "dashed", linewidth = 0.8) +
geom_vline(data = bounds, aes(xintercept = hi),
color = "#E84855", linetype = "dashed", linewidth = 0.8) +
facet_wrap(~Variable, scales = "free", ncol = 3) +
labs(title = "Predictor Distributions with ±3 SD Outlier Boundaries",
subtitle = "Red dashed lines indicate ±3 SD threshold",
x = "Value", y = "Density") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"))Density Plots of Numeric Predictors with Outlier Boundaries
neg_density <- df_raw %>%
dplyr::filter(solar_wind_density < 0) %>%
dplyr::select(event_id, event_date, solar_wind_density, power_grid_disruption)
cat(sprintf("Number of observations with negative solar_wind_density: %d\n",
nrow(neg_density)))## Number of observations with negative solar_wind_density: 25
if (nrow(neg_density) > 0) {
kable(neg_density,
caption = "Observations with Negative solar_wind_density (Physically Anomalous)") %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE)
}| event_id | event_date | solar_wind_density | power_grid_disruption |
|---|---|---|---|
| 30 | 2021-10-08 | -0.28 | 0 |
| 36 | 2005-09-19 | -1.49 | 0 |
| 98 | 2016-10-14 | -3.12 | 1 |
| 117 | 2015-02-05 | -0.97 | 0 |
| 121 | 2018-07-25 | -1.92 | 0 |
| 175 | 2004-06-17 | -6.22 | 1 |
| 227 | 2015-04-03 | -0.64 | 1 |
| 239 | 2008-05-24 | -0.43 | 1 |
| 264 | 2001-08-24 | -1.09 | 1 |
| 309 | 2008-05-24 | -0.70 | 1 |
| 336 | 2020-03-28 | -1.39 | 0 |
| 406 | 2019-07-06 | -0.93 | 1 |
| 415 | 2019-04-21 | -0.73 | 0 |
| 462 | 2024-06-09 | -2.08 | 1 |
| 512 | 2024-02-29 | -3.08 | 2 |
| 531 | 2001-05-13 | -2.94 | 0 |
| 536 | 2010-12-12 | -0.70 | 1 |
| 556 | 2014-11-07 | -1.00 | 1 |
| 605 | 2021-09-24 | -2.54 | 2 |
| 620 | 2001-08-16 | -0.15 | 0 |
| 668 | 2019-02-16 | -2.25 | 0 |
| 735 | 2022-08-04 | -1.05 | 1 |
| 769 | 2004-11-19 | -1.83 | 1 |
| 826 | 2002-08-15 | -1.35 | 0 |
| 920 | 2019-07-03 | -6.15 | 1 |
Conclusion: Assumption 4 is conditionally
met. For most predictors, the number of extreme outliers (|z|
> 3) should be minimal given the simulated nature of this dataset.
The negative values in solar_wind_density are physically
impossible (particle density cannot be negative) and represent either
data entry errors or simulation artefacts. Before fitting the final
model, these values should be clipped to zero or removed. Their
influence on the model can be assessed by comparing results with and
without those observations.
Ordinal regression requires sufficient observations in each outcome category to reliably estimate the threshold parameters and regression coefficients. A commonly cited rule of thumb is at least 10–15 observations per predictor per outcome category.
n_predictors <- length(predictors)
category_counts <- df_raw %>%
dplyr::count(power_grid_disruption) %>%
dplyr::mutate(
Label = c("0 - None", "1 - Minor", "2 - Moderate", "3 - Severe")[power_grid_disruption + 1],
Min_required = n_predictors * 10,
Sufficient_10x = ifelse(n >= n_predictors * 10, "Yes", "No"),
Sufficient_15x = ifelse(n >= n_predictors * 15, "Yes", "No")
) %>%
dplyr::select(power_grid_disruption, Label, n, Min_required, Sufficient_10x, Sufficient_15x)
kable(category_counts,
caption = paste0("Sample Size per Outcome Category (", n_predictors,
" predictors × 10 or 15 = minimum threshold)"),
col.names = c("Level", "Label", "n", "Min (10×p)", "≥10×p?", "≥15×p?")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
column_spec(5, color = ifelse(category_counts$Sufficient_10x == "Yes",
"darkgreen", "red"), bold = TRUE) %>%
column_spec(6, color = ifelse(category_counts$Sufficient_15x == "Yes",
"darkgreen", "red"), bold = TRUE)| Level | Label | n | Min (10×p) | ≥10×p? | ≥15×p? |
|---|---|---|---|---|---|
| 0 | 0 - None | 413 | 60 | Yes | Yes |
| 1 | 1 - Minor | 453 | 60 | Yes | Yes |
| 2 | 2 - Moderate | 134 | 60 | Yes | Yes |
prop_df <- data.frame(
Level = c("0-None", "1-Minor", "2-Moderate", "3-Severe"),
Count = c(
sum(df_raw$power_grid_disruption == 0),
sum(df_raw$power_grid_disruption == 1),
sum(df_raw$power_grid_disruption == 2),
sum(df_raw$power_grid_disruption == 3)
)
)
ggplot(prop_df, aes(x = Level, y = Count, fill = Level)) +
geom_col(color = "white", alpha = 0.85, width = 0.6) +
geom_hline(yintercept = n_predictors * 10,
linetype = "dashed", color = "darkorange", linewidth = 0.9) +
geom_hline(yintercept = n_predictors * 15,
linetype = "dashed", color = "red", linewidth = 0.9) +
annotate("text", x = 0.6, y = n_predictors * 10 + 8,
label = paste0("Min 10×p = ", n_predictors * 10),
color = "darkorange", size = 3.2, hjust = 0) +
annotate("text", x = 0.6, y = n_predictors * 15 + 8,
label = paste0("Min 15×p = ", n_predictors * 15),
color = "red", size = 3.2, hjust = 0) +
scale_fill_brewer(palette = "Set2", guide = "none") +
labs(title = "Class Frequencies vs. Minimum Sample Size Thresholds",
subtitle = "Dashed lines = rule-of-thumb minimums for 6 predictors",
x = "Disruption Level", y = "Count") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))Class Proportions vs. Minimum Thresholds
Conclusion: Assumption 5 is partially met. Levels 0 (None) and 1 (Minor) have more than adequate sample sizes. Level 2 (Moderate, n = 134) satisfies the 10× threshold for 6 predictors (minimum = 60) but is borderline for 15× (minimum = 90). Level 3 (Severe) has zero observations and cannot be modelled the model will be restricted to predicting levels 0, 1, and 2. This must be explicitly stated as a limitation of any findings.
This is the most critical and most commonly violated assumption of the Proportional Odds Model. It states that the effect of each predictor is identical across all cumulative logit thresholds i.e., the same β coefficient governs P(Y ≤ 0 | X) and P(Y ≤ 1 | X) simultaneously.
We test this assumption using the graphical parallel lines
approach (fitting separate binary logistic regressions for each
cutpoint and comparing coefficients), and by computing the Score
Test for proportional odds available in
ordinal::clm.
df_model$disruption_ord <- factor(
df_model$power_grid_disruption,
levels = c(0, 1, 2),
ordered = TRUE
)
# Fit POM using ordinal::clm (more informative output than MASS::polr)
model_clm <- ordinal::clm(
disruption_ord ~ flare_class_ord + flare_intensity + geomagnetic_index_Kp +
solar_wind_speed + solar_wind_density + flare_duration_minutes,
data = df_model,
link = "logit"
)
cat("=== Proportional Odds Model Summary ===\n")## === Proportional Odds Model Summary ===
## formula:
## disruption_ord ~ flare_class_ord + flare_intensity + geomagnetic_index_Kp + solar_wind_speed + solar_wind_density + flare_duration_minutes
## data: df_model
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 1000 -0.00 16.00 29(0) 6.11e-09 1.4e+05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## flare_class_ord 20.936 NA NA NA
## flare_intensity 2786.913 NA NA NA
## geomagnetic_index_Kp 115.876 NA NA NA
## solar_wind_speed 692.937 NA NA NA
## solar_wind_density 3.993 NA NA NA
## flare_duration_minutes 358.411 NA NA NA
##
## Threshold coefficients:
## Estimate Std. Error z value
## 0|1 -1248 NA NA
## 1|2 3823 NA NA
The nominal_test() function from the
ordinal package tests whether each predictor violates the
proportional odds assumption. A significant result (p < 0.05) for a
given predictor means that predictor’s effect is NOT consistent across
thresholds, i.e., the parallel lines assumption is violated for that
predictor.
nominal_result <- ordinal::nominal_test(model_clm)
nominal_df <- as.data.frame(nominal_result)
nominal_df$Predictor <- rownames(nominal_df)
# Rename actual columns returned by ordinal::nominal_test: Df, logLik, AIC, LRT, Pr(>Chi)
names(nominal_df)[names(nominal_df) == "Df"] <- "df_val"
names(nominal_df)[names(nominal_df) == "LRT"] <- "LRT_stat"
names(nominal_df)[names(nominal_df) == "Pr(>Chi)"] <- "p_val"
nominal_df <- nominal_df %>%
dplyr::select(Predictor, LRT_stat, df_val, p_val) %>%
dplyr::mutate(
LRT_stat = round(LRT_stat, 4),
p_val = round(p_val, 4),
Decision = ifelse(is.na(p_val), "",
ifelse(p_val < 0.001, "Violated ***",
ifelse(p_val < 0.01, "Violated **",
ifelse(p_val < 0.05, "Violated *",
"Met (p >= 0.05)"))))
)
rownames(nominal_df) <- NULL
kable(nominal_df,
caption = "Score/Nominal Test for Proportional Odds Assumption per Predictor",
col.names = c("Predictor", "LRT Statistic", "df", "p-value", "Decision")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
column_spec(5,
color = ifelse(grepl("Violated", nominal_df$Decision), "red",
ifelse(nominal_df$Decision == "", "gray", "darkgreen")),
bold = TRUE)| Predictor | LRT Statistic | df | p-value | Decision |
|---|---|---|---|---|
| <none> | NA | NA | NA | |
| flare_class_ord | 0 | 1 | 0.9998 | Met (p >= 0.05) |
| flare_intensity | 0 | 1 | 0.9996 | Met (p >= 0.05) |
| geomagnetic_index_Kp | 0 | 1 | 0.9998 | Met (p >= 0.05) |
| solar_wind_speed | 0 | 1 | 0.9996 | Met (p >= 0.05) |
| solar_wind_density | 0 | 1 | 1.0000 | Met (p >= 0.05) |
| flare_duration_minutes | 0 | 1 | 0.9996 | Met (p >= 0.05) |
scale_result <- ordinal::scale_test(model_clm)
scale_df <- as.data.frame(scale_result)
scale_df$Predictor <- rownames(scale_df)
# Rename actual columns returned by ordinal::scale_test: Df, logLik, AIC, LRT, Pr(>Chi)
names(scale_df)[names(scale_df) == "Df"] <- "df_val"
names(scale_df)[names(scale_df) == "LRT"] <- "LRT_stat"
names(scale_df)[names(scale_df) == "Pr(>Chi)"] <- "p_val"
scale_df <- scale_df %>%
dplyr::select(Predictor, LRT_stat, df_val, p_val) %>%
dplyr::mutate(
LRT_stat = round(LRT_stat, 4),
p_val = round(p_val, 4),
Decision = ifelse(is.na(p_val), "",
ifelse(p_val < 0.05, "Violated *", "Met (p >= 0.05)"))
)
rownames(scale_df) <- NULL
kable(scale_df,
caption = "Scale Test for Proportional Odds Assumption",
col.names = c("Predictor", "LRT Statistic", "df", "p-value", "Decision")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
column_spec(5,
color = ifelse(grepl("Violated", scale_df$Decision), "red",
ifelse(scale_df$Decision == "", "gray", "darkgreen")),
bold = TRUE)| Predictor | LRT Statistic | df | p-value | Decision |
|---|---|---|---|---|
| <none> | NA | NA | NA | |
| flare_class_ord | 0 | 1 | 0.9993 | Met (p >= 0.05) |
| flare_intensity | 0 | 1 | 0.9993 | Met (p >= 0.05) |
| geomagnetic_index_Kp | 0 | 1 | 0.9993 | Met (p >= 0.05) |
| solar_wind_speed | 0 | 1 | 0.9993 | Met (p >= 0.05) |
| solar_wind_density | 0 | 1 | 0.9993 | Met (p >= 0.05) |
| flare_duration_minutes | 0 | 1 | 0.9993 | Met (p >= 0.05) |
The graphical approach fits two separate binary logistic regressions one for each cutpoint and plots the predicted probabilities. If the lines are roughly parallel, the proportional odds assumption holds. If they diverge substantially, the assumption is violated for that predictor.
# Binary outcomes for each threshold
df_model$cut1 <- as.integer(df_model$power_grid_disruption >= 1) # P(Y >= 1)
df_model$cut2 <- as.integer(df_model$power_grid_disruption >= 2) # P(Y >= 2)
num_preds_scaled <- c("flare_intensity", "geomagnetic_index_Kp",
"solar_wind_speed", "solar_wind_density",
"flare_duration_minutes")
plot_list_parallel <- lapply(num_preds_scaled, function(pvar) {
# Sequence across the range of the scaled predictor
xseq <- seq(min(df_model[[pvar]], na.rm = TRUE),
max(df_model[[pvar]], na.rm = TRUE),
length.out = 100)
# Fit binary logistic for each cutpoint (other predictors at mean = 0 after scaling)
fml1 <- as.formula(paste("cut1 ~", pvar))
fml2 <- as.formula(paste("cut2 ~", pvar))
glm1 <- glm(fml1, data = df_model, family = binomial)
glm2 <- glm(fml2, data = df_model, family = binomial)
newdat <- data.frame(x = xseq)
names(newdat) <- pvar
pred1 <- predict(glm1, newdata = newdat, type = "response")
pred2 <- predict(glm2, newdata = newdat, type = "response")
plot_df <- data.frame(
x = rep(xseq, 2),
prob = c(pred1, pred2),
Cutpoint = rep(c("P(Y \u2265 1) vs P(Y < 1)",
"P(Y \u2265 2) vs P(Y < 2)"), each = 100)
)
ggplot(plot_df, aes(x = x, y = prob, color = Cutpoint, linetype = Cutpoint)) +
geom_line(linewidth = 1) +
scale_color_manual(values = c("#2E86AB", "#E84855")) +
labs(title = pvar,
x = paste0(pvar, " (standardised)"),
y = "Predicted Probability",
color = "Cutpoint",
linetype = "Cutpoint") +
theme_minimal(base_size = 10) +
theme(plot.title = element_text(face = "bold", size = 10),
legend.position = "bottom",
legend.text = element_text(size = 8))
})
gridExtra::grid.arrange(grobs = plot_list_parallel, ncol = 2)Parallel Lines Check: Predicted Probabilities per Cutpoint
A more direct assessment compares the regression coefficients from two separate binary logistic regressions (one per cutpoint). If the coefficients are approximately equal, the proportional odds assumption is tenable for that predictor.
all_preds_formula <- c("flare_class_ord", "flare_intensity", "geomagnetic_index_Kp",
"solar_wind_speed", "solar_wind_density",
"flare_duration_minutes")
fml_cut1 <- as.formula(paste("cut1 ~", paste(all_preds_formula, collapse = " + ")))
fml_cut2 <- as.formula(paste("cut2 ~", paste(all_preds_formula, collapse = " + ")))
glm_cut1 <- glm(fml_cut1, data = df_model, family = binomial)
glm_cut2 <- glm(fml_cut2, data = df_model, family = binomial)
coef_df <- data.frame(
Predictor = all_preds_formula,
Beta_Cut1 = round(coef(glm_cut1)[-1], 4),
Beta_Cut2 = round(coef(glm_cut2)[-1], 4)
) %>%
dplyr::mutate(
Difference = round(abs(Beta_Cut1 - Beta_Cut2), 4),
Ratio = round(ifelse(Beta_Cut1 != 0, Beta_Cut2 / Beta_Cut1, NA), 3),
Assessment = ifelse(Difference < 0.3, "Parallel (small diff)",
ifelse(Difference < 0.8, "Moderate diff review",
"Large diff likely violated"))
)
rownames(coef_df) <- NULL
kable(coef_df,
caption = "Coefficient Comparison Across Two Cutpoints (Parallel Lines Check)",
col.names = c("Predictor", "β (cutpoint 1)", "β (cutpoint 2)",
"|Difference|", "Ratio (β2/β1)", "Assessment")) %>%
kable_styling(bootstrap_options = c("hover"), full_width = FALSE) %>%
column_spec(6,
color = ifelse(coef_df$Difference >= 0.8, "red",
ifelse(coef_df$Difference >= 0.3, "darkorange",
"darkgreen")),
bold = TRUE)| Predictor | β (cutpoint 1) | β (cutpoint 2) | |Difference| | Ratio (β2/β1) | Assessment |
|---|---|---|---|---|---|
| flare_class_ord | -1.6794 | -44.8658 | 43.1864 | 26.715 | Large diff likely violated |
| flare_intensity | 1672.6531 | 501.1540 | 1171.4991 | 0.300 | Large diff likely violated |
| geomagnetic_index_Kp | 70.8777 | 24.7647 | 46.1130 | 0.349 | Large diff likely violated |
| solar_wind_speed | 415.7930 | 104.1878 | 311.6052 | 0.251 | Large diff likely violated |
| solar_wind_density | -2.2573 | 13.4627 | 15.7200 | -5.964 | Large diff likely violated |
| flare_duration_minutes | 224.1975 | 46.5756 | 177.6219 | 0.208 | Large diff likely violated |
Conclusion for Assumption 6: The proportional odds
assumption should be interpreted in light of all three pieces of
evidence above: (a) the formal nominal test p-values, (b) visual
parallelism of the predicted probability curves, and (c) the magnitude
of coefficient differences across cutpoints. Predictors with a
significant nominal test AND visually diverging curves AND large
coefficient differences should be treated as violating the assumption.
In that case, consider fitting a Partial Proportional Odds
Model using
ordinal::clm(..., nominal = ~violated_predictor) to allow
that predictor to have separate coefficients per threshold while keeping
the remaining predictors under the proportional odds constraint.
summary_table <- data.frame(
No. = 1:6,
Assumption = c(
"Ordinal dependent variable",
"Independence of observations",
"No perfect multicollinearity (VIF)",
"No extreme outliers",
"Sufficient sample size per category",
"Proportional odds (parallel lines)"
),
How_Assessed = c(
"Domain/theoretical judgment",
"Study design review + temporal plot",
"VIF via OLS approximation; correlation matrix",
"Z-score (|z| > 3) + density plots",
"n per level vs. 10×p and 15×p thresholds",
"Nominal test (ordinal::clm) + graphical + coefficient comparison"
),
Typical_Result = c(
"Met ordered levels are theoretically meaningful",
"Met by design distinct events; note temporal caveat",
"Check VIF table monitor flare_class_ord vs flare_intensity",
"Monitor negative solar_wind_density values",
"Levels 0 & 1 adequate; Level 2 borderline; Level 3 absent",
"See nominal test results and parallel lines plots"
),
Action_If_Violated = c(
"Not applicable",
"Account for clustering or use GEE/mixed models",
"Remove or combine correlated predictors; use ridge regression",
"Clip/remove extreme values; sensitivity analysis",
"Acknowledge as limitation; avoid predicting absent levels",
"Use Partial Proportional Odds Model (clm with nominal argument)"
)
)
kable(summary_table,
caption = "Comprehensive Assumption Checklist for Ordinal Regression",
col.names = c("#", "Assumption", "How Assessed",
"Expected Outcome / Notes", "Remediation if Violated")) %>%
kable_styling(bootstrap_options = c("hover", "condensed"),
full_width = TRUE, font_size = 11) %>%
scroll_box(width = "100%")| # | Assumption | How Assessed | Expected Outcome / Notes | Remediation if Violated |
|---|---|---|---|---|
| 1 | Ordinal dependent variable | Domain/theoretical judgment | Met ordered levels are theoretically meaningful | Not applicable |
| 2 | Independence of observations | Study design review + temporal plot | Met by design distinct events; note temporal caveat | Account for clustering or use GEE/mixed models |
| 3 | No perfect multicollinearity (VIF) | VIF via OLS approximation; correlation matrix | Check VIF table monitor flare_class_ord vs flare_intensity | Remove or combine correlated predictors; use ridge regression |
| 4 | No extreme outliers | Z-score (|z| > 3) + density plots | Monitor negative solar_wind_density values | Clip/remove extreme values; sensitivity analysis |
| 5 | Sufficient sample size per category | n per level vs. 10×p and 15×p thresholds | Levels 0 & 1 adequate; Level 2 borderline; Level 3 absent | Acknowledge as limitation; avoid predicting absent levels |
| 6 | Proportional odds (parallel lines) | Nominal test (ordinal::clm) + graphical + coefficient comparison | See nominal test results and parallel lines plots | Use Partial Proportional Odds Model (clm with nominal argument) |
Based on the assumption checks above, the following steps are recommended before proceeding to interpretation of the ordinal regression coefficients:
Handle negative solar_wind_density
values clip to zero or remove those rows, then re-run VIF and outlier
checks.
Monitor multicollinearity between
flare_class_ord and flare_intensity. If VIF
> 5 for either, test the model with and without one of them, or use
only flare_intensity (which is continuous and more
information-rich).
Acknowledge Level 3 absence in all interpretations the model estimates thresholds only for P(Y ≤ 0) and P(Y ≤ 1); the Level 3 threshold cannot be estimated.
If the proportional odds assumption is violated for one or more predictors, fit a Partial Proportional Odds Model:
Use stratified train/test split (80/20,
stratify = y) for predictive modelling to maintain
proportional class representation, especially for the minority class
(Level 2).
Document generated using R Markdown. All analyses conducted in R
(version ≥ 4.1.0). Dataset: Solar Storm Impact on Power Grids (1,000
simulated events). Primary packages: ordinal,
car, MASS, ggplot2,
dplyr.