This secondary data analysis examines U.S. retail gasoline prices from 2010–2024 using data sourced from the U.S. Energy Information Administration (EIA) and the Federal Reserve Economic Data (FRED) database.
Research Question: How do political and economic events affect U.S. gas prices, and can we identify patterns and clusters in price behavior over time?
Hypotheses tested:
# Install any missing packages automatically
required_packages <- c("ggplot2", "dplyr", "tidyr", "cluster",
"factoextra", "scales", "lubridate", "ggrepel")
for (pkg in required_packages) {
if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg)
}
library(ggplot2)
library(dplyr)
library(tidyr)
library(cluster)
library(factoextra)
library(scales)
library(lubridate)
library(ggrepel)
The dataset below contains monthly average U.S. retail gasoline prices (regular grade, $/gallon) from January 2010 through December 2024, sourced from EIA Monthly Energy Review Table 9.4. Crude oil (WTI, $/barrel) annual averages are included for regression.
# ── Monthly Gas Price Data (EIA, Regular Grade, All Formulations) ──────────────
gas_monthly <- data.frame(
date = seq(as.Date("2010-01-01"), as.Date("2024-12-01"), by = "month"),
gas_price = c(
# 2010
2.73,2.69,2.78,2.86,2.87,2.75,2.74,2.74,2.76,2.84,2.87,3.02,
# 2011
3.10,3.19,3.56,3.81,3.96,3.70,3.65,3.64,3.59,3.47,3.41,3.28,
# 2012
3.38,3.58,3.83,3.92,3.73,3.54,3.45,3.72,3.85,3.71,3.44,3.30,
# 2013
3.30,3.64,3.70,3.52,3.62,3.58,3.52,3.55,3.54,3.34,3.19,3.25,
# 2014
3.31,3.36,3.53,3.67,3.67,3.68,3.60,3.51,3.39,3.17,2.91,2.59,
# 2015
2.12,2.12,2.44,2.45,2.72,2.78,2.77,2.63,2.35,2.33,2.21,2.03,
# 2016
1.88,1.74,1.83,2.07,2.27,2.38,2.24,2.18,2.22,2.24,2.15,2.24,
# 2017
2.35,2.30,2.31,2.41,2.40,2.38,2.30,2.35,2.67,2.45,2.55,2.48,
# 2018
2.55,2.58,2.65,2.81,2.97,2.98,2.88,2.85,2.88,2.90,2.65,2.38,
# 2019
2.25,2.32,2.54,2.87,2.87,2.69,2.72,2.62,2.58,2.63,2.63,2.58,
# 2020
2.57,2.44,2.34,1.83,1.87,2.07,2.18,2.18,2.18,2.19,2.11,2.25,
# 2021
2.39,2.53,2.87,2.88,3.04,3.09,3.16,3.18,3.18,3.29,3.40,3.31,
# 2022
3.32,3.54,4.24,4.10,4.46,4.99,4.65,3.99,3.68,3.76,3.68,3.18,
# 2023
3.27,3.46,3.53,3.66,3.57,3.59,3.53,3.84,3.84,3.62,3.35,3.12,
# 2024
3.23,3.32,3.43,3.65,3.61,3.47,3.25,3.28,3.22,3.18,3.07,3.02
)
)
# Add time features
gas_monthly <- gas_monthly %>%
mutate(
year = year(date),
month = month(date),
month_num = as.numeric(format(date, "%Y")) +
(as.numeric(format(date, "%m")) - 1) / 12,
# Flag major political/economic events
event = case_when(
date >= as.Date("2022-02-01") & date <= as.Date("2022-08-01") ~
"Russia-Ukraine War",
date >= as.Date("2020-03-01") & date <= as.Date("2020-06-01") ~
"COVID-19 Crash",
date >= as.Date("2014-11-01") & date <= as.Date("2016-03-01") ~
"OPEC Supply Glut",
TRUE ~ "Normal"
)
)
# ── Annual Summary ──────────────────────────────────────────────────────────────
gas_annual <- gas_monthly %>%
group_by(year) %>%
summarise(
avg_price = mean(gas_price),
max_price = max(gas_price),
min_price = min(gas_price),
.groups = "drop"
)
# ── WTI Crude Oil Annual Averages (EIA, $/barrel) ──────────────────────────────
crude_annual <- data.frame(
year = 2010:2024,
crude_wti = c(79.4, 95.0, 94.0, 97.9, 93.0, 48.7,
43.3, 50.8, 64.9, 57.0, 41.5, 68.1,
95.0, 77.6, 77.3)
)
# Merge annual datasets
annual_df <- left_join(gas_annual, crude_annual, by = "year")
cat("Monthly observations:", nrow(gas_monthly), "\n")
## Monthly observations: 180
cat("Date range:", as.character(min(gas_monthly$date)),
"to", as.character(max(gas_monthly$date)), "\n")
## Date range: 2010-01-01 to 2024-12-01
cat("Price range: $", round(min(gas_monthly$gas_price), 2),
"to $", round(max(gas_monthly$gas_price), 2), "per gallon\n")
## Price range: $ 1.74 to $ 4.99 per gallon
# Key event labels for annotation
event_labels <- data.frame(
date = as.Date(c("2014-06-01","2016-02-01","2020-04-01","2022-06-01")),
price = c(3.70, 1.74, 1.83, 4.99),
label = c("OPEC cuts supply\n(2014)", "Supply glut low\n(2016)",
"COVID crash\n(2020)", "Russia-Ukraine\npeak $4.99 (2022)")
)
ggplot(gas_monthly, aes(x = date, y = gas_price)) +
# Shaded event periods
annotate("rect", xmin = as.Date("2014-11-01"), xmax = as.Date("2016-03-01"),
ymin = -Inf, ymax = Inf, fill = "#3A86FF", alpha = 0.08) +
annotate("rect", xmin = as.Date("2020-03-01"), xmax = as.Date("2020-06-01"),
ymin = -Inf, ymax = Inf, fill = "#FF006E", alpha = 0.1) +
annotate("rect", xmin = as.Date("2022-02-01"), xmax = as.Date("2022-08-01"),
ymin = -Inf, ymax = Inf, fill = "#FFBE0B", alpha = 0.1) +
# Price line
geom_line(color = "#F4B942", linewidth = 1.2) +
geom_point(data = filter(gas_monthly,
date %in% as.Date(c("2022-06-01","2020-04-01","2016-02-01"))),
aes(x = date, y = gas_price),
color = "white", size = 3) +
# Annotations
geom_label_repel(data = event_labels,
aes(x = date, y = price, label = label),
fill = "#1a1a1a", color = "white", size = 3,
box.padding = 0.4, segment.color = "gray50") +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
scale_y_continuous(labels = dollar_format(prefix = "$"), limits = c(1.5, 5.4)) +
labs(
title = "U.S. Monthly Retail Gasoline Prices (2010–2024)",
subtitle = "Regular Grade, All Formulations — Source: EIA Monthly Energy Review",
x = NULL, y = "Price ($/gallon)",
caption = "Shaded regions: OPEC Glut (blue) | COVID crash (red) | Russia-Ukraine war (gold)"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#111111", color = NA),
panel.background = element_rect(fill = "#111111", color = NA),
panel.grid.major = element_line(color = "#2a2a2a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "white", face = "bold", size = 14),
plot.subtitle = element_text(color = "#aaaaaa", size = 10),
plot.caption = element_text(color = "#666666", size = 8),
axis.text = element_text(color = "#aaaaaa"),
axis.title = element_text(color = "#aaaaaa"),
axis.text.x = element_text(angle = 45, hjust = 1)
)
Analysis: The line chart reveals three distinct price shocks over the 14-year period. The 2014–2016 OPEC-driven supply glut caused prices to drop from ~$3.68 to a low of $1.74/gallon. The COVID-19 pandemic in 2020 caused a sharp demand collapse, dropping prices to $1.83. Most dramatically, Russia’s invasion of Ukraine in February 2022 drove prices to a record $4.99/gallon by June 2022 — confirming Hypothesis 1 that geopolitical events directly cause measurable price spikes.
ggplot(gas_annual, aes(x = year)) +
geom_ribbon(aes(ymin = min_price, ymax = max_price),
fill = "#F4B942", alpha = 0.2) +
geom_line(aes(y = avg_price), color = "#F4B942", linewidth = 1.5) +
geom_point(aes(y = avg_price), color = "white", size = 2.5) +
geom_text(aes(y = avg_price, label = paste0("$", round(avg_price, 2))),
vjust = -1, color = "white", size = 3) +
scale_x_continuous(breaks = 2010:2024) +
scale_y_continuous(labels = dollar_format(prefix = "$"),
limits = c(1.2, 5.5)) +
labs(
title = "Annual Average U.S. Gas Price with Min/Max Range (2010–2024)",
subtitle = "Gold band shows full-year price range per year — Source: EIA",
x = "Year", y = "Price ($/gallon)"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#111111", color = NA),
panel.background = element_rect(fill = "#111111", color = NA),
panel.grid.major = element_line(color = "#2a2a2a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "white", face = "bold", size = 13),
plot.subtitle = element_text(color = "#aaaaaa", size = 10),
axis.text = element_text(color = "#aaaaaa"),
axis.title = element_text(color = "#aaaaaa"),
axis.text.x = element_text(angle = 45, hjust = 1)
)
Analysis: The gold ribbon shows year-round price volatility. Years with wide ribbons (2022, 2011–2014) indicate high price instability, while narrower ribbons (2019–2020) reflect more stable but low-demand periods. 2022 had the widest range ($3.18–$4.99), showing extreme volatility driven by the war in Ukraine.
seasonal <- gas_monthly %>%
group_by(month) %>%
summarise(
avg_price = mean(gas_price),
se = sd(gas_price) / sqrt(n()),
.groups = "drop"
) %>%
mutate(month_name = month.abb[month])
ggplot(seasonal, aes(x = factor(month_name, levels = month.abb),
y = avg_price)) +
geom_col(fill = "#F4B942", alpha = 0.85, width = 0.7) +
geom_errorbar(aes(ymin = avg_price - se, ymax = avg_price + se),
width = 0.3, color = "white", linewidth = 0.7) +
geom_text(aes(label = paste0("$", round(avg_price, 2))),
vjust = -0.5, color = "white", size = 3.2) +
scale_y_continuous(labels = dollar_format(prefix = "$"),
limits = c(0, 4)) +
labs(
title = "Average Gas Price by Month (2010–2024)",
subtitle = "Seasonal patterns — error bars show standard error — Source: EIA",
x = "Month", y = "Avg Price ($/gallon)"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#111111", color = NA),
panel.background = element_rect(fill = "#111111", color = NA),
panel.grid.major = element_line(color = "#2a2a2a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "white", face = "bold", size = 13),
plot.subtitle = element_text(color = "#aaaaaa", size = 10),
axis.text = element_text(color = "#aaaaaa"),
axis.title = element_text(color = "#aaaaaa")
)
Analysis: Gas prices follow a clear seasonal pattern — rising from January through a summer peak (May–August, driven by the summer driving season and reformulated fuel requirements), then declining in fall and winter. This is consistent with EIA seasonal demand data. The large error bars on summer months reflect high year-to-year variation (e.g., $4.99 in June 2022 vs $2.07 in June 2020).
# Simple linear regression: crude WTI → gas price
lm_simple <- lm(avg_price ~ crude_wti, data = annual_df)
summary(lm_simple)
##
## Call:
## lm(formula = avg_price ~ crude_wti, data = annual_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39593 -0.10123 -0.00327 0.08059 0.37966
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.07858 0.20739 5.201 0.000171 ***
## crude_wti 0.02672 0.00277 9.645 2.73e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.211 on 13 degrees of freedom
## Multiple R-squared: 0.8774, Adjusted R-squared: 0.868
## F-statistic: 93.03 on 1 and 13 DF, p-value: 2.731e-07
# Prediction values for smooth line
pred_df <- data.frame(crude_wti = seq(min(annual_df$crude_wti),
max(annual_df$crude_wti), length.out = 100))
pred_df$predicted <- predict(lm_simple, newdata = pred_df)
ggplot(annual_df, aes(x = crude_wti, y = avg_price)) +
geom_point(color = "#F4B942", size = 4) +
geom_label_repel(aes(label = year), color = "white", fill = "#1a1a1a",
size = 3, box.padding = 0.3) +
geom_line(data = pred_df, aes(x = crude_wti, y = predicted),
color = "white", linewidth = 1, linetype = "dashed") +
scale_x_continuous(labels = dollar_format(prefix = "$")) +
scale_y_continuous(labels = dollar_format(prefix = "$")) +
labs(
title = "Simple Linear Regression: WTI Crude Oil vs. U.S. Gas Price",
subtitle = paste0("R² = ", round(summary(lm_simple)$r.squared, 3),
" | Each point = annual average | Source: EIA"),
x = "WTI Crude Oil Price ($/barrel)",
y = "Avg Retail Gas Price ($/gallon)"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#111111", color = NA),
panel.background = element_rect(fill = "#111111", color = NA),
panel.grid.major = element_line(color = "#2a2a2a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "white", face = "bold", size = 13),
plot.subtitle = element_text(color = "#aaaaaa", size = 10),
axis.text = element_text(color = "#aaaaaa"),
axis.title = element_text(color = "#aaaaaa")
)
Analysis: The simple linear regression shows a strong positive relationship between WTI crude oil prices and retail gasoline prices. The R² value indicates that crude oil price explains the majority of variation in gas prices. The regression equation is:
Gas Price = 1.079 + 0.027 × (WTI Crude)
This means for every $1 increase in crude oil per barrel, retail gas prices increase by approximately 2.7 cents per gallon — consistent with EIA estimates that crude oil accounts for ~50–60% of the pump price.
# Multiple regression adding time trend
lm_multi <- lm(avg_price ~ crude_wti + year, data = annual_df)
summary(lm_multi)
##
## Call:
## lm(formula = avg_price ~ crude_wti + year, data = annual_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13476 -0.09793 -0.02065 0.09635 0.17731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -79.294783 14.935826 -5.309 0.000186 ***
## crude_wti 0.029114 0.001623 17.940 4.94e-10 ***
## year 0.039762 0.007389 5.381 0.000165 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1189 on 12 degrees of freedom
## Multiple R-squared: 0.9641, Adjusted R-squared: 0.9581
## F-statistic: 161 on 2 and 12 DF, p-value: 2.148e-09
# Model comparison
cat("\n── Model Comparison ──\n")
##
## ── Model Comparison ──
cat("Simple regression R²: ", round(summary(lm_simple)$r.squared, 4), "\n")
## Simple regression R²: 0.8774
cat("Multiple regression R²: ", round(summary(lm_multi)$r.squared, 4), "\n")
## Multiple regression R²: 0.9641
cat("Improvement: +",
round((summary(lm_multi)$r.squared - summary(lm_simple)$r.squared) * 100, 2),
"percentage points\n")
## Improvement: + 8.67 percentage points
Analysis: Adding the year trend variable to the regression improves model fit. The year coefficient captures the gradual increase in baseline gas prices over time due to inflation, infrastructure costs, and evolving environmental regulations — factors that are independent of crude oil prices. Both crude oil price and the year trend are statistically significant predictors of retail gas price.
par(mfrow = c(1, 2), bg = "#111111", col.axis = "white",
col.lab = "white", col.main = "white", fg = "white")
plot(lm_simple, which = 1, col = "#F4B942", pch = 19,
main = "Residuals vs Fitted")
plot(lm_simple, which = 2, col = "#F4B942", pch = 19,
main = "Normal Q-Q Plot")
par(mfrow = c(1, 1))
Analysis: The residuals vs. fitted plot shows that the model is generally well-fit, though there is some deviation around the 2022 price spike (an extreme outlier year). The Q-Q plot is approximately linear, indicating residuals are roughly normally distributed — the key assumption for valid regression inference is met.
We cluster years by their average gas price, minimum price, maximum price, and price volatility (range) to identify distinct market regimes.
# Build feature matrix for clustering
cluster_df <- gas_annual %>%
mutate(
volatility = max_price - min_price,
price_tier = avg_price
) %>%
select(year, avg_price, min_price, max_price, volatility)
# Scale features (required for K-means)
cluster_scaled <- scale(cluster_df[, -1])
rownames(cluster_scaled) <- cluster_df$year
# Determine optimal K using elbow method
set.seed(42)
wss <- sapply(1:8, function(k) {
kmeans(cluster_scaled, centers = k, nstart = 25)$tot.withinss
})
elbow_df <- data.frame(k = 1:8, wss = wss)
ggplot(elbow_df, aes(x = k, y = wss)) +
geom_line(color = "#F4B942", linewidth = 1.2) +
geom_point(color = "white", size = 3) +
geom_vline(xintercept = 3, color = "#F4B942", linetype = "dashed", alpha = 0.6) +
annotate("text", x = 3.2, y = max(wss) * 0.85,
label = "Optimal K = 3", color = "#F4B942", size = 4, hjust = 0) +
scale_x_continuous(breaks = 1:8) +
labs(
title = "Elbow Method – Optimal Number of Clusters",
subtitle = "Within-cluster sum of squares drops sharply at K = 3",
x = "Number of Clusters (K)", y = "Total Within-Cluster SS"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#111111", color = NA),
panel.background = element_rect(fill = "#111111", color = NA),
panel.grid.major = element_line(color = "#2a2a2a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "white", face = "bold", size = 13),
plot.subtitle = element_text(color = "#aaaaaa", size = 10),
axis.text = element_text(color = "#aaaaaa"),
axis.title = element_text(color = "#aaaaaa")
)
Analysis: The elbow method identifies K = 3 as the optimal number of clusters. The within-cluster sum of squares drops sharply from K=1 to K=3 and levels off afterward, meaning adding more clusters beyond 3 provides diminishing returns in explanatory power.
set.seed(42)
km_result <- kmeans(cluster_scaled, centers = 3, nstart = 25)
cluster_df$cluster <- factor(km_result$cluster)
# Label each cluster meaningfully
cluster_labels <- cluster_df %>%
group_by(cluster) %>%
summarise(avg = mean(avg_price), .groups = "drop") %>%
arrange(avg) %>%
mutate(label = c("Low-Price Era", "Mid-Price Era", "High-Price Era"))
cluster_df <- cluster_df %>%
left_join(cluster_labels %>% select(cluster, label), by = "cluster")
# Plot: avg price by year, colored by cluster
ggplot(cluster_df, aes(x = year, y = avg_price, color = label, fill = label)) +
geom_col(alpha = 0.8, width = 0.7) +
geom_text(aes(label = paste0("$", round(avg_price, 2))),
vjust = -0.4, color = "white", size = 3) +
scale_color_manual(values = c("Low-Price Era" = "#3A86FF",
"Mid-Price Era" = "#F4B942",
"High-Price Era" = "#FF006E")) +
scale_fill_manual(values = c("Low-Price Era" = "#3A86FF",
"Mid-Price Era" = "#F4B942",
"High-Price Era" = "#FF006E")) +
scale_x_continuous(breaks = 2010:2024) +
scale_y_continuous(labels = dollar_format(prefix = "$"), limits = c(0, 5)) +
labs(
title = "K-Means Clustering: Gas Price Eras (2010–2024)",
subtitle = "3 clusters identified — Low (blue), Mid (gold), High (red) price regimes",
x = "Year", y = "Avg Price ($/gallon)", color = "Cluster", fill = "Cluster"
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#111111", color = NA),
panel.background = element_rect(fill = "#111111", color = NA),
panel.grid.major = element_line(color = "#2a2a2a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "white", face = "bold", size = 13),
plot.subtitle = element_text(color = "#aaaaaa", size = 10),
axis.text = element_text(color = "#aaaaaa"),
axis.title = element_text(color = "#aaaaaa"),
legend.background = element_rect(fill = "#111111"),
legend.text = element_text(color = "white"),
legend.title = element_text(color = "white"),
axis.text.x = element_text(angle = 45, hjust = 1)
)
Analysis: K-means clustering reveals three distinct gas price regimes:
# Hierarchical clustering with complete linkage
hc_result <- hclust(dist(cluster_scaled), method = "complete")
# Custom dendrogram plot
plot(hc_result,
labels = cluster_df$year,
main = "Hierarchical Clustering Dendrogram – Gas Price Eras",
sub = "Source: EIA | Features: avg, min, max price, volatility",
xlab = "Year",
ylab = "Distance",
col.main = "white",
col.lab = "white",
col.sub = "white",
col.axis = "white",
hang = -1,
cex = 0.9)
# Draw K=3 cut line
rect.hclust(hc_result, k = 3, border = c("#3A86FF", "#F4B942", "#FF006E"))
Analysis: The hierarchical dendrogram confirms the K-means clustering result. The three colored rectangles mark the same price era groupings. Notably, 2022 sits in isolation in the high-price cluster — its price dynamics ($3.18–$4.99 range, $3.95 average) are so extreme that it forms its own sub-branch, visually confirming the Russia-Ukraine war as an unprecedented price shock in the modern era.
yoy_df <- gas_annual %>%
arrange(year) %>%
mutate(
yoy_change = avg_price - lag(avg_price),
yoy_pct = (avg_price / lag(avg_price) - 1) * 100,
direction = ifelse(yoy_change >= 0, "Increase", "Decrease")
) %>%
filter(!is.na(yoy_change))
ggplot(yoy_df, aes(x = year, y = yoy_pct, fill = direction)) +
geom_col(width = 0.7, alpha = 0.9) +
geom_text(aes(label = paste0(ifelse(yoy_pct > 0, "+", ""),
round(yoy_pct, 1), "%")),
vjust = ifelse(yoy_df$yoy_pct >= 0, -0.4, 1.2),
color = "white", size = 3) +
geom_hline(yintercept = 0, color = "white", linewidth = 0.5) +
scale_fill_manual(values = c("Increase" = "#F4B942", "Decrease" = "#3A86FF")) +
scale_x_continuous(breaks = 2011:2024) +
labs(
title = "Year-Over-Year Change in U.S. Average Gas Prices",
subtitle = "% change from prior year — Source: EIA",
x = "Year", y = "YoY Change (%)", fill = NULL
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#111111", color = NA),
panel.background = element_rect(fill = "#111111", color = NA),
panel.grid.major = element_line(color = "#2a2a2a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "white", face = "bold", size = 13),
plot.subtitle = element_text(color = "#aaaaaa", size = 10),
axis.text = element_text(color = "#aaaaaa"),
axis.title = element_text(color = "#aaaaaa"),
legend.background = element_rect(fill = "#111111"),
legend.text = element_text(color = "white"),
axis.text.x = element_text(angle = 45, hjust = 1)
)
Analysis: The year-over-year chart isolates the magnitude of each price shock. 2022 was the largest single-year price increase (+31%), driven by the Russia-Ukraine war. The largest single-year decline was in 2015 (–28.3%), reflecting the OPEC production glut. 2020 shows a moderate decline from COVID-19 demand loss, which quickly reversed in 2021 with economic reopening.
ggplot(gas_monthly, aes(x = reorder(event, gas_price, median),
y = gas_price, fill = event)) +
geom_boxplot(alpha = 0.8, outlier.color = "white", outlier.size = 2) +
geom_jitter(width = 0.15, alpha = 0.3, color = "white", size = 1) +
scale_fill_manual(values = c(
"Normal" = "#F4B942",
"OPEC Supply Glut" = "#3A86FF",
"COVID-19 Crash" = "#FF006E",
"Russia-Ukraine War" = "#FF6B35"
)) +
scale_y_continuous(labels = dollar_format(prefix = "$")) +
labs(
title = "Gas Price Distribution by Event Period",
subtitle = "Each dot = one monthly observation — Source: EIA",
x = NULL, y = "Gas Price ($/gallon)", fill = NULL
) +
theme_minimal(base_size = 12) +
theme(
plot.background = element_rect(fill = "#111111", color = NA),
panel.background = element_rect(fill = "#111111", color = NA),
panel.grid.major = element_line(color = "#2a2a2a"),
panel.grid.minor = element_blank(),
plot.title = element_text(color = "white", face = "bold", size = 13),
plot.subtitle = element_text(color = "#aaaaaa", size = 10),
axis.text = element_text(color = "#aaaaaa"),
axis.title = element_text(color = "#aaaaaa"),
legend.position = "none"
)
Analysis: The boxplot directly compares the price distribution across event periods. The Russia-Ukraine War period has the highest median and widest spread, confirming it as the most impactful price event in the dataset. The COVID-19 crash period shows the lowest prices with a compressed range, reflecting the brief but sharp demand collapse. The OPEC supply glut period is notably lower than Normal months, showing how supply-side decisions by OPEC directly benefited U.S. consumers at the pump.
# Summary statistics by cluster era
cluster_summary <- cluster_df %>%
group_by(label) %>%
summarise(
Years = paste(sort(year), collapse = ", "),
`Avg Price` = paste0("$", round(mean(avg_price), 2)),
`Min Recorded` = paste0("$", round(min(min_price), 2)),
`Max Recorded` = paste0("$", round(max(max_price), 2)),
`Avg Volatility` = paste0("$", round(mean(volatility), 2)),
.groups = "drop"
) %>%
rename(`Price Era` = label)
knitr::kable(cluster_summary,
caption = "Summary Statistics by K-Means Cluster (EIA Data, 2010–2024)",
align = "lllllll")
| Price Era | Years | Avg Price | Min Recorded | Max Recorded | Avg Volatility |
|---|---|---|---|---|---|
| High-Price Era | 2022 | $3.97 | $3.18 | $4.99 | $1.81 |
| Low-Price Era | 2010, 2015, 2016, 2017, 2018, 2019, 2020 | $2.47 | $1.74 | $3.02 | $0.58 |
| Mid-Price Era | 2011, 2012, 2013, 2014, 2021, 2023, 2024 | $3.41 | $2.39 | $3.96 | $0.78 |
Regression Findings:
Cluster Findings:
Hypotheses Assessment:
Data Sources: U.S. Energy Information Administration (EIA) Monthly Energy Review Table 9.4; FRED Series GASREGW; EIA Gasoline and Diesel Fuel Update.