This analysis explores 100 years of global fuel pricing data (1924-2024), examining gasoline, diesel, and LPG prices across 25 countries and 9 regions. The dataset includes inflation-adjusted prices, crude oil benchmarks, tax rates, subsidy indicators, and macroeconomic variables. Rather than simply describing the data, each question builds on the last to uncover why prices differ, when they change, and what drives those changes.
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(corrplot)
library(scales)
fuel_prices <- read.csv("fuel_prices/all_countries_combined.csv")
glimpse(fuel_prices)
## Rows: 2,525
## Columns: 27
## $ year <int> 1924, 1925, 1926, 1927, 1928, 1929, 1930, 193…
## $ country <chr> "Argentina", "Argentina", "Argentina", "Argen…
## $ iso3 <chr> "ARG", "ARG", "ARG", "ARG", "ARG", "ARG", "AR…
## $ currency_code <chr> "ARS", "ARS", "ARS", "ARS", "ARS", "ARS", "AR…
## $ region <chr> "Latin America", "Latin America", "Latin Amer…
## $ decade <chr> "1920s", "1920s", "1920s", "1920s", "1920s", …
## $ gasoline_usd_per_liter <dbl> 0.0124, 0.0121, 0.0125, 0.0126, 0.0131, 0.012…
## $ gasoline_usd_per_gallon <dbl> 0.0469, 0.0458, 0.0473, 0.0477, 0.0496, 0.048…
## $ gasoline_real_2024usd <dbl> 0.2277, 0.2171, 0.2218, 0.2274, 0.2405, 0.235…
## $ diesel_usd_per_liter <dbl> 0.0101, 0.0098, 0.0102, 0.0103, 0.0108, 0.010…
## $ diesel_real_2024usd <dbl> 0.1855, 0.1758, 0.1809, 0.1859, 0.1983, 0.189…
## $ lpg_usd_per_liter <dbl> 0.0059, 0.0056, 0.0062, 0.0062, 0.0064, 0.006…
## $ gasoline_local_currency <dbl> 11.16, 10.89, 11.25, 11.34, 11.79, 11.52, 11.…
## $ gasoline_yoy_pct_change <dbl> NA, -2.42, 3.31, 0.80, 3.97, -2.29, 1.56, 5.3…
## $ diesel_yoy_pct_change <dbl> NA, -2.97, 4.08, 0.98, 4.85, -4.63, 4.85, 0.0…
## $ price_tier <chr> "Very Low (<$0.30)", "Very Low (<$0.30)", "Ve…
## $ tax_pct_of_pump_price <dbl> 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 1…
## $ pretax_price_usd_ltr <dbl> 0.0107, 0.0104, 0.0108, 0.0108, 0.0113, 0.011…
## $ subsidy_flag <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ subsidy_regime <chr> "heavy", "heavy", "heavy", "heavy", "heavy", …
## $ is_oil_producer <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ crude_oil_usd_per_barrel <dbl> 1.43, 1.68, 1.88, 1.30, 1.17, 1.27, 1.19, 0.6…
## $ crude_oil_usd_per_liter <dbl> 0.0090, 0.0106, 0.0118, 0.0082, 0.0074, 0.008…
## $ refinery_margin_usd_ltr <dbl> 0.0034, 0.0015, 0.0007, 0.0044, 0.0057, 0.004…
## $ oil_price_shock_idx <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ us_cpi <dbl> 17.1, 17.5, 17.7, 17.4, 17.1, 17.1, 16.7, 15.…
## $ inflation_deflator_2024 <dbl> 18.3626, 17.9429, 17.7401, 18.0460, 18.3626, …
Why this question? Before any analysis, we must confirm what data we have, its temporal and geographic coverage, and whether variable types are appropriate for the analysis we plan to do.
str(fuel_prices)
## 'data.frame': 2525 obs. of 27 variables:
## $ year : int 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 ...
## $ country : chr "Argentina" "Argentina" "Argentina" "Argentina" ...
## $ iso3 : chr "ARG" "ARG" "ARG" "ARG" ...
## $ currency_code : chr "ARS" "ARS" "ARS" "ARS" ...
## $ region : chr "Latin America" "Latin America" "Latin America" "Latin America" ...
## $ decade : chr "1920s" "1920s" "1920s" "1920s" ...
## $ gasoline_usd_per_liter : num 0.0124 0.0121 0.0125 0.0126 0.0131 0.0128 0.013 0.0137 0.0137 0.0145 ...
## $ gasoline_usd_per_gallon : num 0.0469 0.0458 0.0473 0.0477 0.0496 0.0485 0.0492 0.0519 0.0519 0.0549 ...
## $ gasoline_real_2024usd : num 0.228 0.217 0.222 0.227 0.24 ...
## $ diesel_usd_per_liter : num 0.0101 0.0098 0.0102 0.0103 0.0108 0.0103 0.0108 0.0108 0.011 0.0116 ...
## $ diesel_real_2024usd : num 0.185 0.176 0.181 0.186 0.198 ...
## $ lpg_usd_per_liter : num 0.0059 0.0056 0.0062 0.0062 0.0064 0.0061 0.0061 0.0068 0.0066 0.007 ...
## $ gasoline_local_currency : num 11.2 10.9 11.2 11.3 11.8 ...
## $ gasoline_yoy_pct_change : num NA -2.42 3.31 0.8 3.97 -2.29 1.56 5.38 0 5.84 ...
## $ diesel_yoy_pct_change : num NA -2.97 4.08 0.98 4.85 -4.63 4.85 0 1.85 5.45 ...
## $ price_tier : chr "Very Low (<$0.30)" "Very Low (<$0.30)" "Very Low (<$0.30)" "Very Low (<$0.30)" ...
## $ tax_pct_of_pump_price : num 14 14 14 14 14 14 14 14 14 14 ...
## $ pretax_price_usd_ltr : num 0.0107 0.0104 0.0108 0.0108 0.0113 0.011 0.0112 0.0118 0.0118 0.0125 ...
## $ subsidy_flag : int 0 0 0 0 0 0 0 0 0 0 ...
## $ subsidy_regime : chr "heavy" "heavy" "heavy" "heavy" ...
## $ is_oil_producer : int 1 1 1 1 1 1 1 1 1 1 ...
## $ crude_oil_usd_per_barrel: num 1.43 1.68 1.88 1.3 1.17 1.27 1.19 0.65 0.87 0.67 ...
## $ crude_oil_usd_per_liter : num 0.009 0.0106 0.0118 0.0082 0.0074 0.008 0.0075 0.0041 0.0055 0.0042 ...
## $ refinery_margin_usd_ltr : num 0.0034 0.0015 0.0007 0.0044 0.0057 0.0048 0.0055 0.0096 0.0082 0.0103 ...
## $ oil_price_shock_idx : num 1 1 1 1 1 1 1 1 1 1 ...
## $ us_cpi : num 17.1 17.5 17.7 17.4 17.1 17.1 16.7 15.2 13.7 13 ...
## $ inflation_deflator_2024 : num 18.4 17.9 17.7 18 18.4 ...
cat("Dataset Dimensions:", nrow(fuel_prices), "rows x",
ncol(fuel_prices), "columns\n")
## Dataset Dimensions: 2525 rows x 27 columns
cat("Time Period:", min(fuel_prices$year),
"to", max(fuel_prices$year), "\n")
## Time Period: 1924 to 2024
cat("Countries Covered:", n_distinct(fuel_prices$country), "\n")
## Countries Covered: 25
cat("Regions Covered:", n_distinct(fuel_prices$region), "\n")
## Regions Covered: 9
Key findings:
Implications:
Why this question? Missing data can bias results or limit certain analyses. Understanding gaps helps determine appropriate analytical methods.
missing_summary <- colSums(is.na(fuel_prices))
missing_df <- data.frame(
Variable = names(missing_summary),
Missing_Count = missing_summary,
Percent_Missing = round((missing_summary / nrow(fuel_prices)) * 100, 2)
) %>%
filter(Missing_Count > 0) %>%
arrange(desc(Missing_Count))
print(missing_df)
## Variable Missing_Count Percent_Missing
## gasoline_yoy_pct_change gasoline_yoy_pct_change 25 0.99
## diesel_yoy_pct_change diesel_yoy_pct_change 25 0.99
Key findings:
gasoline_yoy_pct_change and
diesel_yoy_pct_change (0.99% of data)Implications:
Why this question? Mean alone is not enough. When mean and median are very different, the data is skewed — meaning the average is being pulled by extreme values and does not represent a typical observation. Standard deviation tells us how spread out prices are across the full 100-year range.
fuel_prices %>%
summarise(
Mean_Nominal = round(mean(gasoline_usd_per_liter), 4),
Median_Nominal = round(median(gasoline_usd_per_liter), 4),
SD_Nominal = round(sd(gasoline_usd_per_liter), 4),
Mean_Real = round(mean(gasoline_real_2024usd), 4),
Median_Real = round(median(gasoline_real_2024usd), 4),
SD_Real = round(sd(gasoline_real_2024usd), 4),
Min_Real = round(min(gasoline_real_2024usd), 4),
Max_Real = round(max(gasoline_real_2024usd), 4)
)
## Mean_Nominal Median_Nominal SD_Nominal Mean_Real Median_Real SD_Real Min_Real
## 1 0.3001 0.0551 0.4582 0.6725 0.4528 0.5686 0.0123
## Max_Real
## 1 2.6502
Key findings:
Implications:
Why this question? Price tier is a categorical variable. The mode (most frequent category) tells us what “normal” looks like globally across the full dataset. This is especially useful when the numeric mean is misleading due to skewness.
tier_counts <- fuel_prices %>%
count(price_tier, sort = TRUE) %>%
mutate(Percentage = round(n / sum(n) * 100, 1))
print(tier_counts)
## price_tier n Percentage
## 1 Very Low (<$0.30) 1790 70.9
## 2 Low ($0.30–$0.70) 305 12.1
## 3 Medium ($0.70–$1.10) 207 8.2
## 4 High ($1.10–$1.60) 162 6.4
## 5 Very High (>$1.60) 61 2.4
cat("\nMode (most frequent price tier):", tier_counts$price_tier[1], "\n")
##
## Mode (most frequent price tier): Very Low (<$0.30)
cat("Appearing in", tier_counts$n[1], "observations (",
tier_counts$Percentage[1], "% of all data)\n")
## Appearing in 1790 observations ( 70.9 % of all data)
Key findings:
Implications:
Why this question? Crude oil is the primary input cost for gasoline and diesel. Understanding its volatility contextualises all retail price movements throughout the century.
oil_summary <- fuel_prices %>%
summarise(
Min_Oil = min(crude_oil_usd_per_barrel),
Max_Oil = max(crude_oil_usd_per_barrel),
Mean_Oil = round(mean(crude_oil_usd_per_barrel), 2),
Median_Oil = round(median(crude_oil_usd_per_barrel), 2),
SD_Oil = round(sd(crude_oil_usd_per_barrel), 2)
)
print(oil_summary)
## Min_Oil Max_Oil Mean_Oil Median_Oil SD_Oil
## 1 0.65 112.01 23 9.35 29.98
# Identify years of extreme crude prices
fuel_prices %>%
filter(crude_oil_usd_per_barrel == min(crude_oil_usd_per_barrel) |
crude_oil_usd_per_barrel == max(crude_oil_usd_per_barrel)) %>%
select(year, country, crude_oil_usd_per_barrel) %>%
distinct(year, .keep_all = TRUE)
## year country crude_oil_usd_per_barrel
## 1 1931 Argentina 0.65
## 2 2012 Argentina 112.01
Key findings:
Implications:
Why this question? Before building predictive models, we need to understand how our numerical variables interact with one another. A pair plot visually displays the distribution shape of each variable while simultaneously mapping the scatterplots and correlation coefficients for every variable pairing, instantly revealing multi-collinearity and non-linear trends..
cor_vars <- fuel_prices %>%
select(gasoline_real_2024usd, crude_oil_usd_per_barrel,
tax_pct_of_pump_price, us_cpi) %>%
na.omit()
ggpairs(cor_vars,
title = "Pair Plot of Key Economic Indicators",
lower = list(continuous = wrap("points", alpha = 0.3, size = 0.5, color = "steelblue")),
diag = list(continuous = wrap("densityDiag", fill = "orange", alpha = 0.5))) +
theme_minimal()
Key findings:
crude_oil_usd_per_barrel and
gasoline_real_2024usd show a positive relationship, but the
wide spread indicates other factors (like taxes) heavily influence the
final price.Implications:
Why this question? This tests a logical hypothesis — countries that extract oil should face lower domestic prices due to direct supply access and subsidies. The data either confirms or challenges this assumption.
producer_stats <- fuel_prices %>%
mutate(Producer = ifelse(is_oil_producer == 1,
"Oil Producer", "Non-Producer")) %>%
group_by(Producer) %>%
summarise(
Mean_Real_Price = round(mean(gasoline_real_2024usd), 3),
Median_Real_Price = round(median(gasoline_real_2024usd), 3),
SD_Real_Price = round(sd(gasoline_real_2024usd), 3),
Mean_Tax_Rate = round(mean(tax_pct_of_pump_price), 1),
Subsidy_Pct = round(mean(subsidy_flag) * 100, 1),
Count = n()
)
print(producer_stats)
## # A tibble: 2 × 7
## Producer Mean_Real_Price Median_Real_Price SD_Real_Price Mean_Tax_Rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Non-Producer 1.07 0.856 0.601 43.7
## 2 Oil Producer 0.517 0.35 0.471 20.7
## # ℹ 2 more variables: Subsidy_Pct <dbl>, Count <int>
fuel_prices %>%
mutate(Producer = ifelse(is_oil_producer == 1,
"Oil Producer", "Non-Producer")) %>%
ggplot(aes(x = Producer, y = gasoline_real_2024usd, fill = Producer)) +
geom_boxplot(outlier.alpha = 0.3, show.legend = FALSE) +
labs(title = "Real Gasoline Price: Oil Producers vs Non-Producers",
subtitle = "Inflation-adjusted to 2024 USD",
x = "",
y = "Real Price (USD/Liter)") +
scale_fill_manual(values = c("Oil Producer" = "darkgreen",
"Non-Producer" = "steelblue")) +
theme_minimal()
Key findings:
Implications:
Why this question? Identifying consistently expensive markets reveals the long-run impact of taxation, geography, and energy policy — not just short-term price spikes.
top_expensive <- fuel_prices %>%
group_by(country) %>%
summarise(
Avg_Real_Gasoline = round(mean(gasoline_real_2024usd), 3),
Avg_Tax_Percent = round(mean(tax_pct_of_pump_price), 1),
Is_Oil_Producer = max(is_oil_producer)
) %>%
arrange(desc(Avg_Real_Gasoline)) %>%
head(10)
print(top_expensive)
## # A tibble: 10 × 4
## country Avg_Real_Gasoline Avg_Tax_Percent Is_Oil_Producer
## <chr> <dbl> <dbl> <int>
## 1 Norway 1.42 47.2 1
## 2 Italy 1.29 50.3 0
## 3 France 1.24 49.5 0
## 4 Germany 1.23 48 0
## 5 Singapore 1.09 36.5 0
## 6 United Kingdom 1.06 47.2 1
## 7 South Korea 0.978 39.6 0
## 8 Japan 0.894 36.5 0
## 9 Turkey 0.798 45.7 0
## 10 Australia 0.763 28.9 1
Key findings:
Implications:
Why this question? Volatility (measured by Coefficient of Variation) measures economic risk and instability. High volatility means consumers face unpredictable costs, which disrupts household budgets and business planning far more than a consistently high-but-stable price.
regional_volatility <- fuel_prices %>%
group_by(region) %>%
summarise(
Mean_Price = round(mean(gasoline_real_2024usd), 3),
Median_Price = round(median(gasoline_real_2024usd), 3),
SD_Price = round(sd(gasoline_real_2024usd), 3),
CV = round(sd(gasoline_real_2024usd) /
mean(gasoline_real_2024usd) * 100, 1),
Min_Price = round(min(gasoline_real_2024usd), 3),
Max_Price = round(max(gasoline_real_2024usd), 3)
) %>%
arrange(desc(CV))
print(regional_volatility)
## # A tibble: 9 × 7
## region Mean_Price Median_Price SD_Price CV Min_Price Max_Price
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Middle East 0.172 0.14 0.163 94.7 0.029 0.92
## 2 Asia 0.681 0.39 0.557 81.7 0.172 2.51
## 3 Latin America 0.496 0.374 0.403 81.3 0.012 1.78
## 4 Africa 0.46 0.319 0.342 74.2 0.145 1.74
## 5 Eurasia 0.315 0.212 0.223 70.9 0.075 0.912
## 6 Europe/Asia 0.798 0.53 0.54 67.6 0.271 2.08
## 7 North America 0.493 0.391 0.304 61.8 0.202 1.72
## 8 Asia Pacific 0.829 0.629 0.409 49.3 0.321 1.73
## 9 Europe 1.25 0.971 0.609 48.9 0.402 2.65
ggplot(regional_volatility,
aes(x = reorder(region, CV), y = CV, fill = CV)) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_fill_gradient(low = "lightblue", high = "darkred") +
labs(title = "Fuel Price Volatility by Region",
subtitle = "Coefficient of Variation (SD / Mean x 100) — higher = more volatile",
x = "Region",
y = "Coefficient of Variation (%)") +
theme_minimal()
Key findings:
Implications:
Why this question? Subsidies artificially lower prices but create fiscal burden. Identifying subsidy-dependent economies reveals long-run policy vulnerabilities.
subsidy_reliance <- fuel_prices %>%
group_by(country) %>%
summarise(
Percent_Years_Subsidized = round(mean(subsidy_flag) * 100, 1),
Avg_Real_Price = round(mean(gasoline_real_2024usd), 3),
Subsidy_Regime = names(sort(table(subsidy_regime),
decreasing = TRUE))[1]
) %>%
filter(Percent_Years_Subsidized > 0) %>%
arrange(desc(Percent_Years_Subsidized)) %>%
head(10)
print(subsidy_reliance)
## # A tibble: 10 × 4
## country Percent_Years_Subsidized Avg_Real_Price Subsidy_Regime
## <chr> <dbl> <dbl> <chr>
## 1 Argentina 74.3 0.577 heavy
## 2 Indonesia 74.3 0.332 heavy
## 3 Iran 74.3 0.092 heavy
## 4 Nigeria 74.3 0.276 heavy
## 5 Saudi Arabia 74.3 0.168 heavy
## 6 UAE 74.3 0.257 heavy
## 7 Venezuela 74.3 0.123 heavy
## 8 Brazil 55.4 0.743 partial
## 9 China 55.4 0.477 partial
## 10 India 55.4 0.529 partial
Key findings:
Implications:
Why this question? Diesel powers commercial vehicles, freight, and agriculture. If diesel becomes relatively more expensive, it raises the cost of food and goods transportation — an economic knock-on effect that pure fuel price analysis would miss.
fuel_gap <- fuel_prices %>%
mutate(Price_Gap = gasoline_real_2024usd - diesel_real_2024usd) %>%
group_by(year) %>%
summarise(
Mean_Gasoline = round(mean(gasoline_real_2024usd), 4),
Mean_Diesel = round(mean(diesel_real_2024usd), 4),
Mean_Gap = round(mean(Price_Gap), 4)
)
cat("Gasoline-Diesel Gap Statistics (Real USD):\n")
## Gasoline-Diesel Gap Statistics (Real USD):
cat("Overall Mean Gap: ", round(mean(fuel_gap$Mean_Gap), 4), "\n")
## Overall Mean Gap: 0.0339
cat("Max Gap: ", round(max(fuel_gap$Mean_Gap), 4), "\n")
## Max Gap: 0.0851
cat("Min Gap: ", round(min(fuel_gap$Mean_Gap), 4), "\n")
## Min Gap: 0.0166
ggplot(fuel_gap, aes(x = year)) +
geom_line(aes(y = Mean_Gasoline, color = "Gasoline"), linewidth = 1) +
geom_line(aes(y = Mean_Diesel, color = "Diesel"), linewidth = 1) +
geom_ribbon(aes(ymin = Mean_Diesel, ymax = Mean_Gasoline),
alpha = 0.2, fill = "orange") +
labs(title = "Gasoline vs Diesel Real Prices Over Time (1924-2024)",
subtitle = "Shaded area = price gap between the two fuels",
x = "Year",
y = "Real Price (2024 USD/Liter)",
color = "Fuel Type") +
scale_color_manual(values = c("Gasoline" = "darkorange",
"Diesel" = "firebrick")) +
theme_minimal()
Key findings:
Implications:
Why this question? While our earlier exploratory analysis showed that average prices differ by region, we must statistically prove that these differences are not just random variations in our dataset. ANOVA (Analysis of Variance) mathematically confirms if geographical regions create distinct, provable pricing tiers.
# Run the ANOVA model
regional_anova <- aov(gasoline_real_2024usd ~ region, data = fuel_prices)
cat("ANOVA Results:\n")
## ANOVA Results:
summary(regional_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## region 8 289.3 36.17 172.8 <2e-16 ***
## Residuals 2516 526.8 0.21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Perform Tukey's HSD to see which specific regions differ from each other
tukey_results <- TukeyHSD(regional_anova)
# Convert to data frame to easily filter and view the most significant differences
tukey_df <- as.data.frame(tukey_results$region) %>%
arrange(`p adj`)
cat("\nTop 5 Most Significant Regional Price Differences:\n")
##
## Top 5 Most Significant Regional Price Differences:
head(tukey_df, 5)
## diff lwr upr p adj
## Asia Pacific-Africa 0.3681193 0.2267770 0.5094616 0
## Europe-Africa 0.7851388 0.6668833 0.9033943 0
## Middle East-Africa -0.2881135 -0.4171408 -0.1590862 0
## Eurasia-Asia -0.3668046 -0.5216373 -0.2119718 0
## Europe-Asia 0.5642244 0.4748316 0.6536171 0
Key findings:
Implications:
Why this question? Decadal averages smooth short-term volatility and reveal structural shifts in energy economics — the moments when the global price floor permanently changed.
decade_trends <- fuel_prices %>%
group_by(decade) %>%
summarise(
Mean_Real_Gasoline = round(mean(gasoline_real_2024usd), 3),
Median_Real_Gasoline = round(median(gasoline_real_2024usd), 3),
SD_Real_Gasoline = round(sd(gasoline_real_2024usd), 3),
Mean_Real_Diesel = round(mean(diesel_real_2024usd), 3),
Mean_Crude_Oil = round(mean(crude_oil_usd_per_barrel), 2),
Mean_Tax_Rate = round(mean(tax_pct_of_pump_price), 1),
n_observations = n()
) %>%
arrange(decade)
print(decade_trends)
## # A tibble: 11 × 8
## decade Mean_Real_Gasoline Median_Real_Gasoline SD_Real_Gasoline
## <chr> <dbl> <dbl> <dbl>
## 1 1920s 0.288 0.276 0.132
## 2 1930s 0.402 0.374 0.178
## 3 1940s 0.401 0.365 0.18
## 4 1950s 0.38 0.336 0.195
## 5 1960s 0.44 0.399 0.232
## 6 1970s 0.554 0.449 0.386
## 7 1980s 0.669 0.509 0.553
## 8 1990s 0.811 0.771 0.601
## 9 2000s 1.10 1.03 0.72
## 10 2010s 1.21 1.22 0.672
## 11 2020s 1.31 1.41 0.697
## # ℹ 4 more variables: Mean_Real_Diesel <dbl>, Mean_Crude_Oil <dbl>,
## # Mean_Tax_Rate <dbl>, n_observations <int>
ggplot(decade_trends, aes(x = decade, y = Mean_Real_Gasoline, group = 1)) +
geom_line(color = "darkorange", linewidth = 1.2) +
geom_point(size = 3) +
labs(title = "Average Real Gasoline Price by Decade (1920s-2020s)",
subtitle = "Adjusted to 2024 USD",
x = "Decade",
y = "Average Real Price (USD/Liter)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Key findings:
Implications:
Why this question? A Compound Annual Growth Rate (CAGR) summarises 100 years of change in a single comparable number, and lets us assess whether fuel became more expensive faster or slower than general inflation.
yearly_global <- fuel_prices %>%
group_by(year) %>%
summarise(
Avg_Real_Gas = mean(gasoline_real_2024usd),
Avg_CPI = mean(us_cpi)
)
start_price <- yearly_global$Avg_Real_Gas[yearly_global$year == 1924]
end_price <- yearly_global$Avg_Real_Gas[yearly_global$year == 2024]
n_years <- 100
cagr_real <- (end_price / start_price)^(1 / n_years) - 1
cat("Real gasoline price 1924:", round(start_price, 3), "USD/L\n")
## Real gasoline price 1924: 0.283 USD/L
cat("Real gasoline price 2024:", round(end_price, 3), "USD/L\n")
## Real gasoline price 2024: 1.166 USD/L
cat("100-year CAGR (real): ", round(cagr_real * 100, 2), "% per year\n")
## 100-year CAGR (real): 1.43 % per year
ggplot(yearly_global, aes(x = year, y = Avg_Real_Gas)) +
geom_line(color = "darkorange", linewidth = 1.1) +
geom_smooth(method = "loess", span = 0.3,
color = "darkred", linetype = "dashed", se = FALSE) +
annotate("rect", xmin = 1973, xmax = 1980,
ymin = -Inf, ymax = Inf,
alpha = 0.15, fill = "red") +
annotate("text", x = 1976, y = 1.1,
label = "Oil\nCrises", size = 3, color = "red") +
annotate("rect", xmin = 2007, xmax = 2009,
ymin = -Inf, ymax = Inf,
alpha = 0.15, fill = "blue") +
annotate("text", x = 2008, y = 1.1,
label = "2008\nCrash", size = 3, color = "blue") +
labs(title = "Global Average Real Gasoline Price (1924-2024)",
subtitle = "Dashed = LOESS trend | Shaded = major shock periods",
x = "Year",
y = "Real Price (2024 USD/Liter)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Key findings:
Implications:
Why this question? Sudden price changes disrupt economies, trigger inflation, and generate political crises. Identifying the worst shock years connects the data to real-world events.
yoy_summary <- fuel_prices %>%
filter(!is.na(gasoline_yoy_pct_change)) %>%
group_by(year) %>%
summarise(
Mean_YoY = round(mean(gasoline_yoy_pct_change), 2),
Median_YoY = round(median(gasoline_yoy_pct_change), 2),
SD_YoY = round(sd(gasoline_yoy_pct_change), 2),
Max_YoY = round(max(gasoline_yoy_pct_change), 2),
Min_YoY = round(min(gasoline_yoy_pct_change), 2)
)
cat("Top 5 years with LARGEST price INCREASES:\n")
## Top 5 years with LARGEST price INCREASES:
yoy_summary %>% arrange(desc(Mean_YoY)) %>% head(5) %>% print()
## # A tibble: 5 × 6
## year Mean_YoY Median_YoY SD_YoY Max_YoY Min_YoY
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1977 11.4 11.3 8.48 23.0 -10.6
## 2 1979 11.2 11.8 9.36 25.1 -16.2
## 3 1980 11.0 11.2 8.41 24.5 -12.6
## 4 1978 10.8 10.6 8.31 23.2 -13.0
## 5 1974 10.8 10.9 4.35 18.6 2.78
cat("\nTop 5 years with LARGEST price DROPS:\n")
##
## Top 5 years with LARGEST price DROPS:
yoy_summary %>% arrange(Mean_YoY) %>% head(5) %>% print()
## # A tibble: 5 × 6
## year Mean_YoY Median_YoY SD_YoY Max_YoY Min_YoY
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2023 -5.88 -10.3 10.4 24.2 -19.2
## 2 2024 -1.82 -2.04 2.47 3.82 -6.03
## 3 2012 -1.25 -1.66 5.03 9.07 -13.1
## 4 2015 -1.18 -1.92 5.24 11.8 -13.9
## 5 2014 -0.4 -0.36 4.95 9.83 -12.9
ggplot(yoy_summary, aes(x = year, y = Mean_YoY)) +
geom_col(aes(fill = Mean_YoY > 0), show.legend = FALSE) +
geom_hline(yintercept = 0, color = "black", linewidth = 0.5) +
scale_fill_manual(values = c("TRUE" = "tomato", "FALSE" = "steelblue")) +
labs(title = "Year-over-Year Gasoline Price Change (%)",
subtitle = "Red = price increase | Blue = price decrease",
x = "Year",
y = "Mean YoY Change (%)") +
theme_minimal()
Key findings:
Implications:
Why this question? Tax is a direct policy lever. Ranking by tax rate separates policy-driven expensive markets from geographically or supply-driven ones.
tax_ranking <- fuel_prices %>%
group_by(country) %>%
summarise(
Avg_Tax_Rate = round(mean(tax_pct_of_pump_price), 1),
Avg_Real_Price = round(mean(gasoline_real_2024usd), 3),
Tax_Per_Liter_USD = round(mean(tax_pct_of_pump_price) / 100 *
mean(gasoline_real_2024usd), 3)
) %>%
arrange(desc(Avg_Tax_Rate)) %>%
head(15)
print(tax_ranking)
## # A tibble: 15 × 4
## country Avg_Tax_Rate Avg_Real_Price Tax_Per_Liter_USD
## <chr> <dbl> <dbl> <dbl>
## 1 Italy 50.3 1.29 0.647
## 2 France 49.5 1.24 0.613
## 3 Germany 48 1.23 0.588
## 4 Norway 47.2 1.42 0.668
## 5 United Kingdom 47.2 1.06 0.5
## 6 Turkey 45.7 0.798 0.365
## 7 India 41.9 0.529 0.222
## 8 South Korea 39.6 0.978 0.387
## 9 Japan 36.5 0.894 0.327
## 10 Singapore 36.5 1.09 0.398
## 11 Brazil 32 0.743 0.238
## 12 China 32 0.477 0.152
## 13 South Africa 30.4 0.644 0.196
## 14 Australia 28.9 0.763 0.221
## 15 Argentina 26.6 0.577 0.154
ggplot(tax_ranking,
aes(x = reorder(country, Avg_Tax_Rate), y = Avg_Tax_Rate)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 15 Countries by Average Fuel Tax Rate",
subtitle = "Tax as Percentage of Pump Price",
x = "Country",
y = "Average Tax Rate (%)") +
theme_minimal()
Key findings:
Implications:
Covered in Question 3.3 above with full analysis and visualisation.
Why this question? The refinery margin (difference between crude cost and refined product wholesale price) shows how much of your pump price goes to the refining industry rather than to oil producers or governments. Rising margins indicate industry profitability or capacity constraints.
margin_trends <- fuel_prices %>%
group_by(decade) %>%
summarise(
Mean_Margin = round(mean(refinery_margin_usd_ltr), 4),
Median_Margin = round(median(refinery_margin_usd_ltr), 4),
SD_Margin = round(sd(refinery_margin_usd_ltr), 4),
Min_Margin = round(min(refinery_margin_usd_ltr), 4),
Max_Margin = round(max(refinery_margin_usd_ltr), 4)
) %>%
arrange(decade)
print(margin_trends)
## # A tibble: 11 × 6
## decade Mean_Margin Median_Margin SD_Margin Min_Margin Max_Margin
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1920s 0.0067 0.0058 0.0075 -0.0067 0.0242
## 2 1930s 0.0119 0.0107 0.0079 -0.0018 0.0314
## 3 1940s 0.0138 0.0119 0.0108 -0.007 0.0412
## 4 1950s 0.0155 0.0116 0.017 -0.008 0.062
## 5 1960s 0.0266 0.0217 0.0241 -0.0067 0.0925
## 6 1970s 0.0285 0.0234 0.0748 -0.185 0.251
## 7 1980s 0.0619 -0.0054 0.201 -0.220 0.576
## 8 1990s 0.273 0.242 0.296 -0.137 0.882
## 9 2000s 0.361 0.347 0.447 -0.602 1.42
## 10 2010s 0.413 0.452 0.540 -0.689 1.47
## 11 2020s 0.723 0.814 0.633 -0.618 1.82
Key findings:
Implications:
Why this question? In most markets diesel is cheaper due to lower commercial vehicle taxes. But when diesel exceeds gasoline in price, it signals a policy shift with direct consequences for logistics, agriculture, and food costs.
fuel_prices <- fuel_prices %>%
mutate(
Diesel_to_Gasoline_Ratio = diesel_usd_per_liter / gasoline_usd_per_liter,
Diesel_Premium = ifelse(Diesel_to_Gasoline_Ratio > 1,
"Diesel More Expensive",
"Gasoline More Expensive")
)
diesel_premium_summary <- fuel_prices %>%
group_by(region, Diesel_Premium) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(region) %>%
mutate(Percent = round(n / sum(n) * 100, 1))
print(diesel_premium_summary)
## # A tibble: 11 × 4
## # Groups: region [9]
## region Diesel_Premium n Percent
## <chr> <chr> <int> <dbl>
## 1 Africa Gasoline More Expensive 202 100
## 2 Asia Gasoline More Expensive 505 100
## 3 Asia Pacific Gasoline More Expensive 202 100
## 4 Eurasia Gasoline More Expensive 101 100
## 5 Europe Diesel More Expensive 502 99.4
## 6 Europe Gasoline More Expensive 3 0.6
## 7 Europe/Asia Diesel More Expensive 46 45.5
## 8 Europe/Asia Gasoline More Expensive 55 54.5
## 9 Latin America Gasoline More Expensive 404 100
## 10 Middle East Gasoline More Expensive 303 100
## 11 North America Gasoline More Expensive 202 100
Key findings:
Implications:
Why this question? Subsidies are policy choices. Quantifying the price gap shows the real scale of market distortion and the fiscal cost governments bear.
subsidy_comparison <- fuel_prices %>%
mutate(Subsidy_Status = ifelse(subsidy_flag == 1,
"Subsidised",
"Not Subsidised")) %>%
group_by(Subsidy_Status) %>%
summarise(
Mean_Real_Price = round(mean(gasoline_real_2024usd), 3),
Median_Real_Price = round(median(gasoline_real_2024usd), 3),
SD_Real_Price = round(sd(gasoline_real_2024usd), 3),
Mean_Tax_Rate = round(mean(tax_pct_of_pump_price), 1),
Count = n()
)
print(subsidy_comparison)
## # A tibble: 2 × 6
## Subsidy_Status Mean_Real_Price Median_Real_Price SD_Real_Price Mean_Tax_Rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Not Subsidised 0.813 0.592 0.602 32.5
## 2 Subsidised 0.373 0.244 0.33 15.6
## # ℹ 1 more variable: Count <int>
price_diff <- subsidy_comparison$Mean_Real_Price[
subsidy_comparison$Subsidy_Status == "Not Subsidised"] -
subsidy_comparison$Mean_Real_Price[
subsidy_comparison$Subsidy_Status == "Subsidised"]
pct_diff <- price_diff / subsidy_comparison$Mean_Real_Price[
subsidy_comparison$Subsidy_Status == "Subsidised"] * 100
cat("\nMean price difference (Non-Subsidised minus Subsidised):",
round(price_diff, 3), "USD/Liter\n")
##
## Mean price difference (Non-Subsidised minus Subsidised): 0.44 USD/Liter
cat("Non-subsidised prices are", round(pct_diff, 1),
"% higher on average\n")
## Non-subsidised prices are 118 % higher on average
fuel_prices %>%
mutate(Subsidy_Status = ifelse(subsidy_flag == 1,
"Subsidised",
"Not Subsidised")) %>%
ggplot(aes(x = Subsidy_Status, y = gasoline_real_2024usd,
fill = Subsidy_Status)) +
geom_boxplot(outlier.alpha = 0.3, show.legend = FALSE) +
labs(title = "Real Gasoline Price: Subsidised vs Non-Subsidised Markets",
x = "",
y = "Real Price (2024 USD/Liter)") +
scale_fill_manual(values = c("Subsidised" = "steelblue",
"Not Subsidised" = "tomato")) +
theme_minimal()
Key findings:
Implications:
Why this question? If crude oil perfectly drives retail prices, then energy policy is almost powerless. If the correlation has weakened over time, it means taxes and subsidies are increasingly dominating crude oil as the price determinant.
# Overall correlation
overall_cor <- cor(fuel_prices$crude_oil_usd_per_barrel,
fuel_prices$gasoline_real_2024usd)
cat("Overall correlation (crude vs real gasoline):",
round(overall_cor, 3), "\n\n")
## Overall correlation (crude vs real gasoline): 0.525
# Correlation by decade
cor_by_decade <- fuel_prices %>%
group_by(decade) %>%
summarise(
Correlation = round(cor(crude_oil_usd_per_barrel,
gasoline_real_2024usd), 3),
Mean_Crude = round(mean(crude_oil_usd_per_barrel), 2),
Mean_Retail = round(mean(gasoline_real_2024usd), 3),
Passthrough_Pct = round(Correlation * 100, 1)
) %>%
arrange(decade)
print(cor_by_decade)
## # A tibble: 11 × 5
## decade Correlation Mean_Crude Mean_Retail Passthrough_Pct
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1920s -0.047 1.46 0.288 -4.7
## 2 1930s 0.01 0.98 0.402 1
## 3 1940s -0.174 1.55 0.401 -17.4
## 4 1950s 0.08 2.76 0.38 8
## 5 1960s 0.004 2.91 0.44 0.4
## 6 1970s 0.187 10.6 0.554 18.7
## 7 1980s -0.014 25.6 0.669 -1.4
## 8 1990s -0.055 18.5 0.811 -5.5
## 9 2000s 0.167 51.1 1.10 16.7
## 10 2010s -0.004 79.7 1.21 -0.4
## 11 2020s 0.002 75.3 1.31 0.2
ggplot(fuel_prices, aes(x = crude_oil_usd_per_barrel,
y = gasoline_real_2024usd)) +
geom_point(alpha = 0.2, color = "gray40") +
geom_smooth(method = "lm", color = "red", se = TRUE) +
facet_wrap(~decade, scales = "free") +
labs(title = "Crude Oil Price vs Retail Gasoline Price by Decade",
x = "Crude Oil (USD/Barrel)",
y = "Real Retail Price (2024 USD/Liter)") +
theme_minimal(base_size = 9)
## `geom_smooth()` using formula = 'y ~ x'
What we learn:
Implications:
Why this question? To quantify the exact monetary impact of input costs and government policy. A linear regression model establishes a baseline mathematical equation, allowing us to predict how much the pump price will increase for every single dollar crude oil goes up, holding taxes constant.
# Build the linear regression model
lm_model <- lm(gasoline_real_2024usd ~ crude_oil_usd_per_barrel + tax_pct_of_pump_price,
data = fuel_prices)
summary(lm_model)
##
## Call:
## lm(formula = gasoline_real_2024usd ~ crude_oil_usd_per_barrel +
## tax_pct_of_pump_price, data = fuel_prices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00029 -0.14169 0.02219 0.16676 1.15736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0064799 0.0091696 0.707 0.48
## crude_oil_usd_per_barrel 0.0054078 0.0001934 27.960 <2e-16 ***
## tax_pct_of_pump_price 0.0199626 0.0002751 72.552 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2756 on 2522 degrees of freedom
## Multiple R-squared: 0.7653, Adjusted R-squared: 0.7651
## F-statistic: 4111 on 2 and 2522 DF, p-value: < 2.2e-16
# Extract and interpret coefficients
cat("\nModel Interpretation:\n")
##
## Model Interpretation:
cat("Base price (Intercept): $", round(coef(lm_model)[1], 3), "\n")
## Base price (Intercept): $ 0.006
cat("Impact of $1 crude increase: $", round(coef(lm_model)[2], 4), "per liter\n")
## Impact of $1 crude increase: $ 0.0054 per liter
cat("Impact of 1% tax increase: $", round(coef(lm_model)[3], 4), "per liter\n")
## Impact of 1% tax increase: $ 0.02 per liter
Key findings:
crude_oil_usd_per_barrel and
tax_pct_of_pump_price have significant positive
coefficients (p < 0.05).Implications:
Why this question? Historical data spanning 100 years rarely follows a perfect straight line. Extreme events like the 1970s oil embargoes or the 2008 financial crash cause market panic, meaning prices might rise exponentially rather than linearly during a shock. Polynomial regression allows our predictive line to curve, and we can test if this curve is statistically more accurate than the straight line.
# Build a polynomial regression model (degree 2 for crude oil)
poly_model <- lm(gasoline_real_2024usd ~ poly(crude_oil_usd_per_barrel, 2) + tax_pct_of_pump_price,
data = fuel_prices)
summary(poly_model)
##
## Call:
## lm(formula = gasoline_real_2024usd ~ poly(crude_oil_usd_per_barrel,
## 2) + tax_pct_of_pump_price, data = fuel_prices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00947 -0.14378 0.02368 0.16882 1.14646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.133642 0.009458 14.131 <2e-16 ***
## poly(crude_oil_usd_per_barrel, 2)1 8.180046 0.292247 27.990 <2e-16 ***
## poly(crude_oil_usd_per_barrel, 2)2 -0.411901 0.284472 -1.448 0.148
## tax_pct_of_pump_price 0.019860 0.000284 69.927 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2755 on 2521 degrees of freedom
## Multiple R-squared: 0.7655, Adjusted R-squared: 0.7652
## F-statistic: 2743 on 3 and 2521 DF, p-value: < 2.2e-16
# Compare the linear and polynomial models using ANOVA
cat("\nModel Comparison (Linear vs Polynomial):\n")
##
## Model Comparison (Linear vs Polynomial):
# H0 - The additional squared term does not provide a better fit.
# H1 - The squared term improves the model fit.
anova(lm_model, poly_model)
## Analysis of Variance Table
##
## Model 1: gasoline_real_2024usd ~ crude_oil_usd_per_barrel + tax_pct_of_pump_price
## Model 2: gasoline_real_2024usd ~ poly(crude_oil_usd_per_barrel, 2) + tax_pct_of_pump_price
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2522 191.55
## 2 2521 191.39 1 0.15917 2.0965 0.1478
# Visualizing the Linear vs. Polynomial Fit
ggplot(fuel_prices, aes(x = crude_oil_usd_per_barrel, y = gasoline_real_2024usd)) +
# Add the raw data points with high transparency due to data density
geom_point(alpha = 0.15, color = "gray50") +
# Add the Linear regression line (Straight)
geom_smooth(method = "lm", formula = y ~ x,
aes(color = "Linear (Degree 1)"),
se = FALSE, linewidth = 1.2) +
# Add the Polynomial regression line (Curved)
geom_smooth(method = "lm", formula = y ~ poly(x, 2),
aes(color = "Polynomial (Degree 2)"),
se = FALSE, linewidth = 1.2) +
# Formatting and labels
labs(title = "Predicting Retail Gasoline Prices from Crude Oil",
subtitle = "Linear vs. Polynomial Regression Fit (1924-2024)",
x = "Crude Oil (USD/Barrel)",
y = "Real Retail Price (2024 USD/Liter)",
color = "Model Type") +
scale_color_manual(values = c("Linear (Degree 1)" = "steelblue",
"Polynomial (Degree 2)" = "darkorange")) +
theme_minimal() +
theme(legend.position = "bottom")
Key findings:
($p = 0.148$). This indicates that adding a curve to the
model does not provide a meaningful improvement over a straight
line.Implications:
yearly_avg <- fuel_prices %>%
group_by(year) %>%
summarise(
Nominal_Gasoline = mean(gasoline_usd_per_liter),
Real_Gasoline = mean(gasoline_real_2024usd),
Real_Diesel = mean(diesel_real_2024usd)
)
ggplot(yearly_avg, aes(x = year)) +
geom_line(aes(y = Nominal_Gasoline,
color = "Nominal Gasoline"),
linewidth = 0.8, linetype = "dashed") +
geom_line(aes(y = Real_Gasoline,
color = "Real Gasoline (2024 USD)"),
linewidth = 1.2) +
geom_line(aes(y = Real_Diesel,
color = "Real Diesel (2024 USD)"),
linewidth = 1.2) +
labs(title = "Global Fuel Prices: 100-Year Trend (1924-2024)",
subtitle = "Nominal vs Inflation-Adjusted (Real 2024 USD)",
x = "Year",
y = "Price (USD per Liter)",
color = "Fuel Type") +
theme_minimal() +
scale_color_manual(
values = c("Nominal Gasoline" = "gray60",
"Real Gasoline (2024 USD)" = "darkorange",
"Real Diesel (2024 USD)" = "firebrick")) +
theme(legend.position = "bottom")
Interpretation:
regional_2020s <- fuel_prices %>%
filter(decade == "2020s") %>%
group_by(region) %>%
summarise(
Gasoline = mean(gasoline_usd_per_liter),
Diesel = mean(diesel_usd_per_liter)
) %>%
pivot_longer(cols = c(Gasoline, Diesel),
names_to = "Fuel",
values_to = "Price")
ggplot(regional_2020s,
aes(x = reorder(region, -Price), y = Price, fill = Fuel)) +
geom_col(position = "dodge", width = 0.7) +
labs(title = "Average Nominal Fuel Prices by Region (2020s)",
x = "Region",
y = "Nominal Price (USD/Liter)",
fill = "Fuel Type") +
theme_minimal() +
scale_fill_manual(values = c("Gasoline" = "darkorange",
"Diesel" = "darkblue")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation:
ggplot(fuel_prices,
aes(x = reorder(price_tier,
gasoline_real_2024usd,
FUN = median),
y = gasoline_real_2024usd,
fill = price_tier)) +
geom_boxplot(outlier.alpha = 0.3, show.legend = FALSE) +
labs(title = "Distribution of Real Gasoline Prices Across Price Tiers",
subtitle = "Inflation-adjusted to 2024 USD",
x = "Price Tier",
y = "Real Gasoline Price (2024 USD/Liter)") +
theme_minimal() +
scale_fill_brewer(palette = "Spectral")
Interpretation:
heatmap_data <- fuel_prices %>%
group_by(decade, region) %>%
summarise(Avg_Price = mean(gasoline_real_2024usd),
.groups = "drop")
ggplot(heatmap_data,
aes(x = decade, y = region, fill = Avg_Price)) +
geom_tile(color = "white", linewidth = 0.5) +
scale_fill_gradient(low = "lightyellow",
high = "darkred",
name = "Avg Real\nPrice\n(USD/L)") +
labs(title = "Average Real Gasoline Price Heatmap",
subtitle = "By Decade and Region — darker = more expensive",
x = "Decade",
y = "Region") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation:
cor_vars <- fuel_prices %>%
select(gasoline_real_2024usd,
diesel_real_2024usd,
crude_oil_usd_per_barrel,
tax_pct_of_pump_price,
refinery_margin_usd_ltr,
us_cpi) %>%
na.omit()
cor_matrix <- cor(cor_vars)
corrplot(cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
number.cex = 0.7,
tl.col = "black",
tl.srt = 45,
title = "Correlation: Fuel Price Components",
mar = c(0, 0, 2, 0))
Interpretation:
yoy_data <- fuel_prices %>%
group_by(year) %>%
summarise(Avg_YoY = mean(gasoline_yoy_pct_change, na.rm = TRUE),
.groups = 'drop')
ggplot(yoy_data, aes(x = year, y = Avg_YoY)) +
geom_line(color = "purple", size = 1) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Year-over-Year Gasoline Price Changes (1925-2024)",
subtitle = "Identifying Price Shock Periods",
x = "Year",
y = "YoY Change (%)") +
theme_minimal()
Interpretation:
1. The mean is misleading — skewness tells the real story: The mean real gasoline price ($0.67/L) is 50% above the median ($0.45/L). A handful of modern high-tax European markets distort the global average upward. The median is the honest “typical” price.
2. Oil producers do not automatically have cheap fuel: Being an oil producer lowers prices on average — but Norway proves that policy choice (taxation) can override geological advantage entirely.
3. The 1973 oil crisis permanently reset global prices: Decade averages show prices roughly doubled in real terms from the 1960s to the 1970s and never returned. This was a structural break, not a temporary spike.
4. Taxes create stability; subsidies create volatility: Europe’s tax-heavy model produces high but stable prices (low CV). The Middle East’s subsidy model produces low but volatile prices (high CV). When subsidies are removed, consumers absorb the entire market price overnight.
5. Crude oil is important but not dominant: The correlation matrix shows tax rate and CPI correlate with retail prices almost as strongly as crude oil. Domestic policy matters as much as the global commodity market for what consumers actually pay.
6. The gasoline-diesel gap is a leading economic indicator: When diesel approaches or exceeds gasoline in price, freight and agriculture costs follow — an economic effect that pure price analysis would miss entirely.
7. Price mechanics are non-linear and mathematically predictable: Inferential statistics (ANOVA) confirm that regional price disparities are not random; geography dictates the structural floor of fuel prices. Furthermore, regression modeling proves that the relationship between crude oil inputs and retail outcomes is non-linear. During extreme market shocks, the pass-through effect bends, highlighting that simple linear projections are insufficient for 100-year historical energy economic