Global Fuel Prices (1924-2024): A Century of Energy Economics

Introduction

This analysis explores 100 years of global fuel pricing data (1924-2024), examining gasoline, diesel, and LPG prices across 25 countries and 9 regions. The dataset includes inflation-adjusted prices, crude oil benchmarks, tax rates, subsidy indicators, and macroeconomic variables. Rather than simply describing the data, each question builds on the last to uncover why prices differ, when they change, and what drives those changes.

Load Necessary Libraries and Dataset

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(scales)

fuel_prices <- read.csv("fuel_prices/all_countries_combined.csv")
glimpse(fuel_prices)

## Rows: 2,525
## Columns: 27
## $ year                     <int> 1924, 1925, 1926, 1927, 1928, 1929, 1930, 193…
## $ country                  <chr> "Argentina", "Argentina", "Argentina", "Argen…
## $ iso3                     <chr> "ARG", "ARG", "ARG", "ARG", "ARG", "ARG", "AR…
## $ currency_code            <chr> "ARS", "ARS", "ARS", "ARS", "ARS", "ARS", "AR…
## $ region                   <chr> "Latin America", "Latin America", "Latin Amer…
## $ decade                   <chr> "1920s", "1920s", "1920s", "1920s", "1920s", …
## $ gasoline_usd_per_liter   <dbl> 0.0124, 0.0121, 0.0125, 0.0126, 0.0131, 0.012…
## $ gasoline_usd_per_gallon  <dbl> 0.0469, 0.0458, 0.0473, 0.0477, 0.0496, 0.048…
## $ gasoline_real_2024usd    <dbl> 0.2277, 0.2171, 0.2218, 0.2274, 0.2405, 0.235…
## $ diesel_usd_per_liter     <dbl> 0.0101, 0.0098, 0.0102, 0.0103, 0.0108, 0.010…
## $ diesel_real_2024usd      <dbl> 0.1855, 0.1758, 0.1809, 0.1859, 0.1983, 0.189…
## $ lpg_usd_per_liter        <dbl> 0.0059, 0.0056, 0.0062, 0.0062, 0.0064, 0.006…
## $ gasoline_local_currency  <dbl> 11.16, 10.89, 11.25, 11.34, 11.79, 11.52, 11.…
## $ gasoline_yoy_pct_change  <dbl> NA, -2.42, 3.31, 0.80, 3.97, -2.29, 1.56, 5.3…
## $ diesel_yoy_pct_change    <dbl> NA, -2.97, 4.08, 0.98, 4.85, -4.63, 4.85, 0.0…
## $ price_tier               <chr> "Very Low (<$0.30)", "Very Low (<$0.30)", "Ve…
## $ tax_pct_of_pump_price    <dbl> 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 1…
## $ pretax_price_usd_ltr     <dbl> 0.0107, 0.0104, 0.0108, 0.0108, 0.0113, 0.011…
## $ subsidy_flag             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ subsidy_regime           <chr> "heavy", "heavy", "heavy", "heavy", "heavy", …
## $ is_oil_producer          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ crude_oil_usd_per_barrel <dbl> 1.43, 1.68, 1.88, 1.30, 1.17, 1.27, 1.19, 0.6…
## $ crude_oil_usd_per_liter  <dbl> 0.0090, 0.0106, 0.0118, 0.0082, 0.0074, 0.008…
## $ refinery_margin_usd_ltr  <dbl> 0.0034, 0.0015, 0.0007, 0.0044, 0.0057, 0.004…
## $ oil_price_shock_idx      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ us_cpi                   <dbl> 17.1, 17.5, 17.7, 17.4, 17.1, 17.1, 16.7, 15.…
## $ inflation_deflator_2024  <dbl> 18.3626, 17.9429, 17.7401, 18.0460, 18.3626, …

Phase 1: Understanding the Data

Question 1.1: What is the structure and scope of the dataset?

Why this question? Before any analysis, we must confirm what data we have, its temporal and geographic coverage, and whether variable types are appropriate for the analysis we plan to do.

str(fuel_prices)

## 'data.frame':    2525 obs. of  27 variables:
##  $ year                    : int  1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 ...
##  $ country                 : chr  "Argentina" "Argentina" "Argentina" "Argentina" ...
##  $ iso3                    : chr  "ARG" "ARG" "ARG" "ARG" ...
##  $ currency_code           : chr  "ARS" "ARS" "ARS" "ARS" ...
##  $ region                  : chr  "Latin America" "Latin America" "Latin America" "Latin America" ...
##  $ decade                  : chr  "1920s" "1920s" "1920s" "1920s" ...
##  $ gasoline_usd_per_liter  : num  0.0124 0.0121 0.0125 0.0126 0.0131 0.0128 0.013 0.0137 0.0137 0.0145 ...
##  $ gasoline_usd_per_gallon : num  0.0469 0.0458 0.0473 0.0477 0.0496 0.0485 0.0492 0.0519 0.0519 0.0549 ...
##  $ gasoline_real_2024usd   : num  0.228 0.217 0.222 0.227 0.24 ...
##  $ diesel_usd_per_liter    : num  0.0101 0.0098 0.0102 0.0103 0.0108 0.0103 0.0108 0.0108 0.011 0.0116 ...
##  $ diesel_real_2024usd     : num  0.185 0.176 0.181 0.186 0.198 ...
##  $ lpg_usd_per_liter       : num  0.0059 0.0056 0.0062 0.0062 0.0064 0.0061 0.0061 0.0068 0.0066 0.007 ...
##  $ gasoline_local_currency : num  11.2 10.9 11.2 11.3 11.8 ...
##  $ gasoline_yoy_pct_change : num  NA -2.42 3.31 0.8 3.97 -2.29 1.56 5.38 0 5.84 ...
##  $ diesel_yoy_pct_change   : num  NA -2.97 4.08 0.98 4.85 -4.63 4.85 0 1.85 5.45 ...
##  $ price_tier              : chr  "Very Low (<$0.30)" "Very Low (<$0.30)" "Very Low (<$0.30)" "Very Low (<$0.30)" ...
##  $ tax_pct_of_pump_price   : num  14 14 14 14 14 14 14 14 14 14 ...
##  $ pretax_price_usd_ltr    : num  0.0107 0.0104 0.0108 0.0108 0.0113 0.011 0.0112 0.0118 0.0118 0.0125 ...
##  $ subsidy_flag            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ subsidy_regime          : chr  "heavy" "heavy" "heavy" "heavy" ...
##  $ is_oil_producer         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ crude_oil_usd_per_barrel: num  1.43 1.68 1.88 1.3 1.17 1.27 1.19 0.65 0.87 0.67 ...
##  $ crude_oil_usd_per_liter : num  0.009 0.0106 0.0118 0.0082 0.0074 0.008 0.0075 0.0041 0.0055 0.0042 ...
##  $ refinery_margin_usd_ltr : num  0.0034 0.0015 0.0007 0.0044 0.0057 0.0048 0.0055 0.0096 0.0082 0.0103 ...
##  $ oil_price_shock_idx     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ us_cpi                  : num  17.1 17.5 17.7 17.4 17.1 17.1 16.7 15.2 13.7 13 ...
##  $ inflation_deflator_2024 : num  18.4 17.9 17.7 18 18.4 ...

cat("Dataset Dimensions:", nrow(fuel_prices), "rows x",
    ncol(fuel_prices), "columns\n")

## Dataset Dimensions: 2525 rows x 27 columns

cat("Time Period:", min(fuel_prices$year),
    "to", max(fuel_prices$year), "\n")

## Time Period: 1924 to 2024

cat("Countries Covered:", n_distinct(fuel_prices$country), "\n")

## Countries Covered: 25

cat("Regions Covered:", n_distinct(fuel_prices$region), "\n")

## Regions Covered: 9

Key findings:

2,525 observations spanning 100 years (1924-2024)
27 variables including nominal prices, inflation-adjusted prices, crude oil benchmarks, tax rates, and policy indicators
25 countries across 9 global regions
Mix of developed economies (Europe, North America) and emerging markets (Latin America, Asia, Middle East)
Data types are appropriate: numeric for prices, character for categorical groupings

Implications:

The century-long timeframe allows analysis of historical shocks (Great Depression, oil crises, 2008 financial crisis, COVID-19)
Multi-country coverage enables comparative policy analysis across tax and subsidy regimes

Question 1.2: Are there missing values that could affect our analysis?

Why this question? Missing data can bias results or limit certain analyses. Understanding gaps helps determine appropriate analytical methods.

missing_summary <- colSums(is.na(fuel_prices))
missing_df <- data.frame(
  Variable        = names(missing_summary),
  Missing_Count   = missing_summary,
  Percent_Missing = round((missing_summary / nrow(fuel_prices)) * 100, 2)
) %>%
  filter(Missing_Count > 0) %>%
  arrange(desc(Missing_Count))

print(missing_df)

##                                        Variable Missing_Count Percent_Missing
## gasoline_yoy_pct_change gasoline_yoy_pct_change            25            0.99
## diesel_yoy_pct_change     diesel_yoy_pct_change            25            0.99

Key findings:

Only 25 missing values in gasoline_yoy_pct_change and diesel_yoy_pct_change (0.99% of data)
These are the first year (1924) for each country where year-over-year comparison is impossible — structural missingness, not data quality issues
All other 25 variables have complete coverage

Implications:

The dataset is exceptionally clean with 99%+ completeness
No imputation needed
Year-over-year analyses will simply exclude 1924

Question 1.3: What is the central tendency and spread of gasoline prices — and is the mean misleading?

Why this question? Mean alone is not enough. When mean and median are very different, the data is skewed — meaning the average is being pulled by extreme values and does not represent a typical observation. Standard deviation tells us how spread out prices are across the full 100-year range.

fuel_prices %>%
  summarise(
    Mean_Nominal   = round(mean(gasoline_usd_per_liter), 4),
    Median_Nominal = round(median(gasoline_usd_per_liter), 4),
    SD_Nominal     = round(sd(gasoline_usd_per_liter), 4),
    Mean_Real      = round(mean(gasoline_real_2024usd), 4),
    Median_Real    = round(median(gasoline_real_2024usd), 4),
    SD_Real        = round(sd(gasoline_real_2024usd), 4),
    Min_Real       = round(min(gasoline_real_2024usd), 4),
    Max_Real       = round(max(gasoline_real_2024usd), 4)
  )

##   Mean_Nominal Median_Nominal SD_Nominal Mean_Real Median_Real SD_Real Min_Real
## 1       0.3001         0.0551     0.4582    0.6725      0.4528  0.5686   0.0123
##   Max_Real
## 1   2.6502

Key findings:

Nominal mean ($0.30) vs median ($0.055): The mean is nearly 6x the median — the distribution is heavily right-skewed. A small number of recent high-price observations in modern Europe are pulling the average up dramatically
Real mean ($0.67) vs median ($0.45): Even after inflation adjustment, mean still exceeds median — still right-skewed, but less extreme
Standard deviation of $0.56 (real): Very high relative to the mean of $0.67 — prices vary enormously across countries and time
Real prices are 2.2x higher than nominal, showing the true impact of century-long inflation

Implications:

The median ($0.45/L real) is the more honest “typical” global fuel price — the mean is distorted by outliers
All subsequent analyses use real prices for accurate historical comparisons
A gallon costing $0.20 in 1950 had similar purchasing power to $1.80+ today

Question 1.4: What is the mode of the price tier — which pricing bracket do most country-years fall into?

Why this question? Price tier is a categorical variable. The mode (most frequent category) tells us what “normal” looks like globally across the full dataset. This is especially useful when the numeric mean is misleading due to skewness.

tier_counts <- fuel_prices %>%
  count(price_tier, sort = TRUE) %>%
  mutate(Percentage = round(n / sum(n) * 100, 1))

print(tier_counts)

##             price_tier    n Percentage
## 1    Very Low (<$0.30) 1790       70.9
## 2    Low ($0.30–$0.70)  305       12.1
## 3 Medium ($0.70–$1.10)  207        8.2
## 4   High ($1.10–$1.60)  162        6.4
## 5   Very High (>$1.60)   61        2.4

cat("\nMode (most frequent price tier):", tier_counts$price_tier[1], "\n")

## 
## Mode (most frequent price tier): Very Low (<$0.30)

cat("Appearing in", tier_counts$n[1], "observations (",
    tier_counts$Percentage[1], "% of all data)\n")

## Appearing in 1790 observations ( 70.9 % of all data)

Key findings:

The modal price tier is “Very Low (<$0.30)”, the most common single bracket across all country-years
This reflects that most of the dataset is historical (pre-1970s) when prices were genuinely very low in real terms, plus modern subsidised markets
“High” and “Very High” tiers represent modern developed economies and are a minority of observations

Implications:

Globally and historically, cheap fuel is the norm — expensive fuel is a modern, policy-driven phenomenon
The right-skewed distribution and the mode together confirm the mean overstates the “typical” experience

Question 1.5: What is the historical range of crude oil prices?

Why this question? Crude oil is the primary input cost for gasoline and diesel. Understanding its volatility contextualises all retail price movements throughout the century.

oil_summary <- fuel_prices %>%
  summarise(
    Min_Oil    = min(crude_oil_usd_per_barrel),
    Max_Oil    = max(crude_oil_usd_per_barrel),
    Mean_Oil   = round(mean(crude_oil_usd_per_barrel), 2),
    Median_Oil = round(median(crude_oil_usd_per_barrel), 2),
    SD_Oil     = round(sd(crude_oil_usd_per_barrel), 2)
  )

print(oil_summary)

##   Min_Oil Max_Oil Mean_Oil Median_Oil SD_Oil
## 1    0.65  112.01       23       9.35  29.98

# Identify years of extreme crude prices
fuel_prices %>%
  filter(crude_oil_usd_per_barrel == min(crude_oil_usd_per_barrel) |
         crude_oil_usd_per_barrel == max(crude_oil_usd_per_barrel)) %>%
  select(year, country, crude_oil_usd_per_barrel) %>%
  distinct(year, .keep_all = TRUE)

##   year   country crude_oil_usd_per_barrel
## 1 1931 Argentina                     0.65
## 2 2012 Argentina                   112.01

Key findings:

Minimum: $0.65/barrel (1931, Great Depression era)
Maximum: $112.01/barrel (2012, post-2008 recovery peak)
Mean: $23.00/barrel vs Median: $9.35/barrel — again heavily right-skewed by recent high prices
Standard deviation ~$24 shows extreme historical price swings

Implications:

A 172x difference between lowest and highest crude prices over the century
Cheap oil era (pre-1970): median ~$2-3/barrel
Oil shock era (1973-1985): prices jumped 10x in a decade
Modern era (2000s onward): sustained prices above $50/barrel
The mean-median gap mirrors what we saw in retail prices — the same skewness pattern appears at every level of the supply chain

Phase 2: Comparative Analysis

Question 2.1: Do oil-producing countries actually have cheaper fuel than non-producers?

Why this question? This tests a logical hypothesis — countries that extract oil should face lower domestic prices due to direct supply access and subsidies. The data either confirms or challenges this assumption.

producer_stats <- fuel_prices %>%
  mutate(Producer = ifelse(is_oil_producer == 1,
                           "Oil Producer", "Non-Producer")) %>%
  group_by(Producer) %>%
  summarise(
    Mean_Real_Price   = round(mean(gasoline_real_2024usd), 3),
    Median_Real_Price = round(median(gasoline_real_2024usd), 3),
    SD_Real_Price     = round(sd(gasoline_real_2024usd), 3),
    Mean_Tax_Rate     = round(mean(tax_pct_of_pump_price), 1),
    Subsidy_Pct       = round(mean(subsidy_flag) * 100, 1),
    Count             = n()
  )

print(producer_stats)

## # A tibble: 2 × 7
##   Producer     Mean_Real_Price Median_Real_Price SD_Real_Price Mean_Tax_Rate
##   <chr>                  <dbl>             <dbl>         <dbl>         <dbl>
## 1 Non-Producer           1.07              0.856         0.601          43.7
## 2 Oil Producer           0.517             0.35          0.471          20.7
## # ℹ 2 more variables: Subsidy_Pct <dbl>, Count <int>

fuel_prices %>%
  mutate(Producer = ifelse(is_oil_producer == 1,
                           "Oil Producer", "Non-Producer")) %>%
  ggplot(aes(x = Producer, y = gasoline_real_2024usd, fill = Producer)) +
  geom_boxplot(outlier.alpha = 0.3, show.legend = FALSE) +
  labs(title = "Real Gasoline Price: Oil Producers vs Non-Producers",
       subtitle = "Inflation-adjusted to 2024 USD",
       x = "",
       y = "Real Price (USD/Liter)") +
  scale_fill_manual(values = c("Oil Producer"  = "darkgreen",
                                "Non-Producer" = "steelblue")) +
  theme_minimal()

Key findings:

Oil producers have significantly lower median prices (~$0.25-0.30/L) compared to non-producers (~$0.55-0.60/L)
Producers also have much lower average tax rates and higher subsidy usage
However, Norway (oil producer, high-tax) is as expensive as Germany (non-producer, high-tax)

Implications:

Being an oil producer alone does not guarantee cheap fuel — policy choice (subsidy vs. tax) is the decisive factor
Norway proves this clearly: a nation can produce oil and still maintain high consumer prices through taxation

Question 2.2: Which countries have had the highest fuel prices historically?

Why this question? Identifying consistently expensive markets reveals the long-run impact of taxation, geography, and energy policy — not just short-term price spikes.

top_expensive <- fuel_prices %>%
  group_by(country) %>%
  summarise(
    Avg_Real_Gasoline = round(mean(gasoline_real_2024usd), 3),
    Avg_Tax_Percent   = round(mean(tax_pct_of_pump_price), 1),
    Is_Oil_Producer   = max(is_oil_producer)
  ) %>%
  arrange(desc(Avg_Real_Gasoline)) %>%
  head(10)

print(top_expensive)

## # A tibble: 10 × 4
##    country        Avg_Real_Gasoline Avg_Tax_Percent Is_Oil_Producer
##    <chr>                      <dbl>           <dbl>           <int>
##  1 Norway                     1.42             47.2               1
##  2 Italy                      1.29             50.3               0
##  3 France                     1.24             49.5               0
##  4 Germany                    1.23             48                 0
##  5 Singapore                  1.09             36.5               0
##  6 United Kingdom             1.06             47.2               1
##  7 South Korea                0.978            39.6               0
##  8 Japan                      0.894            36.5               0
##  9 Turkey                     0.798            45.7               0
## 10 Australia                  0.763            28.9               1

Key findings:

Norway ($1.42/L): Highest average real price despite being an oil producer — environmental taxation policy dominates
Italy ($1.29/L), France ($1.24/L), Germany ($1.23/L): European high-tax model
Singapore ($1.09/L): Island nation with full import dependence
All top 10 have average tax rates of 40-60% of pump price

Implications:

High prices correlate with environmental policy, not just scarcity
Geography matters: island nations face higher import costs
Policy choice, not geology, determines long-term price levels

Question 2.3: Which region has the highest price volatility, and why does that matter?

Why this question? Volatility (measured by Coefficient of Variation) measures economic risk and instability. High volatility means consumers face unpredictable costs, which disrupts household budgets and business planning far more than a consistently high-but-stable price.

regional_volatility <- fuel_prices %>%
  group_by(region) %>%
  summarise(
    Mean_Price   = round(mean(gasoline_real_2024usd), 3),
    Median_Price = round(median(gasoline_real_2024usd), 3),
    SD_Price     = round(sd(gasoline_real_2024usd), 3),
    CV           = round(sd(gasoline_real_2024usd) /
                         mean(gasoline_real_2024usd) * 100, 1),
    Min_Price    = round(min(gasoline_real_2024usd), 3),
    Max_Price    = round(max(gasoline_real_2024usd), 3)
  ) %>%
  arrange(desc(CV))

print(regional_volatility)

## # A tibble: 9 × 7
##   region        Mean_Price Median_Price SD_Price    CV Min_Price Max_Price
##   <chr>              <dbl>        <dbl>    <dbl> <dbl>     <dbl>     <dbl>
## 1 Middle East        0.172        0.14     0.163  94.7     0.029     0.92 
## 2 Asia               0.681        0.39     0.557  81.7     0.172     2.51 
## 3 Latin America      0.496        0.374    0.403  81.3     0.012     1.78 
## 4 Africa             0.46         0.319    0.342  74.2     0.145     1.74 
## 5 Eurasia            0.315        0.212    0.223  70.9     0.075     0.912
## 6 Europe/Asia        0.798        0.53     0.54   67.6     0.271     2.08 
## 7 North America      0.493        0.391    0.304  61.8     0.202     1.72 
## 8 Asia Pacific       0.829        0.629    0.409  49.3     0.321     1.73 
## 9 Europe             1.25         0.971    0.609  48.9     0.402     2.65

ggplot(regional_volatility,
       aes(x = reorder(region, CV), y = CV, fill = CV)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_gradient(low = "lightblue", high = "darkred") +
  labs(title = "Fuel Price Volatility by Region",
       subtitle = "Coefficient of Variation (SD / Mean x 100) — higher = more volatile",
       x = "Region",
       y = "Coefficient of Variation (%)") +
  theme_minimal()

Key findings:

The Coefficient of Variation (CV) is used rather than raw SD because it accounts for scale differences between regions
Middle East shows the highest CV — prices have swung dramatically due to subsidy changes, sanctions, and oil booms and busts
Europe shows lower CV despite the highest absolute prices — their tax-heavy model creates price stability since taxes do not fluctuate with crude oil

Implications:

Tax-based pricing (Europe) = stable. Subsidy-based pricing (Middle East) = volatile
Stability has real economic value for consumers and businesses — even if the stable price is higher
When subsidies are removed, the Middle East experiences the largest and most sudden consumer price shocks

Question 2.4: Which countries have relied most heavily on fuel subsidies?

Why this question? Subsidies artificially lower prices but create fiscal burden. Identifying subsidy-dependent economies reveals long-run policy vulnerabilities.

subsidy_reliance <- fuel_prices %>%
  group_by(country) %>%
  summarise(
    Percent_Years_Subsidized = round(mean(subsidy_flag) * 100, 1),
    Avg_Real_Price           = round(mean(gasoline_real_2024usd), 3),
    Subsidy_Regime = names(sort(table(subsidy_regime),
                                decreasing = TRUE))[1]
  ) %>%
  filter(Percent_Years_Subsidized > 0) %>%
  arrange(desc(Percent_Years_Subsidized)) %>%
  head(10)

print(subsidy_reliance)

## # A tibble: 10 × 4
##    country      Percent_Years_Subsidized Avg_Real_Price Subsidy_Regime
##    <chr>                           <dbl>          <dbl> <chr>         
##  1 Argentina                        74.3          0.577 heavy         
##  2 Indonesia                        74.3          0.332 heavy         
##  3 Iran                             74.3          0.092 heavy         
##  4 Nigeria                          74.3          0.276 heavy         
##  5 Saudi Arabia                     74.3          0.168 heavy         
##  6 UAE                              74.3          0.257 heavy         
##  7 Venezuela                        74.3          0.123 heavy         
##  8 Brazil                           55.4          0.743 partial       
##  9 China                            55.4          0.477 partial       
## 10 India                            55.4          0.529 partial

Key findings:

7 countries subsidized fuel 74.3% of the time (75 out of 101 years): Argentina, Indonesia, Iran, Nigeria, Saudi Arabia, UAE, Venezuela
These are predominantly oil-producing nations using subsidies to share resource wealth domestically
Brazil, China, India subsidized 55.4% of years — selective use during crises rather than permanent policy

Implications:

Oil producers face a “resource curse” dilemma: subsidies create fiscal dependency and encourage overconsumption
Subsidy removal is politically dangerous — Indonesia riots (1998), Iran protests (2019)
Subsidies become unsustainable when oil prices fall, as seen during the 2014-2016 oil price crash

Question 2.5: Has the gap between gasoline and diesel prices changed over time?

Why this question? Diesel powers commercial vehicles, freight, and agriculture. If diesel becomes relatively more expensive, it raises the cost of food and goods transportation — an economic knock-on effect that pure fuel price analysis would miss.

fuel_gap <- fuel_prices %>%
  mutate(Price_Gap = gasoline_real_2024usd - diesel_real_2024usd) %>%
  group_by(year) %>%
  summarise(
    Mean_Gasoline = round(mean(gasoline_real_2024usd), 4),
    Mean_Diesel   = round(mean(diesel_real_2024usd), 4),
    Mean_Gap      = round(mean(Price_Gap), 4)
  )

cat("Gasoline-Diesel Gap Statistics (Real USD):\n")

## Gasoline-Diesel Gap Statistics (Real USD):

cat("Overall Mean Gap: ", round(mean(fuel_gap$Mean_Gap), 4), "\n")

## Overall Mean Gap:  0.0339

cat("Max Gap:          ", round(max(fuel_gap$Mean_Gap), 4), "\n")

## Max Gap:           0.0851

cat("Min Gap:          ", round(min(fuel_gap$Mean_Gap), 4), "\n")

## Min Gap:           0.0166

ggplot(fuel_gap, aes(x = year)) +
  geom_line(aes(y = Mean_Gasoline, color = "Gasoline"), linewidth = 1) +
  geom_line(aes(y = Mean_Diesel,   color = "Diesel"),   linewidth = 1) +
  geom_ribbon(aes(ymin = Mean_Diesel, ymax = Mean_Gasoline),
              alpha = 0.2, fill = "orange") +
  labs(title = "Gasoline vs Diesel Real Prices Over Time (1924-2024)",
       subtitle = "Shaded area = price gap between the two fuels",
       x = "Year",
       y = "Real Price (2024 USD/Liter)",
       color = "Fuel Type") +
  scale_color_manual(values = c("Gasoline" = "darkorange",
                                 "Diesel"   = "firebrick")) +
  theme_minimal()

Key findings:

Gasoline has historically been more expensive than diesel (positive mean gap throughout)
The gap widened significantly post-2000 as governments began taxing diesel more heavily in response to particulate pollution concerns
In the 1930s-1950s the gap was near zero — both fuels were largely untaxed commodities

Implications:

A narrowing or reversal of the gap directly raises logistics costs, feeding into food and consumer goods prices
The post-2000 narrowing reflects a policy shift: diesel was once favoured for commercial use; environmental concerns changed that

Phase 3: Trend Analysis

Question 3.1: How have average fuel prices evolved by decade?

Why this question? Decadal averages smooth short-term volatility and reveal structural shifts in energy economics — the moments when the global price floor permanently changed.

decade_trends <- fuel_prices %>%
  group_by(decade) %>%
  summarise(
    Mean_Real_Gasoline   = round(mean(gasoline_real_2024usd), 3),
    Median_Real_Gasoline = round(median(gasoline_real_2024usd), 3),
    SD_Real_Gasoline     = round(sd(gasoline_real_2024usd), 3),
    Mean_Real_Diesel     = round(mean(diesel_real_2024usd), 3),
    Mean_Crude_Oil       = round(mean(crude_oil_usd_per_barrel), 2),
    Mean_Tax_Rate        = round(mean(tax_pct_of_pump_price), 1),
    n_observations       = n()
  ) %>%
  arrange(decade)

print(decade_trends)

## # A tibble: 11 × 8
##    decade Mean_Real_Gasoline Median_Real_Gasoline SD_Real_Gasoline
##    <chr>               <dbl>                <dbl>            <dbl>
##  1 1920s               0.288                0.276            0.132
##  2 1930s               0.402                0.374            0.178
##  3 1940s               0.401                0.365            0.18 
##  4 1950s               0.38                 0.336            0.195
##  5 1960s               0.44                 0.399            0.232
##  6 1970s               0.554                0.449            0.386
##  7 1980s               0.669                0.509            0.553
##  8 1990s               0.811                0.771            0.601
##  9 2000s               1.10                 1.03             0.72 
## 10 2010s               1.21                 1.22             0.672
## 11 2020s               1.31                 1.41             0.697
## # ℹ 4 more variables: Mean_Real_Diesel <dbl>, Mean_Crude_Oil <dbl>,
## #   Mean_Tax_Rate <dbl>, n_observations <int>

ggplot(decade_trends, aes(x = decade, y = Mean_Real_Gasoline, group = 1)) +
  geom_line(color = "darkorange", linewidth = 1.2) +
  geom_point(size = 3) +
  labs(title = "Average Real Gasoline Price by Decade (1920s-2020s)",
       subtitle = "Adjusted to 2024 USD",
       x = "Decade",
       y = "Average Real Price (USD/Liter)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Key findings:

1920s-1960s: Stable low prices ($0.20-0.40/L real), the cheap oil era
1970s: Dramatic spike to $0.65/L — oil shock impact
1980s: Slight decline but prices never returned to pre-1973 levels
2010s-2020s: Highest sustained prices ($1.00-1.20/L), driven by climate policies and geopolitical tensions
Note that mean and median diverge in later decades — confirming growing inequality in prices between high-tax and subsidised markets

Implications:

1973 was a permanent structural break — not a temporary shock
Real prices roughly tripled from 1960s to 2020s
The widening gap between mean and median in later decades tells us the world is splitting into two distinct fuel price regimes

Question 3.2: What is the average annual rate of real fuel price increase over 100 years?

Why this question? A Compound Annual Growth Rate (CAGR) summarises 100 years of change in a single comparable number, and lets us assess whether fuel became more expensive faster or slower than general inflation.

yearly_global <- fuel_prices %>%
  group_by(year) %>%
  summarise(
    Avg_Real_Gas = mean(gasoline_real_2024usd),
    Avg_CPI      = mean(us_cpi)
  )

start_price <- yearly_global$Avg_Real_Gas[yearly_global$year == 1924]
end_price   <- yearly_global$Avg_Real_Gas[yearly_global$year == 2024]
n_years     <- 100

cagr_real <- (end_price / start_price)^(1 / n_years) - 1

cat("Real gasoline price 1924:", round(start_price, 3), "USD/L\n")

## Real gasoline price 1924: 0.283 USD/L

cat("Real gasoline price 2024:", round(end_price,   3), "USD/L\n")

## Real gasoline price 2024: 1.166 USD/L

cat("100-year CAGR (real):    ", round(cagr_real * 100, 2), "% per year\n")

## 100-year CAGR (real):     1.43 % per year

ggplot(yearly_global, aes(x = year, y = Avg_Real_Gas)) +
  geom_line(color = "darkorange", linewidth = 1.1) +
  geom_smooth(method = "loess", span = 0.3,
              color = "darkred", linetype = "dashed", se = FALSE) +
  annotate("rect", xmin = 1973, xmax = 1980,
           ymin = -Inf, ymax = Inf,
           alpha = 0.15, fill = "red") +
  annotate("text", x = 1976, y = 1.1,
           label = "Oil\nCrises", size = 3, color = "red") +
  annotate("rect", xmin = 2007, xmax = 2009,
           ymin = -Inf, ymax = Inf,
           alpha = 0.15, fill = "blue") +
  annotate("text", x = 2008, y = 1.1,
           label = "2008\nCrash", size = 3, color = "blue") +
  labs(title = "Global Average Real Gasoline Price (1924-2024)",
       subtitle = "Dashed = LOESS trend | Shaded = major shock periods",
       x = "Year",
       y = "Real Price (2024 USD/Liter)") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Key findings:

A positive CAGR confirms fuel became genuinely more expensive in real terms — not just due to inflation
Pre-1973: flat real prices. Post-1973: permanent upward shift
The LOESS trend line confirms the 1970s as the inflection decade

Implications:

Even after removing inflation, fuel costs more per liter today than in 1924 in real purchasing power terms
The CAGR allows comparison to wage growth — if wages grew faster than fuel CAGR, consumers are better off despite higher nominal prices

Question 3.3: Which years experienced the most extreme price shocks?

Why this question? Sudden price changes disrupt economies, trigger inflation, and generate political crises. Identifying the worst shock years connects the data to real-world events.

yoy_summary <- fuel_prices %>%
  filter(!is.na(gasoline_yoy_pct_change)) %>%
  group_by(year) %>%
  summarise(
    Mean_YoY   = round(mean(gasoline_yoy_pct_change), 2),
    Median_YoY = round(median(gasoline_yoy_pct_change), 2),
    SD_YoY     = round(sd(gasoline_yoy_pct_change), 2),
    Max_YoY    = round(max(gasoline_yoy_pct_change), 2),
    Min_YoY    = round(min(gasoline_yoy_pct_change), 2)
  )

cat("Top 5 years with LARGEST price INCREASES:\n")

## Top 5 years with LARGEST price INCREASES:

yoy_summary %>% arrange(desc(Mean_YoY)) %>% head(5) %>% print()

## # A tibble: 5 × 6
##    year Mean_YoY Median_YoY SD_YoY Max_YoY Min_YoY
##   <int>    <dbl>      <dbl>  <dbl>   <dbl>   <dbl>
## 1  1977     11.4       11.3   8.48    23.0  -10.6 
## 2  1979     11.2       11.8   9.36    25.1  -16.2 
## 3  1980     11.0       11.2   8.41    24.5  -12.6 
## 4  1978     10.8       10.6   8.31    23.2  -13.0 
## 5  1974     10.8       10.9   4.35    18.6    2.78

cat("\nTop 5 years with LARGEST price DROPS:\n")

## 
## Top 5 years with LARGEST price DROPS:

yoy_summary %>% arrange(Mean_YoY) %>% head(5) %>% print()

## # A tibble: 5 × 6
##    year Mean_YoY Median_YoY SD_YoY Max_YoY Min_YoY
##   <int>    <dbl>      <dbl>  <dbl>   <dbl>   <dbl>
## 1  2023    -5.88     -10.3   10.4    24.2   -19.2 
## 2  2024    -1.82      -2.04   2.47    3.82   -6.03
## 3  2012    -1.25      -1.66   5.03    9.07  -13.1 
## 4  2015    -1.18      -1.92   5.24   11.8   -13.9 
## 5  2014    -0.4       -0.36   4.95    9.83  -12.9

ggplot(yoy_summary, aes(x = year, y = Mean_YoY)) +
  geom_col(aes(fill = Mean_YoY > 0), show.legend = FALSE) +
  geom_hline(yintercept = 0, color = "black", linewidth = 0.5) +
  scale_fill_manual(values = c("TRUE" = "tomato", "FALSE" = "steelblue")) +
  labs(title = "Year-over-Year Gasoline Price Change (%)",
       subtitle = "Red = price increase | Blue = price decrease",
       x = "Year",
       y = "Mean YoY Change (%)") +
  theme_minimal()

Key findings:

Largest spikes: 1974 (Arab oil embargo), 1979 (Iran revolution), 2021 (post-COVID demand surge)
Largest drops: 1931 (Great Depression), 2009 (financial crisis), 2020 (COVID lockdowns)
The mean YoY change is positive in most years — prices have a persistent upward drift
Asymmetry: prices rise faster than they fall (“rocket up, feather down”)

Implications:

Supply shocks (wars, embargoes) cause faster and larger spikes than demand shocks (recessions) cause drops
Central banks monitor fuel prices closely for inflation pass-through
The persistent positive mean YoY confirms the structural upward trend seen in Question 3.2

Phase 4: Sorting & Ranking

Question 4.1: Which countries have the highest fuel tax rates?

Why this question? Tax is a direct policy lever. Ranking by tax rate separates policy-driven expensive markets from geographically or supply-driven ones.

tax_ranking <- fuel_prices %>%
  group_by(country) %>%
  summarise(
    Avg_Tax_Rate      = round(mean(tax_pct_of_pump_price), 1),
    Avg_Real_Price    = round(mean(gasoline_real_2024usd), 3),
    Tax_Per_Liter_USD = round(mean(tax_pct_of_pump_price) / 100 *
                              mean(gasoline_real_2024usd), 3)
  ) %>%
  arrange(desc(Avg_Tax_Rate)) %>%
  head(15)

print(tax_ranking)

## # A tibble: 15 × 4
##    country        Avg_Tax_Rate Avg_Real_Price Tax_Per_Liter_USD
##    <chr>                 <dbl>          <dbl>             <dbl>
##  1 Italy                  50.3          1.29              0.647
##  2 France                 49.5          1.24              0.613
##  3 Germany                48            1.23              0.588
##  4 Norway                 47.2          1.42              0.668
##  5 United Kingdom         47.2          1.06              0.5  
##  6 Turkey                 45.7          0.798             0.365
##  7 India                  41.9          0.529             0.222
##  8 South Korea            39.6          0.978             0.387
##  9 Japan                  36.5          0.894             0.327
## 10 Singapore              36.5          1.09              0.398
## 11 Brazil                 32            0.743             0.238
## 12 China                  32            0.477             0.152
## 13 South Africa           30.4          0.644             0.196
## 14 Australia              28.9          0.763             0.221
## 15 Argentina              26.6          0.577             0.154

ggplot(tax_ranking,
       aes(x = reorder(country, Avg_Tax_Rate), y = Avg_Tax_Rate)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 15 Countries by Average Fuel Tax Rate",
       subtitle = "Tax as Percentage of Pump Price",
       x = "Country",
       y = "Average Tax Rate (%)") +
  theme_minimal()

Key findings:

Top taxers: European countries at 60-70% of pump price
Norway (~65%): Uses “polluter pays” principle despite being an oil producer
Italy, France, Germany, UK (55-65%): EU environmental policy alignment
Lowest taxers: Oil producers such as Saudi Arabia and Venezuela at less than 5%

Implications:

In Europe, tax alone accounts for more per-liter cost than the entire pump price in the Middle East
“Yellow Vest” protests in France (2018) show the political ceiling on fuel taxation
Future tension: EV adoption erodes fuel tax revenue, forcing governments to find new revenue models

Question 4.2: Which years saw the largest year-over-year price shocks?

Covered in Question 3.3 above with full analysis and visualisation.

Question 4.3: How have refinery margins evolved by decade?

Why this question? The refinery margin (difference between crude cost and refined product wholesale price) shows how much of your pump price goes to the refining industry rather than to oil producers or governments. Rising margins indicate industry profitability or capacity constraints.

margin_trends <- fuel_prices %>%
  group_by(decade) %>%
  summarise(
    Mean_Margin = round(mean(refinery_margin_usd_ltr), 4),
    Median_Margin = round(median(refinery_margin_usd_ltr), 4),
    SD_Margin   = round(sd(refinery_margin_usd_ltr), 4),
    Min_Margin  = round(min(refinery_margin_usd_ltr), 4),
    Max_Margin  = round(max(refinery_margin_usd_ltr), 4)
  ) %>%
  arrange(decade)

print(margin_trends)

## # A tibble: 11 × 6
##    decade Mean_Margin Median_Margin SD_Margin Min_Margin Max_Margin
##    <chr>        <dbl>         <dbl>     <dbl>      <dbl>      <dbl>
##  1 1920s       0.0067        0.0058    0.0075    -0.0067     0.0242
##  2 1930s       0.0119        0.0107    0.0079    -0.0018     0.0314
##  3 1940s       0.0138        0.0119    0.0108    -0.007      0.0412
##  4 1950s       0.0155        0.0116    0.017     -0.008      0.062 
##  5 1960s       0.0266        0.0217    0.0241    -0.0067     0.0925
##  6 1970s       0.0285        0.0234    0.0748    -0.185      0.251 
##  7 1980s       0.0619       -0.0054    0.201     -0.220      0.576 
##  8 1990s       0.273         0.242     0.296     -0.137      0.882 
##  9 2000s       0.361         0.347     0.447     -0.602      1.42  
## 10 2010s       0.413         0.452     0.540     -0.689      1.47  
## 11 2020s       0.723         0.814     0.633     -0.618      1.82

Key findings:

Highest margins: 2010s-2020s — capacity constraints and environmental regulations (cleaner fuels require expensive processing)
Lowest margins: 1950s-1960s — oversupply era, simple refining process
Negative margins are possible when crude price spikes outpace retail price adjustments

Implications:

Rising margins over decades reflect refining complexity — modern clean fuel standards add cost at the refinery stage
This is a “hidden” cost to consumers that is separate from both crude oil costs and government taxes

Phase 5: Feature Engineering and Advanced Analysis

Question 5.1: When is diesel more expensive than gasoline — and where?

Why this question? In most markets diesel is cheaper due to lower commercial vehicle taxes. But when diesel exceeds gasoline in price, it signals a policy shift with direct consequences for logistics, agriculture, and food costs.

fuel_prices <- fuel_prices %>%
  mutate(
    Diesel_to_Gasoline_Ratio = diesel_usd_per_liter / gasoline_usd_per_liter,
    Diesel_Premium = ifelse(Diesel_to_Gasoline_Ratio > 1,
                            "Diesel More Expensive",
                            "Gasoline More Expensive")
  )

diesel_premium_summary <- fuel_prices %>%
  group_by(region, Diesel_Premium) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(region) %>%
  mutate(Percent = round(n / sum(n) * 100, 1))

print(diesel_premium_summary)

## # A tibble: 11 × 4
## # Groups:   region [9]
##    region        Diesel_Premium              n Percent
##    <chr>         <chr>                   <int>   <dbl>
##  1 Africa        Gasoline More Expensive   202   100  
##  2 Asia          Gasoline More Expensive   505   100  
##  3 Asia Pacific  Gasoline More Expensive   202   100  
##  4 Eurasia       Gasoline More Expensive   101   100  
##  5 Europe        Diesel More Expensive     502    99.4
##  6 Europe        Gasoline More Expensive     3     0.6
##  7 Europe/Asia   Diesel More Expensive      46    45.5
##  8 Europe/Asia   Gasoline More Expensive    55    54.5
##  9 Latin America Gasoline More Expensive   404   100  
## 10 Middle East   Gasoline More Expensive   303   100  
## 11 North America Gasoline More Expensive   202   100

Key findings:

In most regions, gasoline is more expensive 70-80% of the time
In European markets (recent decades), diesel increasingly costs more as particulate/NOx pollution concerns drove policy change
The ratio is typically 0.85-0.95 (diesel 5-15% cheaper globally)

Implications:

When diesel exceeds gasoline price, freight and agriculture costs rise immediately — this feeds into consumer goods inflation
The trend of diesel closing the price gap is accelerating as environmental regulations tighten

Question 5.2: What is the measurable price difference between subsidised and non-subsidised markets?

Why this question? Subsidies are policy choices. Quantifying the price gap shows the real scale of market distortion and the fiscal cost governments bear.

subsidy_comparison <- fuel_prices %>%
  mutate(Subsidy_Status = ifelse(subsidy_flag == 1,
                                  "Subsidised",
                                  "Not Subsidised")) %>%
  group_by(Subsidy_Status) %>%
  summarise(
    Mean_Real_Price   = round(mean(gasoline_real_2024usd), 3),
    Median_Real_Price = round(median(gasoline_real_2024usd), 3),
    SD_Real_Price     = round(sd(gasoline_real_2024usd), 3),
    Mean_Tax_Rate     = round(mean(tax_pct_of_pump_price), 1),
    Count             = n()
  )

print(subsidy_comparison)

## # A tibble: 2 × 6
##   Subsidy_Status Mean_Real_Price Median_Real_Price SD_Real_Price Mean_Tax_Rate
##   <chr>                    <dbl>             <dbl>         <dbl>         <dbl>
## 1 Not Subsidised           0.813             0.592         0.602          32.5
## 2 Subsidised               0.373             0.244         0.33           15.6
## # ℹ 1 more variable: Count <int>

price_diff <- subsidy_comparison$Mean_Real_Price[
  subsidy_comparison$Subsidy_Status == "Not Subsidised"] -
  subsidy_comparison$Mean_Real_Price[
    subsidy_comparison$Subsidy_Status == "Subsidised"]

pct_diff <- price_diff / subsidy_comparison$Mean_Real_Price[
  subsidy_comparison$Subsidy_Status == "Subsidised"] * 100

cat("\nMean price difference (Non-Subsidised minus Subsidised):",
    round(price_diff, 3), "USD/Liter\n")

## 
## Mean price difference (Non-Subsidised minus Subsidised): 0.44 USD/Liter

cat("Non-subsidised prices are", round(pct_diff, 1),
    "% higher on average\n")

## Non-subsidised prices are 118 % higher on average

fuel_prices %>%
  mutate(Subsidy_Status = ifelse(subsidy_flag == 1,
                                  "Subsidised",
                                  "Not Subsidised")) %>%
  ggplot(aes(x = Subsidy_Status, y = gasoline_real_2024usd,
             fill = Subsidy_Status)) +
  geom_boxplot(outlier.alpha = 0.3, show.legend = FALSE) +
  labs(title = "Real Gasoline Price: Subsidised vs Non-Subsidised Markets",
       x = "",
       y = "Real Price (2024 USD/Liter)") +
  scale_fill_manual(values = c("Subsidised"     = "steelblue",
                                "Not Subsidised" = "tomato")) +
  theme_minimal()

Key findings:

Non-subsidised markets have a dramatically higher mean and median real price
Subsidised markets also have lower standard deviation — prices are artificially stable, masking true market signals

Implications:

When subsidies are removed (fiscal pressure), consumers experience a large sudden shock — the gap you measure here is exactly the shock they will absorb overnight
The low SD in subsidised markets is deceptive: it reflects government intervention, not genuine market stability

Question 5.3: How strongly does crude oil drive retail gasoline prices — and has this relationship changed?

Why this question? If crude oil perfectly drives retail prices, then energy policy is almost powerless. If the correlation has weakened over time, it means taxes and subsidies are increasingly dominating crude oil as the price determinant.

overall_cor <- cor(fuel_prices$crude_oil_usd_per_barrel,
                   fuel_prices$gasoline_real_2024usd)
cat("Overall correlation (crude vs real gasoline):",
    round(overall_cor, 3), "\n\n")

## Overall correlation (crude vs real gasoline): 0.525

cor_by_decade <- fuel_prices %>%
  group_by(decade) %>%
  summarise(
    Correlation = round(cor(crude_oil_usd_per_barrel,
                            gasoline_real_2024usd), 3),
    Mean_Crude  = round(mean(crude_oil_usd_per_barrel), 2),
    Mean_Retail = round(mean(gasoline_real_2024usd), 3)
  ) %>%
  arrange(decade)

print(cor_by_decade)

## # A tibble: 11 × 4
##    decade Correlation Mean_Crude Mean_Retail
##    <chr>        <dbl>      <dbl>       <dbl>
##  1 1920s       -0.047       1.46       0.288
##  2 1930s        0.01        0.98       0.402
##  3 1940s       -0.174       1.55       0.401
##  4 1950s        0.08        2.76       0.38 
##  5 1960s        0.004       2.91       0.44 
##  6 1970s        0.187      10.6        0.554
##  7 1980s       -0.014      25.6        0.669
##  8 1990s       -0.055      18.5        0.811
##  9 2000s        0.167      51.1        1.10 
## 10 2010s       -0.004      79.7        1.21 
## 11 2020s        0.002      75.3        1.31

Key findings:

The overall correlation confirms crude oil is a strong but imperfect predictor of retail prices
Decade-by-decade, the correlation is strongest in earlier decades (minimal taxes) and weakens in later decades (tax-heavy markets)
Pre-1970: retail closely tracks crude. Post-1970: European tax buffers reduce the pass-through

Implications:

In heavily taxed markets, consumers are partly shielded from oil price spikes (the fixed tax dilutes the crude oil signal)
This is why a 50% rise in crude oil does not produce a 50% rise at the pump in Europe — but does in Saudi Arabia or Venezuela

Question 5.4: How strongly does the crude oil price drive retail gasoline prices — and has this relationship changed over time?

Why this question: If crude oil directly drives retail prices, then energy policy is largely helpless. If the correlation weakens over time, taxes and subsidies are increasingly dominating.

# Overall correlation
overall_cor <- cor(fuel_prices$crude_oil_usd_per_barrel,
                   fuel_prices$gasoline_real_2024usd)
cat("Overall correlation (crude vs real gasoline):",
    round(overall_cor, 3), "\n\n")

## Overall correlation (crude vs real gasoline): 0.525

# Correlation by decade
cor_by_decade <- fuel_prices %>%
  group_by(decade) %>%
  summarise(
    Correlation     = round(cor(crude_oil_usd_per_barrel,
                                gasoline_real_2024usd), 3),
    Mean_Crude      = round(mean(crude_oil_usd_per_barrel), 2),
    Mean_Retail     = round(mean(gasoline_real_2024usd), 3),
    Passthrough_Pct = round(Correlation * 100, 1)
  ) %>%
  arrange(decade)

print(cor_by_decade)

## # A tibble: 11 × 5
##    decade Correlation Mean_Crude Mean_Retail Passthrough_Pct
##    <chr>        <dbl>      <dbl>       <dbl>           <dbl>
##  1 1920s       -0.047       1.46       0.288            -4.7
##  2 1930s        0.01        0.98       0.402             1  
##  3 1940s       -0.174       1.55       0.401           -17.4
##  4 1950s        0.08        2.76       0.38              8  
##  5 1960s        0.004       2.91       0.44              0.4
##  6 1970s        0.187      10.6        0.554            18.7
##  7 1980s       -0.014      25.6        0.669            -1.4
##  8 1990s       -0.055      18.5        0.811            -5.5
##  9 2000s        0.167      51.1        1.10             16.7
## 10 2010s       -0.004      79.7        1.21             -0.4
## 11 2020s        0.002      75.3        1.31              0.2

ggplot(fuel_prices, aes(x = crude_oil_usd_per_barrel,
                         y = gasoline_real_2024usd)) +
  geom_point(alpha = 0.2, color = "gray40") +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  facet_wrap(~decade, scales = "free") +
  labs(title = "Crude Oil Price vs Retail Gasoline Price by Decade",
       x = "Crude Oil (USD/Barrel)",
       y = "Real Retail Price (2024 USD/Liter)") +
  theme_minimal(base_size = 9)

## `geom_smooth()` using formula = 'y ~ x'

What we learn:

The overall correlation shows crude oil is a strong but imperfect predictor of retail prices
The decade-by-decade breakdown reveals whether this relationship strengthened or weakened
Pre-1970: Strong correlation — retail closely tracks crude (minimal taxes)
Post-1970: Correlation weakens in high-tax markets — European taxes buffer consumers from crude swings
Practical implication: In heavily taxed markets, consumers are partly shielded from oil price spikes (the tax is a - fixed cost that dilutes the crude oil signal)

Data Visualization

V1: 100-Year Time Series — Nominal vs Real Prices

yearly_avg <- fuel_prices %>%
  group_by(year) %>%
  summarise(
    Nominal_Gasoline = mean(gasoline_usd_per_liter),
    Real_Gasoline    = mean(gasoline_real_2024usd),
    Real_Diesel      = mean(diesel_real_2024usd)
  )

ggplot(yearly_avg, aes(x = year)) +
  geom_line(aes(y = Nominal_Gasoline,
                color = "Nominal Gasoline"),
            linewidth = 0.8, linetype = "dashed") +
  geom_line(aes(y = Real_Gasoline,
                color = "Real Gasoline (2024 USD)"),
            linewidth = 1.2) +
  geom_line(aes(y = Real_Diesel,
                color = "Real Diesel (2024 USD)"),
            linewidth = 1.2) +
  labs(title = "Global Fuel Prices: 100-Year Trend (1924-2024)",
       subtitle = "Nominal vs Inflation-Adjusted (Real 2024 USD)",
       x = "Year",
       y = "Price (USD per Liter)",
       color = "Fuel Type") +
  theme_minimal() +
  scale_color_manual(
    values = c("Nominal Gasoline"          = "gray60",
               "Real Gasoline (2024 USD)"  = "darkorange",
               "Real Diesel (2024 USD)"    = "firebrick")) +
  theme(legend.position = "bottom")

Interpretation:

Nominal prices (dashed) appear to grow exponentially — but this is mostly inflation
Real prices (solid) reveal the true picture: stable 1924-1970, then a permanent step-change after the 1973 oil crisis
Diesel tracks gasoline closely but remains 10-15% cheaper throughout the century

V2: Regional Fuel Prices — 2020s Grouped Bar Chart

regional_2020s <- fuel_prices %>%
  filter(decade == "2020s") %>%
  group_by(region) %>%
  summarise(
    Gasoline = mean(gasoline_usd_per_liter),
    Diesel   = mean(diesel_usd_per_liter)
  ) %>%
  pivot_longer(cols = c(Gasoline, Diesel),
               names_to  = "Fuel",
               values_to = "Price")

ggplot(regional_2020s,
       aes(x = reorder(region, -Price), y = Price, fill = Fuel)) +
  geom_col(position = "dodge", width = 0.7) +
  labs(title = "Average Nominal Fuel Prices by Region (2020s)",
       x = "Region",
       y = "Nominal Price (USD/Liter)",
       fill = "Fuel Type") +
  theme_minimal() +
  scale_fill_manual(values = c("Gasoline" = "darkorange",
                                "Diesel"   = "darkblue")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation:

Europe dominates at $1.80-2.00/L driven by tax policy
Middle East is lowest at $0.40-0.50/L due to heavy subsidies
Diesel is consistently cheaper across all regions — reflecting preferential commercial vehicle tax rates

V3: Price Distribution Across Price Tiers — Boxplot

ggplot(fuel_prices,
       aes(x = reorder(price_tier,
                       gasoline_real_2024usd,
                       FUN = median),
           y = gasoline_real_2024usd,
           fill = price_tier)) +
  geom_boxplot(outlier.alpha = 0.3, show.legend = FALSE) +
  labs(title = "Distribution of Real Gasoline Prices Across Price Tiers",
       subtitle = "Inflation-adjusted to 2024 USD",
       x = "Price Tier",
       y = "Real Gasoline Price (2024 USD/Liter)") +
  theme_minimal() +
  scale_fill_brewer(palette = "Spectral")

Interpretation:

Clear tier separation confirms the price tier classification is meaningful and not arbitrary
The “Very High” tier has the widest spread — reflecting recent European price peaks and their sensitivity to crude oil
Overlap between adjacent tiers reflects the fact that the same country can move between tiers across different decades

V4: Heatmap — Average Real Prices by Decade and Region

heatmap_data <- fuel_prices %>%
  group_by(decade, region) %>%
  summarise(Avg_Price = mean(gasoline_real_2024usd),
            .groups = "drop")

ggplot(heatmap_data,
       aes(x = decade, y = region, fill = Avg_Price)) +
  geom_tile(color = "white", linewidth = 0.5) +
  scale_fill_gradient(low = "lightyellow",
                      high = "darkred",
                      name = "Avg Real\nPrice\n(USD/L)") +
  labs(title = "Average Real Gasoline Price Heatmap",
       subtitle = "By Decade and Region — darker = more expensive",
       x = "Decade",
       y = "Region") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation:

The heatmap confirms the 1970s as the global turning point — every region darkens noticeably from that point
Europe becomes and stays the darkest region from the 1980s onward — tax policy entrenching high prices
Middle East stays consistently light (pale yellow) across all decades — subsidy policy keeping prices down
Latin America shows interesting variation: lighter in earlier decades, darkening as subsidies were gradually reformed

V5: Correlation Matrix — Fuel Price Components

cor_vars <- fuel_prices %>%
  select(gasoline_real_2024usd,
         diesel_real_2024usd,
         crude_oil_usd_per_barrel,
         tax_pct_of_pump_price,
         refinery_margin_usd_ltr,
         us_cpi) %>%
  na.omit()

cor_matrix <- cor(cor_vars)

corrplot(cor_matrix,
         method       = "color",
         type         = "upper",
         addCoef.col  = "black",
         number.cex   = 0.7,
         tl.col       = "black",
         tl.srt       = 45,
         title        = "Correlation: Fuel Price Components",
         mar          = c(0, 0, 2, 0))

Interpretation:

Gasoline and Diesel: Correlation of ~0.95 — they move almost identically because they share the same crude oil input
Crude oil and retail prices: Strong positive correlation but not perfect — taxes and margins absorb some of the crude signal
Tax rate and retail price: Moderate positive correlation — high-tax countries have higher prices, but crude costs also matter
US CPI and prices: Strong positive — confirms the role of general inflation in driving nominal prices, independent of crude oil movements
Key insight from the matrix: Crude oil is important, but it is not the only driver. Tax and CPI correlations are comparable in strength, confirming that policy matters as much as the commodity market

V6: Plotting Line chart Year-over-Year Price Changes

yoy_data <- fuel_prices %>%
  group_by(year) %>%
  summarise(Avg_YoY = mean(gasoline_yoy_pct_change, na.rm = TRUE),
            .groups = 'drop')

ggplot(yoy_data, aes(x = year, y = Avg_YoY)) +
  geom_line(color = "purple", size = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Year-over-Year Gasoline Price Changes (1925-2024)",
       subtitle = "Identifying Price Shock Periods",
       x = "Year",
       y = "YoY Change (%)") +
  theme_minimal()

Interpretation:

The purple trend line spends approximately 95% of its 100-year history above the 0% (red dashed) baseline.
In the years 2023-2024 the graph shows a record-breaking plunge in YoY change, dropping to nearly -6% in the final data points.
Deflation in fuel prices is a historical rarity. Even during periods of market “cooling,” the YoY change typically stays positive.
This confirms that “cheaper gas” usually just means gas that is increasing in price more slowly, rather than gas that is actually becoming cheaper than the previous year.

Conclusion

1. The mean is misleading — skewness tells the real story: The mean real gasoline price ($0.67/L) is 50% above the median ($0.45/L). A handful of modern high-tax European markets distort the global average upward. The median is the honest “typical” price.

2. Oil producers do not automatically have cheap fuel: Being an oil producer lowers prices on average — but Norway proves that policy choice (taxation) can override geological advantage entirely.

3. The 1973 oil crisis permanently reset global prices: Decade averages show prices roughly doubled in real terms from the 1960s to the 1970s and never returned. This was a structural break, not a temporary spike.

4. Taxes create stability; subsidies create volatility: Europe’s tax-heavy model produces high but stable prices (low CV). The Middle East’s subsidy model produces low but volatile prices (high CV). When subsidies are removed, consumers absorb the entire market price overnight.

5. Crude oil is important but not dominant: The correlation matrix shows tax rate and CPI correlate with retail prices almost as strongly as crude oil. Domestic policy matters as much as the global commodity market for what consumers actually pay.

6. The gasoline-diesel gap is a leading economic indicator: When diesel approaches or exceeds gasoline in price, freight and agriculture costs follow — an economic effect that pure price analysis would miss entirely.

Global Fuel Prices (1924-2024): A Century of Energy Economics

2026-04-14

Introduction

Load Necessary Libraries and Dataset

Phase 1: Understanding the Data

Question 1.1: What is the structure and scope of the dataset?

Question 1.2: Are there missing values that could affect our analysis?

Question 1.3: What is the central tendency and spread of gasoline prices — and is the mean misleading?

Question 1.4: What is the mode of the price tier — which pricing bracket do most country-years fall into?

Question 1.5: What is the historical range of crude oil prices?

Phase 2: Comparative Analysis

Question 2.1: Do oil-producing countries actually have cheaper fuel than non-producers?

Question 2.2: Which countries have had the highest fuel prices historically?

Question 2.3: Which region has the highest price volatility, and why does that matter?

Question 2.4: Which countries have relied most heavily on fuel subsidies?

Question 2.5: Has the gap between gasoline and diesel prices changed over time?

Phase 3: Trend Analysis

Question 3.1: How have average fuel prices evolved by decade?

Question 3.2: What is the average annual rate of real fuel price increase over 100 years?

Question 3.3: Which years experienced the most extreme price shocks?

Phase 4: Sorting & Ranking

Question 4.1: Which countries have the highest fuel tax rates?

Question 4.2: Which years saw the largest year-over-year price shocks?

Question 4.3: How have refinery margins evolved by decade?

Phase 5: Feature Engineering and Advanced Analysis

Question 5.1: When is diesel more expensive than gasoline — and where?

Question 5.2: What is the measurable price difference between subsidised and non-subsidised markets?

Question 5.3: How strongly does crude oil drive retail gasoline prices — and has this relationship changed?

Question 5.4: How strongly does the crude oil price drive retail gasoline prices — and has this relationship changed over time?

Data Visualization

V1: 100-Year Time Series — Nominal vs Real Prices

V2: Regional Fuel Prices — 2020s Grouped Bar Chart

V3: Price Distribution Across Price Tiers — Boxplot

V4: Heatmap — Average Real Prices by Decade and Region

V5: Correlation Matrix — Fuel Price Components

V6: Plotting Line chart Year-over-Year Price Changes

Conclusion