This analysis explores 100 years of global fuel pricing data (1924-2024), examining gasoline, diesel, and LPG prices across 25 countries and 9 regions. The dataset includes inflation-adjusted prices, crude oil benchmarks, tax rates, subsidy indicators, and macroeconomic variables. Rather than simply describing the data, each question builds on the last to uncover why prices differ, when they change, and what drives those changes.
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(scales)
fuel_prices <- read.csv("fuel_prices/all_countries_combined.csv")
glimpse(fuel_prices)
## Rows: 2,525
## Columns: 27
## $ year <int> 1924, 1925, 1926, 1927, 1928, 1929, 1930, 193…
## $ country <chr> "Argentina", "Argentina", "Argentina", "Argen…
## $ iso3 <chr> "ARG", "ARG", "ARG", "ARG", "ARG", "ARG", "AR…
## $ currency_code <chr> "ARS", "ARS", "ARS", "ARS", "ARS", "ARS", "AR…
## $ region <chr> "Latin America", "Latin America", "Latin Amer…
## $ decade <chr> "1920s", "1920s", "1920s", "1920s", "1920s", …
## $ gasoline_usd_per_liter <dbl> 0.0124, 0.0121, 0.0125, 0.0126, 0.0131, 0.012…
## $ gasoline_usd_per_gallon <dbl> 0.0469, 0.0458, 0.0473, 0.0477, 0.0496, 0.048…
## $ gasoline_real_2024usd <dbl> 0.2277, 0.2171, 0.2218, 0.2274, 0.2405, 0.235…
## $ diesel_usd_per_liter <dbl> 0.0101, 0.0098, 0.0102, 0.0103, 0.0108, 0.010…
## $ diesel_real_2024usd <dbl> 0.1855, 0.1758, 0.1809, 0.1859, 0.1983, 0.189…
## $ lpg_usd_per_liter <dbl> 0.0059, 0.0056, 0.0062, 0.0062, 0.0064, 0.006…
## $ gasoline_local_currency <dbl> 11.16, 10.89, 11.25, 11.34, 11.79, 11.52, 11.…
## $ gasoline_yoy_pct_change <dbl> NA, -2.42, 3.31, 0.80, 3.97, -2.29, 1.56, 5.3…
## $ diesel_yoy_pct_change <dbl> NA, -2.97, 4.08, 0.98, 4.85, -4.63, 4.85, 0.0…
## $ price_tier <chr> "Very Low (<$0.30)", "Very Low (<$0.30)", "Ve…
## $ tax_pct_of_pump_price <dbl> 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 1…
## $ pretax_price_usd_ltr <dbl> 0.0107, 0.0104, 0.0108, 0.0108, 0.0113, 0.011…
## $ subsidy_flag <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ subsidy_regime <chr> "heavy", "heavy", "heavy", "heavy", "heavy", …
## $ is_oil_producer <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ crude_oil_usd_per_barrel <dbl> 1.43, 1.68, 1.88, 1.30, 1.17, 1.27, 1.19, 0.6…
## $ crude_oil_usd_per_liter <dbl> 0.0090, 0.0106, 0.0118, 0.0082, 0.0074, 0.008…
## $ refinery_margin_usd_ltr <dbl> 0.0034, 0.0015, 0.0007, 0.0044, 0.0057, 0.004…
## $ oil_price_shock_idx <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ us_cpi <dbl> 17.1, 17.5, 17.7, 17.4, 17.1, 17.1, 16.7, 15.…
## $ inflation_deflator_2024 <dbl> 18.3626, 17.9429, 17.7401, 18.0460, 18.3626, …
Why this question? Before any analysis, we must confirm what data we have, its temporal and geographic coverage, and whether variable types are appropriate for the analysis we plan to do.
str(fuel_prices)
## 'data.frame': 2525 obs. of 27 variables:
## $ year : int 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 ...
## $ country : chr "Argentina" "Argentina" "Argentina" "Argentina" ...
## $ iso3 : chr "ARG" "ARG" "ARG" "ARG" ...
## $ currency_code : chr "ARS" "ARS" "ARS" "ARS" ...
## $ region : chr "Latin America" "Latin America" "Latin America" "Latin America" ...
## $ decade : chr "1920s" "1920s" "1920s" "1920s" ...
## $ gasoline_usd_per_liter : num 0.0124 0.0121 0.0125 0.0126 0.0131 0.0128 0.013 0.0137 0.0137 0.0145 ...
## $ gasoline_usd_per_gallon : num 0.0469 0.0458 0.0473 0.0477 0.0496 0.0485 0.0492 0.0519 0.0519 0.0549 ...
## $ gasoline_real_2024usd : num 0.228 0.217 0.222 0.227 0.24 ...
## $ diesel_usd_per_liter : num 0.0101 0.0098 0.0102 0.0103 0.0108 0.0103 0.0108 0.0108 0.011 0.0116 ...
## $ diesel_real_2024usd : num 0.185 0.176 0.181 0.186 0.198 ...
## $ lpg_usd_per_liter : num 0.0059 0.0056 0.0062 0.0062 0.0064 0.0061 0.0061 0.0068 0.0066 0.007 ...
## $ gasoline_local_currency : num 11.2 10.9 11.2 11.3 11.8 ...
## $ gasoline_yoy_pct_change : num NA -2.42 3.31 0.8 3.97 -2.29 1.56 5.38 0 5.84 ...
## $ diesel_yoy_pct_change : num NA -2.97 4.08 0.98 4.85 -4.63 4.85 0 1.85 5.45 ...
## $ price_tier : chr "Very Low (<$0.30)" "Very Low (<$0.30)" "Very Low (<$0.30)" "Very Low (<$0.30)" ...
## $ tax_pct_of_pump_price : num 14 14 14 14 14 14 14 14 14 14 ...
## $ pretax_price_usd_ltr : num 0.0107 0.0104 0.0108 0.0108 0.0113 0.011 0.0112 0.0118 0.0118 0.0125 ...
## $ subsidy_flag : int 0 0 0 0 0 0 0 0 0 0 ...
## $ subsidy_regime : chr "heavy" "heavy" "heavy" "heavy" ...
## $ is_oil_producer : int 1 1 1 1 1 1 1 1 1 1 ...
## $ crude_oil_usd_per_barrel: num 1.43 1.68 1.88 1.3 1.17 1.27 1.19 0.65 0.87 0.67 ...
## $ crude_oil_usd_per_liter : num 0.009 0.0106 0.0118 0.0082 0.0074 0.008 0.0075 0.0041 0.0055 0.0042 ...
## $ refinery_margin_usd_ltr : num 0.0034 0.0015 0.0007 0.0044 0.0057 0.0048 0.0055 0.0096 0.0082 0.0103 ...
## $ oil_price_shock_idx : num 1 1 1 1 1 1 1 1 1 1 ...
## $ us_cpi : num 17.1 17.5 17.7 17.4 17.1 17.1 16.7 15.2 13.7 13 ...
## $ inflation_deflator_2024 : num 18.4 17.9 17.7 18 18.4 ...
cat("Dataset Dimensions:", nrow(fuel_prices), "rows x",
ncol(fuel_prices), "columns\n")
## Dataset Dimensions: 2525 rows x 27 columns
cat("Time Period:", min(fuel_prices$year),
"to", max(fuel_prices$year), "\n")
## Time Period: 1924 to 2024
cat("Countries Covered:", n_distinct(fuel_prices$country), "\n")
## Countries Covered: 25
cat("Regions Covered:", n_distinct(fuel_prices$region), "\n")
## Regions Covered: 9
Key findings:
Implications:
Why this question? Missing data can bias results or limit certain analyses. Understanding gaps helps determine appropriate analytical methods.
missing_summary <- colSums(is.na(fuel_prices))
missing_df <- data.frame(
Variable = names(missing_summary),
Missing_Count = missing_summary,
Percent_Missing = round((missing_summary / nrow(fuel_prices)) * 100, 2)
) %>%
filter(Missing_Count > 0) %>%
arrange(desc(Missing_Count))
print(missing_df)
## Variable Missing_Count Percent_Missing
## gasoline_yoy_pct_change gasoline_yoy_pct_change 25 0.99
## diesel_yoy_pct_change diesel_yoy_pct_change 25 0.99
Key findings:
gasoline_yoy_pct_change and
diesel_yoy_pct_change (0.99% of data)Implications:
Why this question? Mean alone is not enough. When mean and median are very different, the data is skewed — meaning the average is being pulled by extreme values and does not represent a typical observation. Standard deviation tells us how spread out prices are across the full 100-year range.
fuel_prices %>%
summarise(
Mean_Nominal = round(mean(gasoline_usd_per_liter), 4),
Median_Nominal = round(median(gasoline_usd_per_liter), 4),
SD_Nominal = round(sd(gasoline_usd_per_liter), 4),
Mean_Real = round(mean(gasoline_real_2024usd), 4),
Median_Real = round(median(gasoline_real_2024usd), 4),
SD_Real = round(sd(gasoline_real_2024usd), 4),
Min_Real = round(min(gasoline_real_2024usd), 4),
Max_Real = round(max(gasoline_real_2024usd), 4)
)
## Mean_Nominal Median_Nominal SD_Nominal Mean_Real Median_Real SD_Real Min_Real
## 1 0.3001 0.0551 0.4582 0.6725 0.4528 0.5686 0.0123
## Max_Real
## 1 2.6502
Key findings:
Implications:
Why this question? Price tier is a categorical variable. The mode (most frequent category) tells us what “normal” looks like globally across the full dataset. This is especially useful when the numeric mean is misleading due to skewness.
tier_counts <- fuel_prices %>%
count(price_tier, sort = TRUE) %>%
mutate(Percentage = round(n / sum(n) * 100, 1))
print(tier_counts)
## price_tier n Percentage
## 1 Very Low (<$0.30) 1790 70.9
## 2 Low ($0.30–$0.70) 305 12.1
## 3 Medium ($0.70–$1.10) 207 8.2
## 4 High ($1.10–$1.60) 162 6.4
## 5 Very High (>$1.60) 61 2.4
cat("\nMode (most frequent price tier):", tier_counts$price_tier[1], "\n")
##
## Mode (most frequent price tier): Very Low (<$0.30)
cat("Appearing in", tier_counts$n[1], "observations (",
tier_counts$Percentage[1], "% of all data)\n")
## Appearing in 1790 observations ( 70.9 % of all data)
Key findings:
Implications:
Why this question? Crude oil is the primary input cost for gasoline and diesel. Understanding its volatility contextualises all retail price movements throughout the century.
oil_summary <- fuel_prices %>%
summarise(
Min_Oil = min(crude_oil_usd_per_barrel),
Max_Oil = max(crude_oil_usd_per_barrel),
Mean_Oil = round(mean(crude_oil_usd_per_barrel), 2),
Median_Oil = round(median(crude_oil_usd_per_barrel), 2),
SD_Oil = round(sd(crude_oil_usd_per_barrel), 2)
)
print(oil_summary)
## Min_Oil Max_Oil Mean_Oil Median_Oil SD_Oil
## 1 0.65 112.01 23 9.35 29.98
# Identify years of extreme crude prices
fuel_prices %>%
filter(crude_oil_usd_per_barrel == min(crude_oil_usd_per_barrel) |
crude_oil_usd_per_barrel == max(crude_oil_usd_per_barrel)) %>%
select(year, country, crude_oil_usd_per_barrel) %>%
distinct(year, .keep_all = TRUE)
## year country crude_oil_usd_per_barrel
## 1 1931 Argentina 0.65
## 2 2012 Argentina 112.01
Key findings:
Implications:
Why this question? This tests a logical hypothesis — countries that extract oil should face lower domestic prices due to direct supply access and subsidies. The data either confirms or challenges this assumption.
producer_stats <- fuel_prices %>%
mutate(Producer = ifelse(is_oil_producer == 1,
"Oil Producer", "Non-Producer")) %>%
group_by(Producer) %>%
summarise(
Mean_Real_Price = round(mean(gasoline_real_2024usd), 3),
Median_Real_Price = round(median(gasoline_real_2024usd), 3),
SD_Real_Price = round(sd(gasoline_real_2024usd), 3),
Mean_Tax_Rate = round(mean(tax_pct_of_pump_price), 1),
Subsidy_Pct = round(mean(subsidy_flag) * 100, 1),
Count = n()
)
print(producer_stats)
## # A tibble: 2 × 7
## Producer Mean_Real_Price Median_Real_Price SD_Real_Price Mean_Tax_Rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Non-Producer 1.07 0.856 0.601 43.7
## 2 Oil Producer 0.517 0.35 0.471 20.7
## # ℹ 2 more variables: Subsidy_Pct <dbl>, Count <int>
fuel_prices %>%
mutate(Producer = ifelse(is_oil_producer == 1,
"Oil Producer", "Non-Producer")) %>%
ggplot(aes(x = Producer, y = gasoline_real_2024usd, fill = Producer)) +
geom_boxplot(outlier.alpha = 0.3, show.legend = FALSE) +
labs(title = "Real Gasoline Price: Oil Producers vs Non-Producers",
subtitle = "Inflation-adjusted to 2024 USD",
x = "",
y = "Real Price (USD/Liter)") +
scale_fill_manual(values = c("Oil Producer" = "darkgreen",
"Non-Producer" = "steelblue")) +
theme_minimal()
Key findings:
Implications:
Why this question? Identifying consistently expensive markets reveals the long-run impact of taxation, geography, and energy policy — not just short-term price spikes.
top_expensive <- fuel_prices %>%
group_by(country) %>%
summarise(
Avg_Real_Gasoline = round(mean(gasoline_real_2024usd), 3),
Avg_Tax_Percent = round(mean(tax_pct_of_pump_price), 1),
Is_Oil_Producer = max(is_oil_producer)
) %>%
arrange(desc(Avg_Real_Gasoline)) %>%
head(10)
print(top_expensive)
## # A tibble: 10 × 4
## country Avg_Real_Gasoline Avg_Tax_Percent Is_Oil_Producer
## <chr> <dbl> <dbl> <int>
## 1 Norway 1.42 47.2 1
## 2 Italy 1.29 50.3 0
## 3 France 1.24 49.5 0
## 4 Germany 1.23 48 0
## 5 Singapore 1.09 36.5 0
## 6 United Kingdom 1.06 47.2 1
## 7 South Korea 0.978 39.6 0
## 8 Japan 0.894 36.5 0
## 9 Turkey 0.798 45.7 0
## 10 Australia 0.763 28.9 1
Key findings:
Implications:
Why this question? Volatility (measured by Coefficient of Variation) measures economic risk and instability. High volatility means consumers face unpredictable costs, which disrupts household budgets and business planning far more than a consistently high-but-stable price.
regional_volatility <- fuel_prices %>%
group_by(region) %>%
summarise(
Mean_Price = round(mean(gasoline_real_2024usd), 3),
Median_Price = round(median(gasoline_real_2024usd), 3),
SD_Price = round(sd(gasoline_real_2024usd), 3),
CV = round(sd(gasoline_real_2024usd) /
mean(gasoline_real_2024usd) * 100, 1),
Min_Price = round(min(gasoline_real_2024usd), 3),
Max_Price = round(max(gasoline_real_2024usd), 3)
) %>%
arrange(desc(CV))
print(regional_volatility)
## # A tibble: 9 × 7
## region Mean_Price Median_Price SD_Price CV Min_Price Max_Price
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Middle East 0.172 0.14 0.163 94.7 0.029 0.92
## 2 Asia 0.681 0.39 0.557 81.7 0.172 2.51
## 3 Latin America 0.496 0.374 0.403 81.3 0.012 1.78
## 4 Africa 0.46 0.319 0.342 74.2 0.145 1.74
## 5 Eurasia 0.315 0.212 0.223 70.9 0.075 0.912
## 6 Europe/Asia 0.798 0.53 0.54 67.6 0.271 2.08
## 7 North America 0.493 0.391 0.304 61.8 0.202 1.72
## 8 Asia Pacific 0.829 0.629 0.409 49.3 0.321 1.73
## 9 Europe 1.25 0.971 0.609 48.9 0.402 2.65
ggplot(regional_volatility,
aes(x = reorder(region, CV), y = CV, fill = CV)) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_fill_gradient(low = "lightblue", high = "darkred") +
labs(title = "Fuel Price Volatility by Region",
subtitle = "Coefficient of Variation (SD / Mean x 100) — higher = more volatile",
x = "Region",
y = "Coefficient of Variation (%)") +
theme_minimal()
Key findings:
Implications:
Why this question? Subsidies artificially lower prices but create fiscal burden. Identifying subsidy-dependent economies reveals long-run policy vulnerabilities.
subsidy_reliance <- fuel_prices %>%
group_by(country) %>%
summarise(
Percent_Years_Subsidized = round(mean(subsidy_flag) * 100, 1),
Avg_Real_Price = round(mean(gasoline_real_2024usd), 3),
Subsidy_Regime = names(sort(table(subsidy_regime),
decreasing = TRUE))[1]
) %>%
filter(Percent_Years_Subsidized > 0) %>%
arrange(desc(Percent_Years_Subsidized)) %>%
head(10)
print(subsidy_reliance)
## # A tibble: 10 × 4
## country Percent_Years_Subsidized Avg_Real_Price Subsidy_Regime
## <chr> <dbl> <dbl> <chr>
## 1 Argentina 74.3 0.577 heavy
## 2 Indonesia 74.3 0.332 heavy
## 3 Iran 74.3 0.092 heavy
## 4 Nigeria 74.3 0.276 heavy
## 5 Saudi Arabia 74.3 0.168 heavy
## 6 UAE 74.3 0.257 heavy
## 7 Venezuela 74.3 0.123 heavy
## 8 Brazil 55.4 0.743 partial
## 9 China 55.4 0.477 partial
## 10 India 55.4 0.529 partial
Key findings:
Implications:
Why this question? Diesel powers commercial vehicles, freight, and agriculture. If diesel becomes relatively more expensive, it raises the cost of food and goods transportation — an economic knock-on effect that pure fuel price analysis would miss.
fuel_gap <- fuel_prices %>%
mutate(Price_Gap = gasoline_real_2024usd - diesel_real_2024usd) %>%
group_by(year) %>%
summarise(
Mean_Gasoline = round(mean(gasoline_real_2024usd), 4),
Mean_Diesel = round(mean(diesel_real_2024usd), 4),
Mean_Gap = round(mean(Price_Gap), 4)
)
cat("Gasoline-Diesel Gap Statistics (Real USD):\n")
## Gasoline-Diesel Gap Statistics (Real USD):
cat("Overall Mean Gap: ", round(mean(fuel_gap$Mean_Gap), 4), "\n")
## Overall Mean Gap: 0.0339
cat("Max Gap: ", round(max(fuel_gap$Mean_Gap), 4), "\n")
## Max Gap: 0.0851
cat("Min Gap: ", round(min(fuel_gap$Mean_Gap), 4), "\n")
## Min Gap: 0.0166
ggplot(fuel_gap, aes(x = year)) +
geom_line(aes(y = Mean_Gasoline, color = "Gasoline"), linewidth = 1) +
geom_line(aes(y = Mean_Diesel, color = "Diesel"), linewidth = 1) +
geom_ribbon(aes(ymin = Mean_Diesel, ymax = Mean_Gasoline),
alpha = 0.2, fill = "orange") +
labs(title = "Gasoline vs Diesel Real Prices Over Time (1924-2024)",
subtitle = "Shaded area = price gap between the two fuels",
x = "Year",
y = "Real Price (2024 USD/Liter)",
color = "Fuel Type") +
scale_color_manual(values = c("Gasoline" = "darkorange",
"Diesel" = "firebrick")) +
theme_minimal()
Key findings:
Implications:
Why this question? Decadal averages smooth short-term volatility and reveal structural shifts in energy economics — the moments when the global price floor permanently changed.
decade_trends <- fuel_prices %>%
group_by(decade) %>%
summarise(
Mean_Real_Gasoline = round(mean(gasoline_real_2024usd), 3),
Median_Real_Gasoline = round(median(gasoline_real_2024usd), 3),
SD_Real_Gasoline = round(sd(gasoline_real_2024usd), 3),
Mean_Real_Diesel = round(mean(diesel_real_2024usd), 3),
Mean_Crude_Oil = round(mean(crude_oil_usd_per_barrel), 2),
Mean_Tax_Rate = round(mean(tax_pct_of_pump_price), 1),
n_observations = n()
) %>%
arrange(decade)
print(decade_trends)
## # A tibble: 11 × 8
## decade Mean_Real_Gasoline Median_Real_Gasoline SD_Real_Gasoline
## <chr> <dbl> <dbl> <dbl>
## 1 1920s 0.288 0.276 0.132
## 2 1930s 0.402 0.374 0.178
## 3 1940s 0.401 0.365 0.18
## 4 1950s 0.38 0.336 0.195
## 5 1960s 0.44 0.399 0.232
## 6 1970s 0.554 0.449 0.386
## 7 1980s 0.669 0.509 0.553
## 8 1990s 0.811 0.771 0.601
## 9 2000s 1.10 1.03 0.72
## 10 2010s 1.21 1.22 0.672
## 11 2020s 1.31 1.41 0.697
## # ℹ 4 more variables: Mean_Real_Diesel <dbl>, Mean_Crude_Oil <dbl>,
## # Mean_Tax_Rate <dbl>, n_observations <int>
ggplot(decade_trends, aes(x = decade, y = Mean_Real_Gasoline, group = 1)) +
geom_line(color = "darkorange", linewidth = 1.2) +
geom_point(size = 3) +
labs(title = "Average Real Gasoline Price by Decade (1920s-2020s)",
subtitle = "Adjusted to 2024 USD",
x = "Decade",
y = "Average Real Price (USD/Liter)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Key findings:
Implications:
Why this question? A Compound Annual Growth Rate (CAGR) summarises 100 years of change in a single comparable number, and lets us assess whether fuel became more expensive faster or slower than general inflation.
yearly_global <- fuel_prices %>%
group_by(year) %>%
summarise(
Avg_Real_Gas = mean(gasoline_real_2024usd),
Avg_CPI = mean(us_cpi)
)
start_price <- yearly_global$Avg_Real_Gas[yearly_global$year == 1924]
end_price <- yearly_global$Avg_Real_Gas[yearly_global$year == 2024]
n_years <- 100
cagr_real <- (end_price / start_price)^(1 / n_years) - 1
cat("Real gasoline price 1924:", round(start_price, 3), "USD/L\n")
## Real gasoline price 1924: 0.283 USD/L
cat("Real gasoline price 2024:", round(end_price, 3), "USD/L\n")
## Real gasoline price 2024: 1.166 USD/L
cat("100-year CAGR (real): ", round(cagr_real * 100, 2), "% per year\n")
## 100-year CAGR (real): 1.43 % per year
ggplot(yearly_global, aes(x = year, y = Avg_Real_Gas)) +
geom_line(color = "darkorange", linewidth = 1.1) +
geom_smooth(method = "loess", span = 0.3,
color = "darkred", linetype = "dashed", se = FALSE) +
annotate("rect", xmin = 1973, xmax = 1980,
ymin = -Inf, ymax = Inf,
alpha = 0.15, fill = "red") +
annotate("text", x = 1976, y = 1.1,
label = "Oil\nCrises", size = 3, color = "red") +
annotate("rect", xmin = 2007, xmax = 2009,
ymin = -Inf, ymax = Inf,
alpha = 0.15, fill = "blue") +
annotate("text", x = 2008, y = 1.1,
label = "2008\nCrash", size = 3, color = "blue") +
labs(title = "Global Average Real Gasoline Price (1924-2024)",
subtitle = "Dashed = LOESS trend | Shaded = major shock periods",
x = "Year",
y = "Real Price (2024 USD/Liter)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Key findings:
Implications:
Why this question? Sudden price changes disrupt economies, trigger inflation, and generate political crises. Identifying the worst shock years connects the data to real-world events.
yoy_summary <- fuel_prices %>%
filter(!is.na(gasoline_yoy_pct_change)) %>%
group_by(year) %>%
summarise(
Mean_YoY = round(mean(gasoline_yoy_pct_change), 2),
Median_YoY = round(median(gasoline_yoy_pct_change), 2),
SD_YoY = round(sd(gasoline_yoy_pct_change), 2),
Max_YoY = round(max(gasoline_yoy_pct_change), 2),
Min_YoY = round(min(gasoline_yoy_pct_change), 2)
)
cat("Top 5 years with LARGEST price INCREASES:\n")
## Top 5 years with LARGEST price INCREASES:
yoy_summary %>% arrange(desc(Mean_YoY)) %>% head(5) %>% print()
## # A tibble: 5 × 6
## year Mean_YoY Median_YoY SD_YoY Max_YoY Min_YoY
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1977 11.4 11.3 8.48 23.0 -10.6
## 2 1979 11.2 11.8 9.36 25.1 -16.2
## 3 1980 11.0 11.2 8.41 24.5 -12.6
## 4 1978 10.8 10.6 8.31 23.2 -13.0
## 5 1974 10.8 10.9 4.35 18.6 2.78
cat("\nTop 5 years with LARGEST price DROPS:\n")
##
## Top 5 years with LARGEST price DROPS:
yoy_summary %>% arrange(Mean_YoY) %>% head(5) %>% print()
## # A tibble: 5 × 6
## year Mean_YoY Median_YoY SD_YoY Max_YoY Min_YoY
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2023 -5.88 -10.3 10.4 24.2 -19.2
## 2 2024 -1.82 -2.04 2.47 3.82 -6.03
## 3 2012 -1.25 -1.66 5.03 9.07 -13.1
## 4 2015 -1.18 -1.92 5.24 11.8 -13.9
## 5 2014 -0.4 -0.36 4.95 9.83 -12.9
ggplot(yoy_summary, aes(x = year, y = Mean_YoY)) +
geom_col(aes(fill = Mean_YoY > 0), show.legend = FALSE) +
geom_hline(yintercept = 0, color = "black", linewidth = 0.5) +
scale_fill_manual(values = c("TRUE" = "tomato", "FALSE" = "steelblue")) +
labs(title = "Year-over-Year Gasoline Price Change (%)",
subtitle = "Red = price increase | Blue = price decrease",
x = "Year",
y = "Mean YoY Change (%)") +
theme_minimal()
Key findings:
Implications:
Why this question? Tax is a direct policy lever. Ranking by tax rate separates policy-driven expensive markets from geographically or supply-driven ones.
tax_ranking <- fuel_prices %>%
group_by(country) %>%
summarise(
Avg_Tax_Rate = round(mean(tax_pct_of_pump_price), 1),
Avg_Real_Price = round(mean(gasoline_real_2024usd), 3),
Tax_Per_Liter_USD = round(mean(tax_pct_of_pump_price) / 100 *
mean(gasoline_real_2024usd), 3)
) %>%
arrange(desc(Avg_Tax_Rate)) %>%
head(15)
print(tax_ranking)
## # A tibble: 15 × 4
## country Avg_Tax_Rate Avg_Real_Price Tax_Per_Liter_USD
## <chr> <dbl> <dbl> <dbl>
## 1 Italy 50.3 1.29 0.647
## 2 France 49.5 1.24 0.613
## 3 Germany 48 1.23 0.588
## 4 Norway 47.2 1.42 0.668
## 5 United Kingdom 47.2 1.06 0.5
## 6 Turkey 45.7 0.798 0.365
## 7 India 41.9 0.529 0.222
## 8 South Korea 39.6 0.978 0.387
## 9 Japan 36.5 0.894 0.327
## 10 Singapore 36.5 1.09 0.398
## 11 Brazil 32 0.743 0.238
## 12 China 32 0.477 0.152
## 13 South Africa 30.4 0.644 0.196
## 14 Australia 28.9 0.763 0.221
## 15 Argentina 26.6 0.577 0.154
ggplot(tax_ranking,
aes(x = reorder(country, Avg_Tax_Rate), y = Avg_Tax_Rate)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 15 Countries by Average Fuel Tax Rate",
subtitle = "Tax as Percentage of Pump Price",
x = "Country",
y = "Average Tax Rate (%)") +
theme_minimal()
Key findings:
Implications:
Covered in Question 3.3 above with full analysis and visualisation.
Why this question? The refinery margin (difference between crude cost and refined product wholesale price) shows how much of your pump price goes to the refining industry rather than to oil producers or governments. Rising margins indicate industry profitability or capacity constraints.
margin_trends <- fuel_prices %>%
group_by(decade) %>%
summarise(
Mean_Margin = round(mean(refinery_margin_usd_ltr), 4),
Median_Margin = round(median(refinery_margin_usd_ltr), 4),
SD_Margin = round(sd(refinery_margin_usd_ltr), 4),
Min_Margin = round(min(refinery_margin_usd_ltr), 4),
Max_Margin = round(max(refinery_margin_usd_ltr), 4)
) %>%
arrange(decade)
print(margin_trends)
## # A tibble: 11 × 6
## decade Mean_Margin Median_Margin SD_Margin Min_Margin Max_Margin
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1920s 0.0067 0.0058 0.0075 -0.0067 0.0242
## 2 1930s 0.0119 0.0107 0.0079 -0.0018 0.0314
## 3 1940s 0.0138 0.0119 0.0108 -0.007 0.0412
## 4 1950s 0.0155 0.0116 0.017 -0.008 0.062
## 5 1960s 0.0266 0.0217 0.0241 -0.0067 0.0925
## 6 1970s 0.0285 0.0234 0.0748 -0.185 0.251
## 7 1980s 0.0619 -0.0054 0.201 -0.220 0.576
## 8 1990s 0.273 0.242 0.296 -0.137 0.882
## 9 2000s 0.361 0.347 0.447 -0.602 1.42
## 10 2010s 0.413 0.452 0.540 -0.689 1.47
## 11 2020s 0.723 0.814 0.633 -0.618 1.82
Key findings:
Implications:
Why this question? In most markets diesel is cheaper due to lower commercial vehicle taxes. But when diesel exceeds gasoline in price, it signals a policy shift with direct consequences for logistics, agriculture, and food costs.
fuel_prices <- fuel_prices %>%
mutate(
Diesel_to_Gasoline_Ratio = diesel_usd_per_liter / gasoline_usd_per_liter,
Diesel_Premium = ifelse(Diesel_to_Gasoline_Ratio > 1,
"Diesel More Expensive",
"Gasoline More Expensive")
)
diesel_premium_summary <- fuel_prices %>%
group_by(region, Diesel_Premium) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(region) %>%
mutate(Percent = round(n / sum(n) * 100, 1))
print(diesel_premium_summary)
## # A tibble: 11 × 4
## # Groups: region [9]
## region Diesel_Premium n Percent
## <chr> <chr> <int> <dbl>
## 1 Africa Gasoline More Expensive 202 100
## 2 Asia Gasoline More Expensive 505 100
## 3 Asia Pacific Gasoline More Expensive 202 100
## 4 Eurasia Gasoline More Expensive 101 100
## 5 Europe Diesel More Expensive 502 99.4
## 6 Europe Gasoline More Expensive 3 0.6
## 7 Europe/Asia Diesel More Expensive 46 45.5
## 8 Europe/Asia Gasoline More Expensive 55 54.5
## 9 Latin America Gasoline More Expensive 404 100
## 10 Middle East Gasoline More Expensive 303 100
## 11 North America Gasoline More Expensive 202 100
Key findings:
Implications:
Why this question? Subsidies are policy choices. Quantifying the price gap shows the real scale of market distortion and the fiscal cost governments bear.
subsidy_comparison <- fuel_prices %>%
mutate(Subsidy_Status = ifelse(subsidy_flag == 1,
"Subsidised",
"Not Subsidised")) %>%
group_by(Subsidy_Status) %>%
summarise(
Mean_Real_Price = round(mean(gasoline_real_2024usd), 3),
Median_Real_Price = round(median(gasoline_real_2024usd), 3),
SD_Real_Price = round(sd(gasoline_real_2024usd), 3),
Mean_Tax_Rate = round(mean(tax_pct_of_pump_price), 1),
Count = n()
)
print(subsidy_comparison)
## # A tibble: 2 × 6
## Subsidy_Status Mean_Real_Price Median_Real_Price SD_Real_Price Mean_Tax_Rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Not Subsidised 0.813 0.592 0.602 32.5
## 2 Subsidised 0.373 0.244 0.33 15.6
## # ℹ 1 more variable: Count <int>
price_diff <- subsidy_comparison$Mean_Real_Price[
subsidy_comparison$Subsidy_Status == "Not Subsidised"] -
subsidy_comparison$Mean_Real_Price[
subsidy_comparison$Subsidy_Status == "Subsidised"]
pct_diff <- price_diff / subsidy_comparison$Mean_Real_Price[
subsidy_comparison$Subsidy_Status == "Subsidised"] * 100
cat("\nMean price difference (Non-Subsidised minus Subsidised):",
round(price_diff, 3), "USD/Liter\n")
##
## Mean price difference (Non-Subsidised minus Subsidised): 0.44 USD/Liter
cat("Non-subsidised prices are", round(pct_diff, 1),
"% higher on average\n")
## Non-subsidised prices are 118 % higher on average
fuel_prices %>%
mutate(Subsidy_Status = ifelse(subsidy_flag == 1,
"Subsidised",
"Not Subsidised")) %>%
ggplot(aes(x = Subsidy_Status, y = gasoline_real_2024usd,
fill = Subsidy_Status)) +
geom_boxplot(outlier.alpha = 0.3, show.legend = FALSE) +
labs(title = "Real Gasoline Price: Subsidised vs Non-Subsidised Markets",
x = "",
y = "Real Price (2024 USD/Liter)") +
scale_fill_manual(values = c("Subsidised" = "steelblue",
"Not Subsidised" = "tomato")) +
theme_minimal()
Key findings:
Implications:
Why this question? If crude oil perfectly drives retail prices, then energy policy is almost powerless. If the correlation has weakened over time, it means taxes and subsidies are increasingly dominating crude oil as the price determinant.
overall_cor <- cor(fuel_prices$crude_oil_usd_per_barrel,
fuel_prices$gasoline_real_2024usd)
cat("Overall correlation (crude vs real gasoline):",
round(overall_cor, 3), "\n\n")
## Overall correlation (crude vs real gasoline): 0.525
cor_by_decade <- fuel_prices %>%
group_by(decade) %>%
summarise(
Correlation = round(cor(crude_oil_usd_per_barrel,
gasoline_real_2024usd), 3),
Mean_Crude = round(mean(crude_oil_usd_per_barrel), 2),
Mean_Retail = round(mean(gasoline_real_2024usd), 3)
) %>%
arrange(decade)
print(cor_by_decade)
## # A tibble: 11 × 4
## decade Correlation Mean_Crude Mean_Retail
## <chr> <dbl> <dbl> <dbl>
## 1 1920s -0.047 1.46 0.288
## 2 1930s 0.01 0.98 0.402
## 3 1940s -0.174 1.55 0.401
## 4 1950s 0.08 2.76 0.38
## 5 1960s 0.004 2.91 0.44
## 6 1970s 0.187 10.6 0.554
## 7 1980s -0.014 25.6 0.669
## 8 1990s -0.055 18.5 0.811
## 9 2000s 0.167 51.1 1.10
## 10 2010s -0.004 79.7 1.21
## 11 2020s 0.002 75.3 1.31
Key findings:
Implications:
Why this question: If crude oil directly drives retail prices, then energy policy is largely helpless. If the correlation weakens over time, taxes and subsidies are increasingly dominating.
# Overall correlation
overall_cor <- cor(fuel_prices$crude_oil_usd_per_barrel,
fuel_prices$gasoline_real_2024usd)
cat("Overall correlation (crude vs real gasoline):",
round(overall_cor, 3), "\n\n")
## Overall correlation (crude vs real gasoline): 0.525
# Correlation by decade
cor_by_decade <- fuel_prices %>%
group_by(decade) %>%
summarise(
Correlation = round(cor(crude_oil_usd_per_barrel,
gasoline_real_2024usd), 3),
Mean_Crude = round(mean(crude_oil_usd_per_barrel), 2),
Mean_Retail = round(mean(gasoline_real_2024usd), 3),
Passthrough_Pct = round(Correlation * 100, 1)
) %>%
arrange(decade)
print(cor_by_decade)
## # A tibble: 11 × 5
## decade Correlation Mean_Crude Mean_Retail Passthrough_Pct
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1920s -0.047 1.46 0.288 -4.7
## 2 1930s 0.01 0.98 0.402 1
## 3 1940s -0.174 1.55 0.401 -17.4
## 4 1950s 0.08 2.76 0.38 8
## 5 1960s 0.004 2.91 0.44 0.4
## 6 1970s 0.187 10.6 0.554 18.7
## 7 1980s -0.014 25.6 0.669 -1.4
## 8 1990s -0.055 18.5 0.811 -5.5
## 9 2000s 0.167 51.1 1.10 16.7
## 10 2010s -0.004 79.7 1.21 -0.4
## 11 2020s 0.002 75.3 1.31 0.2
ggplot(fuel_prices, aes(x = crude_oil_usd_per_barrel,
y = gasoline_real_2024usd)) +
geom_point(alpha = 0.2, color = "gray40") +
geom_smooth(method = "lm", color = "red", se = TRUE) +
facet_wrap(~decade, scales = "free") +
labs(title = "Crude Oil Price vs Retail Gasoline Price by Decade",
x = "Crude Oil (USD/Barrel)",
y = "Real Retail Price (2024 USD/Liter)") +
theme_minimal(base_size = 9)
## `geom_smooth()` using formula = 'y ~ x'
What we learn:
yearly_avg <- fuel_prices %>%
group_by(year) %>%
summarise(
Nominal_Gasoline = mean(gasoline_usd_per_liter),
Real_Gasoline = mean(gasoline_real_2024usd),
Real_Diesel = mean(diesel_real_2024usd)
)
ggplot(yearly_avg, aes(x = year)) +
geom_line(aes(y = Nominal_Gasoline,
color = "Nominal Gasoline"),
linewidth = 0.8, linetype = "dashed") +
geom_line(aes(y = Real_Gasoline,
color = "Real Gasoline (2024 USD)"),
linewidth = 1.2) +
geom_line(aes(y = Real_Diesel,
color = "Real Diesel (2024 USD)"),
linewidth = 1.2) +
labs(title = "Global Fuel Prices: 100-Year Trend (1924-2024)",
subtitle = "Nominal vs Inflation-Adjusted (Real 2024 USD)",
x = "Year",
y = "Price (USD per Liter)",
color = "Fuel Type") +
theme_minimal() +
scale_color_manual(
values = c("Nominal Gasoline" = "gray60",
"Real Gasoline (2024 USD)" = "darkorange",
"Real Diesel (2024 USD)" = "firebrick")) +
theme(legend.position = "bottom")
Interpretation:
regional_2020s <- fuel_prices %>%
filter(decade == "2020s") %>%
group_by(region) %>%
summarise(
Gasoline = mean(gasoline_usd_per_liter),
Diesel = mean(diesel_usd_per_liter)
) %>%
pivot_longer(cols = c(Gasoline, Diesel),
names_to = "Fuel",
values_to = "Price")
ggplot(regional_2020s,
aes(x = reorder(region, -Price), y = Price, fill = Fuel)) +
geom_col(position = "dodge", width = 0.7) +
labs(title = "Average Nominal Fuel Prices by Region (2020s)",
x = "Region",
y = "Nominal Price (USD/Liter)",
fill = "Fuel Type") +
theme_minimal() +
scale_fill_manual(values = c("Gasoline" = "darkorange",
"Diesel" = "darkblue")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation:
ggplot(fuel_prices,
aes(x = reorder(price_tier,
gasoline_real_2024usd,
FUN = median),
y = gasoline_real_2024usd,
fill = price_tier)) +
geom_boxplot(outlier.alpha = 0.3, show.legend = FALSE) +
labs(title = "Distribution of Real Gasoline Prices Across Price Tiers",
subtitle = "Inflation-adjusted to 2024 USD",
x = "Price Tier",
y = "Real Gasoline Price (2024 USD/Liter)") +
theme_minimal() +
scale_fill_brewer(palette = "Spectral")
Interpretation:
heatmap_data <- fuel_prices %>%
group_by(decade, region) %>%
summarise(Avg_Price = mean(gasoline_real_2024usd),
.groups = "drop")
ggplot(heatmap_data,
aes(x = decade, y = region, fill = Avg_Price)) +
geom_tile(color = "white", linewidth = 0.5) +
scale_fill_gradient(low = "lightyellow",
high = "darkred",
name = "Avg Real\nPrice\n(USD/L)") +
labs(title = "Average Real Gasoline Price Heatmap",
subtitle = "By Decade and Region — darker = more expensive",
x = "Decade",
y = "Region") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation:
cor_vars <- fuel_prices %>%
select(gasoline_real_2024usd,
diesel_real_2024usd,
crude_oil_usd_per_barrel,
tax_pct_of_pump_price,
refinery_margin_usd_ltr,
us_cpi) %>%
na.omit()
cor_matrix <- cor(cor_vars)
corrplot(cor_matrix,
method = "color",
type = "upper",
addCoef.col = "black",
number.cex = 0.7,
tl.col = "black",
tl.srt = 45,
title = "Correlation: Fuel Price Components",
mar = c(0, 0, 2, 0))
Interpretation:
yoy_data <- fuel_prices %>%
group_by(year) %>%
summarise(Avg_YoY = mean(gasoline_yoy_pct_change, na.rm = TRUE),
.groups = 'drop')
ggplot(yoy_data, aes(x = year, y = Avg_YoY)) +
geom_line(color = "purple", size = 1) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Year-over-Year Gasoline Price Changes (1925-2024)",
subtitle = "Identifying Price Shock Periods",
x = "Year",
y = "YoY Change (%)") +
theme_minimal()
Interpretation:
1. The mean is misleading — skewness tells the real story: The mean real gasoline price ($0.67/L) is 50% above the median ($0.45/L). A handful of modern high-tax European markets distort the global average upward. The median is the honest “typical” price.
2. Oil producers do not automatically have cheap fuel: Being an oil producer lowers prices on average — but Norway proves that policy choice (taxation) can override geological advantage entirely.
3. The 1973 oil crisis permanently reset global prices: Decade averages show prices roughly doubled in real terms from the 1960s to the 1970s and never returned. This was a structural break, not a temporary spike.
4. Taxes create stability; subsidies create volatility: Europe’s tax-heavy model produces high but stable prices (low CV). The Middle East’s subsidy model produces low but volatile prices (high CV). When subsidies are removed, consumers absorb the entire market price overnight.
5. Crude oil is important but not dominant: The correlation matrix shows tax rate and CPI correlate with retail prices almost as strongly as crude oil. Domestic policy matters as much as the global commodity market for what consumers actually pay.
6. The gasoline-diesel gap is a leading economic indicator: When diesel approaches or exceeds gasoline in price, freight and agriculture costs follow — an economic effect that pure price analysis would miss entirely.