data.table & ggplot2 AnalysisThis report analyses weekly retail prices of motor fuels in the United States from 1990 to 2025. The data come from the TidyTuesday project (2025-07-01 release), which in turn is sourced from the U.S. Energy Information Administration (EIA). Every Monday the EIA surveys roughly 1,000 gasoline outlets and 590 diesel stations, recording the average self-serve cash pump price (including all taxes) for several fuel grades and formulations.
The raw dataset is a single long, tidy table with five columns:
| Column | Type | Description |
|---|---|---|
date |
Date | Monday of the survey week |
fuel |
character | gasoline or diesel |
grade |
character | e.g. regular, midgrade, premium, all |
formulation |
character | conventional, reformulated, or all |
price |
numeric | Average pump price in US dollars per gallon |
This project investigates the following questions, all using
data.table for wrangling and
ggplot2 for visualisation:
To earn the extra credit we also merge an external metadata table, apply a theme, set every axis/title, use a ColorBrewer palette, and layer multiple geoms on a single plot.
library(data.table) # fast filtering & aggregation
library(ggplot2) # visualisation
library(RColorBrewer) # colorbrewer2.org palettes
We read the CSV directly from GitHub. fread() from
data.table is used so that the object is a
data.table from the very first line.
url <- paste0(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/",
"main/data/2025/2025-07-01/weekly_gas_prices.csv"
)
gas <- fread(url)
# Ensure the date column is a proper Date and confirm the class
gas[, date := as.Date(date)]
class(gas)
## [1] "data.table" "data.frame"
str(gas)
## Classes 'data.table' and 'data.frame': 22360 obs. of 5 variables:
## $ date : Date, format: "1990-08-20" "1990-08-20" ...
## $ fuel : chr "gasoline" "gasoline" "gasoline" "gasoline" ...
## $ grade : chr "regular" "regular" "regular" "regular" ...
## $ formulation: chr "all" "conventional" "all" "conventional" ...
## $ price : num 1.19 1.19 1.25 1.25 1.24 ...
## - attr(*, ".internal.selfref")=<pointer: 0x5e6c4b87d1c0>
# First few rows and the overall dimensions
head(gas)
cat("Rows:", nrow(gas), " Columns:", ncol(gas), "\n")
## Rows: 22360 Columns: 5
# What categories exist in each key column?
gas[, .(unique_fuels = paste(unique(fuel), collapse = ", "))]
gas[, .(unique_grades = paste(unique(grade), collapse = ", "))]
gas[, .(unique_formulations = paste(unique(formulation), collapse = ", "))]
Comment. The dataset is long and tidy: each row is
one fuel/grade/formulation combination for one week.
gasoline carries several grades (regular, midgrade,
premium, plus an all aggregate), while diesel
is reported as a single grade. We will exploit this structure
throughout.
data.tableWe add a year and month for aggregation,
and a decade label for grouping.
gas[, year := as.integer(format(date, "%Y"))]
gas[, month := as.integer(format(date, "%m"))]
gas[, decade := paste0(floor(year / 10) * 10, "s")]
head(gas)
data.table (required
item)data.table’s dt[i, j, by] syntax filters in
the i position. Below we keep only the four core
gasoline series (regular, midgrade, premium) measured
on a comparable basis, dropping the pre-aggregated all rows
for the grade analysis.
# Filter: gasoline only, real grades only, valid prices
gas_grades <- gas[
fuel == "gasoline" &
grade %in% c("regular", "midgrade", "premium") &
!is.na(price)
]
cat("Filtered rows:", nrow(gas_grades),
"of", nrow(gas), "total\n")
## Filtered rows: 14798 of 22360 total
head(gas_grades)
# A second filter example: only the most recent full decade
gas_2020s <- gas[year >= 2020 & !is.na(price)]
cat("Observations from 2020 onward:", nrow(gas_2020s), "\n")
## Observations from 2020 onward: 4004
Comment. The first filter narrows ~22k rows to just the comparable gasoline grades we need for the premium analysis; the second isolates the recent, high-volatility period for later plots.
data.table (required
item)Aggregation happens in the j position with a
by clause. First, yearly average price per fuel type:
yearly <- gas[
!is.na(price),
.(avg_price = round(mean(price), 3),
min_price = min(price),
max_price = max(price),
n_obs = .N),
by = .(year, fuel)
][order(year, fuel)]
head(yearly, 10)
# Average price by gasoline grade across the whole period
grade_summary <- gas_grades[
,
.(avg_price = round(mean(price), 3),
sd_price = round(sd(price), 3),
n = .N),
by = grade
][order(avg_price)]
grade_summary
# Average gasoline price by calendar month (seasonality)
monthly <- gas[
fuel == "gasoline" & !is.na(price),
.(avg_price = round(mean(price), 3)),
by = month
][order(month)]
monthly
Comment. Premium gasoline is on average the most expensive grade and regular the cheapest, exactly as expected. The monthly table hints at mild seasonality, with spring/summer months tending to run higher than winter.
To enrich the analysis we merge a small external metadata table that we construct here. It records, for each fuel type, an approximate energy content (BTU per gallon) and a typical primary use. In a real project this might come from a separate CSV; the merge mechanics are identical.
fuel_meta <- data.table(
fuel = c("gasoline", "diesel"),
btu_per_gal = c(120000L, 138000L), # approx. lower heating value
primary_use = c("Passenger vehicles", "Trucks & freight")
)
fuel_meta
# data.table merge: keyed join on the shared 'fuel' column
setkey(gas, fuel)
setkey(fuel_meta, fuel)
gas_merged <- fuel_meta[gas] # right join: every price row gains metadata
# Confirm the merge added the new columns
head(gas_merged[, .(date, fuel, grade, price, btu_per_gal, primary_use)])
cat("Columns after merge:", paste(names(gas_merged), collapse = ", "), "\n")
## Columns after merge: fuel, btu_per_gal, primary_use, date, grade, formulation, price, year, month, decade
# Use the merged column: price per million BTU (energy-normalised cost)
gas_merged[, price_per_mmbtu := price / (btu_per_gal / 1e6)]
energy_cost <- gas_merged[
!is.na(price),
.(avg_dollar_per_gal = round(mean(price), 3),
avg_dollar_per_mmbtu = round(mean(price_per_mmbtu), 3)),
by = fuel
]
energy_cost
Comment. Once we normalise by energy content, diesel’s apparent price disadvantage shrinks: because a gallon of diesel carries more usable energy, its cost per unit of energy is closer to gasoline’s than the raw per-gallon price suggests. The merge made this comparison possible.
We define a reusable ColorBrewer palette and a consistent theme up front.
# ColorBrewer qualitative palette (colorbrewer2.org -> "Set1" / "Dark2")
fuel_cols <- brewer.pal(3, "Set1")[1:2] # gasoline vs diesel
grade_cols <- brewer.pal(3, "Dark2") # three gasoline grades
# A house theme applied to every plot
theme_report <- theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(colour = "grey35"),
legend.position = "bottom",
panel.grid.minor = element_blank()
)
# For a clean one-series-per-fuel chart we take the aggregate 'all' grade.
# Gasoline reports several formulations under grade == "all", so we also
# restrict to formulation == "all"; diesel has a single 'all' series.
ts_data <- gas[
((fuel == "gasoline" & grade == "all" & formulation == "all") |
(fuel == "diesel" & grade == "all")) & !is.na(price),
.(avg_price = mean(price)),
by = .(date, fuel)
]
ggplot(ts_data, aes(x = date, y = avg_price, colour = fuel)) +
geom_line(linewidth = 0.5, alpha = 0.8) + # layer 1: lines
geom_smooth(method = "loess", se = FALSE, # layer 2: trend
linewidth = 1.1, span = 0.2) +
scale_colour_manual(values = fuel_cols, name = "Fuel") +
labs(
title = "U.S. Retail Fuel Prices, 1990-2025",
subtitle = "Weekly pump price with smoothed trend line",
x = "Year",
y = "Price (US$ per gallon)"
) +
theme_report
Comment. Both fuels track each other closely, with dramatic spikes in 2008 (financial-crisis oil shock) and 2022 (post-pandemic / geopolitical shock). The smoothed lines (a second geom layer) make the long-run upward drift easy to read through the weekly noise.
ggplot(gas[fuel == "gasoline" & !is.na(price)],
aes(x = decade, y = price, fill = decade)) +
geom_boxplot(outlier.alpha = 0.15) +
scale_fill_brewer(palette = "Blues", name = "Decade") +
labs(
title = "Distribution of Gasoline Prices by Decade",
subtitle = "Median, spread and outliers per ten-year window",
x = "Decade",
y = "Price (US$ per gallon)"
) +
theme_report
Comment. The median pump price has risen every decade, but so has the spread: the 2020s box is both higher and much taller, confirming that recent years were not only more expensive but markedly more volatile.
ggplot(grade_summary,
aes(x = reorder(grade, avg_price), y = avg_price, fill = grade)) +
geom_col(width = 0.65) +
geom_text(aes(label = sprintf("$%.2f", avg_price)), # multi-layer: labels
vjust = -0.4, size = 3.5) +
scale_fill_manual(values = grade_cols, name = "Grade") +
labs(
title = "Average Price by Gasoline Grade (1990-2025)",
subtitle = "Premium consistently commands the highest price",
x = "Gasoline grade",
y = "Average price (US$ per gallon)"
) +
theme_report
Comment. The grade ladder is clear and consistent: premium > midgrade > regular. The text labels (an extra geom layer) quantify the gaps directly on the bars.
yearly_gas <- yearly[fuel == "gasoline"]
ggplot(yearly_gas, aes(x = year, y = avg_price)) +
geom_line(colour = grade_cols[1], linewidth = 1) + # layer 1
geom_point(colour = grade_cols[1], size = 2) + # layer 2
geom_hline(yintercept = mean(yearly_gas$avg_price), # layer 3
linetype = "dashed", colour = "grey50") +
annotate("text", x = min(yearly_gas$year) + 3,
y = mean(yearly_gas$avg_price) + 0.12,
label = "Long-run average", colour = "grey40", size = 3.5) +
labs(
title = "Average Annual Gasoline Price",
subtitle = "Points = yearly mean; dashed line = overall mean",
x = "Year",
y = "Average price (US$ per gallon)"
) +
theme_report
Comment. Three layers (line, points, reference line) combine to show how each year sits relative to the 30-year average. The early 2000s sat below the long-run mean; everything from ~2011 onward (bar the 2020 dip) sits above it.
monthly[, month_name := factor(month.abb[month], levels = month.abb)]
ggplot(monthly, aes(x = month_name, y = avg_price, fill = avg_price)) +
geom_col() +
scale_fill_distiller(palette = "YlOrRd", direction = 1,
name = "Avg price") +
labs(
title = "Seasonality of Gasoline Prices",
subtitle = "Averaged across all years, 1990-2025",
x = "Month",
y = "Average price (US$ per gallon)"
) +
theme_report
Comment. A gentle but real seasonal pattern emerges: prices climb into the late-spring/summer driving season (May-August) and ease in winter, consistent with demand cycles and the switch to costlier summer-blend fuel.
# Weekly regular gasoline vs diesel, joined on date.
# Restrict gasoline to grade & formulation == "all" so there is exactly
# one gasoline price per week (otherwise the three formulations duplicate dates).
wk_gas <- gas[fuel == "gasoline" & grade == "all" & formulation == "all" &
!is.na(price), .(date, gasoline = price)]
wk_dsl <- gas[fuel == "diesel" & grade == "all" & !is.na(price),
.(date, diesel = price)]
pair <- merge(wk_gas, wk_dsl, by = "date")
pair[, era := fifelse(year(date) >= 2015, "2015-2025", "1990-2014")]
ggplot(pair, aes(x = gasoline, y = diesel, colour = era)) +
geom_point(alpha = 0.4, size = 1.3) +
geom_abline(slope = 1, intercept = 0, # layer 2: parity line
linetype = "dashed", colour = "grey40") +
scale_colour_brewer(palette = "Dark2", name = "Era") +
labs(
title = "Gasoline vs Diesel: Weekly Price Pairs",
subtitle = "Points above the dashed line = diesel more expensive than gasoline",
x = "Gasoline price (US$/gal)",
y = "Diesel price (US$/gal)"
) +
theme_report
Comment. Below the dashed parity line gasoline costs more; above it diesel does. In the earlier era points cluster near the line, but in 2015-2025 diesel increasingly sits above parity, i.e. diesel became the more expensive fuel — a reversal of the historical norm.
This analysis of three decades of weekly U.S. fuel prices produced several clear findings:
Caveats. Following the TidyTuesday guidance, these are descriptive patterns, not causal claims. Many factors (crude oil markets, taxes, refining capacity, seasonality, policy) jointly drive prices and are not all captured here.
| Requirement | Where |
|---|---|
Filter rows with data.table |
§3.2 |
Aggregate with data.table |
§3.3 |
| ≥ 7 plots | §5.1-5.7 (7 plots) |
| ≥ 3 different geoms | line, boxplot, col/bar, point (4 geoms) |
| Merge datasets | §4 |
| Apply a theme | theme_report, all plots |
| Axis & plot titles | every plot |
| ColorBrewer palette | brewer.pal / scale_*_brewer, all
plots |
| Multiple geom layers on one plot | §5.1, 5.3, 5.4, 5.6, 5.7 |
Data: U.S. EIA via the TidyTuesday project (2025-07-01). Analysis
in R with data.table and ggplot2.