1. Introduction

This report analyses weekly retail prices of motor fuels in the United States from 1990 to 2025. The data come from the TidyTuesday project (2025-07-01 release), which in turn is sourced from the U.S. Energy Information Administration (EIA). Every Monday the EIA surveys roughly 1,000 gasoline outlets and 590 diesel stations, recording the average self-serve cash pump price (including all taxes) for several fuel grades and formulations.

The raw dataset is a single long, tidy table with five columns:

Column Type Description
date Date Monday of the survey week
fuel character gasoline or diesel
grade character e.g. regular, midgrade, premium, all
formulation character conventional, reformulated, or all
price numeric Average pump price in US dollars per gallon

1.1 Questions analysed

This project investigates the following questions, all using data.table for wrangling and ggplot2 for visualisation:

  1. How have average fuel prices evolved over the last three decades?
  2. How large is the price premium for higher gasoline grades, and is it stable?
  3. When was gasoline more expensive than diesel, and when did that reverse?
  4. How volatile are the different fuels (spread of prices)?
  5. Which individual years and months were the most expensive at the pump?

To earn the extra credit we also merge an external metadata table, apply a theme, set every axis/title, use a ColorBrewer palette, and layer multiple geoms on a single plot.

2. Setup and data loading

library(data.table)   # fast filtering & aggregation
library(ggplot2)      # visualisation
library(RColorBrewer) # colorbrewer2.org palettes

We read the CSV directly from GitHub. fread() from data.table is used so that the object is a data.table from the very first line.

url <- paste0(
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/",
  "main/data/2025/2025-07-01/weekly_gas_prices.csv"
)

gas <- fread(url)

# Ensure the date column is a proper Date and confirm the class
gas[, date := as.Date(date)]
class(gas)
## [1] "data.table" "data.frame"
str(gas)
## Classes 'data.table' and 'data.frame':   22360 obs. of  5 variables:
##  $ date       : Date, format: "1990-08-20" "1990-08-20" ...
##  $ fuel       : chr  "gasoline" "gasoline" "gasoline" "gasoline" ...
##  $ grade      : chr  "regular" "regular" "regular" "regular" ...
##  $ formulation: chr  "all" "conventional" "all" "conventional" ...
##  $ price      : num  1.19 1.19 1.25 1.25 1.24 ...
##  - attr(*, ".internal.selfref")=<pointer: 0x5e6c4b87d1c0>
# First few rows and the overall dimensions
head(gas)
cat("Rows:", nrow(gas), " Columns:", ncol(gas), "\n")
## Rows: 22360  Columns: 5
# What categories exist in each key column?
gas[, .(unique_fuels = paste(unique(fuel), collapse = ", "))]
gas[, .(unique_grades = paste(unique(grade), collapse = ", "))]
gas[, .(unique_formulations = paste(unique(formulation), collapse = ", "))]

Comment. The dataset is long and tidy: each row is one fuel/grade/formulation combination for one week. gasoline carries several grades (regular, midgrade, premium, plus an all aggregate), while diesel is reported as a single grade. We will exploit this structure throughout.

3. Data preparation with data.table

3.1 Adding helper columns

We add a year and month for aggregation, and a decade label for grouping.

gas[, year  := as.integer(format(date, "%Y"))]
gas[, month := as.integer(format(date, "%m"))]
gas[, decade := paste0(floor(year / 10) * 10, "s")]

head(gas)

3.2 Filtering rows using data.table (required item)

data.table’s dt[i, j, by] syntax filters in the i position. Below we keep only the four core gasoline series (regular, midgrade, premium) measured on a comparable basis, dropping the pre-aggregated all rows for the grade analysis.

# Filter: gasoline only, real grades only, valid prices
gas_grades <- gas[
  fuel == "gasoline" &
  grade %in% c("regular", "midgrade", "premium") &
  !is.na(price)
]

cat("Filtered rows:", nrow(gas_grades),
    "of", nrow(gas), "total\n")
## Filtered rows: 14798 of 22360 total
head(gas_grades)
# A second filter example: only the most recent full decade
gas_2020s <- gas[year >= 2020 & !is.na(price)]
cat("Observations from 2020 onward:", nrow(gas_2020s), "\n")
## Observations from 2020 onward: 4004

Comment. The first filter narrows ~22k rows to just the comparable gasoline grades we need for the premium analysis; the second isolates the recent, high-volatility period for later plots.

3.3 Aggregating data using data.table (required item)

Aggregation happens in the j position with a by clause. First, yearly average price per fuel type:

yearly <- gas[
  !is.na(price),
  .(avg_price = round(mean(price), 3),
    min_price = min(price),
    max_price = max(price),
    n_obs     = .N),
  by = .(year, fuel)
][order(year, fuel)]

head(yearly, 10)
# Average price by gasoline grade across the whole period
grade_summary <- gas_grades[
  ,
  .(avg_price = round(mean(price), 3),
    sd_price  = round(sd(price), 3),
    n         = .N),
  by = grade
][order(avg_price)]

grade_summary
# Average gasoline price by calendar month (seasonality)
monthly <- gas[
  fuel == "gasoline" & !is.na(price),
  .(avg_price = round(mean(price), 3)),
  by = month
][order(month)]

monthly

Comment. Premium gasoline is on average the most expensive grade and regular the cheapest, exactly as expected. The monthly table hints at mild seasonality, with spring/summer months tending to run higher than winter.

4. Merging an external dataset (extra item)

To enrich the analysis we merge a small external metadata table that we construct here. It records, for each fuel type, an approximate energy content (BTU per gallon) and a typical primary use. In a real project this might come from a separate CSV; the merge mechanics are identical.

fuel_meta <- data.table(
  fuel        = c("gasoline", "diesel"),
  btu_per_gal = c(120000L, 138000L),     # approx. lower heating value
  primary_use = c("Passenger vehicles", "Trucks & freight")
)
fuel_meta
# data.table merge: keyed join on the shared 'fuel' column
setkey(gas, fuel)
setkey(fuel_meta, fuel)

gas_merged <- fuel_meta[gas]   # right join: every price row gains metadata

# Confirm the merge added the new columns
head(gas_merged[, .(date, fuel, grade, price, btu_per_gal, primary_use)])
cat("Columns after merge:", paste(names(gas_merged), collapse = ", "), "\n")
## Columns after merge: fuel, btu_per_gal, primary_use, date, grade, formulation, price, year, month, decade
# Use the merged column: price per million BTU (energy-normalised cost)
gas_merged[, price_per_mmbtu := price / (btu_per_gal / 1e6)]

energy_cost <- gas_merged[
  !is.na(price),
  .(avg_dollar_per_gal  = round(mean(price), 3),
    avg_dollar_per_mmbtu = round(mean(price_per_mmbtu), 3)),
  by = fuel
]
energy_cost

Comment. Once we normalise by energy content, diesel’s apparent price disadvantage shrinks: because a gallon of diesel carries more usable energy, its cost per unit of energy is closer to gasoline’s than the raw per-gallon price suggests. The merge made this comparison possible.

5. Visualisations

We define a reusable ColorBrewer palette and a consistent theme up front.

# ColorBrewer qualitative palette (colorbrewer2.org -> "Set1" / "Dark2")
fuel_cols  <- brewer.pal(3, "Set1")[1:2]          # gasoline vs diesel
grade_cols <- brewer.pal(3, "Dark2")              # three gasoline grades

# A house theme applied to every plot
theme_report <- theme_minimal(base_size = 12) +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(colour = "grey35"),
    legend.position = "bottom",
    panel.grid.minor = element_blank()
  )

5.1 Plot 1 — Line chart: three decades of prices (geom 1: line; multi-layer)

# For a clean one-series-per-fuel chart we take the aggregate 'all' grade.
# Gasoline reports several formulations under grade == "all", so we also
# restrict to formulation == "all"; diesel has a single 'all' series.
ts_data <- gas[
  ((fuel == "gasoline" & grade == "all" & formulation == "all") |
   (fuel == "diesel"   & grade == "all")) & !is.na(price),
  .(avg_price = mean(price)),
  by = .(date, fuel)
]

ggplot(ts_data, aes(x = date, y = avg_price, colour = fuel)) +
  geom_line(linewidth = 0.5, alpha = 0.8) +                 # layer 1: lines
  geom_smooth(method = "loess", se = FALSE,                  # layer 2: trend
              linewidth = 1.1, span = 0.2) +
  scale_colour_manual(values = fuel_cols, name = "Fuel") +
  labs(
    title    = "U.S. Retail Fuel Prices, 1990-2025",
    subtitle = "Weekly pump price with smoothed trend line",
    x        = "Year",
    y        = "Price (US$ per gallon)"
  ) +
  theme_report

Comment. Both fuels track each other closely, with dramatic spikes in 2008 (financial-crisis oil shock) and 2022 (post-pandemic / geopolitical shock). The smoothed lines (a second geom layer) make the long-run upward drift easy to read through the weekly noise.

5.2 Plot 2 — Boxplot: price distribution by decade (geom 2: boxplot)

ggplot(gas[fuel == "gasoline" & !is.na(price)],
       aes(x = decade, y = price, fill = decade)) +
  geom_boxplot(outlier.alpha = 0.15) +
  scale_fill_brewer(palette = "Blues", name = "Decade") +
  labs(
    title    = "Distribution of Gasoline Prices by Decade",
    subtitle = "Median, spread and outliers per ten-year window",
    x        = "Decade",
    y        = "Price (US$ per gallon)"
  ) +
  theme_report

Comment. The median pump price has risen every decade, but so has the spread: the 2020s box is both higher and much taller, confirming that recent years were not only more expensive but markedly more volatile.

5.3 Plot 3 — Bar chart: average price by grade (geom 3: bar/col)

ggplot(grade_summary,
       aes(x = reorder(grade, avg_price), y = avg_price, fill = grade)) +
  geom_col(width = 0.65) +
  geom_text(aes(label = sprintf("$%.2f", avg_price)),   # multi-layer: labels
            vjust = -0.4, size = 3.5) +
  scale_fill_manual(values = grade_cols, name = "Grade") +
  labs(
    title    = "Average Price by Gasoline Grade (1990-2025)",
    subtitle = "Premium consistently commands the highest price",
    x        = "Gasoline grade",
    y        = "Average price (US$ per gallon)"
  ) +
  theme_report

Comment. The grade ladder is clear and consistent: premium > midgrade > regular. The text labels (an extra geom layer) quantify the gaps directly on the bars.

5.4 Plot 4 — Yearly average with points + line (multi-layer)

yearly_gas <- yearly[fuel == "gasoline"]

ggplot(yearly_gas, aes(x = year, y = avg_price)) +
  geom_line(colour = grade_cols[1], linewidth = 1) +        # layer 1
  geom_point(colour = grade_cols[1], size = 2) +            # layer 2
  geom_hline(yintercept = mean(yearly_gas$avg_price),       # layer 3
             linetype = "dashed", colour = "grey50") +
  annotate("text", x = min(yearly_gas$year) + 3,
           y = mean(yearly_gas$avg_price) + 0.12,
           label = "Long-run average", colour = "grey40", size = 3.5) +
  labs(
    title    = "Average Annual Gasoline Price",
    subtitle = "Points = yearly mean; dashed line = overall mean",
    x        = "Year",
    y        = "Average price (US$ per gallon)"
  ) +
  theme_report

Comment. Three layers (line, points, reference line) combine to show how each year sits relative to the 30-year average. The early 2000s sat below the long-run mean; everything from ~2011 onward (bar the 2020 dip) sits above it.

5.5 Plot 5 — Seasonality: average price by month (bar)

monthly[, month_name := factor(month.abb[month], levels = month.abb)]

ggplot(monthly, aes(x = month_name, y = avg_price, fill = avg_price)) +
  geom_col() +
  scale_fill_distiller(palette = "YlOrRd", direction = 1,
                       name = "Avg price") +
  labs(
    title    = "Seasonality of Gasoline Prices",
    subtitle = "Averaged across all years, 1990-2025",
    x        = "Month",
    y        = "Average price (US$ per gallon)"
  ) +
  theme_report

Comment. A gentle but real seasonal pattern emerges: prices climb into the late-spring/summer driving season (May-August) and ease in winter, consistent with demand cycles and the switch to costlier summer-blend fuel.

5.6 Plot 6 — Grade premium over time (line, multi-series)

# Reshape grades wide to compute premium over regular
wide <- dcast(gas_grades, date ~ grade, value.var = "price",
              fun.aggregate = mean)
wide[, `:=`(
  midgrade_premium = midgrade - regular,
  premium_premium  = premium  - regular
)]

prem_long <- melt(
  wide[, .(date, midgrade_premium, premium_premium)],
  id.vars = "date", variable.name = "type", value.name = "premium"
)

ggplot(prem_long[!is.na(premium)],
       aes(x = date, y = premium, colour = type)) +
  geom_line(alpha = 0.7) +
  geom_smooth(method = "loess", se = FALSE, span = 0.3, linewidth = 1) +
  scale_colour_brewer(palette = "Set1", name = "Premium vs regular",
                      labels = c("Midgrade", "Premium")) +
  labs(
    title    = "Price Premium of Higher Gasoline Grades",
    subtitle = "Dollar difference relative to regular grade",
    x        = "Year",
    y        = "Premium (US$ per gallon)"
  ) +
  theme_report

Comment. The premium for higher grades is not constant: it was small and flat in the 1990s-2000s but has widened substantially since ~2015, with premium gasoline now costing well over a dollar more per gallon than regular at times.

5.7 Plot 7 — Gasoline vs diesel scatter (geom 4: point)

# Weekly regular gasoline vs diesel, joined on date.
# Restrict gasoline to grade & formulation == "all" so there is exactly
# one gasoline price per week (otherwise the three formulations duplicate dates).
wk_gas <- gas[fuel == "gasoline" & grade == "all" & formulation == "all" &
              !is.na(price), .(date, gasoline = price)]
wk_dsl <- gas[fuel == "diesel" & grade == "all" & !is.na(price),
              .(date, diesel = price)]
pair <- merge(wk_gas, wk_dsl, by = "date")
pair[, era := fifelse(year(date) >= 2015, "2015-2025", "1990-2014")]

ggplot(pair, aes(x = gasoline, y = diesel, colour = era)) +
  geom_point(alpha = 0.4, size = 1.3) +
  geom_abline(slope = 1, intercept = 0,                  # layer 2: parity line
              linetype = "dashed", colour = "grey40") +
  scale_colour_brewer(palette = "Dark2", name = "Era") +
  labs(
    title    = "Gasoline vs Diesel: Weekly Price Pairs",
    subtitle = "Points above the dashed line = diesel more expensive than gasoline",
    x        = "Gasoline price (US$/gal)",
    y        = "Diesel price (US$/gal)"
  ) +
  theme_report

Comment. Below the dashed parity line gasoline costs more; above it diesel does. In the earlier era points cluster near the line, but in 2015-2025 diesel increasingly sits above parity, i.e. diesel became the more expensive fuel — a reversal of the historical norm.

6. Summary and conclusions

This analysis of three decades of weekly U.S. fuel prices produced several clear findings:

  • Long-run rise with sharp shocks. Pump prices roughly tripled from the late 1990s to the 2020s, punctuated by the 2008 and 2022 spikes.
  • Rising volatility. The spread of prices widened decade over decade; the 2020s are both the most expensive and the most volatile period.
  • A consistent grade ladder. Premium > midgrade > regular at all times, but the premium for higher grades has widened sharply since around 2015.
  • A gasoline-diesel reversal. Diesel, historically comparable to or cheaper than gasoline, has increasingly become the more expensive fuel in recent years.
  • Energy-adjusted cost. After merging fuel-energy metadata, diesel’s cost per unit of energy is closer to gasoline’s than the per-gallon figure implies.

Caveats. Following the TidyTuesday guidance, these are descriptive patterns, not causal claims. Many factors (crude oil markets, taxes, refining capacity, seasonality, policy) jointly drive prices and are not all captured here.

Requirements checklist

Requirement Where
Filter rows with data.table §3.2
Aggregate with data.table §3.3
≥ 7 plots §5.1-5.7 (7 plots)
≥ 3 different geoms line, boxplot, col/bar, point (4 geoms)
Merge datasets §4
Apply a theme theme_report, all plots
Axis & plot titles every plot
ColorBrewer palette brewer.pal / scale_*_brewer, all plots
Multiple geom layers on one plot §5.1, 5.3, 5.4, 5.6, 5.7

Data: U.S. EIA via the TidyTuesday project (2025-07-01). Analysis in R with data.table and ggplot2.