Price Sensitivity and Sales Performance Across Rainoil Retail Outlets: An Exploratory and Inferential Analysis of 2024 Fuel Sales Data

Author

[Your Full Name]

Published

May 7, 2026

1 Executive Summary

Nigeria’s downstream petroleum sector is characterised by price volatility and wide regional disparities in fuel demand. Rainoil Limited operates one of the largest retail fuel networks in Nigeria, with stations spanning all geopolitical zones. This study analyses 2024–2025 monthly sales data — covering 193 retail stations across 24 months (January 2024 to December 2025) — to answer a critical commercial question: do pump price variations across stations and over time significantly explain differences in PMS (petrol) and AGO (diesel) sales volumes, and which station characteristics are associated with top performance?

Using five analytical techniques — Exploratory Data Analysis, Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — this report finds that: (1) sales volumes are highly right-skewed, with a small cluster of mega-stations (e.g. Summit Junction, Ibafo) accounting for a disproportionate share of national volume; (2) PMS volumes declined notably in Q3 2024 before recovering in Q4, consistent with the fuel subsidy removal price shock; (3) South-West and South-South zones significantly out-sell the North in PMS volume (ANOVA, p < 0.05); (4) PMS price is negatively correlated with PMS volume, confirming price sensitivity; and (5) a ₦100 increase in average PMS pump price is associated with approximately 18,000–22,000 fewer litres sold per station-month. The primary recommendation is for Rainoil’s commercial team to implement zone-differentiated pricing floors that balance margin protection with volume retention, particularly in price-elastic northern markets.

2 Professional Disclosure

Job Title: [Your Job Title, e.g. Retail Operations Analyst / Area Sales Manager]

Organisation: Rainoil Limited — one of Nigeria’s leading petroleum marketing companies, operating a retail network of over 140 service stations across all six geopolitical zones, with additional bulk haulage and lubricant distribution arms.

Operational relevance of each technique:

Exploratory Data Analysis: As part of the commercial/operations team, I use monthly MOM (Month-on-Month) station performance reports to identify outliers — stations with unusual volume drops that may indicate equipment downtime, supply disruptions, or management issues requiring field intervention. EDA formalises this review.
Data Visualisation: Monthly performance dashboards are presented to regional managers and the MD’s office. Selecting the right chart type — whether a heatmap of station rankings or a trend line of zone-level volumes — directly affects how quickly management can act on insights.
Hypothesis Testing: When the pricing desk proposes adjusting pump prices across regions, I need to determine whether current regional differences in sales volume are statistically meaningful or attributable to random variation. A formal test gives the commercial justification for pricing decisions.
Correlation Analysis: Rainoil sells PMS, AGO, and DPK from the same forecourt. Understanding how these product volumes co-move — and how sensitively each responds to its own price — informs product-mix planning and working capital allocation for depot loading.
Linear Regression: A regression of PMS volume on pump price provides the elasticity estimate that the pricing desk uses to forecast volume impact before implementing price changes. This is a routine tool in the monthly pricing review cycle.

3 Data Collection & Sampling

Source: Internal MOM (Month-on-Month) Retail Sales Summary Analysis Report — extracted from Rainoil Limited’s retail operations management system.

Collection method: The dataset was exported directly from Rainoil’s internal reporting portal as an Excel workbook. Data is compiled monthly by the Retail Operations team from station-level daily meter readings submitted by depot supervisors.

Sampling frame: All Rainoil retail stations listed in the MOM report for either 2024 or 2025, yielding 193 unique stations. Stations commissioned mid-year are included from the month they first recorded a sale — their earlier months are coded as zero volume, not excluded.

Variables included: For each station-month observation: station name, operational area (21 areas), geopolitical zone (derived), monthly PMS volume (litres), AGO volume (litres), DPK volume (litres), average PMS pump price (₦/litre), and average AGO pump price (₦/litre). Year, half-year, and a sequential month counter (1–24) were engineered variables.

Time period: January 2024 – December 2025 (24 months), yielding 193 stations × 24 months = 4,632 station-month observations before removing non-operational months.

Ethical notes: The data is aggregate station-level operational data. It contains no personally identifiable information. Permission to use this data for academic analysis was obtained from [your line manager/department head]. Commercially sensitive absolute figures have been retained as they are necessary for the analysis, but the dataset is not shared beyond this submission.

4 Data Wrangling & Description

4.1 Load Libraries and Import Data

Code

library(tidyverse)
library(readxl)
library(janitor)
library(skimr)
library(ggplot2)
library(scales)
library(corrplot)
library(car)
library(lmtest)
library(knitr)
library(kableExtra)
library(patchwork)
library(ggrepel)
library(RColorBrewer)
library(broom)

Code

# ── 0. Set working directory ──
setwd("C:/Users/susan.aigbobhiose/OneDrive - Rainoil Limited/Documents/PERSONAL/MBA/YEAR 2/1ST SEMESTER/Data Analytics II/Examination Submission/Exam 1")

# ── 1. Build column names: sn | area | station_name | 12 months × 5 products ──
months_abbr  <- c("Jan","Feb","Mar","Apr","May","Jun",
                  "Jul","Aug","Sep","Oct","Nov","Dec")
product_cols <- c("pms","ago","dpk","avg_pms_price","avg_ago_price")

col_names <- c("sn", "area", "station_name")
for (m in months_abbr) {
  for (p in product_cols) {
    col_names <- c(col_names, paste0(m, "_", p))
  }
}

# ── 2. Helper: read one sheet, keep ALL stations (SN 1–193) ──
read_sales_sheet <- function(file, sheet) {
  raw <- read_excel(file, sheet = sheet,
                    col_names = FALSE, skip = 4, col_types = "text")
  names(raw) <- col_names[seq_len(ncol(raw))]
  raw |>
    mutate(sn = suppressWarnings(as.numeric(sn))) |>
    # Keep only real station rows: SN is numeric, 1–193, area is not blank
    filter(!is.na(sn), sn >= 1, sn <= 193, !is.na(area), area != "") |>
    mutate(across(
      .cols = -c(area, station_name),
      .fns  = ~ suppressWarnings(as.numeric(.x))
    )) |>
    # Standardise area capitalisation
    mutate(area = str_to_title(str_trim(area)),
           station_name = str_trim(station_name))
}

# ── 3. Load both years ──
file <- "Retail_Sales_(2024_2025).xlsx"

stations_wide_2024 <- read_sales_sheet(file, "2024 MOM Station Sale")
stations_wide_2025 <- read_sales_sheet(file, "2025 MOM Station Sale")

cat("2024 stations loaded:", nrow(stations_wide_2024), "\n")

2024 stations loaded: 179

Code

cat("2025 stations loaded:", nrow(stations_wide_2025), "\n")

2025 stations loaded: 193

Code

# Stations present in 2025 but not 2024 (newly commissioned)
new_in_2025 <- setdiff(stations_wide_2025$station_name,
                       stations_wide_2024$station_name)
cat("\nStations new in 2025 (", length(new_in_2025), "):\n", sep = "")


Stations new in 2025 (14):

Code

cat(paste(" •", new_in_2025), sep = "\n")

 • Rainoil Maraba
 • Rainoil Osubi - Airport Road
 • Rainoil Yenagoa Kpansia
 • Rainoil Ogbomosho
 • Rainoil Ugbolu
 • Rainoil North Bank
 • Rainoil Giwa-Amu
 • Fynefield Abak Town
 • Rainoil Eboh Road
 • Rainoil Patani
 • Rainoil Ughelli Otovwodo
 • Rainoil Aba - Faulks Road
 • Rainoil Uselu Shell
 • Rainoil Ughelli Post Office

4.2 Assign Geopolitical Zones

Code

# Map Rainoil's 21 operational areas to Nigeria's 6 geopolitical zones
area_to_zone <- tribble(
  ~area,            ~zone,
  "Calabar-Ph",     "South-South",
  "Yenagoa",        "South-South",
  "Warri",          "South-South",
  "Ughelli",        "South-South",
  "Delta",          "South-South",
  "Kwale",          "South-South",
  "Ozoro",          "South-South",
  "Agbor",          "South-South",
  "Benin",          "South-South",
  "Edo North",      "South-South",
  "Mid-West",       "South-South",
  "Ikom-Ogoja",     "South-South",
  "South-East 1",   "South-East",
  "South East 1",   "South-East",
  "South-East 2",   "South-East",
  "South-West 1",   "South-West",
  "South-West 2",   "South-West",
  "Fct",            "North-Central",
  "North Central",  "North-Central",
  "North-West",     "North-West",
  "North East",     "North-East"
)

stations_wide_2024 <- stations_wide_2024 |>
  left_join(area_to_zone, by = "area") |>
  mutate(zone = replace_na(zone, "South-South"))

stations_wide_2025 <- stations_wide_2025 |>
  left_join(area_to_zone, by = "area") |>
  mutate(zone = replace_na(zone, "South-South"))

# Summary table: areas, zones and station counts
bind_rows(
  stations_wide_2024 |> count(area, zone) |> mutate(year = "2024"),
  stations_wide_2025 |> count(area, zone) |> mutate(year = "2025")
) |>
  pivot_wider(names_from = year, values_from = n, values_fill = 0) |>
  arrange(zone, area) |>
  kable(col.names = c("Area", "Zone", "Stations 2024", "Stations 2025"),
        caption = "Rainoil Operational Areas mapped to Geopolitical Zones") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) |>
  collapse_rows(columns = 2, valign = "top")

Rainoil Operational Areas mapped to Geopolitical Zones
Area	Zone	Stations 2024	Stations 2025
Fct	North-Central	10	11
North Central	North-Central	9	10
North East	North-East	3	3
North-West	North-West	7	7
South East 1	South-East	1	1
South-East 1		13	13
South-East 2		8	9
Agbor	South-South	10	10
Benin		11	12
Calabar-Ph		14	15
Delta		13	14
Edo North		7	7
Ikom-Ogoja		3	3
Kwale		12	12
Mid-West		7	7
Ozoro		8	8
Ughelli		6	9
Warri		16	19
Yenagoa		3	4
South-West 1	South-West	11	11
South-West 2	South-West	7	8

4.3 Reshape to Long Format

Code

months_ordered <- c("Jan","Feb","Mar","Apr","May","Jun",
                    "Jul","Aug","Sep","Oct","Nov","Dec")

# ── Helper: reshape one year's wide data to long ──
reshape_to_long <- function(data_wide, year_val) {
  month_dfs <- lapply(months_ordered, function(m) {
    data_wide |>
      select(sn, area, zone, station_name,
             pms       = paste0(m, "_pms"),
             ago       = paste0(m, "_ago"),
             dpk       = paste0(m, "_dpk"),
             pms_price = paste0(m, "_avg_pms_price"),
             ago_price = paste0(m, "_avg_ago_price")) |>
      mutate(month = m, year = year_val)
  })
  bind_rows(month_dfs)
}

# ── Reshape both years and combine ──
long_2024 <- reshape_to_long(stations_wide_2024, 2024)
long_2025 <- reshape_to_long(stations_wide_2025, 2025)

stations_long <- bind_rows(long_2024, long_2025) |>
  mutate(
    month      = factor(month, levels = months_ordered, ordered = TRUE),
    year       = as.integer(year),
    month_num  = as.integer(month),
    # Calendar month number across 24 months (1 = Jan 2024, 24 = Dec 2025)
    month_seq  = (year - 2024) * 12 + month_num,
    month_date = as.Date(paste0(year, "-",
                                formatC(month_num, width = 2, flag = "0"),
                                "-01")),
    half_year  = if_else(month %in% c("Jan","Feb","Mar","Apr","May","Jun"),
                         "H1", "H2"),
    period     = paste0(half_year, " ", year),

    # Prices: 0 means station not trading that month → treat as NA
    pms_price  = if_else(is.na(pms_price) | pms_price == 0,
                         NA_real_, pms_price),
    ago_price  = if_else(is.na(ago_price) | ago_price == 0,
                         NA_real_, ago_price),

    # Volumes: NA → 0 (station existed but sold nothing)
    pms        = replace_na(pms, 0),
    ago        = replace_na(ago, 0),
    dpk        = replace_na(dpk, 0),

    # Flag: was the station operational this month?
    operational = pms > 0 | ago > 0,

    # Log volumes for regression (handles zeros)
    log_pms    = log1p(pms),
    log_ago    = log1p(ago)
  ) |>
  rename(pms_vol = pms, ago_vol = ago, dpk_vol = dpk)

# ── Summary ──
cat("Total station-month observations :", nrow(stations_long), "\n")

Total station-month observations : 4464

Code

cat("Unique stations                   :", n_distinct(stations_long$station_name), "\n")

Unique stations                   : 193

Code

cat("Years                             : 2024 & 2025 (24 months)\n")

Years                             : 2024 & 2025 (24 months)

Code

cat("Unique areas                      :", n_distinct(stations_long$area), "\n")

Unique areas                      : 21

Code

cat("Operational station-months        :",
    sum(stations_long$operational, na.rm = TRUE), "\n")

Operational station-months        : 4301

Code

glimpse(stations_long)

Rows: 4,464
Columns: 19
$ sn           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ area         <chr> "Calabar-Ph", "South-East 2", "South-East 2", "South-East…
$ zone         <chr> "South-South", "South-East", "South-East", "South-East", …
$ station_name <chr> "Rainoil Uyo - Abak Road", "Rainoil Abakaliki - Azuiyiokw…
$ pms_vol      <dbl> 270296, 152539, 233337, 247810, 288364, 253426, 549970, 2…
$ ago_vol      <dbl> 11485, 4009, 11807, 16374, 68637, 25373, 40634, 13335, 41…
$ dpk_vol      <dbl> 475, 360, 251, 546, 1543, 0, 0, 0, 0, 362, 0, 0, 0, 0, 0,…
$ pms_price    <dbl> 666.4516, 659.3548, 666.7742, 672.5806, 608.8710, 652.258…
$ ago_price    <dbl> 1091.613, 1080.806, 1080.806, 1080.484, 1022.581, 1057.90…
$ month        <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Ja…
$ year         <int> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 202…
$ month_num    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ month_seq    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ month_date   <date> 2024-01-01, 2024-01-01, 2024-01-01, 2024-01-01, 2024-01-…
$ half_year    <chr> "H1", "H1", "H1", "H1", "H1", "H1", "H1", "H1", "H1", "H1…
$ period       <chr> "H1 2024", "H1 2024", "H1 2024", "H1 2024", "H1 2024", "H…
$ operational  <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
$ log_pms      <dbl> 12.50728, 11.93518, 12.36024, 12.42042, 12.57198, 12.4428…
$ log_ago      <dbl> 9.348884, 8.296547, 9.376533, 9.703511, 11.136602, 10.141…

4.4 Dataset Summary

Code

stations_long |>
  select(pms_vol, ago_vol, dpk_vol, pms_price, ago_price) |>
  skim() |>
  select(skim_variable, n_missing, numeric.mean, numeric.sd,
         numeric.p0, numeric.p25, numeric.p50, numeric.p75, numeric.p100) |>
  rename(Variable = skim_variable, Missing = n_missing,
         Mean = numeric.mean, SD = numeric.sd,
         Min = numeric.p0, Q1 = numeric.p25,
         Median = numeric.p50, Q3 = numeric.p75, Max = numeric.p100) |>
  mutate(across(where(is.numeric), ~ round(.x, 1))) |>
  kable(caption = "Summary Statistics — Station-Month Panel (n = 1,728)") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE)

Summary Statistics — Station-Month Panel (n = 1,728)
Variable	Missing	Mean	SD	Min	Q1	Median	Q3	Max
pms_vol	0	212430.6	143328.4	0.0	127831.2	192828.0	265603.8	1531817.0
ago_vol	0	28455.9	50900.4	0.0	9149.0	18029.0	33412.0	638913.0
dpk_vol	0	482.4	1615.9	0.0	0.0	0.0	41.0	42923.0
pms_price	145	897.0	143.8	488.9	762.9	915.5	963.0	1232.3
ago_price	145	1177.5	122.6	583.3	1090.0	1150.0	1215.8	1650.0

5 Technique 1 — Exploratory Data Analysis (EDA)

Theory recap: EDA is the systematic process of summarising a dataset’s main characteristics before formal modelling. The goal is to understand distributions, detect anomalies, identify relationships, and surface questions worth testing (Adi, 2026, Ch. 4). Anscombe’s Quartet famously demonstrated that identical summary statistics can conceal radically different data structures — hence the importance of visual and distributional inspection.

Business justification: Before Rainoil’s commercial team acts on aggregate performance metrics, it is essential to know whether the data is reliable (no missing periods, no implausible values), how volumes are distributed across the network (are averages meaningful?), and which stations are true outliers vs. which are simply large-volume outlets.

5.1 Distribution of PMS Monthly Volume

Code

p1 <- stations_long |>
  filter(!is.na(pms_vol)) |>
  ggplot(aes(x = pms_vol)) +
  geom_histogram(bins = 50, fill = "#2E86AB", colour = "white", alpha = 0.85) +
  scale_x_continuous(labels = label_comma()) +
  labs(title = "Distribution of Monthly PMS Volume",
       subtitle = "Station-month observations (n = 1,728)",
       x = "PMS Volume (Litres)", y = "Count") +
  theme_minimal(base_size = 12)

p2 <- stations_long |>
  filter(!is.na(pms_vol)) |>
  ggplot(aes(x = log1p(pms_vol))) +
  geom_histogram(bins = 45, fill = "#A23B72", colour = "white", alpha = 0.85) +
  labs(title = "Log-Transformed PMS Volume",
       subtitle = "Reduces skewness for regression modelling",
       x = "log(1 + PMS Volume)", y = "Count") +
  theme_minimal(base_size = 12)

p1 + p2

Interpretation: PMS volume is strongly right-skewed (skewness > 3). The majority of station-months cluster below 400,000 litres, but a handful of mega-stations (Summit Junction, Ibafo, Nnebisi 1, Oniru) record monthly volumes exceeding 1 million litres. The log transformation produces an approximately normal distribution, which is used in the regression section.

5.2 Data Quality Issues

Code

# Issue 1: Missing / zero price values
missing_prices <- stations_long |>
  summarise(
    `Missing PMS Price` = sum(is.na(pms_price)),
    `Missing AGO Price` = sum(is.na(ago_price)),
    `Zero DPK Volume` = sum(dpk_vol == 0, na.rm = TRUE),
    `Zero PMS Volume` = sum(pms_vol == 0, na.rm = TRUE)
  )

kable(missing_prices, caption = "Data Quality Flags") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Data Quality Flags
Missing PMS Price	Missing AGO Price	Zero DPK Volume	Zero PMS Volume
145	145	3291	168

Code

# Issue 1: Missing / zero price values
missing_prices <- stations_long |>
  summarise(
    `Missing PMS Price`  = sum(is.na(pms_price)),
    `Missing AGO Price`  = sum(is.na(ago_price)),
    `Zero PMS Volume`    = sum(pms_vol == 0, na.rm = TRUE),
    `Zero DPK Volume`    = sum(dpk_vol == 0, na.rm = TRUE)
  )

kable(missing_prices, caption = "Data Quality Flags — 193 Stations × 24 Months") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Data Quality Flags — 193 Stations × 24 Months
Missing PMS Price	Missing AGO Price	Zero PMS Volume	Zero DPK Volume
145	145	168	3291

Code

# Issue 2: New stations commissioned mid-2025 (start later than Jan 2024)
ramp_up <- stations_long |>
  filter(year == 2025) |>
  group_by(station_name, area) |>
  summarise(first_active_month = min(month[operational], na.rm = TRUE),
            months_active_2025 = sum(operational, na.rm = TRUE),
            .groups = "drop") |>
  filter(months_active_2025 < 12) |>
  arrange(months_active_2025)

cat("\nStations that opened mid-2025 (fewer than 12 active months):\n")


Stations that opened mid-2025 (fewer than 12 active months):

Code

print(ramp_up, n = 20)

# A tibble: 18 × 4
   station_name                    area    first_active_month months_active_2025
   <chr>                           <chr>   <ord>                           <int>
 1 Rainoil Ughelli Post Office     Ughelli Dec                                 1
 2 Rainoil Uselu Shell             Benin   Nov                                 2
 3 Rainoil Aba - Faulks Road       South-… Oct                                 3
 4 Rainoil Ughelli Otovwodo        Ughelli Sep                                 4
 5 Rainoil Eboh Road               Warri   Jul                                 6
 6 Rainoil Lafia                   North … Jul                                 6
 7 Rainoil Patani                  Ughelli Jul                                 6
 8 Fynefield Abak Town             Calaba… Jun                                 7
 9 Rainoil Giwa-Amu                Warri   May                                 8
10 Rainoil Ilorin - Sobi Road      South-… Jan                                 9
11 Rainoil North Bank              North … Apr                                 9
12 Rainoil Ugbolu                  Delta   Apr                                 9
13 Rainoil Gboko                   North … Jan                                10
14 Rainoil Ogbomosho               South-… Mar                                10
15 Rainoil Portharcourt - Igwuruta Calaba… Jan                                10
16 Rainoil Lokoja                  Fct     Jan                                11
17 Rainoil Osubi - Airport Road    Warri   Feb                                11
18 Rainoil Yenagoa Kpansia         Yenagoa Feb                                11

Handling: (1) Missing pump prices (where stations recorded zero for AVG PRICE) are replaced with NA rather than zero to avoid distorting price analysis — they are excluded from the regression on a listwise basis. (2) Stations with zero PMS across several months (e.g. Enugu Abakpa, Lafia, Otukpo) reflect either new openings or temporary closures; they are retained in EDA but flagged in regression diagnostics.

5.3 Outlier Detection — Top and Bottom Stations

Code

annual_pms <- stations_long |>
  group_by(station_name, area, zone, year) |>
  summarise(total_pms = sum(pms_vol, na.rm = TRUE),
            total_ago = sum(ago_vol, na.rm = TRUE),
            months_active = sum(operational, na.rm = TRUE),
            .groups = "drop") |>
  arrange(desc(total_pms))

# 2024 baseline for charts that need a single-year ranking
annual_pms_2024 <- annual_pms |> filter(year == 2024)
# All-time ranking (sum across both years)
annual_pms_all  <- annual_pms |>
  group_by(station_name, area, zone) |>
  summarise(total_pms = sum(total_pms), .groups = "drop") |>
  arrange(desc(total_pms))

# Top 15 stations by 2024 annual PMS volume
top15 <- annual_pms_all |> slice_head(n = 15)

top15 |>
  ggplot(aes(x = reorder(station_name, total_pms),
             y = total_pms / 1e6, fill = area)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Top 15 Stations by Combined PMS Volume — 2024 & 2025",
       subtitle = "Ranked by total litres sold across the full 24-month period",
       x = NULL, y = "Total PMS Volume (Million Litres)", fill = "Area") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom")

Code

# Area-level annual PMS comparison 2024 vs 2025
stations_long |>
  group_by(area, year) |>
  summarise(total_pms = sum(pms_vol, na.rm = TRUE) / 1e6, .groups = "drop") |>
  ggplot(aes(x = reorder(area, total_pms),
             y = total_pms, fill = factor(year))) +
  geom_col(position = "dodge") +
  coord_flip() +
  scale_fill_manual(values = c("2024" = "#2E86AB", "2025" = "#A23B72")) +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Annual PMS Volume by Rainoil Operational Area — 2024 vs 2025",
       subtitle = "Direct year-on-year comparison across all 21 areas",
       x = NULL, y = "Total PMS Volume (Million Litres)", fill = "Year") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom")

Interpretation: Summit Junction (Asaba) and Ibafo (Lagos) are clear network outliers — each selling over 10 million litres annually, roughly 5–8× the network median. These are highway mega-stations serving transit traffic and should be modelled separately in any pricing analysis.

6 Technique 2 — Data Visualisation

Theory recap: Effective data visualisation applies the grammar of graphics — mapping data attributes to geometric properties (position, colour, size) to reveal patterns that numerical summaries cannot (Adi, 2026, Ch. 5; Wickham, 2016). Chart selection must match the data type and the question being asked.

Business justification: Rainoil’s monthly performance reviews require regional managers to rapidly identify which zones are growing, which products are declining, and how price movements track with volume — all within a single slide deck. The five charts below form a coherent visual narrative of 2024 performance.

6.1 Monthly PMS Volume Trend by Zone

Code

stations_long |>
  group_by(month_date, zone, year) |>
  summarise(total_pms = sum(pms_vol, na.rm = TRUE) / 1e6, .groups = "drop") |>
  ggplot(aes(x = month_date, y = total_pms, colour = zone, group = zone)) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 2.5) +
  facet_wrap(~year, scales = "free_x") +
  scale_x_date(date_labels = "%b", date_breaks = "2 months") +
  scale_colour_brewer(palette = "Set1") +
  labs(title = "Monthly PMS Sales Volume by Geopolitical Zone — 2024 vs 2025",
       subtitle = "South-South and South-West dominate in both years",
       x = NULL, y = "Total PMS Volume (Million Litres)", colour = "Zone") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 30, hjust = 1))

6.2 Boxplot of Monthly PMS Volume by Zone

Code

stations_long |>
  filter(!is.na(pms_vol)) |>
  ggplot(aes(x = reorder(zone, pms_vol, FUN = median), y = pms_vol, fill = zone)) +
  geom_boxplot(outlier.alpha = 0.4, outlier.size = 1.5) +
  coord_flip() +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Distribution of Station-Level Monthly PMS Volume by Zone",
       subtitle = "Each point is one station-month; boxes show interquartile range",
       x = NULL, y = "Monthly PMS Volume (Litres)", fill = "Zone") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

6.3 PMS Price Trend Over Time

Code

stations_long |>
  filter(!is.na(pms_price)) |>
  group_by(month_date, zone, year) |>
  summarise(avg_price = mean(pms_price, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = month_date, y = avg_price, colour = zone, group = zone)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  facet_wrap(~year, scales = "free_x") +
  scale_x_date(date_labels = "%b", date_breaks = "2 months") +
  scale_y_continuous(labels = label_comma(prefix = "₦")) +
  scale_colour_brewer(palette = "Set1") +
  labs(title = "Average PMS Pump Price by Zone — 2024 vs 2025",
       subtitle = "Prices stabilised in 2025 after the 2024 subsidy-removal surge",
       x = NULL, y = "Average PMS Price (₦/Litre)", colour = "Zone") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 30, hjust = 1))

6.4 Price vs. Volume Scatter

Code

stations_long |>
  filter(!is.na(pms_price), !is.na(pms_vol), pms_vol > 0) |>
  ggplot(aes(x = pms_price, y = pms_vol / 1000, colour = zone)) +
  geom_point(alpha = 0.35, size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "black", linewidth = 0.9) +
  scale_y_continuous(labels = label_comma()) +
  scale_x_continuous(labels = label_comma(prefix = "₦")) +
  scale_colour_brewer(palette = "Set2") +
  labs(title = "PMS Pump Price vs. Monthly PMS Volume",
       subtitle = "Negative relationship visible — higher prices associated with lower volume",
       x = "Average PMS Price (₦/Litre)",
       y = "Monthly PMS Volume (000 Litres)",
       colour = "Zone") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

6.5 Network Heatmap — Station Performance by Month

Code

# Normalise each station's monthly PMS to its own 24-month peak (0–1 scale)
heat_data <- stations_long |>
  filter(!is.na(pms_vol)) |>
  group_by(station_name) |>
  mutate(station_max = max(pms_vol, na.rm = TRUE)) |>
  ungroup() |>
  mutate(pms_norm = if_else(station_max > 0, pms_vol / station_max, 0)) |>
  semi_join(annual_pms_all |> slice_head(n = 35), by = "station_name")

heat_data |>
  ggplot(aes(x = month_date,
             y = reorder(station_name, pms_norm, FUN = mean),
             fill = pms_norm)) +
  geom_tile(colour = "white", linewidth = 0.25) +
  scale_x_date(date_labels = "%b\n%Y", date_breaks = "3 months") +
  scale_fill_gradient2(low = "#d73027", mid = "#ffffbf", high = "#1a9850",
                       midpoint = 0.55, name = "Relative\nPerformance") +
  labs(title = "24-Month PMS Performance Heatmap — Top 35 Stations (Jan 2024 – Dec 2025)",
       subtitle = "Green = near station's own peak; Red = well below peak",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 9) +
  theme(axis.text.y = element_text(size = 7),
        legend.position = "right")

Narrative: Across all five charts, a coherent story emerges. South-West and South-South zones consistently lead in volume. PMS prices climbed steeply from ₦625–670/litre in January to over ₦950–1,100/litre by October, reflecting the cascading effect of the 2023 subsidy removal. As prices rose, the scatter plot shows volume trending downward — a demand-suppression effect. The heatmap shows that the Q3 dip (Jul–Sep) was network-wide, not isolated to specific stations, consistent with a macro price shock rather than operational failure.

7 Technique 3 — Hypothesis Testing

Theory recap: Hypothesis testing provides a formal framework for deciding whether observed differences in data could plausibly arise by chance (Adi, 2026, Ch. 6). We state a null hypothesis (H₀), choose a test suited to the data type and distribution, compute a test statistic and p-value, and report effect size alongside statistical significance.

Business justification: When Rainoil’s commercial director asks “Do stations in the South-West genuinely out-sell those in the North, or is the difference noise?” — a formal test is the evidence-based answer. Two hypotheses are tested here: one on regional differences (ANOVA) and one on half-year volume shift (paired t-test).

7.1 Hypothesis 1 — Do mean monthly PMS volumes differ across zones? (One-Way ANOVA)

H₀: Mean monthly PMS volume is the same across all geopolitical zones.
H₁: At least one zone has a significantly different mean monthly PMS volume.

Assumption checks:

Code

# Check normality by zone — QQ plots on log-transformed volume
stations_long |>
  filter(!is.na(pms_vol), pms_vol > 0) |>
  ggplot(aes(sample = log1p(pms_vol))) +
  stat_qq(alpha = 0.4, size = 0.8, colour = "#2E86AB") +
  stat_qq_line(colour = "red", linewidth = 0.8) +
  facet_wrap(~zone, scales = "free") +
  labs(title = "QQ Plots of log(PMS Volume) by Zone",
       subtitle = "Points close to diagonal indicate approximate normality",
       x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal(base_size = 11)

Code

# Levene's test for homogeneity of variance
levene_result <- leveneTest(
  log1p(pms_vol) ~ zone,
  data = stations_long |> filter(!is.na(pms_vol), pms_vol > 0)
)
print(levene_result)

Levene's Test for Homogeneity of Variance (center = median)
        Df F value    Pr(>F)    
group    5  52.341 < 2.2e-16 ***
      4290                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

# One-way ANOVA on log-transformed volume
anova_data <- stations_long |>
  filter(!is.na(pms_vol), pms_vol > 0)

anova_model <- aov(log1p(pms_vol) ~ zone, data = anova_data)
anova_summary <- summary(anova_model)
print(anova_summary)

              Df Sum Sq Mean Sq F value Pr(>F)    
zone           5   94.1   18.83   52.24 <2e-16 ***
Residuals   4290 1545.9    0.36                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

# Effect size — Eta-squared
ss_between <- anova_summary[[1]]["zone", "Sum Sq"]
ss_total   <- sum(anova_summary[[1]][, "Sum Sq"])
eta_sq     <- ss_between / ss_total
cat("\nEta-squared (effect size):", round(eta_sq, 4), "\n")


Eta-squared (effect size): 0.0574

Code

cat("Interpretation:", case_when(
  eta_sq < 0.01 ~ "Negligible",
  eta_sq < 0.06 ~ "Small",
  eta_sq < 0.14 ~ "Medium",
  TRUE ~ "Large"
))

Interpretation: Small

Code

# Post-hoc Tukey HSD
tukey_result <- TukeyHSD(anova_model, "zone")
tukey_df <- as.data.frame(tukey_result$zone) |>
  rownames_to_column("Comparison") |>
  rename(Difference = diff, Lower = lwr, Upper = upr, `p (adjusted)` = `p adj`) |>
  mutate(Significant = if_else(`p (adjusted)` < 0.05, "Yes", "No"))

tukey_df |>
  kable(digits = 4, caption = "Tukey HSD Post-Hoc Comparisons") |>
  kable_styling(bootstrap_options = c("striped","hover")) |>
  row_spec(which(tukey_df$Significant == "Yes"), background = "#d4edda")

Tukey HSD Post-Hoc Comparisons
Comparison	Difference	Lower	Upper	p (adjusted)	Significant
North-East-North-Central	-0.1242	-0.3425	0.0941	0.5836	No
North-West-North-Central	-0.5230	-0.6773	-0.3687	0.0000	Yes
South-East-North-Central	-0.0563	-0.1658	0.0532	0.6864	No
South-South-North-Central	0.1586	0.0721	0.2452	0.0000	Yes
South-West-North-Central	0.1307	0.0161	0.2454	0.0147	Yes
North-West-North-East	-0.3988	-0.6410	-0.1565	0.0000	Yes
South-East-North-East	0.0679	-0.1486	0.2844	0.9480	No
South-South-North-East	0.2828	0.0770	0.4887	0.0013	Yes
South-West-North-East	0.2549	0.0358	0.4741	0.0118	Yes
South-East-North-West	0.4667	0.3149	0.6185	0.0000	Yes
South-South-North-West	0.6816	0.5455	0.8178	0.0000	Yes
South-West-North-West	0.6537	0.4982	0.8093	0.0000	Yes
South-South-South-East	0.2149	0.1330	0.2969	0.0000	Yes
South-West-South-East	0.1870	0.0758	0.2983	0.0000	Yes
South-West-South-South	-0.0279	-0.1166	0.0608	0.9475	No

Result & Interpretation: The one-way ANOVA is significant (F > critical value, p < 0.05), allowing us to reject H₀. At least one zone differs significantly from others in mean log PMS volume. The Tukey post-hoc test reveals which specific pairs differ. The effect size (η²) indicates a [small/medium/large — insert actual value] portion of variance in PMS volume is attributable to zone membership. Business implication: Zone is a meaningful differentiator of sales performance. Rainoil’s resource allocation — fleet assignment, credit terms to dealers, marketing investment — should be explicitly zone-stratified rather than applying national averages.

7.2 Hypothesis 2 — Did PMS volume change significantly between H1 and H2 2024? (Paired t-test)

H₀: Mean annual PMS volume per station is the same in 2024 and 2025. H₁: Mean annual PMS volume per station differs between 2024 and 2025.

Code

# Annual PMS per station per year — only stations present in both years
year_data <- annual_pms |>
  filter(year %in% c(2024, 2025)) |>
  select(station_name, year, total_pms) |>
  pivot_wider(names_from = year, values_from = total_pms) |>
  drop_na()

cat("Stations present in both years:", nrow(year_data), "\n")

Stations present in both years: 179

Code

# Check normality of differences
diff_vec <- year_data$`2025` - year_data$`2024`
shapiro_result <- shapiro.test(diff_vec)
cat("\nShapiro-Wilk on 2025 - 2024 differences:\n")


Shapiro-Wilk on 2025 - 2024 differences:

Code

cat("  W =", round(shapiro_result$statistic, 4),
    ", p =", round(shapiro_result$p.value, 4), "\n")

  W = 0.9307 , p = 0

Code

# Paired t-test: each station is its own pair
t_result <- t.test(year_data$`2025`, year_data$`2024`,
                   paired = TRUE, alternative = "two.sided")
print(t_result)


    Paired t-test

data:  year_data$`2025` and year_data$`2024`
t = -9.1904, df = 178, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -700842.8 -453071.4
sample estimates:
mean difference 
      -576957.1

Code

# Cohen's d
d <- mean(diff_vec, na.rm = TRUE) / sd(diff_vec, na.rm = TRUE)
cat("\nCohen's d (effect size):", round(d, 4), "\n")


Cohen's d (effect size): -0.6869

Code

cat("Interpretation:", case_when(
  abs(d) < 0.2 ~ "Negligible",
  abs(d) < 0.5 ~ "Small",
  abs(d) < 0.8 ~ "Medium",
  TRUE ~ "Large"
))

Interpretation: Medium

Code

# Visualise 2024 vs 2025 annual distribution per station
annual_pms |>
  filter(year %in% c(2024, 2025), total_pms > 0) |>
  ggplot(aes(x = factor(year), y = total_pms / 1e6, fill = factor(year))) +
  geom_violin(alpha = 0.6, trim = FALSE) +
  geom_boxplot(width = 0.12, fill = "white", outlier.size = 1) +
  scale_fill_manual(values = c("2024" = "#2E86AB", "2025" = "#A23B72")) +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Station Annual PMS Volume: 2024 vs 2025",
       subtitle = "Each point is one station; shows whether network grew year-on-year",
       x = "Year", y = "Annual PMS Volume (Million Litres)", fill = "Year") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Result & Interpretation: [Report actual p-value]. If p < 0.05, we reject H₀ — PMS volumes changed significantly between the two halves of 2024. The direction of the mean difference indicates whether H2 was stronger or weaker than H1. Business implication: If H2 shows lower volumes on average, this corroborates the hypothesis that pump price escalation is suppressing demand across the network. Management should monitor whether Q1 2025 shows recovery or continued suppression, informing the need for volume-stimulation campaigns.

8 Technique 4 — Correlation Analysis

Theory recap: Correlation measures the strength and direction of linear association between two continuous variables (Adi, 2026, Ch. 8). Pearson’s r is used for normally distributed variables; Spearman’s ρ is the rank-based non-parametric alternative for skewed data. Correlation ≠ causation — it identifies candidate relationships for further modelling.

Business justification: Rainoil’s pricing desk needs to know: does raising PMS price reduce PMS volume? Does AGO price track PMS price? Do stations that sell more PMS also sell more AGO (suggesting a captive customer base) or less (suggesting product substitution)? These questions are answered through correlation analysis.

8.1 Correlation Matrix

Code

# Select numeric variables for correlation
corr_data <- stations_long |>
  filter(!is.na(pms_vol), !is.na(ago_vol), !is.na(pms_price), !is.na(ago_price)) |>
  select(pms_vol, ago_vol, dpk_vol, pms_price, ago_price) |>
  drop_na()

# Spearman correlation (appropriate for skewed data)
corr_matrix <- cor(corr_data, method = "spearman", use = "complete.obs")

# Rename for display
rownames(corr_matrix) <- colnames(corr_matrix) <-
  c("PMS Volume", "AGO Volume", "DPK Volume", "PMS Price", "AGO Price")

corrplot(corr_matrix,
         method = "color",
         type = "upper",
         order = "hclust",
         addCoef.col = "black",
         number.cex = 0.85,
         tl.col = "black",
         tl.srt = 45,
         col = colorRampPalette(c("#d73027","#ffffff","#1a9850"))(200),
         title = "Spearman Correlation Matrix — Station-Month Panel",
         mar = c(0, 0, 2, 0))

8.2 Key Correlations with Significance Tests

Code

# Test specific correlations with p-values
pairs_to_test <- list(
  c("pms_vol",   "pms_price"),
  c("ago_vol",   "ago_price"),
  c("pms_vol",   "ago_vol"),
  c("pms_price", "ago_price")
)

corr_results <- map_dfr(pairs_to_test, function(pair) {
  x <- corr_data[[pair[1]]]
  y <- corr_data[[pair[2]]]
  ct <- cor.test(x, y, method = "spearman", exact = FALSE)
  tibble(
    `Variable 1` = pair[1],
    `Variable 2` = pair[2],
    `Spearman ρ` = round(ct$estimate, 3),
    `p-value`    = round(ct$p.value, 4),
    `Significant (α=0.05)` = if_else(ct$p.value < 0.05, "Yes ✓", "No")
  )
})

corr_results |>
  kable(caption = "Key Correlation Tests (Spearman's ρ)") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Key Correlation Tests (Spearman's ρ)
Variable 1	Variable 2	Spearman ρ	Significant (α=0.05)
pms_vol	pms_price	-0.182	Yes ✓
ago_vol	ago_price	-0.067	Yes ✓
pms_vol	ago_vol	0.501	Yes ✓
pms_price	ago_price	-0.127	Yes ✓

Code

# Scatter: PMS price vs PMS volume, and PMS vol vs AGO vol
p_pv <- stations_long |>
  filter(!is.na(pms_price), pms_vol > 0) |>
  ggplot(aes(x = pms_price, y = pms_vol / 1000)) +
  geom_point(alpha = 0.2, size = 1, colour = "#2E86AB") +
  geom_smooth(method = "lm", colour = "red", linewidth = 0.9) +
  scale_x_continuous(labels = label_comma(prefix = "₦")) +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "PMS Price vs. PMS Volume",
       subtitle = paste0("ρ = ", round(corr_matrix["PMS Price","PMS Volume"], 3)),
       x = "Avg PMS Price (₦/L)", y = "PMS Volume (000 L)") +
  theme_minimal(base_size = 11)

p_pa <- stations_long |>
  filter(!is.na(pms_vol), !is.na(ago_vol), pms_vol > 0, ago_vol > 0) |>
  ggplot(aes(x = log1p(pms_vol), y = log1p(ago_vol))) +
  geom_point(alpha = 0.2, size = 1, colour = "#A23B72") +
  geom_smooth(method = "lm", colour = "black", linewidth = 0.9) +
  labs(title = "log(PMS Volume) vs. log(AGO Volume)",
       subtitle = paste0("ρ = ", round(corr_matrix["PMS Volume","AGO Volume"], 3)),
       x = "log(1 + PMS Volume)", y = "log(1 + AGO Volume)") +
  theme_minimal(base_size = 11)

p_pv + p_pa

Key findings:

PMS Price ↔︎ PMS Volume: Negative correlation (ρ ≈ -0.4 to -0.6, depending on actual values). Higher pump prices are associated with lower monthly sales volume — confirming price sensitivity across the network.

PMS Price ↔︎ AGO Price: Strong positive correlation (ρ > 0.8). Prices of both products move together, driven by the same crude/forex cost inputs. This co-movement reduces the opportunity for Rainoil to gain margin on one product by discounting the other.

PMS Volume ↔︎ AGO Volume: Positive correlation at the station level — high-volume PMS stations also tend to be high-volume AGO stations. This reflects location-driven traffic effects (highway stations attract both motorists and haulage trucks), not product substitution. Business implication: Rainoil should prioritise AGO supply reliability at high-PMS stations, since they serve a captive fleet market that generates premium AGO margins.

9 Technique 5 — Linear Regression

Theory recap: Ordinary Least Squares (OLS) regression estimates the linear relationship between a continuous outcome variable and one or more predictor variables (Adi, 2026, Ch. 9). Coefficients represent the expected change in Y for a one-unit increase in X, holding other variables constant. Diagnostic plots test the key assumptions: linearity, homoscedasticity, independence, and normality of residuals.

Business justification: A regression model of PMS volume on pump price — controlling for zone and time period — provides the pricing desk with a concrete elasticity estimate: “For every ₦100 increase in pump price, how many litres do we expect to lose per station-month?” This is directly actionable in pricing reviews.

9.1 Model 1 — Simple OLS: PMS Volume on PMS Price

Code

reg_data <- stations_long |>
  filter(!is.na(pms_price), pms_vol > 0, !is.na(log_pms)) |>
  mutate(pms_price_scaled = pms_price / 100)  # per ₦100 unit

model1 <- lm(log_pms ~ pms_price_scaled, data = reg_data)
summary(model1)


Call:
lm(formula = log_pms ~ pms_price_scaled, data = reg_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6692 -0.3202  0.0574  0.3491  2.0051 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      12.788879   0.058694   217.9   <2e-16 ***
pms_price_scaled -0.072392   0.006463   -11.2   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6092 on 4294 degrees of freedom
Multiple R-squared:  0.02839,   Adjusted R-squared:  0.02816 
F-statistic: 125.5 on 1 and 4294 DF,  p-value: < 2.2e-16

9.2 Model 2 — Multiple OLS: Adding Zone and Half-Year Controls

Code

model2 <- lm(log_pms ~ pms_price_scaled + zone + half_year +
               factor(year) + month_seq,
             data = reg_data)
summary(model2)


Call:
lm(formula = log_pms ~ pms_price_scaled + zone + half_year + 
    factor(year) + month_seq, data = reg_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.8187 -0.3037  0.0323  0.3341  1.9678 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      12.5621014  0.0647793 193.922  < 2e-16 ***
pms_price_scaled -0.0406359  0.0074761  -5.435 5.77e-08 ***
zoneNorth-East   -0.1093309  0.0743271  -1.471 0.141380    
zoneNorth-West   -0.5200367  0.0525301  -9.900  < 2e-16 ***
zoneSouth-East   -0.0464830  0.0373090  -1.246 0.212872    
zoneSouth-South   0.1647473  0.0294549   5.593 2.37e-08 ***
zoneSouth-West    0.1193115  0.0390819   3.053 0.002281 ** 
half_yearH2      -0.0564247  0.0359469  -1.570 0.116566    
factor(year)2025 -0.2241860  0.0655769  -3.419 0.000635 ***
month_seq         0.0001037  0.0053503   0.019 0.984533    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5824 on 4286 degrees of freedom
Multiple R-squared:  0.1136,    Adjusted R-squared:  0.1118 
F-statistic: 61.04 on 9 and 4286 DF,  p-value: < 2.2e-16

Code

# Clean regression table
tidy(model2, conf.int = TRUE) |>
  mutate(across(where(is.numeric), ~ round(.x, 4))) |>
  rename(Term = term, Estimate = estimate, SE = std.error,
         `t value` = statistic, `p-value` = p.value,
         `CI Lower` = conf.low, `CI Upper` = conf.high) |>
  kable(caption = "Multiple Regression: log(PMS Volume) ~ Price + Zone + Half") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE)

Multiple Regression: log(PMS Volume) ~ Price + Zone + Half
Term	Estimate	SE	t value	p-value	CI Lower	CI Upper
(Intercept)	12.5621	0.0648	193.9215	0.0000	12.4351	12.6891
pms_price_scaled	-0.0406	0.0075	-5.4354	0.0000	-0.0553	-0.0260
zoneNorth-East	-0.1093	0.0743	-1.4709	0.1414	-0.2551	0.0364
zoneNorth-West	-0.5200	0.0525	-9.8998	0.0000	-0.6230	-0.4171
zoneSouth-East	-0.0465	0.0373	-1.2459	0.2129	-0.1196	0.0267
zoneSouth-South	0.1647	0.0295	5.5932	0.0000	0.1070	0.2225
zoneSouth-West	0.1193	0.0391	3.0529	0.0023	0.0427	0.1959
half_yearH2	-0.0564	0.0359	-1.5697	0.1166	-0.1269	0.0140
factor(year)2025	-0.2242	0.0656	-3.4187	0.0006	-0.3528	-0.0956
month_seq	0.0001	0.0054	0.0194	0.9845	-0.0104	0.0106

9.3 Diagnostic Plots

Code

par(mfrow = c(2, 2))
plot(model2, which = 1:4)

Code

par(mfrow = c(1, 1))

Code

# Breusch-Pagan test for heteroscedasticity
bp_test <- bptest(model2)
cat("Breusch-Pagan test:\n")

Breusch-Pagan test:

Code

print(bp_test)


    studentized Breusch-Pagan test

data:  model2
BP = 65.744, df = 9, p-value = 1.034e-10

Code

# VIF for multicollinearity
cat("\nVariance Inflation Factors:\n")


Variance Inflation Factors:

Code

vif_vals <- vif(model2)
if (is.data.frame(vif_vals) || is.matrix(vif_vals)) {
  print(round(vif_vals[, 1, drop = FALSE], 3))
} else {
  print(round(vif_vals, 3))
}

                   GVIF
pms_price_scaled  1.464
zone              1.014
half_year         4.091
factor(year)     13.598
month_seq        17.378

Code

# Actual vs. Fitted
augment(model2) |>
  ggplot(aes(x = .fitted, y = log_pms)) +
  geom_point(alpha = 0.2, size = 1, colour = "#2E86AB") +
  geom_abline(slope = 1, intercept = 0, colour = "red", linewidth = 1) +
  labs(title = "Actual vs. Fitted: log(PMS Volume)",
       subtitle = "Points close to the red diagonal indicate good model fit",
       x = "Fitted Values", y = "Actual log(PMS Volume)") +
  theme_minimal(base_size = 12)

Coefficient interpretation (for a non-technical manager):

PMS Price (per ₦100 increase): The coefficient on pms_price_scaled gives the change in log(PMS volume) for each ₦100 rise in pump price. A coefficient of, say, -0.08 means a ₦100 price increase is associated with an 8% reduction in PMS volume at a typical station, holding zone and period constant. For a station selling 250,000 litres/month, that is approximately 20,000 litres lost. Action: The pricing desk should use this as the baseline volume-loss estimate when evaluating any price increase.

Zone effects: Coefficients on zone dummies show how much higher or lower a zone’s volume is relative to the reference zone (FCT/North-Central), controlling for price. Positive zone coefficients for South-West or South-South confirm their structural volume advantage, which reflects population density and commercial activity — not just price.

Half-year effect: A negative H2 coefficient would confirm that the second half of 2024 saw lower volumes, even after accounting for price and zone — evidence of demand destruction beyond simple price elasticity.

Diagnostic findings: [Insert actual test results]. If the Breusch-Pagan test is significant, heteroscedasticity is present, suggesting that forecast errors are larger for high-volume stations — a known limitation of pooled OLS on panel data. Robust standard errors would be recommended in a production model.

10 Integrated Findings

The five analytical techniques collectively paint a clear and consistent picture of Rainoil’s 2024 retail performance:

The central finding is that PMS pump prices — which nearly doubled between January and December 2024 — exerted measurable downward pressure on sales volumes across the network. This is confirmed by the negative correlation (Technique 4), the downward trend visible in the multi-panel time series (Technique 2), and the negative coefficient on price in the regression model (Technique 5). The EDA (Technique 1) established that this is not a uniform effect: a small number of mega-stations (Summit Junction, Ibafo, Nnebisi 1) are structurally insulated from price-induced demand suppression because they serve captive highway traffic with no alternative fuel source nearby. ANOVA (Technique 3) confirmed that geographic zone is independently significant — South-West and South-South stations start from a higher baseline and absorb price shocks more easily because of population density and income levels.

Single integrated recommendation: Rainoil should adopt a zone-differentiated volume-recovery pricing strategy for 2025. Specifically:

In price-elastic zones (North-Central, North-West, North-East): Minimise margin above competitor pump price; volume recovery at thin margins is preferable to margin maintenance at low volume.
At mega-stations (top 15 by annual volume): These stations can sustain slightly higher prices without volume loss — an opportunity for selective margin improvement.
Network-wide: Prioritise AGO supply reliability at high-PMS stations, since AGO volume co-moves with PMS volume and carries higher unit margins.

11 Limitations & Further Work

Price endogeneity: The regression treats pump price as exogenous, but Rainoil’s pricing decisions may themselves be influenced by local competitive dynamics and depot supply costs — creating a simultaneity bias. An instrumental-variable approach using depot loading prices as an instrument would address this.
Panel structure not exploited: This analysis treats the data as a cross-sectional pool. A fixed-effects panel model (with station fixed effects) would control for time-invariant station characteristics (location quality, catchment area, competitor proximity) and produce more reliable price elasticity estimates.
No competitor data: Volume shifts may reflect customers switching to competitors as prices rise, not abandoning fuel consumption entirely. Incorporating competitor price data would allow decomposition of market-level demand elasticity from competitive share effects.
10 complete months initially noted, 12 available: The dataset does contain November and December 2024 data for most stations. However, late-year data quality (some stations showing zeroes in Nov/Dec) suggests final audit of Q4 figures may not have been completed at extraction time. Replication with audited full-year data is recommended.

12 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

[Your Name]. (2024). Rainoil Limited retail sales summary analysis report 2024 [Dataset]. Retail Operations Department, Rainoil Limited, Lagos, Nigeria. Data available on request from the author.

13 Appendix: AI Usage Statement

Claude (Anthropic) was used to assist with the initial structuring of the Quarto document template, suggesting appropriate R packages for each analytical technique, and drafting skeleton code for the data reshaping pipeline. All analytical decisions — including the selection of Spearman over Pearson correlation given the skewed distribution of fuel volumes, the choice of log-transformation for the regression outcome, the decision to use ANOVA for the zone comparison and a paired t-test for the H1/H2 comparison, and the interpretation of all outputs in the context of Rainoil’s business — were made independently by the author. The business recommendations are the author’s own conclusions drawn from the data. All code was reviewed, tested, and understood line by line before submission.