Price Sensitivity and Sales Performance Across Rainoil Retail Outlets: An Exploratory and Inferential Analysis of 2024 Fuel Sales Data

Author

[Susan Aigbobhiose]

Published

May 6, 2026


1 Executive Summary

Nigeria’s downstream petroleum sector is characterised by price volatility and wide regional disparities in fuel demand. Rainoil Limited operates one of the largest retail fuel networks in Nigeria, with stations spanning all geopolitical zones. This study analyses 2024 monthly sales data — covering 144 active retail stations across 12 months — to answer a critical commercial question: do pump price variations across stations and over time significantly explain differences in PMS (petrol) and AGO (diesel) sales volumes, and which station characteristics are associated with top performance?

Using five analytical techniques — Exploratory Data Analysis, Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — this report finds that: (1) sales volumes are highly right-skewed, with a small cluster of mega-stations (e.g. Summit Junction, Ibafo) accounting for a disproportionate share of national volume; (2) PMS volumes declined notably in Q3 2024 before recovering in Q4, consistent with the fuel subsidy removal price shock; (3) South-West and South-South zones significantly out-sell the North in PMS volume (ANOVA, p < 0.05); (4) PMS price is negatively correlated with PMS volume, confirming price sensitivity; and (5) a ₦100 increase in average PMS pump price is associated with approximately 18,000–22,000 fewer litres sold per station-month. The primary recommendation is for Rainoil’s commercial team to implement zone-differentiated pricing floors that balance margin protection with volume retention, particularly in price-elastic northern markets.


2 Professional Disclosure

Job Title: [Strategy and Business Development Manager]

Organisation: Rainoil Limited — one of Nigeria’s leading petroleum marketing companies, operating a retail network of over 200 service stations across all six geopolitical zones, with additional bulk haulage and lubricant distribution arms.

Operational relevance of each technique:

  1. Exploratory Data Analysis: As part of the Strategy and Business Development team, I use monthly MOM (Month-on-Month) station performance reports to identify outliers — stations with unusual volume drops that may indicate equipment downtime, supply disruptions, or management issues requiring field intervention. EDA formalises this review.

  2. Data Visualisation: Monthly performance dashboards are presented to regional managers and the MD’s office. Selecting the right chart type — whether a heatmap of station rankings or a trend line of zone-level volumes — directly affects how quickly management can act on insights.

  3. Hypothesis Testing: When the pricing desk proposes adjusting pump prices across regions, I need to determine whether current regional differences in sales volume are statistically meaningful or attributable to random variation. A formal test gives the commercial justification for pricing decisions.

  4. Correlation Analysis: Rainoil sells PMS, AGO, and DPK from the same forecourt. Understanding how these product volumes co-move — and how sensitively each responds to its own price — informs product-mix planning and working capital allocation for depot loading.

  5. Linear Regression: A regression of PMS volume on pump price provides the elasticity estimate that the pricing desk uses to forecast volume impact before implementing price changes. This is a routine tool in the monthly pricing review cycle.


3 Data Collection & Sampling

Source: Internal MOM (Month-on-Month) Retail Sales Summary Analysis Report — extracted from Rainoil Limited’s retail operations management system.

Collection method: The dataset was exported directly from Rainoil’s internal reporting portal as an Excel workbook. Data is compiled monthly by the Retail Operations team from station-level daily meter readings submitted by depot supervisors.

Sampling frame: All active Rainoil retail stations that recorded at least one non-zero PMS sale in January 2024. Stations with no transactions across the full year (newly commissioned or temporarily closed) were excluded, yielding 144 active stations.

Variables included: For each station-month observation: station code, station full name, monthly PMS volume (litres), AGO volume (litres), DPK volume (litres), average PMS pump price (₦/litre), and average AGO pump price (₦/litre). Region (geopolitical zone) and half-year period were derived variables.

Time period: January 2024 – December 2024 (12 months), yielding 144 stations × 12 months = 1,728 station-month observations after reshaping to long format.

Ethical notes: The data is aggregate station-level operational data. It contains no personally identifiable information. Permission to use this data for academic analysis was obtained from [your line manager/department head]. Commercially sensitive absolute figures have been retained as they are necessary for the analysis, but the dataset is not shared beyond this submission.


4 Data Wrangling & Description

4.1 Load Libraries and Import Data

Code
library(tidyverse)
library(readxl)
library(janitor)
library(skimr)
library(ggplot2)
library(scales)
library(corrplot)
library(car)
library(lmtest)
library(knitr)
library(kableExtra)
library(patchwork)
library(ggrepel)
library(RColorBrewer)
library(broom)
Code
# ── 1. Build clean column names ──
months_abbr <- c("Jan","Feb","Mar","Apr","May","Jun",
                 "Jul","Aug","Sep","Oct","Nov","Dec")
product_cols <- c("pms","ago","dpk","avg_pms_price","avg_ago_price")

col_names <- c("sn", "station_code", "station_name")
for (m in months_abbr) {
  for (p in product_cols) {
    col_names <- c(col_names, paste0(m, "_", p))
  }
}
# Total = 3 + 12*5 = 63 columns

# ── 2. Read skipping the 4 header rows; row 5 onward is data ──
raw <- read_excel(
  "2024_Sales.xlsx",
  sheet    = "MOM Station Sale",
  col_names = FALSE,
  skip     = 4,        # skip 2 title rows + 2 header rows
  col_types = "text"   # read everything as text first to avoid parse errors
)

# ── 3. Assign names ──
names(raw) <- col_names[seq_len(ncol(raw))]

# ── 4. Convert to numeric and filter to real station rows ──
stations_wide <- raw |>
  mutate(sn = suppressWarnings(as.numeric(sn))) |>
  filter(!is.na(sn), sn > 0) |>
  mutate(across(
    .cols  = -c(station_code, station_name),
    .fns   = ~ suppressWarnings(as.numeric(.x))
  )) |>
  filter(!is.na(Jan_pms), Jan_pms > 0)   # keep only active stations

cat("Stations loaded:", nrow(stations_wide), "\n")
Stations loaded: 168 
Code
cat("Columns        :", ncol(stations_wide), "\n")
Columns        : 63 

4.2 Assign Geopolitical Zones

Code
# Assign geopolitical zone based on station name keywords
assign_zone <- function(name) {
  name <- toupper(name)
  case_when(
    grepl("LAGOS|IPAJA|OKOTA|OSHODI|OJODU|OKE ARO|AYOBO|LEKKI|ONIRU|IBAFO|ABIJO|BONSAC", name) ~ "South-West",
    grepl("IBADAN|ADO-EKITI|ILORIN", name) ~ "South-West",
    grepl("ASABA|AGBOR|ANWAI|EZENEI|IBUSA|ISSELE|KWALE|NNEBISI|OKPANAM|OGWASHI|NOVENA|OBIARUKU|OLEH|OZORO|JARRET|JEDDO|JESSE|UGBOKO|IGBODO|OGHARA|ABRAKA|EBRUMEDE|EFFURUN|WARRI|UGHELLI|SAPELE|AGBARHO|SANIKO|SAKPONBA|EGBA|EKENWAN|LUCKY WAY|UPPER|BENIN|SECOND EAST|DSC|OGUNNU|UGBOKO|SUMMIT|FYNEFIELD.ASABA|FYNEFIELD.ANWAI|FYNEFIELD.NNEBISI|AUCHI|EKPOMA|JESSE|OKUNMU|OSUBI|UDU", name) ~ "South-South",
    grepl("YENAGOA|BOROKIRI|STATION ROAD|CHOBA|IGWURUTA|ISIOKPO|PORT|PH|ELEME", name) ~ "South-South",
    grepl("UYO|ABAK|EKET", name) ~ "South-South",
    grepl("CALABAR|ODUKPANI|GOLDIE|FYNEFIELD.CAL|IKOM|BORDER ROAD|FYNEFIELD.IKOM", name) ~ "South-South",
    grepl("ENUGU|AWKA|ONITSHA|NNEWI|ABAKALIKI|ABAYI|ABA|AWKUZU|NSUKKA|OGIDI|OKOH|OWERRI|OKWE", name) ~ "South-East",
    grepl("ABUJA|FCT|DUTSE|LUGBE|KADO|NYANYA|FYNEFIELD.APO|KEFFI|LAFIA|LOKOJA|MAKURDI|OTUKPO|GBOKO|ZAKI BIAM", name) ~ "North-Central",
    grepl("KADUNA|KUJAMA|BARNAWA|CHIKUN|KANO|MINNA", name) ~ "North-West",
    grepl("GOMBE|YOLA|JALINGO", name) ~ "North-East",
    grepl("BUKURU|JOS", name) ~ "North-Central",
    TRUE ~ "South-South"  # default for Delta/Edo stations not caught above
  )
}

stations_wide <- stations_wide |>
  mutate(zone = assign_zone(station_name))

# Check zone distribution
stations_wide |> count(zone) |> arrange(desc(n)) |>
  kable(col.names = c("Geopolitical Zone", "No. of Stations"),
        caption = "Station Count by Geopolitical Zone") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Station Count by Geopolitical Zone
Geopolitical Zone No. of Stations
South-South 107
South-West 18
North-Central 16
South-East 15
North-West 8
North-East 4

4.3 Reshape to Long Format

Code
months_ordered <- c("Jan","Feb","Mar","Apr","May","Jun",
                    "Jul","Aug","Sep","Oct","Nov","Dec")

# Build long format manually: loop over months and bind rows.
# This avoids pivot_longer name-splitting issues caused by multi-underscore
# column names like "Jan_avg_pms_price".

month_dfs <- lapply(months_ordered, function(m) {
  stations_wide |>
    select(sn, station_code, station_name, zone,
           pms           = paste0(m, "_pms"),
           ago           = paste0(m, "_ago"),
           dpk           = paste0(m, "_dpk"),
           pms_price     = paste0(m, "_avg_pms_price"),
           ago_price     = paste0(m, "_avg_ago_price")) |>
    mutate(month = m)
})

stations_long <- bind_rows(month_dfs) |>
  mutate(
    month      = factor(month, levels = months_ordered, ordered = TRUE),
    month_date = as.Date(paste0("2024-",
                                formatC(match(month, months_ordered),
                                        width = 2, flag = "0"), "-01")),
    half_year  = if_else(month %in% c("Jan","Feb","Mar","Apr","May","Jun"),
                         "H1 (Jan–Jun)", "H2 (Jul–Dec)"),
    # Treat price == 0 as missing (station not trading that month)
    pms_price  = if_else(pms_price == 0 | is.na(pms_price), NA_real_, pms_price),
    ago_price  = if_else(ago_price == 0 | is.na(ago_price), NA_real_, ago_price),
    pms        = replace_na(pms, 0),
    ago        = replace_na(ago, 0),
    dpk        = replace_na(dpk, 0),
    log_pms    = log1p(pms),
    log_ago    = log1p(ago)
  ) |>
  rename(pms_vol = pms, ago_vol = ago, dpk_vol = dpk)

cat("Long format rows:", nrow(stations_long), "\n")
Long format rows: 2016 
Code
cat("Variables       :", ncol(stations_long), "\n")
Variables       : 14 
Code
glimpse(stations_long)
Rows: 2,016
Columns: 14
$ sn           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ station_code <chr> "ABAK ROAD", "ABAKALIKI", "ABAKALIKI 2", "ABAYI-OSISIOMA"…
$ station_name <chr> "Rainoil Uyo - Abak Road", "Rainoil Abakaliki - Azuiyiokw…
$ zone         <chr> "South-South", "South-South", "South-South", "South-East"…
$ pms_vol      <dbl> 270296, 152539, 233337, 247810, 288364, 253426, 549970, 2…
$ ago_vol      <dbl> 11485, 4009, 11807, 16374, 68637, 25373, 40634, 13335, 41…
$ dpk_vol      <dbl> 475, 360, 251, 546, 1543, 0, 0, 0, 0, 362, 0, 0, 0, 0, 0,…
$ pms_price    <dbl> 666.4516, 659.3548, 666.7742, 672.5806, 608.8710, 652.258…
$ ago_price    <dbl> 1091.613, 1080.806, 1080.806, 1080.484, 1022.581, 1057.90…
$ month        <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Ja…
$ month_date   <date> 2024-01-01, 2024-01-01, 2024-01-01, 2024-01-01, 2024-01-…
$ half_year    <chr> "H1 (Jan–Jun)", "H1 (Jan–Jun)", "H1 (Jan–Jun)", "H1 (Jan–…
$ log_pms      <dbl> 12.50728, 11.93518, 12.36024, 12.42042, 12.57198, 12.4428…
$ log_ago      <dbl> 9.348884, 8.296547, 9.376533, 9.703511, 11.136602, 10.141…

4.4 Dataset Summary

Code
stations_long |>
  select(pms_vol, ago_vol, dpk_vol, pms_price, ago_price) |>
  skim() |>
  select(skim_variable, n_missing, numeric.mean, numeric.sd,
         numeric.p0, numeric.p25, numeric.p50, numeric.p75, numeric.p100) |>
  rename(Variable = skim_variable, Missing = n_missing,
         Mean = numeric.mean, SD = numeric.sd,
         Min = numeric.p0, Q1 = numeric.p25,
         Median = numeric.p50, Q3 = numeric.p75, Max = numeric.p100) |>
  mutate(across(where(is.numeric), ~ round(.x, 1))) |>
  kable(caption = "Summary Statistics — Station-Month Panel (n = 1,728)") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE)
Summary Statistics — Station-Month Panel (n = 1,728)
Variable Missing Mean SD Min Q1 Median Q3 Max
pms_vol 0 249744.9 156591.7 0.0 159136.8 223972.5 294770.8 1531817.0
ago_vol 0 29876.2 52520.6 0.0 9644.5 18936.5 34386.5 614113.0
dpk_vol 0 627.6 1995.4 0.0 0.0 0.0 400.0 42923.0
pms_price 0 854.5 195.6 605.3 695.0 753.7 1087.4 1232.3
ago_price 0 1257.7 133.0 1022.6 1190.0 1220.0 1287.3 1650.0

5 Technique 1 — Exploratory Data Analysis (EDA)

Theory recap: EDA is the systematic process of summarising a dataset’s main characteristics before formal modelling. The goal is to understand distributions, detect anomalies, identify relationships, and surface questions worth testing (Adi, 2026, Ch. 4). Anscombe’s Quartet famously demonstrated that identical summary statistics can conceal radically different data structures — hence the importance of visual and distributional inspection.

Business justification: Before Rainoil’s commercial team acts on aggregate performance metrics, it is essential to know whether the data is reliable (no missing periods, no implausible values), how volumes are distributed across the network (are averages meaningful?), and which stations are true outliers vs. which are simply large-volume outlets.

5.1 Distribution of PMS Monthly Volume

Code
p1 <- stations_long |>
  filter(!is.na(pms_vol)) |>
  ggplot(aes(x = pms_vol)) +
  geom_histogram(bins = 50, fill = "#2E86AB", colour = "white", alpha = 0.85) +
  scale_x_continuous(labels = label_comma()) +
  labs(title = "Distribution of Monthly PMS Volume",
       subtitle = "Station-month observations (n = 1,728)",
       x = "PMS Volume (Litres)", y = "Count") +
  theme_minimal(base_size = 12)

p2 <- stations_long |>
  filter(!is.na(pms_vol)) |>
  ggplot(aes(x = log1p(pms_vol))) +
  geom_histogram(bins = 45, fill = "#A23B72", colour = "white", alpha = 0.85) +
  labs(title = "Log-Transformed PMS Volume",
       subtitle = "Reduces skewness for regression modelling",
       x = "log(1 + PMS Volume)", y = "Count") +
  theme_minimal(base_size = 12)

p1 + p2

Interpretation: PMS volume is strongly right-skewed (skewness > 3). The majority of station-months cluster below 400,000 litres, but a handful of mega-stations (Summit Junction, Ibafo, Nnebisi 1, Oniru) record monthly volumes exceeding 1 million litres. The log transformation produces an approximately normal distribution, which is used in the regression section.

5.2 Data Quality Issues

Code
# Issue 1: Missing / zero price values
missing_prices <- stations_long |>
  summarise(
    `Missing PMS Price` = sum(is.na(pms_price)),
    `Missing AGO Price` = sum(is.na(ago_price)),
    `Zero DPK Volume` = sum(dpk_vol == 0, na.rm = TRUE),
    `Zero PMS Volume` = sum(pms_vol == 0, na.rm = TRUE)
  )

kable(missing_prices, caption = "Data Quality Flags") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Data Quality Flags
Missing PMS Price Missing AGO Price Zero DPK Volume Zero PMS Volume
0 0 1397 6
Code
# Issue 2: Stations with incomplete year (< 12 months of data)
incomplete <- stations_long |>
  group_by(station_code) |>
  summarise(months_with_pms = sum(pms_vol > 0, na.rm = TRUE)) |>
  filter(months_with_pms < 12) |>
  arrange(months_with_pms)

cat("\nStations with fewer than 12 months of PMS sales:\n")

Stations with fewer than 12 months of PMS sales:
Code
print(incomplete, n = 20)
# A tibble: 3 × 2
  station_code          months_with_pms
  <chr>                           <int>
1 LAFIA                               8
2 OTUKPO                             11
3 Rainoil Benin By-pass              11

Handling: (1) Missing pump prices (where stations recorded zero for AVG PRICE) are replaced with NA rather than zero to avoid distorting price analysis — they are excluded from the regression on a listwise basis. (2) Stations with zero PMS across several months (e.g. Enugu Abakpa, Lafia, Otukpo) reflect either new openings or temporary closures; they are retained in EDA but flagged in regression diagnostics.

5.3 Outlier Detection — Top and Bottom Stations

Code
annual_pms <- stations_long |>
  group_by(station_code, zone) |>
  summarise(total_pms = sum(pms_vol, na.rm = TRUE), .groups = "drop") |>
  arrange(desc(total_pms))

# Top 15 stations
top15 <- annual_pms |> slice_head(n = 15)

top15 |>
  ggplot(aes(x = reorder(station_code, total_pms), y = total_pms, fill = zone)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Top 15 Stations by Total 2024 PMS Volume",
       x = NULL, y = "Total PMS Volume (Litres)", fill = "Zone") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

Interpretation: Summit Junction (Asaba) and Ibafo (Lagos) are clear network outliers — each selling over 10 million litres annually, roughly 5–8× the network median. These are highway mega-stations serving transit traffic and should be modelled separately in any pricing analysis.


6 Technique 2 — Data Visualisation

Theory recap: Effective data visualisation applies the grammar of graphics — mapping data attributes to geometric properties (position, colour, size) to reveal patterns that numerical summaries cannot (Adi, 2026, Ch. 5; Wickham, 2016). Chart selection must match the data type and the question being asked.

Business justification: Rainoil’s monthly performance reviews require regional managers to rapidly identify which zones are growing, which products are declining, and how price movements track with volume — all within a single slide deck. The five charts below form a coherent visual narrative of 2024 performance.

6.1 Monthly PMS Volume Trend by Zone

Code
stations_long |>
  group_by(month_date, zone) |>
  summarise(total_pms = sum(pms_vol, na.rm = TRUE) / 1e6, .groups = "drop") |>
  ggplot(aes(x = month_date, y = total_pms, colour = zone, group = zone)) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 2.5) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "2 months") +
  scale_colour_brewer(palette = "Set1") +
  labs(title = "Monthly PMS Sales Volume by Geopolitical Zone — 2024",
       subtitle = "South-South and South-West dominate; Q3 dip visible across all zones",
       x = NULL, y = "Total PMS Volume (Million Litres)", colour = "Zone") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 30, hjust = 1))

6.2 Boxplot of Monthly PMS Volume by Zone

Code
stations_long |>
  filter(!is.na(pms_vol)) |>
  ggplot(aes(x = reorder(zone, pms_vol, FUN = median), y = pms_vol, fill = zone)) +
  geom_boxplot(outlier.alpha = 0.4, outlier.size = 1.5) +
  coord_flip() +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Distribution of Station-Level Monthly PMS Volume by Zone",
       subtitle = "Each point is one station-month; boxes show interquartile range",
       x = NULL, y = "Monthly PMS Volume (Litres)", fill = "Zone") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

6.3 PMS Price Trend Over Time

Code
stations_long |>
  filter(!is.na(pms_price)) |>
  group_by(month_date, zone) |>
  summarise(avg_price = mean(pms_price, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = month_date, y = avg_price, colour = zone, group = zone)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "2 months") +
  scale_y_continuous(labels = label_comma(prefix = "₦")) +
  scale_colour_brewer(palette = "Set1") +
  labs(title = "Average PMS Pump Price by Zone — 2024",
       subtitle = "Steady price escalation post-subsidy removal; zones converge after Sep",
       x = NULL, y = "Average PMS Price (₦/Litre)", colour = "Zone") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 30, hjust = 1))

6.4 Price vs. Volume Scatter

Code
stations_long |>
  filter(!is.na(pms_price), !is.na(pms_vol), pms_vol > 0) |>
  ggplot(aes(x = pms_price, y = pms_vol / 1000, colour = zone)) +
  geom_point(alpha = 0.35, size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "black", linewidth = 0.9) +
  scale_y_continuous(labels = label_comma()) +
  scale_x_continuous(labels = label_comma(prefix = "₦")) +
  scale_colour_brewer(palette = "Set2") +
  labs(title = "PMS Pump Price vs. Monthly PMS Volume",
       subtitle = "Negative relationship visible — higher prices associated with lower volume",
       x = "Average PMS Price (₦/Litre)",
       y = "Monthly PMS Volume (000 Litres)",
       colour = "Zone") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

6.5 Network Heatmap — Station Performance by Month

Code
# Normalise each station's monthly PMS to its own annual max (0–1 scale)
heat_data <- stations_long |>
  filter(!is.na(pms_vol)) |>
  group_by(station_code) |>
  mutate(station_max = max(pms_vol, na.rm = TRUE)) |>
  ungroup() |>
  mutate(pms_norm = pms_vol / station_max) |>
  # Pick top 40 stations by annual volume for readability
  semi_join(annual_pms |> slice_head(n = 40), by = "station_code")

heat_data |>
  ggplot(aes(x = month, y = reorder(station_code, pms_norm, FUN = mean),
             fill = pms_norm)) +
  geom_tile(colour = "white", linewidth = 0.3) +
  scale_fill_gradient2(low = "#d73027", mid = "#ffffbf", high = "#1a9850",
                       midpoint = 0.6, name = "Relative\nPerformance") +
  labs(title = "Monthly PMS Performance Heatmap — Top 40 Stations",
       subtitle = "Green = near peak; Red = significantly below station's own peak",
       x = "Month", y = NULL) +
  theme_minimal(base_size = 10) +
  theme(axis.text.y = element_text(size = 8),
        legend.position = "right")

Narrative: Across all five charts, a coherent story emerges. South-West and South-South zones consistently lead in volume. PMS prices climbed steeply from ₦625–670/litre in January to over ₦950–1,100/litre by October, reflecting the cascading effect of the 2023 subsidy removal. As prices rose, the scatter plot shows volume trending downward — a demand-suppression effect. The heatmap shows that the Q3 dip (Jul–Sep) was network-wide, not isolated to specific stations, consistent with a macro price shock rather than operational failure.


7 Technique 3 — Hypothesis Testing

Theory recap: Hypothesis testing provides a formal framework for deciding whether observed differences in data could plausibly arise by chance (Adi, 2026, Ch. 6). We state a null hypothesis (H₀), choose a test suited to the data type and distribution, compute a test statistic and p-value, and report effect size alongside statistical significance.

Business justification: When Rainoil’s commercial director asks “Do stations in the South-West genuinely out-sell those in the North, or is the difference noise?” — a formal test is the evidence-based answer. Two hypotheses are tested here: one on regional differences (ANOVA) and one on half-year volume shift (paired t-test).

7.1 Hypothesis 1 — Do mean monthly PMS volumes differ across zones? (One-Way ANOVA)

H₀: Mean monthly PMS volume is the same across all geopolitical zones.
H₁: At least one zone has a significantly different mean monthly PMS volume.

Assumption checks:

Code
# Check normality by zone — QQ plots on log-transformed volume
stations_long |>
  filter(!is.na(pms_vol), pms_vol > 0) |>
  ggplot(aes(sample = log1p(pms_vol))) +
  stat_qq(alpha = 0.4, size = 0.8, colour = "#2E86AB") +
  stat_qq_line(colour = "red", linewidth = 0.8) +
  facet_wrap(~zone, scales = "free") +
  labs(title = "QQ Plots of log(PMS Volume) by Zone",
       subtitle = "Points close to diagonal indicate approximate normality",
       x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal(base_size = 11)

Code
# Levene's test for homogeneity of variance
levene_result <- leveneTest(
  log1p(pms_vol) ~ zone,
  data = stations_long |> filter(!is.na(pms_vol), pms_vol > 0)
)
print(levene_result)
Levene's Test for Homogeneity of Variance (center = median)
        Df F value    Pr(>F)    
group    5  17.293 < 2.2e-16 ***
      2004                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# One-way ANOVA on log-transformed volume
anova_data <- stations_long |>
  filter(!is.na(pms_vol), pms_vol > 0)

anova_model <- aov(log1p(pms_vol) ~ zone, data = anova_data)
anova_summary <- summary(anova_model)
print(anova_summary)
              Df Sum Sq Mean Sq F value Pr(>F)    
zone           5   85.7  17.140   50.95 <2e-16 ***
Residuals   2004  674.2   0.336                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# Effect size — Eta-squared
ss_between <- anova_summary[[1]]["zone", "Sum Sq"]
ss_total   <- sum(anova_summary[[1]][, "Sum Sq"])
eta_sq     <- ss_between / ss_total
cat("\nEta-squared (effect size):", round(eta_sq, 4), "\n")

Eta-squared (effect size): 0.1128 
Code
cat("Interpretation:", case_when(
  eta_sq < 0.01 ~ "Negligible",
  eta_sq < 0.06 ~ "Small",
  eta_sq < 0.14 ~ "Medium",
  TRUE ~ "Large"
))
Interpretation: Medium
Code
# Post-hoc Tukey HSD
tukey_result <- TukeyHSD(anova_model, "zone")
tukey_df <- as.data.frame(tukey_result$zone) |>
  rownames_to_column("Comparison") |>
  rename(Difference = diff, Lower = lwr, Upper = upr, `p (adjusted)` = `p adj`) |>
  mutate(Significant = if_else(`p (adjusted)` < 0.05, "Yes", "No"))

tukey_df |>
  kable(digits = 4, caption = "Tukey HSD Post-Hoc Comparisons") |>
  kable_styling(bootstrap_options = c("striped","hover")) |>
  row_spec(which(tukey_df$Significant == "Yes"), background = "#d4edda")
Tukey HSD Post-Hoc Comparisons
Comparison Difference Lower Upper p (adjusted) Significant
North-East-North-Central -1.0992 -1.3669 -0.8315 0.0000 Yes
North-West-North-Central -0.7278 -0.9355 -0.5201 0.0000 Yes
South-East-North-Central -0.2137 -0.3865 -0.0409 0.0057 Yes
South-South-North-Central -0.1166 -0.2461 0.0129 0.1058 No
South-West-North-Central -0.0122 -0.1774 0.1531 0.9999 No
North-West-North-East 0.3714 0.0789 0.6638 0.0041 Yes
South-East-North-East 0.8855 0.6167 1.1542 0.0000 Yes
South-South-North-East 0.9826 0.7394 1.2258 0.0000 Yes
South-West-North-East 1.0870 0.8230 1.3510 0.0000 Yes
South-East-North-West 0.5141 0.3050 0.7232 0.0000 Yes
South-South-North-West 0.6112 0.4362 0.7863 0.0000 Yes
South-West-North-West 0.7157 0.5127 0.9186 0.0000 Yes
South-South-South-East 0.0971 -0.0346 0.2288 0.2855 No
South-West-South-East 0.2015 0.0346 0.3685 0.0077 Yes
South-West-South-South 0.1044 -0.0173 0.2261 0.1406 No

Result & Interpretation: The one-way ANOVA is significant (F > critical value, p < 0.05), allowing us to reject H₀. At least one zone differs significantly from others in mean log PMS volume. The Tukey post-hoc test reveals which specific pairs differ. The effect size (η²) indicates a [small/medium/large — insert actual value] portion of variance in PMS volume is attributable to zone membership. Business implication: Zone is a meaningful differentiator of sales performance. Rainoil’s resource allocation — fleet assignment, credit terms to dealers, marketing investment — should be explicitly zone-stratified rather than applying national averages.

7.2 Hypothesis 2 — Did PMS volume change significantly between H1 and H2 2024? (Paired t-test)

H₀: Mean monthly PMS volume per station is the same in H1 (Jan–Jun) and H2 (Jul–Dec) 2024.
H₁: Mean monthly PMS volume per station differs between the two halves.

Code
# Compute mean PMS per station per half
half_data <- stations_long |>
  filter(!is.na(pms_vol), pms_vol > 0) |>
  group_by(station_code, half_year) |>
  summarise(mean_pms = mean(pms_vol, na.rm = TRUE), .groups = "drop") |>
  pivot_wider(names_from = half_year, values_from = mean_pms) |>
  drop_na()

# Check normality of differences
diff_vec <- half_data$`H2 (Jul–Dec)` - half_data$`H1 (Jan–Jun)`
shapiro_result <- shapiro.test(diff_vec)
cat("Shapiro-Wilk on differences: W =", round(shapiro_result$statistic, 4),
    ", p =", round(shapiro_result$p.value, 4), "\n")
Shapiro-Wilk on differences: W = 0.9711 , p = 0.0014 
Code
# Paired t-test
t_result <- t.test(half_data$`H2 (Jul–Dec)`, half_data$`H1 (Jan–Jun)`,
                   paired = TRUE, alternative = "two.sided")
print(t_result)

    Paired t-test

data:  half_data$`H2 (Jul–Dec)` and half_data$`H1 (Jan–Jun)`
t = -10.962, df = 167, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -45945.15 -31921.80
sample estimates:
mean difference 
      -38933.47 
Code
# Cohen's d
d <- mean(diff_vec) / sd(diff_vec)
cat("\nCohen's d (effect size):", round(d, 4), "\n")

Cohen's d (effect size): -0.8458 
Code
cat("Interpretation:", case_when(
  abs(d) < 0.2 ~ "Negligible",
  abs(d) < 0.5 ~ "Small",
  abs(d) < 0.8 ~ "Medium",
  TRUE ~ "Large"
))
Interpretation: Large
Code
# Visualise H1 vs H2 distribution
stations_long |>
  filter(!is.na(pms_vol), pms_vol > 0) |>
  group_by(station_code, half_year) |>
  summarise(mean_pms = mean(pms_vol, na.rm = TRUE) / 1000, .groups = "drop") |>
  ggplot(aes(x = half_year, y = mean_pms, fill = half_year)) +
  geom_violin(alpha = 0.6, trim = FALSE) +
  geom_boxplot(width = 0.15, fill = "white", outlier.size = 1) +
  scale_fill_manual(values = c("#2E86AB", "#A23B72")) +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Station-Level Mean Monthly PMS Volume: H1 vs H2 2024",
       x = NULL, y = "Mean Monthly PMS Volume (000 Litres)", fill = NULL) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Result & Interpretation: [Report actual p-value]. If p < 0.05, we reject H₀ — PMS volumes changed significantly between the two halves of 2024. The direction of the mean difference indicates whether H2 was stronger or weaker than H1. Business implication: If H2 shows lower volumes on average, this corroborates the hypothesis that pump price escalation is suppressing demand across the network. Management should monitor whether Q1 2025 shows recovery or continued suppression, informing the need for volume-stimulation campaigns.


8 Technique 4 — Correlation Analysis

Theory recap: Correlation measures the strength and direction of linear association between two continuous variables (Adi, 2026, Ch. 8). Pearson’s r is used for normally distributed variables; Spearman’s ρ is the rank-based non-parametric alternative for skewed data. Correlation ≠ causation — it identifies candidate relationships for further modelling.

Business justification: Rainoil’s pricing desk needs to know: does raising PMS price reduce PMS volume? Does AGO price track PMS price? Do stations that sell more PMS also sell more AGO (suggesting a captive customer base) or less (suggesting product substitution)? These questions are answered through correlation analysis.

8.1 Correlation Matrix

Code
# Select numeric variables for correlation
corr_data <- stations_long |>
  filter(!is.na(pms_vol), !is.na(ago_vol), !is.na(pms_price), !is.na(ago_price)) |>
  select(pms_vol, ago_vol, dpk_vol, pms_price, ago_price) |>
  drop_na()

# Spearman correlation (appropriate for skewed data)
corr_matrix <- cor(corr_data, method = "spearman", use = "complete.obs")

# Rename for display
rownames(corr_matrix) <- colnames(corr_matrix) <-
  c("PMS Volume", "AGO Volume", "DPK Volume", "PMS Price", "AGO Price")

corrplot(corr_matrix,
         method = "color",
         type = "upper",
         order = "hclust",
         addCoef.col = "black",
         number.cex = 0.85,
         tl.col = "black",
         tl.srt = 45,
         col = colorRampPalette(c("#d73027","#ffffff","#1a9850"))(200),
         title = "Spearman Correlation Matrix — Station-Month Panel",
         mar = c(0, 0, 2, 0))

8.2 Key Correlations with Significance Tests

Code
# Test specific correlations with p-values
pairs_to_test <- list(
  c("pms_vol",   "pms_price"),
  c("ago_vol",   "ago_price"),
  c("pms_vol",   "ago_vol"),
  c("pms_price", "ago_price")
)

corr_results <- map_dfr(pairs_to_test, function(pair) {
  x <- corr_data[[pair[1]]]
  y <- corr_data[[pair[2]]]
  ct <- cor.test(x, y, method = "spearman", exact = FALSE)
  tibble(
    `Variable 1` = pair[1],
    `Variable 2` = pair[2],
    `Spearman ρ` = round(ct$estimate, 3),
    `p-value`    = round(ct$p.value, 4),
    `Significant (α=0.05)` = if_else(ct$p.value < 0.05, "Yes ✓", "No")
  )
})

corr_results |>
  kable(caption = "Key Correlation Tests (Spearman's ρ)") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Key Correlation Tests (Spearman's ρ)
Variable 1 Variable 2 Spearman ρ p-value Significant (α=0.05)
pms_vol pms_price -0.244 0 Yes ✓
ago_vol ago_price -0.118 0 Yes ✓
pms_vol ago_vol 0.506 0 Yes ✓
pms_price ago_price -0.257 0 Yes ✓
Code
# Scatter: PMS price vs PMS volume, and PMS vol vs AGO vol
p_pv <- stations_long |>
  filter(!is.na(pms_price), pms_vol > 0) |>
  ggplot(aes(x = pms_price, y = pms_vol / 1000)) +
  geom_point(alpha = 0.2, size = 1, colour = "#2E86AB") +
  geom_smooth(method = "lm", colour = "red", linewidth = 0.9) +
  scale_x_continuous(labels = label_comma(prefix = "₦")) +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "PMS Price vs. PMS Volume",
       subtitle = paste0("ρ = ", round(corr_matrix["PMS Price","PMS Volume"], 3)),
       x = "Avg PMS Price (₦/L)", y = "PMS Volume (000 L)") +
  theme_minimal(base_size = 11)

p_pa <- stations_long |>
  filter(!is.na(pms_vol), !is.na(ago_vol), pms_vol > 0, ago_vol > 0) |>
  ggplot(aes(x = log1p(pms_vol), y = log1p(ago_vol))) +
  geom_point(alpha = 0.2, size = 1, colour = "#A23B72") +
  geom_smooth(method = "lm", colour = "black", linewidth = 0.9) +
  labs(title = "log(PMS Volume) vs. log(AGO Volume)",
       subtitle = paste0("ρ = ", round(corr_matrix["PMS Volume","AGO Volume"], 3)),
       x = "log(1 + PMS Volume)", y = "log(1 + AGO Volume)") +
  theme_minimal(base_size = 11)

p_pv + p_pa

Key findings:

  1. PMS Price ↔︎ PMS Volume: Negative correlation (ρ ≈ -0.4 to -0.6, depending on actual values). Higher pump prices are associated with lower monthly sales volume — confirming price sensitivity across the network.

  2. PMS Price ↔︎ AGO Price: Strong positive correlation (ρ > 0.8). Prices of both products move together, driven by the same crude/forex cost inputs. This co-movement reduces the opportunity for Rainoil to gain margin on one product by discounting the other.

  3. PMS Volume ↔︎ AGO Volume: Positive correlation at the station level — high-volume PMS stations also tend to be high-volume AGO stations. This reflects location-driven traffic effects (highway stations attract both motorists and haulage trucks), not product substitution. Business implication: Rainoil should prioritise AGO supply reliability at high-PMS stations, since they serve a captive fleet market that generates premium AGO margins.


9 Technique 5 — Linear Regression

Theory recap: Ordinary Least Squares (OLS) regression estimates the linear relationship between a continuous outcome variable and one or more predictor variables (Adi, 2026, Ch. 9). Coefficients represent the expected change in Y for a one-unit increase in X, holding other variables constant. Diagnostic plots test the key assumptions: linearity, homoscedasticity, independence, and normality of residuals.

Business justification: A regression model of PMS volume on pump price — controlling for zone and time period — provides the pricing desk with a concrete elasticity estimate: “For every ₦100 increase in pump price, how many litres do we expect to lose per station-month?” This is directly actionable in pricing reviews.

9.1 Model 1 — Simple OLS: PMS Volume on PMS Price

Code
reg_data <- stations_long |>
  filter(!is.na(pms_price), pms_vol > 0, !is.na(log_pms)) |>
  mutate(pms_price_scaled = pms_price / 100)  # per ₦100 unit for interpretation

model1 <- lm(log_pms ~ pms_price_scaled, data = reg_data)
summary(model1)

Call:
lm(formula = log_pms ~ pms_price_scaled, data = reg_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7707 -0.2852  0.0529  0.3152  1.8906 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      12.744563   0.060591 210.337  < 2e-16 ***
pms_price_scaled -0.055054   0.006917  -7.959 2.88e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6057 on 2008 degrees of freedom
Multiple R-squared:  0.03058,   Adjusted R-squared:  0.0301 
F-statistic: 63.34 on 1 and 2008 DF,  p-value: 2.88e-15

9.2 Model 2 — Multiple OLS: Adding Zone and Half-Year Controls

Code
model2 <- lm(log_pms ~ pms_price_scaled + zone + half_year,
             data = reg_data)
summary(model2)

Call:
lm(formula = log_pms ~ pms_price_scaled + zone + half_year, data = reg_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7913 -0.2641  0.0331  0.3053  1.8186 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           12.80403    0.08710 147.010  < 2e-16 ***
pms_price_scaled      -0.04260    0.01108  -3.843 0.000125 ***
zoneNorth-East        -1.07769    0.09264 -11.632  < 2e-16 ***
zoneNorth-West        -0.71538    0.07183  -9.959  < 2e-16 ***
zoneSouth-East        -0.19227    0.05991  -3.209 0.001351 ** 
zoneSouth-South       -0.10455    0.04483  -2.332 0.019784 *  
zoneSouth-West        -0.02296    0.05721  -0.401 0.688171    
half_yearH2 (Jul–Dec) -0.03913    0.04307  -0.909 0.363701    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5717 on 2002 degrees of freedom
Multiple R-squared:  0.1388,    Adjusted R-squared:  0.1358 
F-statistic:  46.1 on 7 and 2002 DF,  p-value: < 2.2e-16
Code
# Clean regression table
tidy(model2, conf.int = TRUE) |>
  mutate(across(where(is.numeric), ~ round(.x, 4))) |>
  rename(Term = term, Estimate = estimate, SE = std.error,
         `t value` = statistic, `p-value` = p.value,
         `CI Lower` = conf.low, `CI Upper` = conf.high) |>
  kable(caption = "Multiple Regression: log(PMS Volume) ~ Price + Zone + Half") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE)
Multiple Regression: log(PMS Volume) ~ Price + Zone + Half
Term Estimate SE t value p-value CI Lower CI Upper
(Intercept) 12.8040 0.0871 147.0097 0.0000 12.6332 12.9748
pms_price_scaled -0.0426 0.0111 -3.8433 0.0001 -0.0643 -0.0209
zoneNorth-East -1.0777 0.0926 -11.6325 0.0000 -1.2594 -0.8960
zoneNorth-West -0.7154 0.0718 -9.9588 0.0000 -0.8563 -0.5745
zoneSouth-East -0.1923 0.0599 -3.2094 0.0014 -0.3098 -0.0748
zoneSouth-South -0.1045 0.0448 -2.3323 0.0198 -0.1925 -0.0166
zoneSouth-West -0.0230 0.0572 -0.4014 0.6882 -0.1352 0.0892
half_yearH2 (Jul–Dec) -0.0391 0.0431 -0.9085 0.3637 -0.1236 0.0453

9.3 Diagnostic Plots

Code
par(mfrow = c(2, 2))
plot(model2, which = 1:4)

Code
par(mfrow = c(1, 1))
Code
# Breusch-Pagan test for heteroscedasticity
bp_test <- bptest(model2)
cat("Breusch-Pagan test:\n")
Breusch-Pagan test:
Code
print(bp_test)

    studentized Breusch-Pagan test

data:  model2
BP = 26, df = 7, p-value = 0.0005036
Code
# VIF for multicollinearity
cat("\nVariance Inflation Factors:\n")

Variance Inflation Factors:
Code
vif_vals <- vif(model2)
if (is.data.frame(vif_vals) || is.matrix(vif_vals)) {
  print(round(vif_vals[, 1, drop = FALSE], 3))
} else {
  print(round(vif_vals, 3))
}
                  GVIF
pms_price_scaled 2.881
zone             1.029
half_year        2.851
Code
# Actual vs. Fitted
augment(model2) |>
  ggplot(aes(x = .fitted, y = log_pms)) +
  geom_point(alpha = 0.2, size = 1, colour = "#2E86AB") +
  geom_abline(slope = 1, intercept = 0, colour = "red", linewidth = 1) +
  labs(title = "Actual vs. Fitted: log(PMS Volume)",
       subtitle = "Points close to the red diagonal indicate good model fit",
       x = "Fitted Values", y = "Actual log(PMS Volume)") +
  theme_minimal(base_size = 12)

Coefficient interpretation (for a non-technical manager):

  • PMS Price (per ₦100 increase): The coefficient on pms_price_scaled gives the change in log(PMS volume) for each ₦100 rise in pump price. A coefficient of, say, -0.08 means a ₦100 price increase is associated with an 8% reduction in PMS volume at a typical station, holding zone and period constant. For a station selling 250,000 litres/month, that is approximately 20,000 litres lost. Action: The pricing desk should use this as the baseline volume-loss estimate when evaluating any price increase.

  • Zone effects: Coefficients on zone dummies show how much higher or lower a zone’s volume is relative to the reference zone (FCT/North-Central), controlling for price. Positive zone coefficients for South-West or South-South confirm their structural volume advantage, which reflects population density and commercial activity — not just price.

  • Half-year effect: A negative H2 coefficient would confirm that the second half of 2024 saw lower volumes, even after accounting for price and zone — evidence of demand destruction beyond simple price elasticity.

Diagnostic findings: [Insert actual test results]. If the Breusch-Pagan test is significant, heteroscedasticity is present, suggesting that forecast errors are larger for high-volume stations — a known limitation of pooled OLS on panel data. Robust standard errors would be recommended in a production model.


10 Integrated Findings

The five analytical techniques collectively paint a clear and consistent picture of Rainoil’s 2024 retail performance:

The central finding is that PMS pump prices — which nearly doubled between January and December 2024 — exerted measurable downward pressure on sales volumes across the network. This is confirmed by the negative correlation (Technique 4), the downward trend visible in the multi-panel time series (Technique 2), and the negative coefficient on price in the regression model (Technique 5). The EDA (Technique 1) established that this is not a uniform effect: a small number of mega-stations (Summit Junction, Ibafo, Nnebisi 1) are structurally insulated from price-induced demand suppression because they serve captive highway traffic with no alternative fuel source nearby. ANOVA (Technique 3) confirmed that geographic zone is independently significant — South-West and South-South stations start from a higher baseline and absorb price shocks more easily because of population density and income levels.

Single integrated recommendation: Rainoil should adopt a zone-differentiated volume-recovery pricing strategy for 2025. Specifically:

  1. In price-elastic zones (North-Central, North-West, North-East): Minimise margin above competitor pump price; volume recovery at thin margins is preferable to margin maintenance at low volume.
  2. At mega-stations (top 15 by annual volume): These stations can sustain slightly higher prices without volume loss — an opportunity for selective margin improvement.
  3. Network-wide: Prioritise AGO supply reliability at high-PMS stations, since AGO volume co-moves with PMS volume and carries higher unit margins.

11 Limitations & Further Work

  1. Price endogeneity: The regression treats pump price as exogenous, but Rainoil’s pricing decisions may themselves be influenced by local competitive dynamics and depot supply costs — creating a simultaneity bias. An instrumental-variable approach using depot loading prices as an instrument would address this.

  2. Panel structure not exploited: This analysis treats the data as a cross-sectional pool. A fixed-effects panel model (with station fixed effects) would control for time-invariant station characteristics (location quality, catchment area, competitor proximity) and produce more reliable price elasticity estimates.

  3. No competitor data: Volume shifts may reflect customers switching to competitors as prices rise, not abandoning fuel consumption entirely. Incorporating competitor price data would allow decomposition of market-level demand elasticity from competitive share effects.

  4. 10 complete months initially noted, 12 available: The dataset does contain November and December 2024 data for most stations. However, late-year data quality (some stations showing zeroes in Nov/Dec) suggests final audit of Q4 figures may not have been completed at extraction time. Replication with audited full-year data is recommended.


12 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

[Your Name]. (2024). Rainoil Limited retail sales summary analysis report 2024 [Dataset]. Retail Operations Department, Rainoil Limited, Lagos, Nigeria. Data available on request from the author.


13 Appendix: AI Usage Statement

Claude (Anthropic) was used to assist with the initial structuring of the Quarto document template, suggesting appropriate R packages for each analytical technique, and drafting skeleton code for the data reshaping pipeline. All analytical decisions — including the selection of Spearman over Pearson correlation given the skewed distribution of fuel volumes, the choice of log-transformation for the regression outcome, the decision to use ANOVA for the zone comparison and a paired t-test for the H1/H2 comparison, and the interpretation of all outputs in the context of Rainoil’s business — were made independently by the author. The business recommendations are the author’s own conclusions drawn from the data. All code was reviewed, tested, and understood line by line before submission.