Retail Fuel Sales Performance Analytics: Exploratory & Inferential Analysis of Rainoil/Fynefield Station Data (Oct–Dec 2025)

Author

[Your Full Name]

Published

May 10, 2026


1. Executive Summary

Nigeria’s downstream petroleum sector is characterised by high station density, thin margins, and intense competition for throughput. For a retail fuel distributor operating a nationwide network, understanding which stations generate the most revenue, which products drive that revenue, and how performance shifts across months is critical to making sound decisions on logistics, pricing, and capital deployment.

This report analyses station-level retail fuel sales data extracted from the company’s internal ERP system, covering 193 stations in the Rainoil and Fynefield networks across October to December 2025. Three products are examined — Premium Motor Spirit (PMS), Automotive Gas Oil (AGO), and Dual Purpose Kerosene (DPK) — across six Nigerian geopolitical zones.

Five analytical techniques are applied: Exploratory Data Analysis surfaces distributional patterns and data quality issues; Data Visualisation communicates performance trends across stations, products, and months; Hypothesis Testing formally tests whether regional and seasonal performance differences are statistically significant; Correlation Analysis quantifies the relationship between price, volume, and revenue; and Linear Regression models revenue as a function of operational variables.

Key findings show that PMS dominates revenue contribution, December consistently outperforms earlier months, and FCT and Lagos stations lead in throughput. Volume — not pump price — is the primary revenue driver across the network. The central recommendation is a throughput-maximisation strategy: prioritise supply reliability, reduce station downtime, and pre-position stock in high-volume corridors ahead of the December peak.


2. Professional Disclosure

Job Title: Finance Manager, Liquidity & Treasury Management
Organisation Type: Downstream Petroleum Distribution Company (Nigeria)

Operational Context

As Finance Manager for Liquidity and Treasury Management at a downstream petroleum distribution company, my primary responsibility is ensuring the organisation maintains sufficient cash flow to fund product procurement, logistics operations, and debt obligations across a nationwide retail network. In practical terms, this means I monitor daily and monthly cash inflows from over 190 retail stations, manage working capital cycles tied to fuel stock purchases and sales receipts, and provide liquidity forecasts that inform decisions on short-term borrowing, supplier payment terms, and capital deployment.

Station-level retail sales data is not peripheral to my work — it is its foundation. The timing, volume, and product mix of fuel sales at each station directly determines when cash enters the business, how much liquidity is available at any point in the month, and how reliably we can meet our procurement and debt service obligations. The five techniques selected for this case study each address a specific analytical need that arises in my treasury management function.

Technique Justification

1. Exploratory Data Analysis (EDA): Before I can build any liquidity forecast or cash flow projection, I need to understand the distributional behaviour of station revenue — which stations are reliable cash generators, which are erratic, and where zero or near-zero sales entries indicate supply disruptions that will create cash flow gaps. EDA gives me a systematic, reproducible framework for this validation, replacing the informal eyeballing of spreadsheets that currently precedes every monthly treasury review.

2. Data Visualisation: My monthly treasury reports to the CFO and Board Finance Committee must communicate cash flow patterns clearly — which regions are generating the most liquidity, whether December’s revenue surge is consistent across the network, and how product mix shifts affect the timing of cash receipts. Visualisation techniques translate complex multi-station, multi-product data into narratives that non-technical executives can act on when approving working capital facilities or adjusting payment schedules.

3. Hypothesis Testing: A core question in my liquidity planning is whether observed seasonal cash flow patterns are statistically reliable or simply artefacts of a short observation window. For instance, if December consistently generates higher inflows than October, I can build that into our borrowing plan with confidence. Hypothesis testing — specifically ANOVA for regional differences and t-tests for monthly comparisons — gives me the statistical evidence to defend seasonal liquidity assumptions to external lenders and auditors.

4. Correlation Analysis: Understanding whether pump price increases translate into higher cash inflows — or whether they suppress volume and therefore reduce total receipts — is directly relevant to treasury. If price and revenue are weakly correlated but volume and revenue are strongly correlated, my cash flow forecasts should be anchored to throughput projections rather than pricing assumptions. Correlation analysis makes this relationship explicit and quantifiable.

5. Linear Regression: My annual liquidity plan requires station-level revenue forecasts as inputs to the company’s 12-month cash flow model. Currently these are built on simple historical averages. A regression model that explains revenue from volume, product type, region, and month gives me a more rigorous forecasting tool — one that can be updated as operational assumptions change and that produces defensible projections I can present to banking partners when negotiating working capital credit lines.


3. Data Collection & Sampling

3.1 Data Source & Collection Method

The dataset was extracted from the company’s internal Enterprise Resource Planning (ERP) system by the Finance Analytics team. The ERP system aggregates daily pump-level transactions — recorded at the point of sale via automated dispensing meters — into monthly station-level summaries. The extraction was performed by running a standard sales performance report for the period October 1 to December 31, 2025, filtered to all active Rainoil and Fynefield retail stations. The output was exported as a structured Excel workbook, which forms the primary dataset for this analysis.

No manual data entry was involved in the collection process. All figures originate from automated meter readings reconciled against physical stock counts at month-end, which is the company’s standard procedure for financial reporting and regulatory compliance with the Nigerian Midstream and Downstream Petroleum Regulatory Authority (NMDPRA).

3.2 Variables

The dataset contains the following variables for each station, across three products (PMS, AGO, DPK) and three months (October, November, December 2025):

Variable Type Description
Station Categorical Name and physical location of the retail outlet
Product Categorical Fuel type: PMS (petrol), AGO (diesel), DPK (kerosene)
Month Date / Time Reporting month: Oct-25, Nov-25, or Dec-25
Sales Volume (Litres) Numeric (continuous) Total litres dispensed per product per month
Average Price (₦/Litre) Numeric (continuous) Average pump price during the month
Total Revenue (₦) Numeric (continuous) Total naira receipts = volume × average price

After reshaping from the raw wide format to a long analytical format (one row per station–product–month combination), the dataset yields 1,737 observations across 193 stations — well above the 100-observation minimum required.

3.3 Sampling Frame & Justification

This is a complete census, not a sample. Every active station in the Rainoil and Fynefield retail network that recorded at least one transaction during the three-month window is included. There was no need for probabilistic sampling because the full population of company-operated stations is accessible through the ERP system and the dataset remains computationally manageable.

The three-month window (Q4 2025) was chosen because it represents the most recent complete quarter available at the time of submission, and because Q4 is operationally significant — it includes the year-end demand surge in December, which has direct implications for the company’s liquidity planning cycle. Extending the window further back would require accessing archived ERP snapshots, which was not feasible within the submission timeline.

3.4 Ethical Notes & Data Governance

The dataset contains no personally identifiable information (PII). All records relate to commercial retail locations — station names are publicly visible on signage, company websites, and the NMDPRA retail licence register. No customer-level, employee-level, or individually sensitive data is present.

The data was accessed under the company’s standard internal data governance framework. As Finance Manager with authorised access to the ERP system, no special ethical clearance was required beyond normal role-based access controls. Commercially sensitive aggregate revenue figures are included in this submission in line with the scope approved for the LBS Data Analytics programme. These figures have been rounded to two decimal places where necessary and no station-level data has been withheld that would affect the integrity of the analysis.

Data Citation:
[Your Full Name]. (2026). Rainoil/Fynefield retail fuel sales — Q4 2025 [Dataset]. Extracted from internal ERP system, Finance Department, [Your Company Name], Lagos, Nigeria. Data available on request from the author.


4. Data Description

This section loads the raw Excel file, reshapes it from wide to long format, assigns a geographic region to each station, and produces summary statistics and variable distributions. All subsequent sections depend on the objects created here.

Code
# ── Packages ──────────────────────────────────────────────────────────────────
library(tidyverse)
library(readxl)
library(janitor)
library(scales)
library(knitr)
library(kableExtra)
library(moments)

# ── Load raw Excel (multi-header structure) ───────────────────────────────────
raw <- read_excel("2025_OCT_NOV_DEC_SALES_DATA.xlsx",
                  col_names = FALSE, skip = 2)

# ── Assign clean column names ─────────────────────────────────────────────────
col_names <- c(
  "sn", "station",
  "vol_pms_oct", "vol_ago_oct", "vol_dpk_oct",
  "vol_pms_nov", "vol_ago_nov", "vol_dpk_nov",
  "vol_pms_dec", "vol_ago_dec", "vol_dpk_dec",
  "price_pms_oct", "price_ago_oct", "price_dpk_oct",
  "price_pms_nov", "price_ago_nov", "price_dpk_nov",
  "price_pms_dec", "price_ago_dec", "price_dpk_dec",
  "rev_pms_oct", "rev_ago_oct", "rev_dpk_oct",
  "rev_pms_nov", "rev_ago_nov", "rev_dpk_nov",
  "rev_pms_dec", "rev_ago_dec", "rev_dpk_dec"
)

# ── Clean wide format ─────────────────────────────────────────────────────────
sales_wide <- raw %>%
  slice(2:n()) %>%                          # drop product sub-header row
  set_names(col_names) %>%
  filter(!is.na(sn), !is.na(station)) %>%   # drop totals row
  mutate(across(-station, as.numeric),
         station = str_trim(station))

# ── Reshape to long format ────────────────────────────────────────────────────
# Helper: pivot one metric (vol / price / rev) across all product-month combos
pivot_metric <- function(df, metric) {
  df %>%
    select(station, matches(paste0("^", metric, "_"))) %>%
    pivot_longer(
      cols      = -station,
      names_to  = c("product", "month"),
      names_pattern = paste0(metric, "_(.+)_(.+)"),
      values_to = metric
    )
}

sales_long <- pivot_metric(sales_wide, "vol") %>%
  left_join(pivot_metric(sales_wide, "price"), by = c("station", "product", "month")) %>%
  left_join(pivot_metric(sales_wide, "rev"),   by = c("station", "product", "month")) %>%
  mutate(
    product = toupper(product),
    month   = factor(month,
                     levels = c("oct", "nov", "dec"),
                     labels = c("Oct-25", "Nov-25", "Dec-25")),
    # ── Assign region based on station name keywords ──────────────────────────
    region = case_when(
      str_detect(station, regex("FCT|Abuja|Kado|Karmo|Dutse|Gwarimpa|Apo", ignore_case = TRUE)) ~ "FCT / Abuja",
      str_detect(station, regex("Ibafo|Ipaja|Ayobo|Abijo",                 ignore_case = TRUE)) ~ "Lagos",
      str_detect(station, regex("Port|Choba|Borokiri|Igwuruta|Isiokpo",    ignore_case = TRUE)) ~ "Rivers",
      str_detect(station, regex("Kaduna|Gombe|Jos|Bukuru|Chikun|Barnawa|Kachia|Muazu|Gboko", ignore_case = TRUE)) ~ "North",
      str_detect(station, regex("Ilorin|Ado.Ekiti",                        ignore_case = TRUE)) ~ "South-West (Other)",
      str_detect(station, regex("Enugu|Awka|Awkuzu|Aba|Eket|Uyo|Ikom|Calabar|Abakaliki", ignore_case = TRUE)) ~ "South-East / SS Cross River",
      TRUE ~ "South-South (Delta / Edo / Anambra)"
    )
  ) %>%
  filter(vol > 0 | rev > 0)   # keep only active station-product-month combos

cat("Total observations (active):", nrow(sales_long), "\n")
Total observations (active): 1247 
Code
cat("Unique stations             :", n_distinct(sales_long$station), "\n")
Unique stations             : 192 
Code
cat("Products                    :", paste(unique(sales_long$product), collapse = ", "), "\n")
Products                    : PMS, AGO, DPK 
Code
cat("Months                      :", paste(levels(sales_long$month),  collapse = ", "), "\n")
Months                      : Oct-25, Nov-25, Dec-25 
Code
cat("Regions                     :", paste(unique(sales_long$region), collapse = " | "), "\n")
Regions                     : South-East / SS Cross River | Lagos | South-South (Delta / Edo / Anambra) | FCT / Abuja | South-West (Other) | North | Rivers 
Code
# ── Summary statistics by product ────────────────────────────────────────────
sales_long %>%
  group_by(product) %>%
  summarise(
    N            = n(),
    Mean_Vol     = mean(vol,   na.rm = TRUE),
    Median_Vol   = median(vol, na.rm = TRUE),
    SD_Vol       = sd(vol,     na.rm = TRUE),
    Mean_Price   = mean(price, na.rm = TRUE),
    Mean_Rev_M   = mean(rev,   na.rm = TRUE) / 1e6,   # in ₦ millions
    .groups      = "drop"
  ) %>%
  mutate(across(where(is.numeric), ~ round(.x, 1))) %>%
  kable(
    caption    = "Table 1: Summary Statistics by Product Type",
    col.names  = c("Product", "N", "Mean Vol (L)", "Median Vol (L)",
                   "SD Vol", "Mean Price (₦/L)", "Mean Revenue (₦M)"),
    format.args = list(big.mark = ",")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 1: Summary Statistics by Product Type
Product N Mean Vol (L) Median Vol (L) SD Vol Mean Price (₦/L) Mean Revenue (₦M)
AGO 565 29,787.5 18,850.0 51,667.2 1,064.0 31.6
DPK 112 1,265.4 805.5 1,605.1 1,407.5 1.8
PMS 570 200,736.6 176,135.0 131,179.0 914.6 183.4
Code
# ── Variable distributions ────────────────────────────────────────────────────
p_vol <- sales_long %>%
  filter(vol > 0) %>%
  ggplot(aes(x = vol / 1e3, fill = product)) +
  geom_histogram(bins = 35, colour = "white", alpha = 0.85) +
  facet_wrap(~product, scales = "free") +
  scale_fill_manual(values = c("PMS" = "#1565C0", "AGO" = "#E65100", "DPK" = "#2E7D32")) +
  scale_x_continuous(labels = label_number(suffix = "K")) +
  labs(title = "Figure 1: Distribution of Monthly Sales Volume by Product",
       x = "Volume (000 litres)", y = "Count") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title      = element_text(face = "bold"),
        strip.text      = element_text(face = "bold"))

p_rev <- sales_long %>%
  filter(rev > 0) %>%
  ggplot(aes(x = rev / 1e6, fill = product)) +
  geom_histogram(bins = 35, colour = "white", alpha = 0.85) +
  facet_wrap(~product, scales = "free") +
  scale_fill_manual(values = c("PMS" = "#1565C0", "AGO" = "#E65100", "DPK" = "#2E7D32")) +
  scale_x_continuous(labels = label_number(suffix = "M")) +
  labs(title = "Figure 2: Distribution of Monthly Revenue by Product",
       x = "Revenue (₦ Millions)", y = "Count") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title      = element_text(face = "bold"),
        strip.text      = element_text(face = "bold"))

p_vol

Code
p_rev

Code
# ── Skewness summary ──────────────────────────────────────────────────────────
sales_long %>%
  filter(vol > 0) %>%
  group_by(product) %>%
  summarise(
    Skewness = round(skewness(vol), 2),
    Kurtosis = round(kurtosis(vol), 2),
    .groups  = "drop"
  ) %>%
  kable(caption = "Table 2: Skewness and Excess Kurtosis of Sales Volume") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 2: Skewness and Excess Kurtosis of Sales Volume
product Skewness Kurtosis
AGO 8.11 80.22
DPK 2.78 12.20
PMS 2.83 17.18
Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

# ── Load raw Excel ────────────────────────────────────────────────────────────
raw = pd.read_excel("2025_OCT_NOV_DEC_SALES_DATA.xlsx",
                    header=None, skiprows=2)

col_names = [
    "sn", "station",
    "vol_pms_oct", "vol_ago_oct", "vol_dpk_oct",
    "vol_pms_nov", "vol_ago_nov", "vol_dpk_nov",
    "vol_pms_dec", "vol_ago_dec", "vol_dpk_dec",
    "price_pms_oct", "price_ago_oct", "price_dpk_oct",
    "price_pms_nov", "price_ago_nov", "price_dpk_nov",
    "price_pms_dec", "price_ago_dec", "price_dpk_dec",
    "rev_pms_oct", "rev_ago_oct", "rev_dpk_oct",
    "rev_pms_nov", "rev_ago_nov", "rev_dpk_nov",
    "rev_pms_dec", "rev_ago_dec", "rev_dpk_dec"
]

# ── Clean wide format ─────────────────────────────────────────────────────────
df_wide = raw.iloc[1:].copy()
df_wide.columns = col_names
df_wide = df_wide.dropna(subset=["sn", "station"])
df_wide["station"] = df_wide["station"].str.strip()
for c in col_names[2:]:
    df_wide[c] = pd.to_numeric(df_wide[c], errors="coerce")

# ── Reshape to long format ────────────────────────────────────────────────────
products = ["pms", "ago", "dpk"]
months   = {"oct": "Oct-25", "nov": "Nov-25", "dec": "Dec-25"}
records  = []

for _, row in df_wide.iterrows():
    for prod in products:
        for mon, mon_label in months.items():
            records.append({
                "station": row["station"],
                "product": prod.upper(),
                "month":   mon_label,
                "vol":     row[f"vol_{prod}_{mon}"],
                "price":   row[f"price_{prod}_{mon}"],
                "rev":     row[f"rev_{prod}_{mon}"]
            })

sales = pd.DataFrame(records)

# ── Assign regions ────────────────────────────────────────────────────────────
def assign_region(station):
    s = str(station).lower()
    if any(x in s for x in ["fct", "abuja", "kado", "karmo", "dutse", "gwarimpa", "apo"]):
        return "FCT / Abuja"
    if any(x in s for x in ["ibafo", "ipaja", "ayobo", "abijo"]):
        return "Lagos"
    if any(x in s for x in ["port", "choba", "borokiri", "igwuruta", "isiokpo"]):
        return "Rivers"
    if any(x in s for x in ["kaduna", "gombe", "jos", "bukuru", "chikun",
                              "barnawa", "kachia", "muazu", "gboko"]):
        return "North"
    if any(x in s for x in ["ilorin", "ado-ekiti"]):
        return "South-West (Other)"
    if any(x in s for x in ["enugu", "awka", "awkuzu", "aba", "eket",
                              "uyo", "ikom", "calabar", "abakaliki"]):
        return "South-East / SS Cross River"
    return "South-South (Delta / Edo / Anambra)"

sales["region"] = sales["station"].apply(assign_region)
month_order     = ["Oct-25", "Nov-25", "Dec-25"]
sales["month"]  = pd.Categorical(sales["month"], categories=month_order, ordered=True)

# Keep only active combos
sales_active = sales[(sales["vol"] > 0) | (sales["rev"] > 0)].copy()

print(f"Total observations (active): {len(sales_active):,}")
Total observations (active): 1,247
Code
print(f"Unique stations             : {sales_active['station'].nunique()}")
Unique stations             : 192
Code
print(f"Products                    : {', '.join(sales_active['product'].unique())}")
Products                    : PMS, AGO, DPK
Code
print(f"Months                      : {', '.join(month_order)}")
Months                      : Oct-25, Nov-25, Dec-25
Code
print(f"Regions                     : {' | '.join(sales_active['region'].unique())}")
Regions                     : South-East / SS Cross River | Lagos | South-South (Delta / Edo / Anambra) | FCT / Abuja | South-West (Other) | North | Rivers
Code
# ── Summary statistics ────────────────────────────────────────────────────────
summary = (sales_active
           .groupby("product")
           .agg(
               N          = ("vol",   "count"),
               Mean_Vol   = ("vol",   "mean"),
               Median_Vol = ("vol",   "median"),
               SD_Vol     = ("vol",   "std"),
               Mean_Price = ("price", "mean"),
               Mean_Rev_M = ("rev",   lambda x: x.mean() / 1e6)
           )
           .round(1)
           .reset_index())

summary.columns = ["Product", "N", "Mean Vol (L)", "Median Vol (L)",
                   "SD Vol", "Mean Price (₦/L)", "Mean Revenue (₦M)"]
print("\nTable 1: Summary Statistics by Product Type")

Table 1: Summary Statistics by Product Type
Code
print(summary.to_string(index=False))
Product   N  Mean Vol (L)  Median Vol (L)   SD Vol  Mean Price (₦/L)  Mean Revenue (₦M)
    AGO 565       29787.5         18850.0  51667.2            1064.0               31.6
    DPK 112        1265.4           805.5   1605.1            1407.5                1.8
    PMS 570      200736.6        176135.0 131179.0             914.6              183.4
Code
# ── Variable distributions ────────────────────────────────────────────────────
from scipy.stats import skew, kurtosis

colors = {"PMS": "#1565C0", "AGO": "#E65100", "DPK": "#2E7D32"}
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
fig.suptitle("Figures 1 & 2: Volume and Revenue Distributions by Product",
             fontsize=13, fontweight="bold")
Text(0.5, 0.98, 'Figures 1 & 2: Volume and Revenue Distributions by Product')
Code
active_pos = sales_active[sales_active["vol"] > 0]

for i, prod in enumerate(["PMS", "AGO", "DPK"]):
    sub = active_pos[active_pos["product"] == prod]

    # Row 1: volume
    axes[0, i].hist(sub["vol"] / 1e3, bins=30, color=colors[prod], alpha=0.85, edgecolor="white")
    axes[0, i].set_title(f"{prod} — Volume", fontweight="bold")
    axes[0, i].set_xlabel("Volume (000 L)")
    axes[0, i].set_ylabel("Count")

    # Row 2: revenue
    axes[1, i].hist(sub["rev"] / 1e6, bins=30, color=colors[prod], alpha=0.85, edgecolor="white")
    axes[1, i].set_title(f"{prod} — Revenue (₦M)", fontweight="bold")
    axes[1, i].set_xlabel("Revenue (₦ Millions)")
    axes[1, i].set_ylabel("Count")
(array([10., 55., 83., 89., 86., 77., 51., 43., 19., 16., 13.,  8.,  7.,
        2.,  2.,  0.,  3.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,
        1.,  0.,  0.,  3.]), array([   8.201     ,   45.38093333,   82.56086667,  119.7408    ,
        156.92073333,  194.10066667,  231.2806    ,  268.46053333,
        305.64046667,  342.8204    ,  380.00033333,  417.18026667,
        454.3602    ,  491.54013333,  528.72006667,  565.9       ,
        603.07993333,  640.25986667,  677.4398    ,  714.61973333,
        751.79966667,  788.9796    ,  826.15953333,  863.33946667,
        900.5194    ,  937.69933333,  974.87926667, 1012.0592    ,
       1049.23913333, 1086.41906667, 1123.599     ]), <BarContainer object of 30 artists>)
Text(0.5, 1.0, 'PMS — Volume')
Text(0.5, 0, 'Volume (000 L)')
Text(0, 0.5, 'Count')
(array([ 9., 51., 83., 92., 79., 76., 55., 48., 20., 13., 16.,  7.,  5.,
        4.,  1.,  2.,  3.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,
        1.,  0.,  1.,  2.]), array([   7.53945267,   40.91153561,   74.28361855,  107.6557015 ,
        141.02778444,  174.39986738,  207.77195033,  241.14403327,
        274.51611621,  307.88819916,  341.2602821 ,  374.63236504,
        408.00444799,  441.37653093,  474.74861387,  508.12069682,
        541.49277976,  574.8648627 ,  608.23694565,  641.60902859,
        674.98111153,  708.35319448,  741.72527742,  775.09736036,
        808.46944331,  841.84152625,  875.21360919,  908.58569214,
        941.95777508,  975.32985802, 1008.70194097]), <BarContainer object of 30 artists>)
Text(0.5, 1.0, 'PMS — Revenue (₦M)')
Text(0.5, 0, 'Revenue (₦ Millions)')
Text(0, 0.5, 'Count')
(array([313., 145.,  72.,  19.,   6.,   2.,   1.,   1.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   2.,   0.,   1.,   0.,   0.,
         0.,   0.,   1.,   0.,   0.,   0.,   1.,   1.]), array([3.83000000e-01, 2.10621667e+01, 4.17413333e+01, 6.24205000e+01,
       8.30996667e+01, 1.03778833e+02, 1.24458000e+02, 1.45137167e+02,
       1.65816333e+02, 1.86495500e+02, 2.07174667e+02, 2.27853833e+02,
       2.48533000e+02, 2.69212167e+02, 2.89891333e+02, 3.10570500e+02,
       3.31249667e+02, 3.51928833e+02, 3.72608000e+02, 3.93287167e+02,
       4.13966333e+02, 4.34645500e+02, 4.55324667e+02, 4.76003833e+02,
       4.96683000e+02, 5.17362167e+02, 5.38041333e+02, 5.58720500e+02,
       5.79399667e+02, 6.00078833e+02, 6.20758000e+02]), <BarContainer object of 30 artists>)
Text(0.5, 1.0, 'AGO — Volume')
Text(0.5, 0, 'Volume (000 L)')
Text(0, 0.5, 'Count')
(array([309., 147.,  68.,  25.,   4.,   4.,   1.,   1.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   2.,   0.,   1.,   0.,   0.,
         0.,   0.,   0.,   1.,   0.,   0.,   1.,   1.]), array([4.13640000e-01, 2.18193404e+01, 4.32250408e+01, 6.46307412e+01,
       8.60364416e+01, 1.07442142e+02, 1.28847842e+02, 1.50253543e+02,
       1.71659243e+02, 1.93064944e+02, 2.14470644e+02, 2.35876344e+02,
       2.57282045e+02, 2.78687745e+02, 3.00093446e+02, 3.21499146e+02,
       3.42904847e+02, 3.64310547e+02, 3.85716247e+02, 4.07121948e+02,
       4.28527648e+02, 4.49933349e+02, 4.71339049e+02, 4.92744749e+02,
       5.14150450e+02, 5.35556150e+02, 5.56961851e+02, 5.78367551e+02,
       5.99773251e+02, 6.21178952e+02, 6.42584652e+02]), <BarContainer object of 30 artists>)
Text(0.5, 1.0, 'AGO — Revenue (₦M)')
Text(0.5, 0, 'Revenue (₦ Millions)')
Text(0, 0.5, 'Count')
(array([31., 17., 16., 12.,  7.,  8.,  6.,  3.,  1.,  3.,  1.,  0.,  1.,
        0.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,
        0.,  0.,  0.,  1.]), array([2.00000000e-03, 3.21833333e-01, 6.41666667e-01, 9.61500000e-01,
       1.28133333e+00, 1.60116667e+00, 1.92100000e+00, 2.24083333e+00,
       2.56066667e+00, 2.88050000e+00, 3.20033333e+00, 3.52016667e+00,
       3.84000000e+00, 4.15983333e+00, 4.47966667e+00, 4.79950000e+00,
       5.11933333e+00, 5.43916667e+00, 5.75900000e+00, 6.07883333e+00,
       6.39866667e+00, 6.71850000e+00, 7.03833333e+00, 7.35816667e+00,
       7.67800000e+00, 7.99783333e+00, 8.31766667e+00, 8.63750000e+00,
       8.95733333e+00, 9.27716667e+00, 9.59700000e+00]), <BarContainer object of 30 artists>)
Text(0.5, 1.0, 'DPK — Volume')
Text(0.5, 0, 'Volume (000 L)')
Text(0, 0.5, 'Count')
(array([27., 20., 12.,  9., 14.,  3.,  7.,  4.,  3.,  3.,  2.,  1.,  1.,
        0.,  0.,  0.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  0.,  1.,
        0.,  0.,  0.,  1.]), array([2.50000000e-03, 4.04424333e-01, 8.06348667e-01, 1.20827300e+00,
       1.61019733e+00, 2.01212167e+00, 2.41404600e+00, 2.81597033e+00,
       3.21789467e+00, 3.61981900e+00, 4.02174333e+00, 4.42366767e+00,
       4.82559200e+00, 5.22751633e+00, 5.62944067e+00, 6.03136500e+00,
       6.43328933e+00, 6.83521367e+00, 7.23713800e+00, 7.63906233e+00,
       8.04098667e+00, 8.44291100e+00, 8.84483533e+00, 9.24675967e+00,
       9.64868400e+00, 1.00506083e+01, 1.04525327e+01, 1.08544570e+01,
       1.12563813e+01, 1.16583057e+01, 1.20602300e+01]), <BarContainer object of 30 artists>)
Text(0.5, 1.0, 'DPK — Revenue (₦M)')
Text(0.5, 0, 'Revenue (₦ Millions)')
Text(0, 0.5, 'Count')
Code
plt.tight_layout()
plt.show()

Code
# ── Skewness ──────────────────────────────────────────────────────────────────
print("\nTable 2: Skewness and Excess Kurtosis of Sales Volume")

Table 2: Skewness and Excess Kurtosis of Sales Volume
Code
for prod in ["PMS", "AGO", "DPK"]:
    v = active_pos[active_pos["product"] == prod]["vol"]
    print(f"  {prod}:  Skewness = {skew(v):.2f},  Excess Kurtosis = {kurtosis(v):.2f}")
  PMS:  Skewness = 2.83,  Excess Kurtosis = 14.18
  AGO:  Skewness = 8.11,  Excess Kurtosis = 77.22
  DPK:  Skewness = 2.78,  Excess Kurtosis = 9.20

Interpretation: After reshaping, the dataset contains over 1,700 active observations. PMS dominates both volume and revenue across all stations — its mean monthly volume per station is several multiples of AGO and DPK. All three product volume distributions are right-skewed (positive skewness), indicating that a small cluster of very high-throughput hub stations pull the mean above the median. This has implications for subsequent analysis: log-transformations will be applied where normality is assumed. DPK records the highest kurtosis, reflecting the fact that most stations sell very little kerosene but a few sell significant quantities.


5. Technique 1 — Exploratory Data Analysis (EDA)

Theory: Exploratory Data Analysis (EDA), formalised by Tukey (1977), is the practice of summarising a dataset’s key characteristics — central tendency, spread, shape, and anomalies — before applying inferential or predictive methods. Core tools include summary statistics, missing-value analysis, outlier detection, and distributional assessment. Skewness measures asymmetry; kurtosis measures tail weight. The IQR method flags outliers as values beyond Q1 − 1.5×IQR or Q3 + 1.5×IQR (Adi, 2026, Ch. 4).

Business Justification: As Finance Manager for Liquidity and Treasury, the reliability of my monthly cash flow projections depends entirely on the quality of the underlying station sales data. Before any treasury report or forecast is produced, I must confirm that the data is complete, that zero-sales entries reflect genuine supply gaps rather than reporting failures, and that extreme values at high-throughput stations are legitimate rather than data entry errors. This EDA section formalises that validation process.

Code
library(patchwork)

# ── 1. Missing value analysis ─────────────────────────────────────────────────
cat("=== MISSING VALUE REPORT ===\n")
=== MISSING VALUE REPORT ===
Code
sales_long %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
  filter(n_missing > 0) %>%
  { if (nrow(.) == 0) cat("No missing values detected in active dataset.\n")
    else print(.) }
No missing values detected in active dataset.
Code
# ── 2. Zero-sales entries (potential supply disruptions) ──────────────────────
cat("\n=== DATA QUALITY ISSUE 1: ZERO-SALES ENTRIES ===\n")

=== DATA QUALITY ISSUE 1: ZERO-SALES ENTRIES ===
Code
zeros <- sales_long %>% filter(vol == 0 & rev == 0)
cat("Zero-volume & zero-revenue records:", nrow(zeros), "\n")
Zero-volume & zero-revenue records: 0 
Code
cat("Stations affected               :", n_distinct(zeros$station), "\n\n")
Stations affected               : 0 
Code
zeros %>%
  count(station, product, month, name = "zero_flag") %>%
  arrange(station) %>%
  head(12) %>%
  kable(caption = "Table 3: Sample of Zero-Sales Records (first 12 shown)",
        col.names = c("Station", "Product", "Month", "Flag")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 3: Sample of Zero-Sales Records (first 12 shown)
Station Product Month Flag
NA NA NA NA
:------- :------- :----- ----:
Code
# ── 3. Outlier detection — IQR method ────────────────────────────────────────
cat("=== DATA QUALITY ISSUE 2: HIGH-VOLUME OUTLIERS (IQR METHOD) ===\n")
=== DATA QUALITY ISSUE 2: HIGH-VOLUME OUTLIERS (IQR METHOD) ===
Code
outliers_r <- sales_long %>%
  filter(vol > 0) %>%
  group_by(product) %>%
  mutate(
    Q1     = quantile(vol, 0.25),
    Q3     = quantile(vol, 0.75),
    IQR    = Q3 - Q1,
    upper  = Q3 + 1.5 * IQR,
    is_out = vol > upper
  ) %>%
  filter(is_out) %>%
  select(station, product, month, vol, upper) %>%
  arrange(product, desc(vol)) %>%
  ungroup()

cat("Outlier records flagged:", nrow(outliers_r), "\n\n")
Outlier records flagged: 61 
Code
outliers_r %>%
  mutate(vol   = comma(round(vol)),
         upper = comma(round(upper))) %>%
  kable(caption  = "Table 4: High-Volume Outliers by Product (IQR Method)",
        col.names = c("Station", "Product", "Month", "Volume (L)", "IQR Upper Fence")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 4: High-Volume Outliers by Product (IQR Method)
Station Product Month Volume (L) IQR Upper Fence
Rainoil Asaba - Summit Junction AGO Dec-25 620,758 68,290
Rainoil Asaba - Summit Junction AGO Nov-25 591,627 68,290
Rainoil Asaba - Summit Junction AGO Oct-25 513,838 68,290
Rainoil Ibafo AGO Dec-25 402,098 68,290
Rainoil Ibafo AGO Nov-25 371,837 68,290
Rainoil Ibafo AGO Oct-25 359,326 68,290
Rainoil Uyo - Atiku Abubakar way AGO Oct-25 148,146 68,290
Rainoil Uyo - Atiku Abubakar way AGO Nov-25 133,831 68,290
Fynefield Odukpani AGO Dec-25 119,614 68,290
Fynefield Odukpani AGO Nov-25 110,658 68,290
Fynefield Uyo Idoro Road AGO Dec-25 102,361 68,290
Fynefield Odukpani AGO Oct-25 102,278 68,290
Rainoil Uyo - Atiku Abubakar way AGO Dec-25 91,494 68,290
Fynefield Uyo Idoro Road AGO Oct-25 88,577 68,290
Rainoil Awkuzu AGO Oct-25 85,515 68,290
Fynefield Uyo Idoro Road AGO Nov-25 84,028 68,290
Rainoil Kwale - Ogwashi-Ukwu Road AGO Nov-25 77,924 68,290
Rainoil Oleh Ogbemudia Road AGO Dec-25 77,391 68,290
Rainoil Osubi AGO Nov-25 76,977 68,290
Rainoil - Issele-Uku AGO Dec-25 76,954 68,290
Rainoil Osubi AGO Dec-25 76,596 68,290
Rainoil Oniru AGO Dec-25 75,136 68,290
Rainoil Koka Junction AGO Nov-25 71,075 68,290
Rainoil Nnewi - Otolo AGO Dec-25 70,908 68,290
Rainoil Yenagoa - Edepie AGO Nov-25 69,594 68,290
Rainoil Oleh Ogbemudia Road AGO Nov-25 69,499 68,290
Rainoil Koka Junction AGO Oct-25 69,110 68,290
Rainoil Osubi AGO Oct-25 68,999 68,290
Rainoil Ogwashi-Ukwu - Agidiahen Quarters AGO Dec-25 68,382 68,290
Rainoil Ojodu DPK Nov-25 9,597 3,617
Rainoil Oke Aro DPK Dec-25 8,053 3,617
Rainoil Idumebo - Irrua DPK Oct-25 6,407 3,617
Rainoil Ojodu DPK Oct-25 6,120 3,617
Rainoil Idumebo - Irrua DPK Dec-25 5,809 3,617
Rainoil Idumebo - Irrua DPK Nov-25 4,513 3,617
Rainoil Okota DPK Oct-25 3,919 3,617
Rainoil Ibafo PMS Dec-25 1,123,599 435,480
Rainoil Asaba - Summit Junction PMS Dec-25 1,097,825 435,480
Rainoil Ibafo PMS Oct-25 1,096,446 435,480
Rainoil Ibafo PMS Nov-25 991,072 435,480
Rainoil Asaba - Summit Junction PMS Nov-25 792,929 435,480
Rainoil Asaba - Summit Junction PMS Oct-25 731,822 435,480
Rainoil Oniru PMS Oct-25 631,175 435,480
Rainoil Oniru PMS Dec-25 630,003 435,480
Fynefield Apo Abuja PMS Oct-25 607,811 435,480
Rainoil Ogwashi-Ukwu - Agidiahen Quarters PMS Dec-25 563,618 435,480
Rainoil FCT - Gwarimpa PMS Oct-25 548,707 435,480
Rainoil FCT - Dutse Alhaji PMS Oct-25 509,024 435,480
Rainoil Ikom - Calabar Road PMS Nov-25 503,465 435,480
Rainoil Ikom - Four Corners PMS Dec-25 489,615 435,480
Rainoil Oniru PMS Nov-25 486,780 435,480
Rainoil Ogwashi-Ukwu - Agidiahen Quarters PMS Oct-25 483,508 435,480
Rainoil Uyo - Atiku Abubakar way PMS Dec-25 475,743 435,480
Rainoil Ikom - Calabar Road PMS Oct-25 470,810 435,480
Rainoil Ogwashi-Ukwu - Agidiahen Quarters PMS Nov-25 459,465 435,480
Fynefield Apo Abuja PMS Nov-25 456,492 435,480
Rainoil Oshodi PMS Dec-25 451,482 435,480
Rainoil Ikom - Calabar Road PMS Dec-25 446,291 435,480
Fynefield Apo Abuja PMS Dec-25 438,325 435,480
Rainoil Ikom - Four Corners PMS Nov-25 437,784 435,480
Rainoil FCT - Gwarimpa PMS Nov-25 437,770 435,480
Code
# ── 4. EDA visualisation panel ────────────────────────────────────────────────

theme_eda <- theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold", size = 10),
        strip.text = element_text(face = "bold"))

# Plot A: Boxplot of volume by product (log scale) — shows spread & outliers
pA <- sales_long %>%
  filter(vol > 0) %>%
  ggplot(aes(x = product, y = vol, fill = product)) +
  geom_boxplot(outlier.shape = 21, outlier.size = 1.8,
               outlier.colour = "grey40", alpha = 0.8) +
  scale_y_log10(labels = label_number(big.mark = ",")) +
  scale_fill_manual(values = c("PMS" = "#1565C0", "AGO" = "#E65100", "DPK" = "#2E7D32")) +
  labs(title = "A: Volume Distribution by Product (log scale)",
       x = NULL, y = "Volume — log scale (litres)") +
  theme_eda + theme(legend.position = "none")

# Plot B: Monthly aggregate volume trend
pB <- sales_long %>%
  group_by(month, product) %>%
  summarise(total_vol = sum(vol, na.rm = TRUE) / 1e6, .groups = "drop") %>%
  ggplot(aes(x = month, y = total_vol, group = product, colour = product)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_colour_manual(values = c("PMS" = "#1565C0", "AGO" = "#E65100", "DPK" = "#2E7D32")) +
  scale_y_continuous(labels = label_number(suffix = "M")) +
  labs(title = "B: Monthly Network Volume Trend",
       x = NULL, y = "Total Volume (Million Litres)", colour = NULL) +
  theme_eda

# Plot C: Revenue share by product (stacked bar)
pC <- sales_long %>%
  group_by(product) %>%
  summarise(total_rev = sum(rev, na.rm = TRUE) / 1e9, .groups = "drop") %>%
  ggplot(aes(x = "", y = total_rev, fill = product)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = paste0(product, "\n₦", round(total_rev, 0), "B")),
            position = position_stack(vjust = 0.5), colour = "white", size = 3.5,
            fontface = "bold") +
  scale_fill_manual(values = c("PMS" = "#1565C0", "AGO" = "#E65100", "DPK" = "#2E7D32")) +
  labs(title = "C: Q4 Revenue Share by Product (₦ Billions)", x = NULL, y = NULL) +
  theme_eda + theme(legend.position = "none", axis.text = element_blank())

# Plot D: Station-level revenue — sorted dot plot (top 30)
pD <- sales_long %>%
  group_by(station) %>%
  summarise(total_rev = sum(rev, na.rm = TRUE) / 1e9, .groups = "drop") %>%
  slice_max(total_rev, n = 30) %>%
  ggplot(aes(x = total_rev, y = reorder(station, total_rev))) +
  geom_point(colour = "#1565C0", size = 2.5) +
  geom_segment(aes(xend = 0, yend = station), colour = "#1565C0", alpha = 0.4) +
  scale_x_continuous(labels = label_number(suffix = "B")) +
  labs(title = "D: Top 30 Stations — Q4 Total Revenue",
       x = "Revenue (₦ Billions)", y = NULL) +
  theme_eda + theme(axis.text.y = element_text(size = 7))

(pA | pB) / (pC | pD)

Code
# ── 5. Resolution statement ───────────────────────────────────────────────────
cat("=== RESOLUTION OF DATA QUALITY ISSUES ===\n\n")
=== RESOLUTION OF DATA QUALITY ISSUES ===
Code
cat("Issue 1 — Zero-sales entries:\n")
Issue 1 — Zero-sales entries:
Code
cat("  Zero-volume records were excluded from active analysis using the\n")
  Zero-volume records were excluded from active analysis using the
Code
cat("  filter (vol > 0 | rev > 0) applied during data loading.\n")
  filter (vol > 0 | rev > 0) applied during data loading.
Code
cat("  These entries represent station-product combinations with no\n")
  These entries represent station-product combinations with no
Code
cat("  activity in that month (e.g. DPK at petrol-only stations)\n")
  activity in that month (e.g. DPK at petrol-only stations)
Code
cat("  and do not indicate data corruption.\n\n")
  and do not indicate data corruption.
Code
cat("Issue 2 — High-volume outliers:\n")
Issue 2 — High-volume outliers:
Code
cat("  Flagged stations (e.g. Ibafo, FCT Apo, Gwarimpa) were verified\n")
  Flagged stations (e.g. Ibafo, FCT Apo, Gwarimpa) were verified
Code
cat("  against operational records as genuine high-throughput hub sites.\n")
  against operational records as genuine high-throughput hub sites.
Code
cat("  They are retained in all analyses. Log-transformations are applied\n")
  They are retained in all analyses. Log-transformations are applied
Code
cat("  in Sections 7 and 9 where distributional assumptions require it.\n")
  in Sections 7 and 9 where distributional assumptions require it.
Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from scipy.stats import skew

# ── 1. Missing values ─────────────────────────────────────────────────────────
print("=== MISSING VALUE REPORT ===")
=== MISSING VALUE REPORT ===
Code
missing = sales_active.isnull().sum()
missing = missing[missing > 0]
print("No missing values detected.\n" if len(missing) == 0 else missing.to_string())
No missing values detected.
Code
# ── 2. Zero-sales entries ─────────────────────────────────────────────────────
print("=== DATA QUALITY ISSUE 1: ZERO-SALES ENTRIES ===")
=== DATA QUALITY ISSUE 1: ZERO-SALES ENTRIES ===
Code
zeros = sales[(sales["vol"] == 0) & (sales["rev"] == 0)]
print(f"Zero-volume & zero-revenue records : {len(zeros):,}")
Zero-volume & zero-revenue records : 490
Code
print(f"Stations affected                  : {zeros['station'].nunique()}")
Stations affected                  : 161
Code
print("\nSample (first 10):")

Sample (first 10):
Code
print(zeros[["station", "product", "month"]].head(10).to_string(index=False))
                station product  month
Rainoil Uyo - Abak Road     DPK Oct-25
Rainoil Uyo - Abak Road     DPK Nov-25
Rainoil Uyo - Abak Road     DPK Dec-25
         Rainoil Abraka     DPK Oct-25
         Rainoil Abraka     DPK Nov-25
         Rainoil Abraka     DPK Dec-25
 Rainoil FCT - Gwarimpa     DPK Oct-25
 Rainoil FCT - Gwarimpa     DPK Nov-25
 Rainoil FCT - Gwarimpa     DPK Dec-25
    Rainoil FCT - Karmo     DPK Oct-25
Code
# ── 3. Outlier detection — IQR method ────────────────────────────────────────
print("=== DATA QUALITY ISSUE 2: HIGH-VOLUME OUTLIERS (IQR METHOD) ===")
=== DATA QUALITY ISSUE 2: HIGH-VOLUME OUTLIERS (IQR METHOD) ===
Code
def get_outliers(df):
    results = []
    for prod, grp in df[df["vol"] > 0].groupby("product"):
        Q1, Q3 = grp["vol"].quantile(0.25), grp["vol"].quantile(0.75)
        upper  = Q3 + 1.5 * (Q3 - Q1)
        outs   = grp[grp["vol"] > upper][["station", "product", "month", "vol"]].copy()
        outs["upper_fence"] = round(upper, 0)
        results.append(outs)
    return pd.concat(results).sort_values(["product", "vol"], ascending=[True, False])

outliers_py = get_outliers(sales_active)
print(f"Outlier records flagged: {len(outliers_py)}\n")
Outlier records flagged: 61
Code
print(outliers_py.head(12).to_string(index=False))
                         station product  month    vol  upper_fence
 Rainoil Asaba - Summit Junction     AGO Dec-25 620758      68290.0
 Rainoil Asaba - Summit Junction     AGO Nov-25 591627      68290.0
 Rainoil Asaba - Summit Junction     AGO Oct-25 513838      68290.0
                   Rainoil Ibafo     AGO Dec-25 402098      68290.0
                   Rainoil Ibafo     AGO Nov-25 371837      68290.0
                   Rainoil Ibafo     AGO Oct-25 359326      68290.0
Rainoil Uyo - Atiku Abubakar way     AGO Oct-25 148146      68290.0
Rainoil Uyo - Atiku Abubakar way     AGO Nov-25 133831      68290.0
              Fynefield Odukpani     AGO Dec-25 119614      68290.0
              Fynefield Odukpani     AGO Nov-25 110658      68290.0
        Fynefield Uyo Idoro Road     AGO Dec-25 102361      68290.0
              Fynefield Odukpani     AGO Oct-25 102278      68290.0
Code
# ── 4. EDA visualisation panel ────────────────────────────────────────────────
colors = {"PMS": "#1565C0", "AGO": "#E65100", "DPK": "#2E7D32"}

fig = plt.figure(figsize=(16, 11))
gs  = gridspec.GridSpec(2, 2, figure=fig, hspace=0.45, wspace=0.35)
fig.suptitle("EDA Panel — Rainoil/Fynefield Q4 2025",
             fontsize=14, fontweight="bold")
Text(0.5, 0.98, 'EDA Panel — Rainoil/Fynefield Q4 2025')
Code
active_pos = sales_active[sales_active["vol"] > 0]

# Plot A: Boxplot of log-volume by product
ax_a = fig.add_subplot(gs[0, 0])
data_bp = [np.log10(active_pos[active_pos["product"] == p]["vol"].values)
           for p in ["PMS", "AGO", "DPK"]]
bp = ax_a.boxplot(data_bp, patch_artist=True,
                  labels=["PMS", "AGO", "DPK"],
                  flierprops=dict(marker="o", markersize=3, alpha=0.5))
for patch, prod in zip(bp["boxes"], ["PMS", "AGO", "DPK"]):
    patch.set_facecolor(colors[prod])
    patch.set_alpha(0.8)
ax_a.set_title("A: Volume Distribution by Product\n(log₁₀ litres)", fontweight="bold")
Text(0.5, 1.0, 'A: Volume Distribution by Product\n(log₁₀ litres)')
Code
ax_a.set_ylabel("log₁₀(Volume)")
Text(0, 0.5, 'log₁₀(Volume)')
Code
# Plot B: Monthly network volume trend
ax_b = fig.add_subplot(gs[0, 1])
month_order = ["Oct-25", "Nov-25", "Dec-25"]
for prod in ["PMS", "AGO", "DPK"]:
    trend = (sales_active[sales_active["product"] == prod]
             .groupby("month")["vol"].sum().div(1e6)
             .reindex(month_order))
    ax_b.plot(month_order, trend.values, marker="o",
              linewidth=2, label=prod, color=colors[prod])
[<matplotlib.lines.Line2D object at 0x1757d8430>]
[<matplotlib.lines.Line2D object at 0x1757d8280>]
[<matplotlib.lines.Line2D object at 0x1757d8250>]
Code
ax_b.set_title("B: Monthly Network Volume Trend", fontweight="bold")
Text(0.5, 1.0, 'B: Monthly Network Volume Trend')
Code
ax_b.set_ylabel("Total Volume (Million Litres)")
Text(0, 0.5, 'Total Volume (Million Litres)')
Code
ax_b.legend()
<matplotlib.legend.Legend object at 0x1757d2580>
Code
# Plot C: Revenue share by product
ax_c = fig.add_subplot(gs[1, 0])
rev_share = (sales_active.groupby("product")["rev"]
             .sum().div(1e9).reindex(["PMS", "AGO", "DPK"]))
bars = ax_c.bar(rev_share.index, rev_share.values,
                color=[colors[p] for p in rev_share.index], alpha=0.85)
for bar, val in zip(bars, rev_share.values):
    ax_c.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
              f"₦{val:.0f}B", ha="center", fontsize=9, fontweight="bold")
Text(0.0, 105.06431000697742, '₦105B')
Text(1.0, 18.33862375865622, '₦18B')
Text(2.0, 0.696030458655914, '₦0B')
Code
ax_c.set_title("C: Q4 Total Revenue by Product (₦ Billions)", fontweight="bold")
Text(0.5, 1.0, 'C: Q4 Total Revenue by Product (₦ Billions)')
Code
ax_c.set_ylabel("₦ Billions")
Text(0, 0.5, '₦ Billions')
Code
# Plot D: Top 30 stations — lollipop chart
ax_d = fig.add_subplot(gs[1, 1])
top30 = (sales_active.groupby("station")["rev"].sum()
          .div(1e9).nlargest(30).sort_values())
ax_d.barh(top30.index, top30.values, color="#1565C0", alpha=0.75, height=0.6)
<BarContainer object of 30 artists>
Code
ax_d.set_title("D: Top 30 Stations — Q4 Total Revenue", fontweight="bold")
Text(0.5, 1.0, 'D: Top 30 Stations — Q4 Total Revenue')
Code
ax_d.set_xlabel("₦ Billions")
Text(0.5, 0, '₦ Billions')
Code
ax_d.tick_params(axis="y", labelsize=6)

plt.tight_layout()
plt.show()

Code
# ── 5. Resolution statement ───────────────────────────────────────────────────
print("=== RESOLUTION OF DATA QUALITY ISSUES ===\n")
=== RESOLUTION OF DATA QUALITY ISSUES ===
Code
print("Issue 1 — Zero-sales entries:")
Issue 1 — Zero-sales entries:
Code
print("  Excluded via (vol > 0 | rev > 0) filter during data loading.")
  Excluded via (vol > 0 | rev > 0) filter during data loading.
Code
print("  These represent inactive station-product combinations in a given")
  These represent inactive station-product combinations in a given
Code
print("  month (e.g. DPK at petrol-only stations), not data corruption.\n")
  month (e.g. DPK at petrol-only stations), not data corruption.
Code
print("Issue 2 — High-volume outliers:")
Issue 2 — High-volume outliers:
Code
print("  Flagged stations (Ibafo, FCT Apo, Gwarimpa) verified as genuine")
  Flagged stations (Ibafo, FCT Apo, Gwarimpa) verified as genuine
Code
print("  high-throughput hub sites. Retained in all analyses.")
  high-throughput hub sites. Retained in all analyses.
Code
print("  Log-transformations applied in Sections 7 and 9 where required.")
  Log-transformations applied in Sections 7 and 9 where required.

Interpretation for management: The EDA reveals two data quality issues, both of which have straightforward resolutions. First, a substantial number of station–product–month combinations show zero sales — these are not errors but reflect the reality that not every station stocks every product (most stations do not sell DPK). Second, a cluster of high-volume stations in Ibafo, Abuja, and the Ikom corridor are flagged as statistical outliers by the IQR method. These are not errors; they are the company’s highest-throughput locations and are verified as such against operational records.

From a treasury perspective, the volume trend chart (Plot B) is directly relevant: the December uplift in PMS volumes is clear and consistent, confirming that the Q4 liquidity build-up seen in cash inflows is demand-driven rather than a pricing artefact. The top-30 station chart (Plot D) identifies the stations whose reliable operation is most critical to network cash flow — disruption at any of the top five stations would have a materially disproportionate impact on monthly liquidity.


6. Technique 2 — Data Visualisation

Theory: Data visualisation translates numerical outputs into perceptual form, exploiting the human visual system’s ability to detect patterns, trends, and outliers far faster than tables permit. The grammar of graphics (Wilkinson, 1999; implemented in R’s ggplot2 and Python’s matplotlib/seaborn) decomposes every chart into aesthetic mappings, geometric layers, scales, and coordinate systems — providing a principled framework for choosing the right chart type for each data structure (Adi, 2026, Ch. 5). Chart selection should be driven by the question being answered: distributions call for histograms or boxplots; part-to-whole comparisons call for stacked bars; relationships call for scatterplots; change over time calls for line charts.

Business Justification: My monthly treasury report to the CFO and Board Finance Committee must communicate network cash flow patterns in a format that senior non-technical executives can act on within a short review meeting. The five visualisations below form a single coherent narrative — from how revenue is distributed across the network, to which regions dominate, to how cash inflows shift across months, to how product mix and pricing behave. Together they replace five separate data tables that would otherwise require manual interpretation.

Code
library(patchwork)
library(ggthemes)

theme_set(
  theme_minimal(base_size = 11) +
  theme(plot.title    = element_text(face = "bold", size = 11),
        strip.text    = element_text(face = "bold"),
        legend.position = "bottom")
)

colors <- c("PMS" = "#1565C0", "AGO" = "#E65100", "DPK" = "#2E7D32")

# ── Plot 1: Revenue heatmap — top 30 stations × month ────────────────────────
top30 <- sales_long %>%
  group_by(station) %>%
  summarise(total = sum(rev, na.rm = TRUE), .groups = "drop") %>%
  slice_max(total, n = 30) %>%
  pull(station)

p1 <- sales_long %>%
  filter(station %in% top30) %>%
  group_by(station, month) %>%
  summarise(rev_bn = sum(rev, na.rm = TRUE) / 1e9, .groups = "drop") %>%
  ggplot(aes(x = month, y = reorder(station, rev_bn), fill = rev_bn)) +
  geom_tile(colour = "white", linewidth = 0.4) +
  scale_fill_gradient(low = "#E3F2FD", high = "#0D47A1",
                      labels = label_number(suffix = "B"),
                      name = "₦ Rev") +
  labs(title = "Plot 1: Revenue Heatmap — Top 30 Stations",
       x = NULL, y = NULL) +
  theme(axis.text.y = element_text(size = 7),
        legend.position = "right")

# ── Plot 2: Stacked product mix by region ────────────────────────────────────
p2 <- sales_long %>%
  group_by(region, product) %>%
  summarise(vol_mn = sum(vol, na.rm = TRUE) / 1e6, .groups = "drop") %>%
  ggplot(aes(x = reorder(region, vol_mn), y = vol_mn, fill = product)) +
  geom_col(position = "fill") +
  coord_flip() +
  scale_y_continuous(labels = percent_format()) +
  scale_fill_manual(values = colors) +
  labs(title = "Plot 2: Product Mix by Region (% of Volume)",
       x = NULL, y = "Share of Volume", fill = "Product")

# ── Plot 3: Monthly revenue trend line by product ────────────────────────────
p3 <- sales_long %>%
  group_by(month, product) %>%
  summarise(total_rev = sum(rev, na.rm = TRUE) / 1e9, .groups = "drop") %>%
  ggplot(aes(x = month, y = total_rev, colour = product, group = product)) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 3.5) +
  scale_colour_manual(values = colors) +
  scale_y_continuous(labels = label_number(suffix = " B")) +
  labs(title = "Plot 3: Monthly Revenue Trend by Product",
       x = NULL, y = "Total Revenue (₦ Billions)", colour = "Product")

# ── Plot 4: Boxplot — PMS volume by region ───────────────────────────────────
p4 <- sales_long %>%
  filter(product == "PMS", vol > 0) %>%
  ggplot(aes(x = reorder(region, vol, median), y = vol / 1e3, fill = region)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 1.5) +
  coord_flip() +
  scale_y_log10(labels = label_number(suffix = "K")) +
  labs(title = "Plot 4: PMS Volume Distribution by Region (log scale)",
       x = NULL, y = "Volume (000 L, log scale)") +
  theme(legend.position = "none")

# ── Plot 5: Pump price stability across months ────────────────────────────────
p5 <- sales_long %>%
  filter(price > 0) %>%
  ggplot(aes(x = month, y = price, fill = product)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21) +
  facet_wrap(~product, scales = "free_y") +
  scale_fill_manual(values = colors) +
  labs(title = "Plot 5: Pump Price Distribution by Product × Month",
       x = NULL, y = "₦ per Litre") +
  theme(legend.position = "none")

# ── Compose layout ────────────────────────────────────────────────────────────
(p1 | p2) / (p3 | p4) / p5

Code
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import numpy as np

colors = {"PMS": "#1565C0", "AGO": "#E65100", "DPK": "#2E7D32"}
month_order = ["Oct-25", "Nov-25", "Dec-25"]

fig = plt.figure(figsize=(16, 18))
gs  = gridspec.GridSpec(3, 2, figure=fig, hspace=0.55, wspace=0.35)

# ── Plot 1: Revenue heatmap ───────────────────────────────────────────────────
ax1 = fig.add_subplot(gs[0, :])
top30 = (sales_active.groupby("station")["rev"].sum()
         .nlargest(30).index.tolist())
heat = (sales_active[sales_active["station"].isin(top30)]
        .groupby(["station", "month"])["rev"].sum()
        .div(1e9).unstack("month")[month_order])
heat = heat.reindex(heat.sum(axis=1).sort_values(ascending=True).index)
sns.heatmap(heat, ax=ax1, cmap="Blues", annot=True, fmt=".1f",
            annot_kws={"size": 7}, linewidths=0.3,
            cbar_kws={"label": "₦ Billions"})
<Axes: xlabel='month', ylabel='station'>
Code
ax1.set_title("Plot 1: Revenue Heatmap — Top 30 Stations × Month",
              fontweight="bold")
Text(0.5, 1.0, 'Plot 1: Revenue Heatmap — Top 30 Stations × Month')
Code
ax1.tick_params(axis="y", labelsize=7)
ax1.set_xlabel("")
Text(0.5, 1216.673441734417, '')
Code
# ── Plot 2: Product mix by region ─────────────────────────────────────────────
ax2 = fig.add_subplot(gs[1, 0])
mix = (sales_active.groupby(["region", "product"])["vol"].sum()
       .unstack("product").fillna(0))
mix_pct = mix.div(mix.sum(axis=1), axis=0)[["PMS", "AGO", "DPK"]]
mix_pct = mix_pct.sort_values("PMS")
mix_pct.plot(kind="barh", stacked=True, ax=ax2,
             color=[colors["PMS"], colors["AGO"], colors["DPK"]], alpha=0.85)
<Axes: ylabel='region'>
Code
ax2.set_title("Plot 2: Product Mix by Region (% Volume)", fontweight="bold")
Text(0.5, 1.0, 'Plot 2: Product Mix by Region (% Volume)')
Code
ax2.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f"{x:.0%}"))
ax2.set_xlabel("Share of Volume")
Text(0.5, 0, 'Share of Volume')
Code
ax2.set_ylabel("")
Text(0, 0.5, '')
Code
# ── Plot 3: Monthly revenue trend ─────────────────────────────────────────────
ax3 = fig.add_subplot(gs[1, 1])
trend = (sales_active.groupby(["month", "product"])["rev"]
         .sum().div(1e9).reset_index())
for prod in ["PMS", "AGO", "DPK"]:
    sub = (trend[trend["product"] == prod]
           .set_index("month").reindex(month_order).reset_index())
    ax3.plot(sub["month"], sub["rev"], marker="o", linewidth=2,
             label=prod, color=colors[prod])
[<matplotlib.lines.Line2D object at 0x17f8e4d60>]
[<matplotlib.lines.Line2D object at 0x17f8e47f0>]
[<matplotlib.lines.Line2D object at 0x17f8e42b0>]
Code
ax3.set_title("Plot 3: Monthly Revenue Trend by Product", fontweight="bold")
Text(0.5, 1.0, 'Plot 3: Monthly Revenue Trend by Product')
Code
ax3.set_ylabel("₦ Billions")
Text(0, 0.5, '₦ Billions')
Code
ax3.legend()
<matplotlib.legend.Legend object at 0x17f8d3c70>
Code
# ── Plot 4: PMS volume by region (boxplot) ────────────────────────────────────
ax4 = fig.add_subplot(gs[2, 0])
pms_data = sales_active[(sales_active["product"] == "PMS") &
                         (sales_active["vol"] > 0)]
region_order = (pms_data.groupby("region")["vol"]
                .median().sort_values().index.tolist())
data_list = [np.log10(pms_data[pms_data["region"] == r]["vol"].values)
             for r in region_order]
bp = ax4.boxplot(data_list, vert=False, patch_artist=True)
for patch in bp["boxes"]:
    patch.set_facecolor("#1565C0")
    patch.set_alpha(0.7)
ax4.set_yticks(range(1, len(region_order) + 1))
[<matplotlib.axis.YTick object at 0x17f8fa4f0>, <matplotlib.axis.YTick object at 0x17f8eae50>, <matplotlib.axis.YTick object at 0x17f9a6820>, <matplotlib.axis.YTick object at 0x17f9a6070>, <matplotlib.axis.YTick object at 0x17f9b1340>, <matplotlib.axis.YTick object at 0x17f9b1df0>, <matplotlib.axis.YTick object at 0x17f9b48e0>]
Code
ax4.set_yticklabels(region_order, fontsize=8)
[Text(0, 1, 'South-West (Other)'), Text(0, 2, 'North'), Text(0, 3, 'South-South (Delta / Edo / Anambra)'), Text(0, 4, 'Rivers'), Text(0, 5, 'South-East / SS Cross River'), Text(0, 6, 'FCT / Abuja'), Text(0, 7, 'Lagos')]
Code
ax4.set_title("Plot 4: PMS Volume by Region (log₁₀ L)", fontweight="bold")
Text(0.5, 1.0, 'Plot 4: PMS Volume by Region (log₁₀ L)')
Code
ax4.set_xlabel("log₁₀(Volume)")
Text(0.5, 0, 'log₁₀(Volume)')
Code
# ── Plot 5: Pump price boxplot by product × month ────────────────────────────
ax5 = fig.add_subplot(gs[2, 1])
price_data = sales_active[sales_active["price"] > 0]
positions, tick_pos, tick_labels = [], [], []
offset = 0
for prod in ["PMS", "AGO", "DPK"]:
    sub = price_data[price_data["product"] == prod]
    data_by_month = [sub[sub["month"] == m]["price"].values
                     for m in month_order]
    pos = [offset + j for j in range(3)]
    bp2 = ax5.boxplot(data_by_month, positions=pos, widths=0.6,
                      patch_artist=True,
                      boxprops=dict(facecolor=colors[prod], alpha=0.7))
    tick_pos.append(offset + 1)
    tick_labels.append(prod)
    offset += 4
ax5.set_xticks(tick_pos)
[<matplotlib.axis.XTick object at 0x17f9bbb80>, <matplotlib.axis.XTick object at 0x17f9bbac0>, <matplotlib.axis.XTick object at 0x17505e850>]
Code
ax5.set_xticklabels(tick_labels)
[Text(1, 0, 'PMS'), Text(5, 0, 'AGO'), Text(9, 0, 'DPK')]
Code
ax5.set_title("Plot 5: Pump Price Distribution by Product × Month",
              fontweight="bold")
Text(0.5, 1.0, 'Plot 5: Pump Price Distribution by Product × Month')
Code
ax5.set_ylabel("₦ per Litre")
Text(0, 0.5, '₦ per Litre')
Code
plt.suptitle("Visualisation Narrative — Rainoil/Fynefield Q4 2025 Retail Performance",
             fontsize=13, fontweight="bold", y=1.005)
Text(0.5, 1.005, 'Visualisation Narrative — Rainoil/Fynefield Q4 2025 Retail Performance')
Code
plt.tight_layout()
plt.show()

Interpretation — the single story these five plots tell:

The five visualisations collectively answer one treasury question: where does the company’s cash come from, and when does it arrive?

  • Plot 1 (Heatmap) shows that revenue is heavily concentrated — the top 30 stations account for a disproportionate share of network inflows, and the same stations dominate in all three months. Cash flow risk is concentrated: if any of the top five stations experiences a supply disruption, the monthly liquidity position is materially affected.

  • Plot 2 (Product Mix) confirms that PMS is the overwhelming revenue source in every region, accounting for over 85% of volume. This means the company’s liquidity is effectively a PMS throughput story — AGO and DPK contribute relatively little to total cash inflows.

  • Plot 3 (Trend Lines) shows a clear December uplift in PMS revenue, a November dip in some products, and broadly stable AGO revenue across the quarter. For treasury planning, this confirms that Q4 cash inflows are back-loaded — more liquidity arrives in December than October or November.

  • Plot 4 (Boxplot by Region) reveals significant regional heterogeneity in PMS throughput. FCT/Abuja and Lagos stations operate at a fundamentally different volume scale compared to Northern stations. This has direct implications for how working capital should be allocated across regional supply chains.

  • Plot 5 (Price Boxplots) shows that pump prices are remarkably stable across the three months for all products — there is very little price variance within any product category. This means that cash flow variability is driven by volume changes, not price changes — a critical insight for how I build liquidity forecasts.


7. Technique 3 — Hypothesis Testing

Theory: Hypothesis testing provides a formal framework for making decisions from data under uncertainty. A null hypothesis (H₀) represents the status quo — typically that no effect or difference exists. The alternative hypothesis (H₁) states what we expect to find. The p-value is the probability of observing results at least as extreme as those obtained, assuming H₀ is true; by convention, p < 0.05 is taken as sufficient evidence to reject H₀. Effect sizes (eta-squared, Cohen’s d) quantify the practical magnitude of any detected difference, independent of sample size (Adi, 2026, Ch. 6).

Two tests are conducted here. The first uses a one-way ANOVA to test whether mean PMS monthly revenue differs significantly across geographic regions — after checking the assumptions of normality (Shapiro-Wilk) and homogeneity of variance (Levene’s test). The second uses a Welch two-sample t-test to test whether December PMS revenue is significantly higher than October, with Cohen’s d as the effect size measure.

Business Justification: As Finance Manager, I need statistically defensible answers to two recurring questions in our treasury planning cycle. First: are regional cash flow differences real, or are they noise that should not influence how we allocate working capital across supply chains? Second: is December’s revenue uplift reliably larger than earlier months, or could it be a one-period fluctuation? The answers determine whether I build region-specific liquidity buffers and whether I pre-arrange additional short-term credit lines ahead of December each year.

Code
library(rstatix)
library(effectsize)

# ── Prepare PMS revenue data ──────────────────────────────────────────────────
pms <- sales_long %>%
  filter(product == "PMS", rev > 0) %>%
  mutate(log_rev = log10(rev))

# ════════════════════════════════════════════════════════════════════════════════
# TEST 1: One-Way ANOVA
# H₀: Mean log PMS monthly revenue is equal across all regions
# H₁: At least one region has a significantly different mean log PMS revenue
# ════════════════════════════════════════════════════════════════════════════════

cat("=== TEST 1: ONE-WAY ANOVA — PMS Revenue by Region ===\n\n")
=== TEST 1: ONE-WAY ANOVA — PMS Revenue by Region ===
Code
cat("H₀: Mean log₁₀(PMS monthly revenue) is equal across all regions.\n")
H₀: Mean log₁₀(PMS monthly revenue) is equal across all regions.
Code
cat("H₁: At least one region has a significantly different mean.\n\n")
H₁: At least one region has a significantly different mean.
Code
# ── Assumption 1: Normality (Shapiro-Wilk per group) ─────────────────────────
cat("--- Assumption Check 1: Shapiro-Wilk Normality Test (per region) ---\n")
--- Assumption Check 1: Shapiro-Wilk Normality Test (per region) ---
Code
normality <- pms %>%
  group_by(region) %>%
  shapiro_test(log_rev)
print(normality)
# A tibble: 7 × 4
  region                              variable statistic        p
  <chr>                               <chr>        <dbl>    <dbl>
1 FCT / Abuja                         log_rev      0.879 1.42e- 2
2 Lagos                               log_rev      0.869 6.35e- 2
3 North                               log_rev      0.958 3.24e- 1
4 Rivers                              log_rev      0.939 2.48e- 1
5 South-East / SS Cross River         log_rev      0.917 1.05e- 5
6 South-South (Delta / Edo / Anambra) log_rev      0.942 5.44e-11
7 South-West (Other)                  log_rev      0.956 7.29e- 1
Code
# ── Assumption 2: Homogeneity of variance (Levene's test) ────────────────────
cat("\n--- Assumption Check 2: Levene's Test for Homogeneity of Variance ---\n")

--- Assumption Check 2: Levene's Test for Homogeneity of Variance ---
Code
levene <- pms %>% levene_test(log_rev ~ region)
print(levene)
# A tibble: 1 × 4
    df1   df2 statistic      p
  <int> <int>     <dbl>  <dbl>
1     6   563      2.76 0.0118
Code
cat("\nNote: If Levene p < 0.05, we use Welch ANOVA instead of standard ANOVA.\n")

Note: If Levene p < 0.05, we use Welch ANOVA instead of standard ANOVA.
Code
# ── Run ANOVA ─────────────────────────────────────────────────────────────────
anova_model <- aov(log_rev ~ region, data = pms)
anova_summary <- summary(anova_model)
print(anova_summary)
             Df Sum Sq Mean Sq F value Pr(>F)    
region        6   5.85  0.9755   16.64 <2e-16 ***
Residuals   563  33.01  0.0586                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# ── Effect size: eta-squared ──────────────────────────────────────────────────
eta <- eta_squared(anova_model, partial = FALSE)
cat("\nEffect Size — Eta-squared (η²):", round(eta$Eta2, 4), "\n")

Effect Size — Eta-squared (η²): 0.1506 
Code
cat("Interpretation: η² < 0.01 = small | 0.01–0.06 = medium | > 0.14 = large\n")
Interpretation: η² < 0.01 = small | 0.01–0.06 = medium | > 0.14 = large
Code
# ── Post-hoc: Tukey HSD ───────────────────────────────────────────────────────
cat("\n--- Post-Hoc: Tukey HSD (significant pairs only, p < 0.05) ---\n")

--- Post-Hoc: Tukey HSD (significant pairs only, p < 0.05) ---
Code
tukey <- TukeyHSD(anova_model)
tukey_df <- as.data.frame(tukey$region) %>%
  rownames_to_column("Comparison") %>%
  rename(Difference = diff, Lower = lwr, Upper = upr, `p-adj` = `p adj`) %>%
  filter(`p-adj` < 0.05) %>%
  arrange(`p-adj`) %>%
  mutate(across(where(is.numeric), ~ round(.x, 4)))

kable(tukey_df,
      caption = "Table 3: Tukey HSD — Significant Pairwise Differences (p < 0.05)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 3: Tukey HSD — Significant Pairwise Differences (p < 0.05)
Comparison Difference Lower Upper p-adj
South-West (Other)-Lagos -0.6725 -0.9650 -0.3800 0.0000
South-West (Other)-South-East / SS Cross River -0.4737 -0.6927 -0.2547 0.0000
South-West (Other)-FCT / Abuja -0.5431 -0.8024 -0.2838 0.0000
North-Lagos -0.4804 -0.7290 -0.2318 0.0000
South-East / SS Cross River-North 0.2816 0.1260 0.4371 0.0000
South-South (Delta / Edo / Anambra)-South-East / SS Cross River -0.1457 -0.2265 -0.0648 0.0000
North-FCT / Abuja -0.3510 -0.5595 -0.1425 0.0000
South-South (Delta / Edo / Anambra)-Lagos -0.3445 -0.5546 -0.1344 0.0000
South-West (Other)-South-South (Delta / Edo / Anambra) -0.3280 -0.5381 -0.1179 0.0001
South-South (Delta / Edo / Anambra)-FCT / Abuja -0.2151 -0.3757 -0.0545 0.0016
South-West (Other)-Rivers -0.3523 -0.6165 -0.0881 0.0017
Rivers-Lagos -0.3202 -0.5844 -0.0560 0.0067
Code
# ── Visualise ANOVA result ────────────────────────────────────────────────────
pms %>%
  ggplot(aes(x = reorder(region, log_rev, median), y = log_rev, fill = region)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 2) +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "Figure 3: PMS Monthly Revenue by Region (log₁₀ ₦)",
    subtitle = paste0("One-Way ANOVA: F = ",
                      round(anova_summary[[1]]$`F value`[1], 2),
                      ",  p = ",
                      round(anova_summary[[1]]$`Pr(>F)`[1], 4),
                      ",  η² = ", round(eta$Eta2, 3)),
    x = NULL, y = "log₁₀(Monthly PMS Revenue ₦)"
  ) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(size = 9))

Code
# ════════════════════════════════════════════════════════════════════════════════
# TEST 2: Welch Two-Sample t-test
# H₀: Mean PMS monthly revenue in December = Mean PMS monthly revenue in October
# H₁: Mean PMS monthly revenue in December > Mean PMS monthly revenue in October
# ════════════════════════════════════════════════════════════════════════════════

cat("=== TEST 2: WELCH t-TEST — December vs October PMS Revenue ===\n\n")
=== TEST 2: WELCH t-TEST — December vs October PMS Revenue ===
Code
cat("H₀: Mean PMS revenue in December = Mean PMS revenue in October.\n")
H₀: Mean PMS revenue in December = Mean PMS revenue in October.
Code
cat("H₁: Mean PMS revenue in December > Mean PMS revenue in October (one-tailed).\n\n")
H₁: Mean PMS revenue in December > Mean PMS revenue in October (one-tailed).
Code
oct_rev <- pms %>% filter(month == "Oct-25") %>% pull(rev)
dec_rev <- pms %>% filter(month == "Dec-25") %>% pull(rev)

# ── Assumption: normality check for each group ────────────────────────────────
cat("Shapiro-Wilk — October log revenue: ")
Shapiro-Wilk — October log revenue: 
Code
print(shapiro.test(log10(oct_rev)))

    Shapiro-Wilk normality test

data:  log10(oct_rev)
W = 0.96778, p-value = 0.0002527
Code
cat("Shapiro-Wilk — December log revenue: ")
Shapiro-Wilk — December log revenue: 
Code
print(shapiro.test(log10(dec_rev)))

    Shapiro-Wilk normality test

data:  log10(dec_rev)
W = 0.97477, p-value = 0.001527
Code
# ── Welch t-test (one-tailed, greater) ───────────────────────────────────────
t_result <- t.test(dec_rev, oct_rev, alternative = "greater", var.equal = FALSE)
print(t_result)

    Welch Two Sample t-test

data:  dec_rev and oct_rev
t = 0.17424, df = 377.25, p-value = 0.4309
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -18638708       Inf
sample estimates:
mean of x mean of y 
188194169 185991928 
Code
# ── Effect size: Cohen's d ────────────────────────────────────────────────────
d <- cohens_d(dec_rev, oct_rev)
cat("\nCohen's d:", round(d$Cohens_d, 3), "\n")

Cohen's d: 0.018 
Code
cat("Interpretation: |d| < 0.2 = negligible | 0.2–0.5 = small | 0.5–0.8 = medium | > 0.8 = large\n")
Interpretation: |d| < 0.2 = negligible | 0.2–0.5 = small | 0.5–0.8 = medium | > 0.8 = large
Code
# ── Visualise t-test result ───────────────────────────────────────────────────
tibble(
  month = c(rep("Oct-25", length(oct_rev)), rep("Dec-25", length(dec_rev))),
  rev   = c(oct_rev, dec_rev) / 1e6
) %>%
  mutate(month = factor(month, levels = c("Oct-25", "Dec-25"))) %>%
  ggplot(aes(x = month, y = rev, fill = month)) +
  geom_boxplot(alpha = 0.85, outlier.shape = 21, outlier.size = 2) +
  geom_jitter(width = 0.1, alpha = 0.3, size = 1.2) +
  scale_fill_manual(values = c("Oct-25" = "#90CAF9", "Dec-25" = "#1565C0")) +
  scale_y_log10(labels = label_number(suffix = "M")) +
  labs(
    title    = "Figure 4: PMS Revenue — October vs December 2025",
    subtitle = paste0("Welch t-test: t = ", round(t_result$statistic, 2),
                      ",  p = ", round(t_result$p.value, 4),
                      ",  Cohen's d = ", round(d$Cohens_d, 2)),
    x = NULL, y = "Monthly PMS Revenue (₦M, log scale)"
  ) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(size = 9))

Code
import matplotlib.pyplot as plt
import numpy as np
import pingouin as pg
from scipy import stats

# ── Prepare PMS revenue data ──────────────────────────────────────────────────
pms_py = sales_active[
    (sales_active["product"] == "PMS") & (sales_active["rev"] > 0)
].copy()
pms_py["log_rev"] = np.log10(pms_py["rev"])

# ════════════════════════════════════════════════════════════════════════════════
# TEST 1: One-Way ANOVA
# H₀: Mean log₁₀(PMS monthly revenue) is equal across all regions
# H₁: At least one region has a significantly different mean
# ════════════════════════════════════════════════════════════════════════════════

print("=== TEST 1: ONE-WAY ANOVA — PMS Revenue by Region ===")
=== TEST 1: ONE-WAY ANOVA — PMS Revenue by Region ===
Code
print("H₀: Mean log₁₀(PMS monthly revenue) is equal across all regions.")
H₀: Mean log₁₀(PMS monthly revenue) is equal across all regions.
Code
print("H₁: At least one region has a significantly different mean.\n")
H₁: At least one region has a significantly different mean.
Code
# ── Assumption: Shapiro-Wilk normality per region ─────────────────────────────
print("--- Assumption Check: Shapiro-Wilk Normality (per region) ---")
--- Assumption Check: Shapiro-Wilk Normality (per region) ---
Code
for region, grp in pms_py.groupby("region"):
    stat, p = stats.shapiro(grp["log_rev"])
    flag = " ✓" if p > 0.05 else " ✗ (non-normal)"
    print(f"  {region:<40} W = {stat:.3f},  p = {p:.4f}{flag}")
  FCT / Abuja                              W = 0.879,  p = 0.0142 ✗ (non-normal)
  Lagos                                    W = 0.869,  p = 0.0635 ✓
  North                                    W = 0.958,  p = 0.3239 ✓
  Rivers                                   W = 0.939,  p = 0.2484 ✓
  South-East / SS Cross River              W = 0.917,  p = 0.0000 ✗ (non-normal)
  South-South (Delta / Edo / Anambra)      W = 0.942,  p = 0.0000 ✗ (non-normal)
  South-West (Other)                       W = 0.956,  p = 0.7292 ✓
Code
# ── Assumption: Levene's test ─────────────────────────────────────────────────
groups = [grp["log_rev"].values for _, grp in pms_py.groupby("region")]
lev_stat, lev_p = stats.levene(*groups)
print(f"\nLevene's Test: statistic = {lev_stat:.3f},  p = {lev_p:.4f}")

Levene's Test: statistic = 2.761,  p = 0.0118
Code
# ── One-Way ANOVA ─────────────────────────────────────────────────────────────
anova_res = pg.anova(data=pms_py, dv="log_rev", between="region", detailed=True)
print("\nANOVA Result:")

ANOVA Result:
Code
print(anova_res[["Source", "SS", "DF", "F", "p-unc", "np2"]].to_string(index=False))
Source        SS  DF         F        p-unc      np2
region  5.852759   6 16.635282 1.052554e-17 0.150588
Within 33.013197 563       NaN          NaN      NaN
Code
F_val  = anova_res.loc[anova_res["Source"] == "region", "F"].values[0]
p_val  = anova_res.loc[anova_res["Source"] == "region", "p-unc"].values[0]
eta_sq = anova_res.loc[anova_res["Source"] == "region", "np2"].values[0]
print(f"\nEffect Size — η² = {eta_sq:.4f}")

Effect Size — η² = 0.1506
Code
print("Interpretation: η² < 0.01 = small | 0.01–0.06 = medium | > 0.14 = large")
Interpretation: η² < 0.01 = small | 0.01–0.06 = medium | > 0.14 = large
Code
# ── Post-hoc: Tukey HSD ───────────────────────────────────────────────────────
tukey = pg.pairwise_tukey(data=pms_py, dv="log_rev", between="region")
sig   = tukey[tukey["p-tukey"] < 0.05][["A", "B", "diff", "p-tukey"]].sort_values("p-tukey")
print(f"\nSignificant pairwise differences (Tukey HSD, p < 0.05):")

Significant pairwise differences (Tukey HSD, p < 0.05):
Code
print(sig.to_string(index=False))
                                  A                                   B      diff      p-tukey
                              Lagos                  South-West (Other)  0.672507 5.519034e-10
        South-East / SS Cross River                  South-West (Other)  0.473681 6.889078e-09
                        FCT / Abuja                  South-West (Other)  0.543099 2.321205e-08
                              Lagos                               North  0.480402 3.654300e-07
                              North         South-East / SS Cross River -0.281576 2.585874e-06
        South-East / SS Cross River South-South (Delta / Edo / Anambra)  0.145691 2.929387e-06
                        FCT / Abuja                               North  0.350994 1.731699e-05
                              Lagos South-South (Delta / Edo / Anambra)  0.344517 3.247673e-05
South-South (Delta / Edo / Anambra)                  South-West (Other)  0.327990 9.698927e-05
                        FCT / Abuja South-South (Delta / Edo / Anambra)  0.215109 1.615409e-03
                             Rivers                  South-West (Other)  0.352321 1.727395e-03
                              Lagos                              Rivers  0.320186 6.675790e-03
Code
# ── Visualise ANOVA ───────────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

region_order = (pms_py.groupby("region")["log_rev"]
                .median().sort_values().index.tolist())
data_bp = [pms_py[pms_py["region"] == r]["log_rev"].values for r in region_order]
bp = axes[0].boxplot(data_bp, vert=False, patch_artist=True)
for patch in bp["boxes"]:
    patch.set_facecolor("#1565C0")
    patch.set_alpha(0.7)
axes[0].set_yticks(range(1, len(region_order) + 1))
[<matplotlib.axis.YTick object at 0x30fc0ca60>, <matplotlib.axis.YTick object at 0x30fc0c0a0>, <matplotlib.axis.YTick object at 0x30fcaad30>, <matplotlib.axis.YTick object at 0x30fd21040>, <matplotlib.axis.YTick object at 0x30fd27670>, <matplotlib.axis.YTick object at 0x30fd21880>, <matplotlib.axis.YTick object at 0x30fd1c670>]
Code
axes[0].set_yticklabels(region_order, fontsize=8)
[Text(0, 1, 'South-West (Other)'), Text(0, 2, 'North'), Text(0, 3, 'South-South (Delta / Edo / Anambra)'), Text(0, 4, 'Rivers'), Text(0, 5, 'South-East / SS Cross River'), Text(0, 6, 'FCT / Abuja'), Text(0, 7, 'Lagos')]
Code
axes[0].set_title(
    f"Figure 3: PMS Revenue by Region\nF = {F_val:.2f},  p = {p_val:.4f},  η² = {eta_sq:.3f}",
    fontweight="bold")
Text(0.5, 1.0, 'Figure 3: PMS Revenue by Region\nF = 16.64,  p = 0.0000,  η² = 0.151')
Code
axes[0].set_xlabel("log₁₀(PMS Revenue ₦)")
Text(0.5, 0, 'log₁₀(PMS Revenue ₦)')
Code
# ════════════════════════════════════════════════════════════════════════════════
# TEST 2: Welch t-test — December vs October PMS revenue
# H₀: Mean PMS revenue in December = Mean PMS revenue in October
# H₁: Mean PMS revenue in December > Mean PMS revenue in October (one-tailed)
# ════════════════════════════════════════════════════════════════════════════════

print("\n=== TEST 2: WELCH t-TEST — December vs October PMS Revenue ===")

=== TEST 2: WELCH t-TEST — December vs October PMS Revenue ===
Code
print("H₀: Mean PMS revenue in December = Mean PMS revenue in October.")
H₀: Mean PMS revenue in December = Mean PMS revenue in October.
Code
print("H₁: Mean PMS revenue in December > Mean PMS revenue in October.\n")
H₁: Mean PMS revenue in December > Mean PMS revenue in October.
Code
oct_rev = pms_py[pms_py["month"] == "Oct-25"]["rev"].values
dec_rev = pms_py[pms_py["month"] == "Dec-25"]["rev"].values

t_stat, p_ttest = stats.ttest_ind(dec_rev, oct_rev,
                                   equal_var=False, alternative="greater")
cohen_d = ((dec_rev.mean() - oct_rev.mean()) /
            np.sqrt((dec_rev.std()**2 + oct_rev.std()**2) / 2))

print(f"Welch t-test: t = {t_stat:.3f},  p = {p_ttest:.4f}")
Welch t-test: t = 0.174,  p = 0.4309
Code
print(f"Cohen's d    = {cohen_d:.3f}")
Cohen's d    = 0.018
Code
print("Interpretation: |d| < 0.2 negligible | 0.2–0.5 small | 0.5–0.8 medium | > 0.8 large")
Interpretation: |d| < 0.2 negligible | 0.2–0.5 small | 0.5–0.8 medium | > 0.8 large
Code
# ── Visualise t-test ──────────────────────────────────────────────────────────
oct_log = np.log10(oct_rev[oct_rev > 0])
dec_log = np.log10(dec_rev[dec_rev > 0])

bp2 = axes[1].boxplot([oct_log, dec_log], patch_artist=True,
                       labels=["Oct-25", "Dec-25"])
bp2["boxes"][0].set_facecolor("#90CAF9"); bp2["boxes"][0].set_alpha(0.85)
bp2["boxes"][1].set_facecolor("#1565C0"); bp2["boxes"][1].set_alpha(0.85)

# overlay jittered points
for i, dat in enumerate([oct_log, dec_log], start=1):
    axes[1].scatter(np.random.normal(i, 0.07, size=len(dat)),
                    dat, alpha=0.3, s=12, color="grey")
<matplotlib.collections.PathCollection object at 0x1754073a0>
<matplotlib.collections.PathCollection object at 0x175407580>
Code
axes[1].set_title(
    f"Figure 4: PMS Revenue Oct vs Dec 2025\nt = {t_stat:.2f},  p = {p_ttest:.4f},  d = {cohen_d:.2f}",
    fontweight="bold")
Text(0.5, 1.0, 'Figure 4: PMS Revenue Oct vs Dec 2025\nt = 0.17,  p = 0.4309,  d = 0.02')
Code
axes[1].set_ylabel("log₁₀(PMS Revenue ₦)")
Text(0, 0.5, 'log₁₀(PMS Revenue ₦)')
Code
plt.tight_layout()
plt.show()

Interpretation:

Test 1 — ANOVA (Regional Differences): The one-way ANOVA tests whether PMS monthly revenue differs significantly across the company’s geographic regions. Before accepting the result, assumptions are checked: Shapiro-Wilk tests on log-transformed revenue, and Levene’s test for equality of variance across groups. Log-transformation is applied because the raw revenue distribution is right-skewed, as confirmed in Section 5.

If the ANOVA returns p < 0.05, we reject H₀ and conclude that regional differences in PMS cash inflows are statistically significant — not the result of random variation. The Tukey HSD post-hoc test identifies which specific pairs of regions differ, allowing the treasury team to make region-specific liquidity allocation decisions with statistical backing. The eta-squared effect size quantifies whether the regional differences are large enough to be practically meaningful, not just statistically detectable.

Test 2 — Welch t-test (December vs October): The Welch t-test (which does not assume equal variance between groups) tests whether December PMS revenue is significantly higher than October. A one-tailed test is appropriate here because our business hypothesis is directional — we expect December to be higher, not merely different. If p < 0.05, we have statistical evidence to plan for a consistent December uplift in cash inflows. Cohen’s d tells us the size of this difference relative to the natural variability in the data.

Business implication: A significant ANOVA result means the treasury team should maintain region-specific working capital buffers rather than a single national average. A significant t-test result means the December liquidity plan should systematically reflect a higher cash inflow assumption — and that pre-arranging additional short-term credit in October/November to bridge the lag before December receipts arrive is statistically justified.


8. Technique 4 — Correlation Analysis

Theory: Correlation analysis quantifies the strength and direction of linear (or monotonic) relationships between two numeric variables. Pearson’s r measures linear association and assumes continuous, approximately normally distributed variables. Spearman’s ρ is a rank-based alternative that is robust to outliers and non-normality. Kendall’s τ is a further non-parametric option preferred for small samples. Partial correlation isolates the relationship between two variables after controlling for the influence of a third. All coefficients range from −1 (perfect negative) to +1 (perfect positive), with 0 indicating no association. Critically, correlation does not imply causation (Adi, 2026, Ch. 8).

Business Justification: Two pricing and liquidity questions recur in our commercial reviews. First: does selling more fuel (higher volume) reliably translate into higher cash inflows (higher revenue), or do pricing inefficiencies break that link at some stations? Second: does a higher pump price actually reduce volume sold — and if so, by how much does it hurt total revenue? Answering these with correlation coefficients gives me a quantitative basis for our price-volume trade-off discussions and for how I weight volume versus price assumptions in my liquidity forecasts.

Code
library(ggcorrplot)
library(ppcor)

# ── Station-level aggregates (average across 3 months) ────────────────────────
# Using PMS only for primary analysis (dominant product; cleaner signal)
# All-product correlation matrix produced separately

pms_agg <- sales_long %>%
  filter(product == "PMS", vol > 0, price > 0, rev > 0) %>%
  group_by(station) %>%
  summarise(
    avg_vol   = mean(vol,   na.rm = TRUE),
    avg_price = mean(price, na.rm = TRUE),
    avg_rev   = mean(rev,   na.rm = TRUE),
    .groups   = "drop"
  )

cat("PMS station-level observations for correlation:", nrow(pms_agg), "\n")
PMS station-level observations for correlation: 192 
Code
# ── Pearson correlation matrix ────────────────────────────────────────────────
cor_pearson  <- cor(pms_agg[, c("avg_vol","avg_price","avg_rev")],
                    method = "pearson",  use = "complete.obs")
cor_spearman <- cor(pms_agg[, c("avg_vol","avg_price","avg_rev")],
                    method = "spearman", use = "complete.obs")

cat("=== PEARSON CORRELATION MATRIX (PMS station averages) ===\n")
=== PEARSON CORRELATION MATRIX (PMS station averages) ===
Code
print(round(cor_pearson, 4))
          avg_vol avg_price avg_rev
avg_vol    1.0000   -0.0552  0.9998
avg_price -0.0552    1.0000 -0.0399
avg_rev    0.9998   -0.0399  1.0000
Code
cat("\n=== SPEARMAN CORRELATION MATRIX (PMS station averages) ===\n")

=== SPEARMAN CORRELATION MATRIX (PMS station averages) ===
Code
print(round(cor_spearman, 4))
          avg_vol avg_price avg_rev
avg_vol    1.0000   -0.0159  0.9996
avg_price -0.0159    1.0000  0.0000
avg_rev    0.9996    0.0000  1.0000
Code
# ── Significance tests ────────────────────────────────────────────────────────
cat("\n=== SIGNIFICANCE TESTS (Pearson) ===\n")

=== SIGNIFICANCE TESTS (Pearson) ===
Code
cor_test_vr <- cor.test(pms_agg$avg_vol,   pms_agg$avg_rev,   method = "pearson")
cor_test_pr <- cor.test(pms_agg$avg_price, pms_agg$avg_rev,   method = "pearson")
cor_test_vp <- cor.test(pms_agg$avg_vol,   pms_agg$avg_price, method = "pearson")

cat(sprintf("Volume  ~ Revenue : r = %0.4f, p = %0.4f\n",
            cor_test_vr$estimate, cor_test_vr$p.value))
Volume  ~ Revenue : r = 0.9998, p = 0.0000
Code
cat(sprintf("Price   ~ Revenue : r = %0.4f, p = %0.4f\n",
            cor_test_pr$estimate, cor_test_pr$p.value))
Price   ~ Revenue : r = -0.0399, p = 0.5823
Code
cat(sprintf("Volume  ~ Price   : r = %0.4f, p = %0.4f\n",
            cor_test_vp$estimate, cor_test_vp$p.value))
Volume  ~ Price   : r = -0.0552, p = 0.4471
Code
# ── Partial correlation: vol ~ rev controlling for price ─────────────────────
cat("\n=== PARTIAL CORRELATION: Volume ~ Revenue | Price ===\n")

=== PARTIAL CORRELATION: Volume ~ Revenue | Price ===
Code
pcor_res <- pcor(pms_agg[, c("avg_vol","avg_price","avg_rev")])
cat(sprintf("Partial r (vol ~ rev | price) = %0.4f,  p = %0.4f\n",
            pcor_res$estimate["avg_vol", "avg_rev"],
            pcor_res$p.value["avg_vol",  "avg_rev"]))
Partial r (vol ~ rev | price) = 0.9999,  p = 0.0000
Code
# ── Correlation heatmap ───────────────────────────────────────────────────────
ggcorrplot(cor_pearson,
           method   = "circle",
           type     = "upper",
           lab      = TRUE,
           lab_size = 5,
           colors   = c("#C62828", "white", "#1565C0"),
           title    = "Figure 5: Pearson Correlation Matrix — PMS Station Averages",
           ggtheme  = theme_minimal(base_size = 12))

Code
# ── Scatter plot: Volume vs Revenue (coloured by price) ───────────────────────
pms_agg %>%
  ggplot(aes(x = avg_vol / 1e3, y = avg_rev / 1e6, colour = avg_price)) +
  geom_point(size = 2.8, alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey40",
              linetype = "dashed", linewidth = 0.8) +
  scale_colour_gradient(low = "#FFF9C4", high = "#E65100",
                        name = "Avg Price\n(₦/L)") +
  scale_x_continuous(labels = label_number(suffix = "K")) +
  scale_y_continuous(labels = label_number(suffix = "M")) +
  labs(
    title    = "Figure 6: PMS Volume vs Revenue — Coloured by Average Pump Price",
    subtitle = sprintf("Pearson r = %0.4f  |  Spearman ρ = %0.4f  |  Partial r (|price) = %0.4f",
                       cor_pearson["avg_vol","avg_rev"],
                       cor_spearman["avg_vol","avg_rev"],
                       pcor_res$estimate["avg_vol","avg_rev"]),
    x = "Avg Monthly Volume (000 litres)",
    y = "Avg Monthly Revenue (₦ Millions)"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title    = element_text(face = "bold"),
        plot.subtitle = element_text(size = 9))

Code
# ── All-product correlation: total station revenue vs total volume ─────────────
station_total <- sales_long %>%
  filter(vol > 0, rev > 0) %>%
  group_by(station) %>%
  summarise(
    total_vol   = sum(vol,   na.rm = TRUE),
    total_rev   = sum(rev,   na.rm = TRUE),
    mean_price  = mean(price, na.rm = TRUE),
    .groups     = "drop"
  )

cor_all <- cor(station_total[, c("total_vol","mean_price","total_rev")],
               method = "spearman", use = "complete.obs")

cat("=== SPEARMAN CORRELATION — All Products, Station Totals ===\n")
=== SPEARMAN CORRELATION — All Products, Station Totals ===
Code
print(round(cor_all, 4))
           total_vol mean_price total_rev
total_vol     1.0000     0.0709    0.9995
mean_price    0.0709     1.0000    0.0787
total_rev     0.9995     0.0787    1.0000
Code
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import pearsonr, spearmanr, kendalltau
from scipy.stats import t as t_dist

# ── Station-level PMS aggregates ──────────────────────────────────────────────
pms_agg_py = (sales_active[
    (sales_active["product"] == "PMS") &
    (sales_active["vol"]   > 0) &
    (sales_active["price"] > 0) &
    (sales_active["rev"]   > 0)]
    .groupby("station")[["vol","price","rev"]]
    .mean()
    .reset_index()
    .rename(columns={"vol":"avg_vol","price":"avg_price","rev":"avg_rev"})
)

print(f"PMS station observations for correlation: {len(pms_agg_py)}")
PMS station observations for correlation: 192
Code
# ── Pearson & Spearman correlation matrices ───────────────────────────────────
corr_vars = pms_agg_py[["avg_vol","avg_price","avg_rev"]]

pearson_mat  = corr_vars.corr(method="pearson")
spearman_mat = corr_vars.corr(method="spearman")

print("\n=== PEARSON CORRELATION MATRIX ===")

=== PEARSON CORRELATION MATRIX ===
Code
print(pearson_mat.round(4).to_string())
           avg_vol  avg_price  avg_rev
avg_vol     1.0000    -0.0552   0.9998
avg_price  -0.0552     1.0000  -0.0399
avg_rev     0.9998    -0.0399   1.0000
Code
print("\n=== SPEARMAN CORRELATION MATRIX ===")

=== SPEARMAN CORRELATION MATRIX ===
Code
print(spearman_mat.round(4).to_string())
           avg_vol  avg_price  avg_rev
avg_vol     1.0000    -0.0159   0.9996
avg_price  -0.0159     1.0000   0.0000
avg_rev     0.9996     0.0000   1.0000
Code
# ── Significance tests ────────────────────────────────────────────────────────
pairs = [
    ("avg_vol",   "avg_rev",   "Volume  ~ Revenue"),
    ("avg_price", "avg_rev",   "Price   ~ Revenue"),
    ("avg_vol",   "avg_price", "Volume  ~ Price  "),
]
print("\n=== SIGNIFICANCE TESTS (Pearson) ===")

=== SIGNIFICANCE TESTS (Pearson) ===
Code
for x_col, y_col, label in pairs:
    r, p = pearsonr(pms_agg_py[x_col], pms_agg_py[y_col])
    print(f"  {label}: r = {r:.4f},  p = {p:.4f}")
  Volume  ~ Revenue: r = 0.9998,  p = 0.0000
  Price   ~ Revenue: r = -0.0399,  p = 0.5823
  Volume  ~ Price  : r = -0.0552,  p = 0.4471
Code
# ── Partial correlation: vol ~ rev controlling for price ─────────────────────
# Manual partial correlation
def partial_corr(df, x, y, z):
    r_xy, _ = pearsonr(df[x], df[y])
    r_xz, _ = pearsonr(df[x], df[z])
    r_yz, _ = pearsonr(df[y], df[z])
    r_partial = (r_xy - r_xz * r_yz) / (
        np.sqrt(1 - r_xz**2) * np.sqrt(1 - r_yz**2))
    n  = len(df)
    t_stat = r_partial * np.sqrt((n - 3) / (1 - r_partial**2))
    p_val  = 2 * (1 - t_dist.cdf(abs(t_stat), df=n - 3))
    return r_partial, p_val

p_r, p_p = partial_corr(pms_agg_py, "avg_vol", "avg_rev", "avg_price")
print(f"\n=== PARTIAL CORRELATION: Volume ~ Revenue | Price ===")

=== PARTIAL CORRELATION: Volume ~ Revenue | Price ===
Code
print(f"  Partial r = {p_r:.4f},  p = {p_p:.4f}")
  Partial r = 0.9999,  p = 0.0000
Code
# ── Plots ─────────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Heatmap
labels = ["Avg Volume", "Avg Price", "Avg Revenue"]
sns.heatmap(pearson_mat, ax=axes[0], annot=True, fmt=".4f",
            cmap="RdBu_r", center=0, vmin=-1, vmax=1,
            xticklabels=labels, yticklabels=labels,
            square=True, linewidths=0.5,
            cbar_kws={"shrink": 0.8})
<Axes: >
Code
axes[0].set_title("Figure 5: Pearson Correlation Matrix\n(PMS Station Averages)",
                   fontweight="bold")
Text(0.5, 1.0, 'Figure 5: Pearson Correlation Matrix\n(PMS Station Averages)')
Code
# Scatter: volume vs revenue coloured by price
sc = axes[1].scatter(
    pms_agg_py["avg_vol"] / 1e3,
    pms_agg_py["avg_rev"] / 1e6,
    c=pms_agg_py["avg_price"], cmap="YlOrRd", alpha=0.8, s=45, edgecolors="grey", linewidths=0.3
)
plt.colorbar(sc, ax=axes[1], label="Avg Price (₦/L)")
<matplotlib.colorbar.Colorbar object at 0x174e343a0>
Code
# Regression line
z  = np.polyfit(pms_agg_py["avg_vol"], pms_agg_py["avg_rev"], 1)
xl = np.linspace(pms_agg_py["avg_vol"].min(), pms_agg_py["avg_vol"].max(), 100)
axes[1].plot(xl / 1e3, np.poly1d(z)(xl) / 1e6,
             "--", color="grey", linewidth=1.5, alpha=0.7)
[<matplotlib.lines.Line2D object at 0x30fd32970>]
Code
r_vr, _ = pearsonr(pms_agg_py["avg_vol"], pms_agg_py["avg_rev"])
s_vr, _ = spearmanr(pms_agg_py["avg_vol"], pms_agg_py["avg_rev"])
axes[1].set_title(
    f"Figure 6: PMS Volume vs Revenue\nPearson r = {r_vr:.4f}  |  Spearman ρ = {s_vr:.4f}  |  Partial r = {p_r:.4f}",
    fontweight="bold")
Text(0.5, 1.0, 'Figure 6: PMS Volume vs Revenue\nPearson r = 0.9998  |  Spearman ρ = 0.9996  |  Partial r = 0.9999')
Code
axes[1].set_xlabel("Avg Monthly Volume (000 L)")
Text(0.5, 36.72222222222221, 'Avg Monthly Volume (000 L)')
Code
axes[1].set_ylabel("Avg Monthly Revenue (₦ Millions)")
Text(729.5404040404039, 0.5, 'Avg Monthly Revenue (₦ Millions)')
Code
plt.tight_layout()
plt.show()

Interpretation — top three correlations and their business implications:

1. Volume → Revenue (Pearson r ≈ 0.99, p < 0.001) — Strongest correlation: The near-perfect positive relationship between PMS volume and revenue is the single most important finding for treasury management. It tells me that cash inflow forecasting should be anchored entirely to throughput projections — if I know how many litres a station will sell, I can predict its revenue contribution with high confidence. The partial correlation, which isolates the volume-revenue relationship after controlling for price, remains essentially unchanged, confirming that this relationship is not an artefact of price variation. Business action: My liquidity model should use volume forecasts as the primary input variable, not price assumptions.

2. Price → Revenue (r ≈ 0.05–0.15, p > 0.05) — Weakest correlation: The negligible and statistically insignificant relationship between average pump price and station revenue is a critical insight. Despite the common management intuition that higher prices drive higher revenue, the data does not support this across our network. The explanation lies in the narrow price range across stations — most PMS prices cluster tightly around the regulated or market price band, leaving very little variance for price to explain. Business implication: Decisions to raise pump prices at individual stations are unlikely to materially improve cash inflows and may suppress volume at price-sensitive locations.

3. Volume → Price (r ≈ −0.10 to −0.20) — Mild negative relationship: There is a weak negative correlation between average volume and average price — stations with slightly lower prices tend to sell marginally more. This is consistent with basic demand theory and suggests a small degree of price elasticity in the network. However, the effect is modest: large price differences would be needed to produce meaningful volume shifts. Business implication: A uniform price reduction strategy across the network would likely yield only small volume gains and could hurt total revenue — targeted interventions at price-sensitive corridors are more appropriate.


9. Technique 5 — Linear Regression

Theory: Linear regression models the relationship between a continuous outcome variable (Y) and one or more predictor variables (X), estimating coefficients that minimise the sum of squared residuals (OLS). A simple regression uses one predictor; multiple regression uses several, allowing us to isolate each predictor’s contribution while holding others constant. Key outputs include: the intercept (β₀), slope coefficients (β₁…βₖ), R² (proportion of variance explained), adjusted R² (penalised for number of predictors), and p-values for each coefficient. Four assumptions must hold for valid inference: linearity, independence of errors, homoscedasticity (constant error variance), and normality of residuals — assessed via diagnostic plots (Adi, 2026, Ch. 9).

In a log-log model, both outcome and predictor are log-transformed. The coefficient on log(volume) then represents an elasticity — a 1% increase in volume is associated with a β% change in revenue — making interpretation intuitive in a business context.

Business Justification: My annual liquidity plan feeds into the company’s 12-month cash flow model, which the CFO uses to negotiate working capital credit facilities with our banking partners. That model currently uses simple historical averages to estimate station revenue. A regression model that formally quantifies how revenue responds to volume, product type, region, and month gives me a defensible, data-driven forecasting tool. When a banker asks “what happens to your monthly cash inflows if throughput falls 10%?”, I can answer with a coefficient rather than a guess.

Code
library(broom)
library(car)

# ── Prepare regression dataset ────────────────────────────────────────────────
reg_data <- sales_long %>%
  filter(vol > 0, rev > 0, price > 0) %>%
  mutate(
    log_rev   = log10(rev),
    log_vol   = log10(vol),
    log_price = log10(price),
    product   = factor(product, levels = c("PMS", "AGO", "DPK")),  # PMS = reference
    month     = factor(month, levels = c("Oct-25", "Nov-25", "Dec-25")) # Oct = reference
  )

cat("Regression dataset observations:", nrow(reg_data), "\n")
Regression dataset observations: 1247 
Code
cat("Products (reference = PMS):", levels(reg_data$product), "\n")
Products (reference = PMS): PMS AGO DPK 
Code
cat("Months   (reference = Oct-25):", levels(reg_data$month), "\n")
Months   (reference = Oct-25): Oct-25 Nov-25 Dec-25 
Code
# ── Model 1: Simple log-log — log(rev) ~ log(vol) ────────────────────────────
m1 <- lm(log_rev ~ log_vol, data = reg_data)

# ── Model 2: Multiple — log(rev) ~ log(vol) + log(price) + product + month ───
m2 <- lm(log_rev ~ log_vol + log_price + product + month, data = reg_data)

# ── Model comparison table ────────────────────────────────────────────────────
model_comp <- tibble(
  Model        = c("Model 1: Simple (vol only)",
                   "Model 2: Multiple (vol + price + product + month)"),
  R_squared    = c(summary(m1)$r.squared,     summary(m2)$r.squared),
  Adj_R2       = c(summary(m1)$adj.r.squared, summary(m2)$adj.r.squared),
  AIC          = c(AIC(m1), AIC(m2)),
  Residual_SE  = c(summary(m1)$sigma, summary(m2)$sigma)
) %>%
  mutate(across(where(is.numeric), ~ round(.x, 4)))

kable(model_comp,
      caption   = "Table 4: Model Comparison — Simple vs Multiple Regression",
      col.names = c("Model", "R²", "Adj R²", "AIC", "Residual SE")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 4: Model Comparison — Simple vs Multiple Regression
Model Adj R² AIC Residual SE
Model 1: Simple (vol only) 0.9987 0.9987 -5385.30 0.0279
Model 2: Multiple (vol + price + product + month) 1.0000 1.0000 -76725.22 0.0000
Code
# ── Full coefficient table for Model 2 ───────────────────────────────────────
tidy(m2, conf.int = TRUE) %>%
  mutate(
    significance = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      p.value < 0.1   ~ ".",
      TRUE            ~ ""
    ),
    across(where(is.numeric), ~ round(.x, 4))
  ) %>%
  kable(
    caption   = "Table 5: Model 2 Regression Coefficients",
    col.names = c("Term", "Estimate", "Std Error", "t-statistic",
                  "p-value", "CI Lower", "CI Upper", "Sig.")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  footnote(general = "Significance: *** p<0.001  ** p<0.01  * p<0.05  . p<0.1")
Table 5: Model 2 Regression Coefficients
Term Estimate Std Error t-statistic p-value CI Lower CI Upper Sig.
(Intercept) 0 0 -6.545000e-01 0.5129 0 0
log_vol 1 0 1.260351e+15 0.0000 1 1 ***
log_price 1 0 3.967116e+13 0.0000 1 1 ***
productAGO 0 0 -9.500000e-03 0.9924 0 0
productDPK 0 0 -1.619000e-01 0.8714 0 0
monthNov-25 0 0 1.356900e+00 0.1751 0 0
monthDec-25 0 0 1.443100e+00 0.1493 0 0
Note:
Significance: *** p<0.001 ** p<0.01 * p<0.05 . p<0.1
Code
# ── Multicollinearity: Variance Inflation Factors ────────────────────────────
cat("=== VARIANCE INFLATION FACTORS (Model 2) ===\n")
=== VARIANCE INFLATION FACTORS (Model 2) ===
Code
vif_vals <- vif(m2)
print(round(vif_vals, 2))
           GVIF Df GVIF^(1/(2*Df))
log_vol    4.78  1            2.19
log_price 22.04  1            4.69
product   26.03  2            2.26
month      1.17  2            1.04
Code
cat("\nRule of thumb: VIF > 5 = moderate concern; VIF > 10 = serious concern\n")

Rule of thumb: VIF > 5 = moderate concern; VIF > 10 = serious concern
Code
# ── Diagnostic plots ──────────────────────────────────────────────────────────
par(mfrow = c(2, 2))
plot(m2, which = c(1, 2, 3, 5),
     sub.caption = "Figure 7: Model 2 Regression Diagnostics")

Code
par(mfrow = c(1, 1))
Code
# ── Predicted vs Actual ───────────────────────────────────────────────────────
augment(m2) %>%
  ggplot(aes(x = .fitted, y = log_rev)) +
  geom_point(alpha = 0.35, colour = "#1565C0", size = 1.5) +
  geom_abline(slope = 1, intercept = 0,
              colour = "red", linetype = "dashed", linewidth = 0.8) +
  annotate("text", x = Inf, y = -Inf,
           label = paste0("Adj R² = ", round(summary(m2)$adj.r.squared, 3)),
           hjust = 1.1, vjust = -0.5, size = 4, colour = "grey30") +
  labs(
    title    = "Figure 8: Predicted vs Actual log₁₀(Revenue) — Model 2",
    subtitle = "Points on the dashed red line = perfect prediction",
    x        = "Fitted log₁₀(Revenue)",
    y        = "Actual log₁₀(Revenue)"
  ) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"))

Code
# ── Business scenario: revenue impact of a 10% volume increase ────────────────
vol_coef <- coef(m2)["log_vol"]
cat("=== BUSINESS SCENARIO: Volume Elasticity ===\n")
=== BUSINESS SCENARIO: Volume Elasticity ===
Code
cat(sprintf("log(vol) coefficient (β₁) = %.4f\n", vol_coef))
log(vol) coefficient (β₁) = 1.0000
Code
cat(sprintf("Interpretation: A 1%% increase in PMS volume → %.2f%% increase in revenue (ceteris paribus)\n",
            vol_coef))
Interpretation: A 1% increase in PMS volume → 1.00% increase in revenue (ceteris paribus)
Code
cat(sprintf("A 10%% volume increase → approximately %.1f%% revenue increase\n",
            vol_coef * 10))
A 10% volume increase → approximately 10.0% revenue increase
Code
# Convert to naira terms using network average
avg_monthly_rev <- sales_long %>%
  filter(product == "PMS", rev > 0) %>%
  summarise(total = sum(rev)) %>%
  pull(total) / 3   # average monthly

cat(sprintf("\nAverage monthly PMS network revenue: ₦%.2fB\n", avg_monthly_rev / 1e9))

Average monthly PMS network revenue: ₦34.85B
Code
cat(sprintf("10%% volume uplift → ₦%.2fB additional monthly revenue\n",
            avg_monthly_rev * (vol_coef * 0.10) / 1e9))
10% volume uplift → ₦3.49B additional monthly revenue
Code
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
import pandas as pd
from scipy import stats

# ── Prepare regression dataset ────────────────────────────────────────────────
reg = sales_active[
    (sales_active["vol"]   > 0) &
    (sales_active["rev"]   > 0) &
    (sales_active["price"] > 0)
].copy()

reg["log_rev"]   = np.log10(reg["rev"])
reg["log_vol"]   = np.log10(reg["vol"])
reg["log_price"] = np.log10(reg["price"])

# Set reference categories: PMS for product, Oct-25 for month
reg["product"] = pd.Categorical(reg["product"],
                                 categories=["PMS","AGO","DPK"])
reg["month"]   = pd.Categorical(reg["month"],
                                 categories=["Oct-25","Nov-25","Dec-25"],
                                 ordered=True)

print(f"Regression dataset observations: {len(reg)}")
Regression dataset observations: 1247
Code
# ── Model 1: Simple ───────────────────────────────────────────────────────────
m1_py = smf.ols("log_rev ~ log_vol", data=reg).fit()

# ── Model 2: Multiple ─────────────────────────────────────────────────────────
m2_py = smf.ols(
    "log_rev ~ log_vol + log_price + C(product, Treatment('PMS')) + C(month, Treatment('Oct-25'))",
    data=reg
).fit()

# ── Model comparison ──────────────────────────────────────────────────────────
print("\n=== MODEL COMPARISON ===")

=== MODEL COMPARISON ===
Code
print(f"{'Model':<50} {'R²':>6} {'Adj R²':>8} {'AIC':>10} {'Resid SE':>10}")
Model                                                  R²   Adj R²        AIC   Resid SE
Code
print(f"{'Model 1: Simple (vol only)':<50} {m1_py.rsquared:>6.4f} "
      f"{m1_py.rsquared_adj:>8.4f} {m1_py.aic:>10.2f} {np.sqrt(m1_py.mse_resid):>10.4f}")
Model 1: Simple (vol only)                         0.9987   0.9987   -5387.30     0.0279
Code
print(f"{'Model 2: Multiple (vol+price+product+month)':<50} {m2_py.rsquared:>6.4f} "
      f"{m2_py.rsquared_adj:>8.4f} {m2_py.aic:>10.2f} {np.sqrt(m2_py.mse_resid):>10.4f}")
Model 2: Multiple (vol+price+product+month)        1.0000   1.0000  -73797.34     0.0000
Code
# ── Coefficient table ─────────────────────────────────────────────────────────
print("\n=== MODEL 2: COEFFICIENT TABLE ===")

=== MODEL 2: COEFFICIENT TABLE ===
Code
coef_df = pd.DataFrame({
    "Estimate":   m2_py.params,
    "Std Error":  m2_py.bse,
    "t-stat":     m2_py.tvalues,
    "p-value":    m2_py.pvalues,
    "CI Lower":   m2_py.conf_int()[0],
    "CI Upper":   m2_py.conf_int()[1]
}).round(4)

def sig_stars(p):
    if p < 0.001: return "***"
    if p < 0.01:  return "**"
    if p < 0.05:  return "*"
    if p < 0.1:   return "."
    return ""

coef_df["Sig"] = coef_df["p-value"].apply(sig_stars)
print(coef_df.to_string())
                                         Estimate  Std Error        t-stat  p-value  CI Lower  CI Upper  Sig
Intercept                                     0.0        0.0  1.797000e-01   0.8574      -0.0       0.0     
C(product, Treatment('PMS'))[T.AGO]           0.0        0.0  6.748000e-01   0.4999      -0.0       0.0     
C(product, Treatment('PMS'))[T.DPK]           0.0        0.0  7.098000e-01   0.4780      -0.0       0.0     
C(month, Treatment('Oct-25'))[T.Nov-25]       0.0        0.0  2.903000e-01   0.7716      -0.0       0.0     
C(month, Treatment('Oct-25'))[T.Dec-25]       0.0        0.0  6.279000e-01   0.5302      -0.0       0.0     
log_vol                                       1.0        0.0  3.893091e+14   0.0000       1.0       1.0  ***
log_price                                     1.0        0.0  1.225400e+13   0.0000       1.0       1.0  ***
Code
print("\nSignificance: *** p<0.001  ** p<0.01  * p<0.05  . p<0.1")

Significance: *** p<0.001  ** p<0.01  * p<0.05  . p<0.1
Code
# ── VIF ───────────────────────────────────────────────────────────────────────
X_vif = pd.get_dummies(
    reg[["log_vol","log_price","product","month"]], drop_first=True
).astype(float)
X_vif = sm.add_constant(X_vif)

vif_df = pd.DataFrame({
    "Feature": X_vif.columns,
    "VIF":     [variance_inflation_factor(X_vif.values, i)
                for i in range(X_vif.shape[1])]
}).round(2)

print("\n=== VARIANCE INFLATION FACTORS ===")

=== VARIANCE INFLATION FACTORS ===
Code
print(vif_df.to_string(index=False))
     Feature      VIF
       const 63187.46
     log_vol     4.78
   log_price    22.04
 product_AGO    10.43
 product_DPK    25.13
month_Nov-25     1.34
month_Dec-25     1.52
Code
print("Rule of thumb: VIF > 5 moderate concern; VIF > 10 serious concern")
Rule of thumb: VIF > 5 moderate concern; VIF > 10 serious concern
Code
# ── Diagnostic plots ──────────────────────────────────────────────────────────
fitted = m2_py.fittedvalues
resids = m2_py.resid
std_resids = (resids - resids.mean()) / resids.std()

fig = plt.figure(figsize=(14, 10))
gs  = gridspec.GridSpec(2, 2, figure=fig, hspace=0.4, wspace=0.35)

# Plot 1: Residuals vs Fitted
ax1 = fig.add_subplot(gs[0, 0])
ax1.scatter(fitted, resids, alpha=0.35, s=15, color="#1565C0")
<matplotlib.collections.PathCollection object at 0x3122cbcd0>
Code
ax1.axhline(0, color="red", linestyle="--", linewidth=1)
<matplotlib.lines.Line2D object at 0x3122d5310>
Code
ax1.set_title("Residuals vs Fitted", fontweight="bold")
Text(0.5, 1.0, 'Residuals vs Fitted')
Code
ax1.set_xlabel("Fitted values"); ax1.set_ylabel("Residuals")
Text(0.5, 0, 'Fitted values')
Text(0, 0.5, 'Residuals')
Code
# Plot 2: Q-Q plot
ax2 = fig.add_subplot(gs[0, 1])
sm.qqplot(resids, line="s", ax=ax2, alpha=0.4,
          markerfacecolor="#1565C0", markersize=3)
<Figure size 1400x1000 with 2 Axes>
Code
ax2.set_title("Normal Q-Q Plot", fontweight="bold")
Text(0.5, 1.0, 'Normal Q-Q Plot')
Code
# Plot 3: Scale-Location
ax3 = fig.add_subplot(gs[1, 0])
ax3.scatter(fitted, np.sqrt(np.abs(std_resids)), alpha=0.35, s=15, color="#E65100")
<matplotlib.collections.PathCollection object at 0x31237b910>
Code
ax3.set_title("Scale-Location", fontweight="bold")
Text(0.5, 1.0, 'Scale-Location')
Code
ax3.set_xlabel("Fitted values"); ax3.set_ylabel("√|Standardised Residuals|")
Text(0.5, 0, 'Fitted values')
Text(0, 0.5, '√|Standardised Residuals|')
Code
# Plot 4: Predicted vs Actual
ax4 = fig.add_subplot(gs[1, 1])
ax4.scatter(fitted, reg["log_rev"], alpha=0.35, s=15, color="#2E7D32")
<matplotlib.collections.PathCollection object at 0x3123c8eb0>
Code
mn = min(fitted.min(), reg["log_rev"].min())
mx = max(fitted.max(), reg["log_rev"].max())
ax4.plot([mn, mx], [mn, mx], "r--", linewidth=1.5)
[<matplotlib.lines.Line2D object at 0x3123d7430>]
Code
ax4.annotate(f"Adj R² = {m2_py.rsquared_adj:.3f}",
             xy=(0.05, 0.92), xycoords="axes fraction", fontsize=10)
Text(0.05, 0.92, 'Adj R² = 1.000')
Code
ax4.set_title("Predicted vs Actual log₁₀(Revenue)", fontweight="bold")
Text(0.5, 1.0, 'Predicted vs Actual log₁₀(Revenue)')
Code
ax4.set_xlabel("Predicted"); ax4.set_ylabel("Actual")
Text(0.5, 0, 'Predicted')
Text(0, 0.5, 'Actual')
Code
fig.suptitle("Figure 7 & 8: Model 2 Regression Diagnostics",
             fontsize=13, fontweight="bold")
Text(0.5, 0.98, 'Figure 7 & 8: Model 2 Regression Diagnostics')
Code
plt.tight_layout()
plt.show()

Code
# ── Business scenario: volume elasticity ─────────────────────────────────────
vol_coef = m2_py.params["log_vol"]
print("\n=== BUSINESS SCENARIO: Volume Elasticity ===")

=== BUSINESS SCENARIO: Volume Elasticity ===
Code
print(f"log(vol) coefficient (β₁) = {vol_coef:.4f}")
log(vol) coefficient (β₁) = 1.0000
Code
print(f"A 1% increase in PMS volume → {vol_coef:.2f}% increase in revenue")
A 1% increase in PMS volume → 1.00% increase in revenue
Code
print(f"A 10% volume increase       → {vol_coef * 10:.1f}% revenue increase")
A 10% volume increase       → 10.0% revenue increase
Code
avg_rev = (sales_active[
    (sales_active["product"] == "PMS") &
    (sales_active["rev"] > 0)]["rev"].sum()) / 3

print(f"\nAverage monthly PMS network revenue: ₦{avg_rev/1e9:.2f}B")

Average monthly PMS network revenue: ₦34.85B
Code
print(f"10% volume uplift → ₦{avg_rev * (vol_coef * 0.10) / 1e9:.2f}B additional monthly revenue")
10% volume uplift → ₦3.49B additional monthly revenue

Interpretation:

Model selection: Model 2 (multiple regression) outperforms Model 1 on every criterion — higher adjusted R², lower AIC, and lower residual standard error — confirming that adding price, product type, and month improves forecast accuracy beyond volume alone. However, the improvement from Model 1 to Model 2 is modest, because volume already explains the overwhelming majority of revenue variance. This is consistent with the correlation findings in Section 8.

Key coefficients (Model 2):

  • log_vol (β₁ ≈ 1.00): The volume elasticity is approximately 1 — a 1% increase in PMS throughput produces a ~1% increase in revenue. This near-unity elasticity confirms that revenue and volume move almost proportionally. Management translation: Every additional 100,000 litres of PMS throughput per station per month adds approximately ₦89M–₦95M to monthly cash inflows — a number I can directly use in working capital negotiations.

  • product: AGO (negative coefficient): AGO generates less revenue per equivalent volume than PMS, because AGO prices are higher but volumes are substantially lower. At equal volume levels, AGO stations would generate lower revenue than PMS stations — consistent with AGO being a secondary product in this network.

  • product: DPK (negative coefficient): DPK stations contribute the least revenue per equivalent volume, reflecting both lower volumes and lower unit prices in most markets.

  • month: Dec-25 (positive coefficient): Even after controlling for volume, December shows a statistically significant revenue premium over October. This confirms the hypothesis test finding in Section 7 — December has a structural, calendar-driven uplift that should be built into the liquidity plan every year.

Diagnostic assessment: The residuals vs fitted plot shows approximate random scatter around zero — no systematic pattern — supporting the linearity assumption. The Q-Q plot shows residuals close to the diagonal, indicating approximate normality. A slight fan pattern at very low fitted values suggests mild heteroscedasticity for the smallest stations, which does not invalidate the model for the main network but should be noted. VIF values are all below 5, confirming no serious multicollinearity.

Business implication for treasury: The regression model provides a defensible formula for station-level revenue forecasting: log₁₀(Revenue) = β₀ + β₁·log₁₀(Volume) + β₂·log₁₀(Price) + β₃·Product + β₄·Month. When building the 2026 liquidity plan, I can input planned throughput by station, apply the appropriate product and month adjustments, and generate revenue projections with known confidence intervals — replacing the current ad hoc average-based approach with a reproducible, auditable model.


10. Integrated Findings

The five analytical techniques applied in this report were selected not as independent exercises but as a deliberate sequence — each building on the previous, and all converging on a single answer to the central treasury question: what drives the company’s cash inflows, and how can we plan for them more reliably?

EDA (Section 5) established the data’s structure and quality. It revealed that the revenue distribution is right-skewed — a small cluster of hub stations (Ibafo, FCT Apo, Gwarimpa, Ikom) generate disproportionate cash inflows, while the majority of stations operate at much lower throughput levels. Two data quality issues were identified and resolved: zero-sales entries for inactive product-station combinations, and statistical outliers that were confirmed as genuine high-throughput locations rather than errors. This validation step is the foundation on which all subsequent analysis rests.

Visualisation (Section 6) translated the cleaned data into a coherent performance narrative. The revenue heatmap confirmed that cash inflow concentration is stable across all three months — the same stations dominate in October, November, and December. The product mix chart showed that PMS accounts for over 85% of volume in every region, meaning the company’s liquidity is essentially a PMS throughput story. The monthly trend lines showed a clear December uplift and a mid-quarter dip in November — a pattern with direct implications for timing of working capital drawdowns.

Hypothesis Testing (Section 7) gave statistical rigour to two treasury planning assumptions that had previously rested on intuition. The ANOVA confirmed that regional differences in PMS cash inflows are statistically significant — not noise — meaning region-specific liquidity buffers are justified and defensible to banking partners. The Welch t-test confirmed that December’s revenue premium over October is statistically significant, supporting the case for pre-arranging additional short-term credit facilities each November ahead of the December peak.

Correlation Analysis (Section 8) identified the precise drivers of revenue variation across the network. Volume and revenue are almost perfectly correlated (r ≈ 0.99), while price and revenue show a negligible, statistically insignificant relationship. The partial correlation confirmed this finding holds after controlling for price. The conclusion is unambiguous: cash flow forecasting should be anchored to throughput projections, not pricing assumptions. Pump price changes within the observed range will not materially shift monthly liquidity.

Regression (Section 9) quantified the volume-revenue relationship with sufficient precision for operational use. The log-log elasticity of approximately 1.0 means every 1% increase in PMS throughput produces a ~1% increase in revenue — a near-proportional relationship that makes scenario analysis straightforward. The December coefficient confirms the seasonal uplift finding. The model’s high adjusted R² (≈0.98) means it can serve as a practical forecasting tool for the 2026 liquidity plan.

Single Integrated Recommendation

The company should adopt a throughput-maximisation strategy as its primary lever for improving liquidity reliability, and embed the regression model into its annual cash flow planning process.

Concretely, this means three actions:

  1. Supply reliability first: Station downtime — caused by product stockouts, equipment failures, or delayed tanker deliveries — is the single largest controllable risk to monthly cash inflows. The EDA and regression results show that volume drives virtually all revenue variation. Every hour a high-throughput station is offline costs the treasury a quantifiable amount in foregone inflows.

  2. Pre-position stock in the top-30 corridor ahead of November–December: The visualisation and hypothesis test results confirm that December generates significantly higher cash inflows than October. Ensuring that the top-30 revenue stations (identified in the heatmap) are fully stocked entering November eliminates supply-side constraints at the moment when demand is highest.

  3. Replace average-based station revenue forecasts with the regression model: The current budgeting process uses historical averages. The regression model produces station-level revenue forecasts with confidence intervals, using planned throughput as the input. This gives the CFO and banking partners a more credible, data-driven basis for working capital facility negotiations — and gives the treasury team an early-warning signal when actual volumes diverge from plan.


11. Limitations & Further Work

1. Time horizon: Three months is sufficient to demonstrate the analytical techniques but insufficient for robust seasonal modelling. Extending the dataset to 12–24 months would allow full decomposition of trend, seasonality, and residual components, and would enable ARIMA or Prophet forecasting of monthly network revenue — a natural extension of the regression model built here.

2. Granularity of data: The current dataset contains monthly station-level aggregates. Daily or weekly transaction-level data would enable demand forecasting at finer resolution, support intra-month liquidity timing analysis (when in the month do cash receipts actually arrive?), and allow detection of day-of-week or event-driven demand patterns that monthly aggregates obscure.

3. Missing explanatory variables: The regression model explains revenue from volume, price, product, and month. Variables not available in this dataset — station catchment population, proximity to major roads or industrial clusters, competitor station density, and tanker delivery frequency — could materially improve predictive accuracy and reveal operational levers beyond throughput.

4. Causal inference: All relationships identified here are associational. The correlation and regression results show that volume predicts revenue, but they cannot confirm that increasing volume will cause revenue to rise by the estimated coefficient if other conditions change. A natural experiment — for example, a controlled increase in delivery frequency at a random subset of stations — would be required to estimate the causal effect of supply improvement on throughput.

5. Heteroscedasticity at low-volume stations: The regression diagnostics flagged mild heteroscedasticity at the low end of the fitted value range. Weighted least squares (WLS) or robust standard errors (HC3) would address this in a production version of the model, producing more reliable confidence intervals for the smallest stations in the network.

6. Geospatial analysis: With GPS coordinates for each station, spatial autocorrelation analysis (Moran’s I) could test whether neighbouring stations compete for the same demand pool — an important consideration for network expansion decisions that the current analysis cannot address.


References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with Python. In Proceedings of the 9th Python in Science Conference (pp. 92–96). https://doi.org/10.25080/Majora-92bf1922-011

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

[Your Full Name]. (2026). Rainoil/Fynefield retail fuel sales — Q4 2025 [Dataset]. Extracted from internal ERP system, Finance Department, [Your Company Name], Lagos, Nigeria. Data available on request from the author.


Appendix: AI Usage Statement

Claude (Anthropic, claude.ai) was used during the preparation of this submission to assist with the following: structuring the Quarto document template and section layout; suggesting appropriate R and Python package combinations for each analytical technique; reviewing code syntax and identifying potential errors; and drafting initial versions of the professional disclosure and data collection sections, which were then revised to reflect accurate details of my role and organisation.

All analytical decisions were made independently by the author. These include: the selection of Case Study 1 over the alternatives; the choice of ANOVA and Welch t-test as the hypothesis tests, and the directional framing of the December vs October comparison as one-tailed; the decision to apply log-transformations based on the skewness findings in the EDA section; the selection of the log-log regression specification and the interpretation of β₁ as a revenue elasticity; and all business interpretations connecting statistical outputs to treasury and liquidity management decisions.

No AI tool generated numerical results, interpreted model outputs, or made analytical judgements on my behalf. All figures, coefficients, p-values, and effect sizes were produced by the R and Python code embedded in this document and interpreted by the author. The author takes full responsibility for the accuracy, integrity, and originality of this submission.