Reducing Fuel misuse and Improving Profitability per Trip

Data Analytics II — Case Study 1: Exploratory & Inferential Analytics | AB InBev Sagamu Transport Operations — Q1 2026

Author

Anthonia Okocha

Published

May 23, 2026

Abstract

This study applies five exploratory and inferential analytics techniques to 1,073 truck-trip records from DP World’s Sagamu Transport Operations hub, covering AB InBev deliveries across Nigeria in Q1 2026. The central business problem is fuel misuse: 36.7% of all trips generated misused fuel costs totalling ₦30.5 million, with long-haul routes (>700 km) accounting for 77.7% of losses. Dead kilometres emerge as the strongest predictor of financial waste (Spearman ρ = 0.65, p < 0.001), with every 100 dead kilometres associated with ₦77,727 in misused fuel value. Hypothesis testing confirms significant fuel efficiency differences across fleet controllers (Welch ANOVA, p < 0.001, η² = 0.089). Findings support implementing a Long-Haul Fuel Accountability Protocol projected to recover over ₦21 million per quarter — a material improvement in BU profitability and contract renewal positioning.

🎓 Course: Data Analytics II | 📋 Assessment: Capstone Case Study 1 | 👨‍🏫 Instructor: Prof Bongo Adi | 🏛️ Institution: Lagos Business School | 📅 Date: May 2026


1 Executive Summary

This study applies five exploratory and inferential analytics techniques to 1,073 truck-trip records generated by AB InBev’s Sagamu Transport Operations hub between January and March 2026. The dataset was extracted directly from DP World MDS’s internal fleet management system and represents a complete census of all dispatched shipments over the quarter.

The central business problem is fuel waste and its financial cost. Across Q1 2026, 394 of 1,073 trips (36.7%) recorded positive fuel misuse, generating total losses of ₦30,472,200 — equivalent to approximately ₦122 million annually. The analyses reveal that:

  • Misused fuel value is highly right-skewed: the top 10% of waste-generating trips account for 33.8% of all losses
  • Fuel efficiency differs significantly across the five fleet controllers (Welch ANOVA, p < 0.001, η² = 0.089)
  • Long-haul routes (>700 km) carry a 51.9% waste rate and generate 77.7% of total losses despite representing 50% of trips
  • Dead kilometres (mean: 47.7 km/trip; total Q1: 51,165 km) are the strongest operational predictor of financial waste (Spearman ρ = 0.65, p < 0.001)
  • Every 100 dead kilometres is associated with ₦77,727 in misused fuel value

Recommendation: DP World should implement a Long-Haul Fuel Accountability Protocol — enforcing GPS-verified dead-KM caps, daily fuel log submission on trips exceeding 8 days, and controller performance tiering. Capping long-haul dead kilometres at 20 km per trip is projected to recover over ₦21 million per quarter.


2 Professional Disclosure

2.1 Role and Organisation

I am a Business Development Specialist at DP World, a global ports, logistics, and supply chain conglomerate operating across 77 countries. Within the Nigerian Business Unit, my role sits in the commercial department where I am responsible for developing growth strategies, conducting cost modelling for commercial proposals, managing responses to tenders (RFIs, RFPs, RFQs), revising contract rates to optimise BU profitability, and identifying commercial risks pre- and post-contract execution.

The dataset analysed here — 1,073 truck trips dispatched from the Sagamu Transport Operations hub for AB InBev across Q1 2026 — comes directly from DP World’s internal fleet management system for a client account I manage commercially. Every variable maps to a metric I monitor: paid versus actual distance informs route pricing accuracy; fuel efficiency feeds into cost-per-kilometre benchmarks used in commercial proposals; misused fuel value quantifies the leakage that compresses BU profitability; and dead KM is a direct proxy for asset underutilisation — a KPI I am accountable for driving down.

2.2 Technique Operational Relevance

Technique 1 — Exploratory Data Analysis: A core deliverable in my role is conducting cost modelling for commercial proposals. No cost model is credible if built on unchecked data. Before I can price a rate for AB InBev or present a business case to BU leadership, I need to establish the true baseline of operational performance — not dashboard averages, but the full distribution of fuel costs including outliers, anomalies, and data quality failures that would distort the model. EDA on this dataset identifies impossible fuel efficiency readings, zero-distance trips, and severe right-skew in misused fuel value before any analysis is built on top of it.

Technique 2 — Data Visualisation: My role explicitly requires supporting the preparation and presentation of commercial proposals and driving client engagements throughout the sales cycle. Both activities demand visual communication — procurement directors and operations managers do not read regression tables; they read charts. The five visualisations in this project are structured as I would present them in a client quarterly business review: showing which destinations generate unsustainable waste, how performance varies across controllers, and how cost leakage trends month-on-month. A chart showing a client their own operational waste in naira terms is also a commercial tool — it builds the case for a revised contract structure or performance-linked rate.

Technique 3 — Hypothesis Testing: My role requires supporting the identification and management of commercial risks pre, during, and post service engagement. A key commercial risk in haulage is systematic underperformance by a specific controller team — because if persistent, it can trigger SLA breaches, erode client satisfaction scores, and threaten contract renewal. Before recommending a performance improvement plan, I need to confirm that the observed efficiency gap is statistically real and not a product of different route assignments. ANOVA provides that confirmation in a form that is defensible in a client performance meeting or an internal HR process.

Technique 4 — Correlation Analysis: My role requires revising applicable rates to optimise income from existing clients and collaborating with other departments to determine the most cost-effective rate. Both tasks require knowing which operational variables actually drive cost outcomes. If dead kilometres and misused fuel value are strongly correlated, then any contract that does not cap dead KM is absorbing an unpriced commercial risk — one that compresses the BU margin I am accountable for protecting. Correlation analysis maps these relationships before they are embedded into a pricing model or contract term.

Technique 5 — Multiple Linear Regression: My role requires managing responses to RFIs, RFPs, and RFQs and negotiating commercial proposals. Both tasks require me to defend a rate with data. Regression on this dataset produces the precise, coefficient-level answer I need: how much does each additional dead kilometre or extra day on the road add to misused fuel cost, holding all other factors constant? That number translates directly into a pricing adjustment, a contractual dead-KM cap, or a risk premium I can justify to a client or to BU finance.


3 Data Collection & Sampling

3.1 Source and Collection Method

The dataset was exported directly from DP World’s internal fleet management and trip-tracking platform, which records every shipment dispatched from the Sagamu Transport Operations hub for AB InBev. Data extraction was performed by downloading the Q1 trip review report for the period January 1 – March 31, 2026, in Excel format (.xlsx). The file (TripReviewReport_ABNBEV_JAN_FEB_MAR.xlsx) was accessed through the internal system available to me in my commercial role — the same system I use when preparing client performance reports and rate revision proposals for BU leadership.

All fields — including distances, fuel figures, and timestamps — are system-generated at the point of trip completion, with fuel reconciliation performed by fleet controllers upon vehicle return. No manual data entry was involved.

3.2 Sampling Frame and Sample Size

This dataset represents a complete census of all AB InBev trips dispatched from the Sagamu hub during Q1 2026. Every trip recorded in the system for the period is included. The final clean dataset contains 1,073 observations across 20 variables, comfortably exceeding the minimum requirement of 100 observations and 5 variables.

Attribute Detail
Client AB InBev Nigeria
Operation Site Sagamu Transport Operations
Operator DP World MDS
Period Covered 01 January 2026 – 31 March 2026
Total Trips (raw) 1,104
Total Trips (clean) 1,073
Variables 20 (6 numeric continuous, 4 numeric integer, 4 categorical, 2 datetime, 4 identifier/text)
Fleet Controllers 5 individuals
Distinct Destinations 40+ cities/states across Nigeria

3.3 Variable Description

Variable Type Description
Shipment Number Identifier Unique trip ID
Fleet Controller Categorical Supervisor responsible for the trip
Vehicle Categorical Truck registration number
Delivery Officer Categorical Driver assigned
Created At Date Trip dispatch date
Completed At Date Trip completion date
Origin Categorical Departure location
Destination Categorical Delivery location
Paid Distance (KM) Numeric Contracted/agreed trip distance
Distance Covered (KM) Numeric Actual GPS-recorded distance
Fuel Budget (Litres) Numeric Approved fuel allocation
Fuel Used (Litres) Numeric Actual fuel consumed
Fuel Consumption Numeric Litres per 100 KM
Fuel Efficiency Numeric KM per litre ratio
Fuel Misused (Litres) Numeric Fuel used beyond budget
Dead KM Numeric Non-revenue kilometres driven
Misused Fuel Value (Naira) Outcome Financial cost of misused fuel (₦)

3.4 Ethical Notes

Driver names and delivery officer identities are present in the operational data. For this published document, individuals are referenced only at the controller and vehicle level — no individual driver is singled out in the analysis. The data is used with organisational awareness for academic purposes under LBS guidelines.


4 Data Loading, Cleaning & Description

Show Code
df_raw <- read_excel("TripReviewReport_ABNBEV_JAN_FEB_MAR.xlsx")

cat("Raw dimensions:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")
Raw dimensions: 1104 rows x 20 columns
Show Code
df_raw %>%
  head(10) %>%
  kable(
    caption = "Preview of Raw Trip Review Dataset",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0,
           bold = TRUE,
           color = "white",
           background = "#1F4E79") %>%
  scroll_box(width = "100%", height = "400px")
Preview of Raw Trip Review Dataset
S/N Client Shipment Number Operation Site Fleet Controller Vehicle Delivery Officer Created At Completed At Origin Destination Paid Distance (KM) Distance Covered(KM) Fuel Budget (Litres) Fuel Used (Litres) Fuel Consumption Fuel Efficiency Fuel Misused(Litres) Dead KM Misused Fuel Value(Naira)
1 AB InBev 206855 Sagamu Transport Operations Vincent Ojo T32805LA Alabi sunday Kayode 02-Jan-2026 19-Jan-2026 Sagamu Lagos 138 111 69 69 1.61 2.00 0 0 0
2 AB InBev 269651 Sagamu Transport Operations Vincent Ojo T32822LA Godfrey Braimoh 02-Jan-2026 12-Jan-2026 Ilesa Abuja 980 1270 490 626 2.03 1.57 136 290 258400
3 AB InBev 273737 Sagamu Transport Operations Dare Adediran T32781LA Izuka Goodluck 23-Jan-2026 30-Jan-2026 Sagamu Abia 1110 867 555 535 1.62 2.07 0 0 0
4 AB InBev 201603 Sagamu Transport Operations Vincent Ojo T32808LA Ajibola Najeem 28-Jan-2026 02-Feb-2026 Sagamu Lagos 164 136 82 78 1.74 2.10 0 0 0
5 AB InBev 201604 Sagamu Transport Operations Kayode Adebisi T32791LA Christopher Sunday 29-Jan-2026 04-Feb-2026 Sagamu Oyo 316 420 158 180 2.33 1.76 22 104 41800
6 AB InBev 201605 Sagamu Transport Operations OLUWADAMILARE ADENIRAN T32752LA Adekunle benson Ayomide 29-Jan-2026 01-Feb-2026 Sagamu Lagos 164 139 82 78 1.78 2.10 0 0 0
7 AB InBev 201606 Sagamu Transport Operations Vincent Ojo T32815LA Waheed Tajudeen 28-Jan-2026 31-Jan-2026 Sagamu Oyo 174 164 87 83 1.98 2.10 0 0 0
8 AB InBev 201607 Sagamu Transport Operations Kayode Adebisi T32744LA Adams Abdulsalam 29-Jan-2026 02-Feb-2026 Sagamu Lagos 280 284 140 133 2.14 2.11 0 4 0
9 AB InBev 201883 Sagamu Transport Operations Dare Adediran T32794LA Wasiu Bello 17-Jan-2026 24-Jan-2026 Sagamu Anambra 436 476 218 436 1.09 1.00 218 40 414200
10 AB InBev 237751 Sagamu Transport Operations OLUWADAMILARE ADENIRAN T32771LA Adesunmiboye Adeposi 23-Jan-2026 30-Jan-2026 Sagamu Akwa Ibom 1388 1196 694 618 1.94 2.25 0 0 0
Show Code
df <- df_raw |>
  rename(
    sn               = `S/N`,
    client           = Client,
    shipment_no      = `Shipment Number`,
    operation_site   = `Operation Site`,
    controller       = `Fleet Controller`,
    vehicle          = Vehicle,
    driver           = `Delivery Officer`,
    created_at       = `Created At`,
    completed_at     = `Completed At`,
    origin           = Origin,
    destination      = Destination,
    paid_km          = `Paid Distance (KM)`,
    actual_km        = `Distance Covered(KM)`,
    fuel_budget_L    = `Fuel Budget (Litres)`,
    fuel_used_L      = `Fuel Used (Litres)`,
    fuel_consumption = `Fuel Consumption`,
    fuel_efficiency  = `Fuel Efficiency`,
    fuel_misused_L   = `Fuel Misused(Litres)`,
    dead_km          = `Dead KM`,
    misused_value    = `Misused Fuel Value(Naira)`
  ) |>
  mutate(
    created_at        = dmy(created_at),
    completed_at      = dmy(completed_at),
    trip_days         = as.numeric(completed_at - created_at),
    month             = month(created_at, label = TRUE, abbr = FALSE),
    month_num         = month(created_at),
    km_variance       = actual_km - paid_km,
    fuel_over_budget  = fuel_used_L - fuel_budget_L,
    has_waste         = ifelse(misused_value > 0, "Waste", "No Waste"),
    controller        = str_to_title(controller),
    route_type        = case_when(
      paid_km <= 300  ~ "Short-haul (<=300 km)",
      paid_km <= 700  ~ "Medium-haul (301-700 km)",
      TRUE            ~ "Long-haul (>700 km)"
    ) |> factor(levels = c(
      "Short-haul (<=300 km)",
      "Medium-haul (301-700 km)",
      "Long-haul (>700 km)"
    ))
  )

# =========================
# DATA QUALITY SUMMARY
# =========================

quality_report <- tibble(
  Metric = c(
    "Total Rows",
    "Negative Trip Days",
    "Fuel Efficiency > 10",
    "Trips with Actual KM = 0"
  ),
  Value = c(
    nrow(df),
    sum(df$trip_days < 0, na.rm = TRUE),
    sum(df$fuel_efficiency > 10, na.rm = TRUE),
    sum(df$actual_km == 0, na.rm = TRUE)
  )
)

quality_report %>%
  kable(
    caption = "Data Quality Summary",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")
Data Quality Summary
Metric Value
Total Rows 1104
Negative Trip Days 0
Fuel Efficiency > 10 4
Trips with Actual KM = 0 28
Show Code
# =========================
# MISSING VALUES REPORT
# =========================

missing_report <- tibble(
  Column = names(df),
  `Missing Values` = colSums(is.na(df))
)

missing_report %>%
  kable(
    caption = "Missing Values by Column",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 12
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019") %>%
  scroll_box(width = "100%", height = "450px")
Missing Values by Column
Column Missing Values
sn 0
client 0
shipment_no 0
operation_site 0
controller 0
vehicle 0
driver 0
created_at 0
completed_at 0
origin 0
destination 0
paid_km 0
actual_km 0
fuel_budget_L 0
fuel_used_L 0
fuel_consumption 0
fuel_efficiency 0
fuel_misused_L 0
dead_km 0
misused_value 0
trip_days 0
month 0
month_num 0
km_variance 0
fuel_over_budget 0
has_waste 0
route_type 0
Show Code
outlier_log <- df |>
  filter(fuel_efficiency > 10 | actual_km == 0 | trip_days < 0) |>
  select(sn, shipment_no, created_at, paid_km, actual_km, fuel_efficiency, trip_days)

cat("Records flagged for exclusion:", nrow(outlier_log), "\n")
Records flagged for exclusion: 31 
Show Code
kable(outlier_log, caption = "Table 0: Flagged Records Before Exclusion") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 0: Flagged Records Before Exclusion
sn shipment_no created_at paid_km actual_km fuel_efficiency trip_days
367 275675 2026-02-05 168 0 2.63 1
397 275947 2026-02-05 952 0 2.00 9
499 278180 2026-02-14 174 0 2.76 5
502 278208 2026-02-13 146 113 146.00 3
506 278246 2026-02-18 152 0 2.11 1
582 279388 2026-02-19 280 0 21.54 18
602 279605 2026-02-19 1816 0 1.94 8
617 279904 2026-02-21 224 12 37.33 5
626 280151 2026-02-27 1188 0 1.91 7
646 280459 2026-02-23 1146 0 2.10 9
659 280813 2026-02-25 282 0 2.10 2
667 280964 2026-02-28 130 0 2.10 1
701 00126 2026-03-10 350 0 4.07 2
716 0035 2026-03-21 175 0 1.86 1
742 282045 2026-03-04 350 0 2.11 3
792 282922 2026-03-06 172 0 2.12 5
796 282976 2026-03-07 834 0 2.00 9
797 283050 2026-03-10 170 0 1.42 6
822 283905 2026-03-11 968 0 2.10 12
886 284676 2026-03-16 138 0 2.12 6
911 285055 2026-03-16 1026 0 1.86 14
969 286218 2026-03-22 1456 0 2.00 9
972 286316 2026-03-22 134 0 2.09 2
978 286384 2026-03-24 180 0 2.09 11
981 286465 2026-03-23 130 0 2.10 2
987 286502 2026-03-23 130 10 65.00 3
1026 287038 2026-03-25 478 0 1.93 4
1084 288178 2026-03-30 1188 0 1.91 10
1087 288213 2026-03-31 246 0 1.68 4
1096 288441 2026-03-31 144 0 0.00 1
1098 288447 2026-03-31 190 0 2.11 15
Show Code
# Rationale: fuel_efficiency > 10 km/L is physically impossible for a loaded truck
# (normal operating range: 1.5-3 km/L). actual_km == 0 indicates cancelled trips.
df_clean <- df |>
  filter(
    fuel_efficiency <= 10,
    actual_km > 0,
    trip_days >= 0
  )

cat("Clean dataset:", nrow(df_clean), "rows retained (",
    nrow(df) - nrow(df_clean), "rows removed)\n")
Clean dataset: 1073 rows retained ( 31 rows removed)
Show Code
summary_tbl <- df_clean |>
  select(paid_km, actual_km, fuel_budget_L, fuel_used_L,
         fuel_efficiency, fuel_misused_L, dead_km, misused_value, trip_days) |>
  pivot_longer(everything(), names_to = "Variable") |>
  group_by(Variable) |>
  summarise(
    N        = n(),
    Mean     = round(mean(value, na.rm = TRUE), 2),
    Median   = round(median(value, na.rm = TRUE), 2),
    SD       = round(sd(value, na.rm = TRUE), 2),
    Min      = round(min(value, na.rm = TRUE), 2),
    Max      = round(max(value, na.rm = TRUE), 2),
    Skewness = round(moments::skewness(value, na.rm = TRUE), 2)
  )

kable(summary_tbl,
      caption = "Table 1: Summary Statistics — Key Numeric Variables",
      format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) |>
  column_spec(8, bold = TRUE)
Table 1: Summary Statistics — Key Numeric Variables
Variable N Mean Median SD Min Max Skewness
actual_km 1,073 705.91 568 543.82 2 3,151 0.80
dead_km 1,073 47.68 20 83.26 0 1,167 4.97
fuel_budget_L 1,073 346.77 337 255.58 32 1,489 0.66
fuel_efficiency 1,073 2.00 2 0.74 -19 6 -20.93
fuel_misused_L 1,073 14.95 0 31.48 0 342 3.97
fuel_used_L 1,073 350.26 357 265.48 -10 1,518 0.76
misused_value 1,073 28,399.07 0 59,807.62 0 649,800 3.97
paid_km 1,073 693.49 674 511.21 64 2,978 0.66
trip_days 1,073 6.84 6 4.34 0 56 2.61

5 Technique 1 — Exploratory Data Analysis (EDA)

Business justification: Before I can renegotiate a contract rate or build a proposal for a new client, I need an honest baseline of operational performance. EDA surfaces the true distribution of fuel costs, identifies anomalous trips (potential fraud, system errors, or route failures), and establishes whether reported performance figures are internally consistent — exactly what I do before every client quarterly business review.

5.1 Distribution of the Outcome Variable

Show Code
p1 <- ggplot(df_clean, aes(x = misused_value)) +
  geom_histogram(bins = 50, fill = "#E63946", colour = "white", alpha = 0.85) +
  scale_x_continuous(labels = label_comma()) +
  labs(
    title    = "Figure 1a: Distribution of Misused Fuel Value (N)",
    subtitle = "Strongly right-skewed — majority of trips have zero waste",
    x        = "Misused Fuel Value (N)",
    y        = "Number of Trips"
  )

p2 <- ggplot(df_clean |> filter(misused_value > 0),
             aes(x = log1p(misused_value))) +
  geom_histogram(bins = 40, fill = "#457B9D", colour = "white", alpha = 0.85) +
  labs(
    title    = "Figure 1b: Log-transformed Misused Fuel Value (waste trips only)",
    subtitle = paste0(sum(df_clean$misused_value > 0), " of ", nrow(df_clean),
                      " trips (", round(mean(df_clean$misused_value > 0)*100, 1),
                      "%) recorded positive waste"),
    x        = "log(1 + Misused Fuel Value)",
    y        = "Number of Trips"
  )

p1 / p2

Show Code
waste_summary <- df_clean |>
  group_by(month) |>
  summarise(
    Total_Trips   = n(),
    Waste_Trips   = sum(misused_value > 0),
    Pct_Waste     = round(Waste_Trips / Total_Trips * 100, 1),
    Total_Waste_N = sum(misused_value),
    Avg_Waste_N   = round(mean(misused_value[misused_value > 0]), 0)
  )

kable(waste_summary,
      caption = "Table 2: Fuel Waste Summary by Month",
      col.names = c("Month", "Total Trips", "Waste Trips", "% with Waste",
                    "Total Waste (N)", "Avg Waste per Waste-Trip (N)"),
      format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 2: Fuel Waste Summary by Month
Month Total Trips Waste Trips % with Waste Total Waste (N) Avg Waste per Waste-Trip (N)
January 355 155 43.7 15,000,500 96,777
February 331 118 35.6 9,004,100 76,306
March 387 121 31.3 6,467,600 53,451

5.2 Data Quality Issues Identified and Resolved

Three data quality issues were identified and resolved prior to analysis:

Show Code
# Issue 1: Extreme fuel efficiency values
cat("ISSUE 1 - Fuel Efficiency Outliers:\n")
ISSUE 1 - Fuel Efficiency Outliers:
Show Code
cat("  Values > 10 km/L (physically impossible for loaded truck):",
    sum(df$fuel_efficiency > 10), "\n")
  Values > 10 km/L (physically impossible for loaded truck): 4 
Show Code
cat("  Resolution: excluded from analysis\n\n")
  Resolution: excluded from analysis
Show Code
# Issue 2: Zero-distance trips
cat("ISSUE 2 - Incomplete/Cancelled Trips:\n")
ISSUE 2 - Incomplete/Cancelled Trips:
Show Code
cat("  Trips where actual_km = 0:", sum(df$actual_km == 0), "\n")
  Trips where actual_km = 0: 28 
Show Code
cat("  Resolution: excluded as no operational activity occurred\n\n")
  Resolution: excluded as no operational activity occurred
Show Code
# Issue 3: Severe skewness in outcome variable
cat("ISSUE 3 - Right-Skew in Outcome Variable:\n")
ISSUE 3 - Right-Skew in Outcome Variable:
Show Code
cat("  Skewness of misused_value:",
    round(moments::skewness(df_clean$misused_value), 2), "\n")
  Skewness of misused_value: 3.97 
Show Code
cat("  Resolution: log1p transformation applied in regression (Section 9)\n\n")
  Resolution: log1p transformation applied in regression (Section 9)
Show Code
# Retained: large km_variance trips
extreme_var <- df_clean |> filter(abs(km_variance) > 500)
cat("NOTE: Trips with |km_variance| > 500 km:", nrow(extreme_var),
    "- retained; large variances likely reflect genuine detours or route changes\n")
NOTE: Trips with |km_variance| > 500 km: 19 - retained; large variances likely reflect genuine detours or route changes

5.3 Distributions of Key Operational Variables

Show Code
vars_to_plot <- c("paid_km", "fuel_used_L", "fuel_efficiency", "dead_km", "trip_days")
labels_map   <- c(
  paid_km         = "Paid Distance (KM)",
  fuel_used_L     = "Fuel Used (Litres)",
  fuel_efficiency = "Fuel Efficiency (km/L)",
  dead_km         = "Dead KM",
  trip_days       = "Trip Duration (Days)"
)

df_clean |>
  select(all_of(vars_to_plot)) |>
  pivot_longer(everything(), names_to = "var", values_to = "val") |>
  mutate(var = factor(var, levels = vars_to_plot, labels = unname(labels_map))) |>
  ggplot(aes(x = val)) +
  geom_histogram(bins = 35, fill = "#2A9D8F", colour = "white", alpha = 0.8) +
  facet_wrap(~var, scales = "free", ncol = 3) +
  labs(
    title    = "Figure 2: Distributions of Key Operational Variables",
    subtitle = "Note varying scales across panels",
    x = NULL, y = "Count"
  )

Key observations: The mean paid distance of 693.5 km confirms this is a predominantly long-haul fleet. Dead KM is heavily right-skewed (median: 20 km; mean: 47.7 km; max: 1,167 km) — indicating a small number of trips with extreme non-revenue driving pulling the average up sharply. Trip duration averages 6.8 days, consistent with long-haul, multi-day routes across Nigeria. Fuel efficiency clusters tightly around 2.0 km/L, with the spread explained largely by route type and load variation.


6 Technique 2 — Data Visualisation

Business justification: I present performance dashboards to both internal management and external clients. The five visualisations below replicate the format of a commercial review deck — showing controller efficiency gaps, monthly waste trends, and dead-kilometre concentration by route in a form that drives an executive decision, not further analysis.

6.1 Monthly Trip Volume and Waste Cost

Show Code
monthly_data <- df_clean |>
  group_by(month) |>
  summarise(
    trips       = n(),
    waste_pct   = mean(misused_value > 0) * 100,
    total_waste = sum(misused_value) / 1e6
  )

p_vol <- ggplot(monthly_data, aes(x = month, y = trips, fill = month)) +
  geom_col(show.legend = FALSE, alpha = 0.85, width = 0.6) +
  scale_fill_viridis_d(option = "D", begin = 0.3) +
  geom_text(aes(label = trips), vjust = -0.4, fontface = "bold") +
  labs(title = "Trip Volume", x = NULL, y = "Number of Trips") +
  ylim(0, max(monthly_data$trips) * 1.15)

p_waste <- ggplot(monthly_data, aes(x = month, y = total_waste, fill = month)) +
  geom_col(show.legend = FALSE, alpha = 0.85, width = 0.6) +
  scale_fill_viridis_d(option = "C", begin = 0.3) +
  geom_text(aes(label = paste0("N", round(total_waste, 1), "M")),
            vjust = -0.4, fontface = "bold") +
  labs(title = "Total Misused Fuel Value (N millions)", x = NULL, y = "N Millions") +
  ylim(0, max(monthly_data$total_waste) * 1.15)

(p_vol | p_waste) +
  plot_annotation(
    title    = "Figure 3: Monthly Trip Volume and Fuel Waste Cost",
    subtitle = "January was the worst month (N15.0M lost); March showed clear improvement (N6.5M)"
  )

6.2 Fuel Waste by Fleet Controller

Show Code
controller_data <- df_clean |>
  group_by(controller) |>
  summarise(
    trips       = n(),
    waste_trips = sum(misused_value > 0),
    waste_pct   = waste_trips / trips * 100,
    avg_waste   = mean(misused_value[misused_value > 0]),
    total_waste = sum(misused_value)
  ) |>
  arrange(desc(total_waste))

ggplot(controller_data,
       aes(x = reorder(controller, total_waste),
           y = total_waste / 1e6, fill = waste_pct)) +
  geom_col(alpha = 0.9, width = 0.65) +
  geom_text(aes(label = paste0(round(waste_pct, 1), "% waste rate")),
            hjust = -0.1, size = 3.5, fontface = "italic") +
  coord_flip() +
  scale_fill_gradient(low = "#A8DADC", high = "#E63946",
                      name = "% Trips\nwith Waste") +
  scale_y_continuous(labels = label_comma(),
                     expand = expansion(mult = c(0, 0.25))) +
  labs(
    title    = "Figure 4: Total Misused Fuel Value by Fleet Controller",
    subtitle = "Bar length = total N lost; colour intensity = % of trips with waste",
    x = "Fleet Controller", y = "Total Misused Fuel Value (N Millions)"
  )

6.3 Fuel Waste Distribution by Route Type

Show Code
ggplot(df_clean,
       aes(x = route_type, y = log1p(misused_value), fill = route_type)) +
  geom_violin(alpha = 0.7, trim = FALSE, show.legend = FALSE) +
  geom_boxplot(width = 0.12, fill = "white", outlier.alpha = 0.3,
               show.legend = FALSE) +
  scale_fill_viridis_d(option = "B", begin = 0.2) +
  scale_y_continuous(
    labels = function(x) paste0("N", comma(round(expm1(x))))
  ) +
  labs(
    title    = "Figure 5: Misused Fuel Value Distribution by Route Type",
    subtitle = "Violin width proportional to density; y-axis log-scaled. Long-haul: 51.9% waste rate vs 16.7% for short-haul",
    x = "Route Type", y = "Misused Fuel Value (log scale, N)"
  )

6.4 Top 15 Destinations by Average Waste

Show Code
dest_data <- df_clean |>
  group_by(destination) |>
  summarise(
    trips     = n(),
    avg_waste = mean(misused_value),
    tot_waste = sum(misused_value)
  ) |>
  filter(trips >= 5) |>
  slice_max(avg_waste, n = 15)

ggplot(dest_data,
       aes(x = reorder(destination, avg_waste), y = avg_waste, fill = tot_waste)) +
  geom_col(alpha = 0.85) +
  geom_text(aes(label = comma(round(avg_waste))), hjust = -0.1, size = 3.3) +
  coord_flip() +
  scale_fill_gradient(low = "#A8DADC", high = "#E63946",
                      labels = label_comma(), name = "Total Waste (N)") +
  scale_y_continuous(labels = label_comma(),
                     expand = expansion(mult = c(0, 0.2))) +
  labs(
    title    = "Figure 6: Top 15 Destinations by Average Misused Fuel Value",
    subtitle = "Minimum 5 trips to qualify. Kano (N236,313 avg) is the highest-risk destination",
    x = "Destination", y = "Average Misused Fuel Value (N)"
  )

6.5 Weekly Fuel Efficiency Trend by Controller

Show Code
weekly_eff <- df_clean |>
  mutate(week = floor_date(created_at, "week")) |>
  group_by(week, controller) |>
  summarise(avg_eff = mean(fuel_efficiency), .groups = "drop")

ggplot(weekly_eff, aes(x = week, y = avg_eff, colour = controller)) +
  geom_line(linewidth = 0.8, alpha = 0.8) +
  geom_point(size = 1.5, alpha = 0.7) +
  geom_smooth(se = FALSE, method = "loess", linetype = "dashed",
              linewidth = 0.5, alpha = 0.5) +
  scale_colour_viridis_d(option = "D", name = "Fleet Controller") +
  scale_x_date(date_breaks = "2 weeks", date_labels = "%d %b") +
  labs(
    title    = "Figure 7: Weekly Average Fuel Efficiency by Fleet Controller",
    subtitle = "Dashed lines = LOESS trend; normal operating range approx. 1.8-2.5 km/L",
    x = "Week Starting", y = "Average Fuel Efficiency (km/L)"
  )

Narrative summary: The five visualisations tell a coherent story. January was the most wasteful month (N15.0M in losses), but March showed measurable improvement (N6.5M), suggesting interventions may already be having some effect. Long-haul trips generate dramatically more waste than short-haul (Figure 5). Dare Adediran’s team has the highest waste rate (44.7%) while Deborah James has the lowest average waste per trip (N19,479). Kano, Kaduna, and Plateau are the highest-risk destinations (Figure 6). These patterns set up the formal testing in Section 7.


7 Technique 3 — Hypothesis Testing

Business justification: Before recommending that a controller be placed on a performance improvement plan — a commercially sensitive decision with HR and contractual implications — I need to confirm that the observed efficiency gap is statistically real and not a product of different route assignments. ANOVA provides that confirmation. Without it, any intervention recommendation is legally and commercially vulnerable to challenge.

7.1 Hypothesis 1 — Does Fuel Efficiency Differ Across Fleet Controllers?

Business question: Are all five fleet controllers managing their fleets equally efficiently, or are some systematically associated with better or worse fuel economy?

H₀: Mean fuel efficiency is equal across all five fleet controllers (μ₁ = μ₂ = μ₃ = μ₄ = μ₅).
H₁: At least one fleet controller’s mean fuel efficiency differs from the others.
Test: Welch one-way ANOVA (robust to unequal group variances). Significance level: α = 0.05.

Show Code
cat("=== Shapiro-Wilk Normality Test by Controller ===\n")
=== Shapiro-Wilk Normality Test by Controller ===
Show Code
df_clean |>
  group_by(controller) |>
  summarise(
    n         = n(),
    W_stat    = shapiro.test(fuel_efficiency)$statistic,
    p_value   = shapiro.test(fuel_efficiency)$p.value,
    Skewness  = round(moments::skewness(fuel_efficiency), 3)
  ) |>
  kable(digits = 4, caption = "Shapiro-Wilk Normality Test by Fleet Controller") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Shapiro-Wilk Normality Test by Fleet Controller
controller n W_stat p_value Skewness
Dare Adediran 208 0.6572 0 3.461
Deborah James 135 0.6192 0 4.369
Kayode Adebisi 250 0.7706 0 -0.097
Oluwadamilare Adeniran 226 0.1376 0 -13.807
Vincent Ojo 254 0.6865 0 2.119
Show Code
levene_result <- leveneTest(fuel_efficiency ~ controller, data = df_clean)

levene_tbl <- as.data.frame(levene_result) |>
  rownames_to_column("Test") |>
  mutate(across(where(is.numeric), ~round(.x, 4)))

levene_tbl %>%
  kable(
    caption = "Levene’s Test for Homogeneity of Variance",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")
Levene’s Test for Homogeneity of Variance
Test Df F value Pr(>F)
group 4 0.8713 0.4805
1068 NA NA
Show Code
p_val <- levene_result$`Pr(>F)`[1]

tibble(
  Interpretation = c(
    paste0("p-value = ", round(p_val, 4)),
    ifelse(p_val < 0.05,
           "Reject H0: Variances are not equal",
           "Fail to reject H0: Variances are equal"),
    "Recommendation: Use Welch ANOVA (robust to unequal variances)"
  )
) %>%
  kable(
    caption = "Statistical Interpretation",
    align = "l",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 12
  )
Statistical Interpretation
Interpretation
p-value = 0.4805
Fail to reject H0: Variances are equal
Recommendation: Use Welch ANOVA (robust to unequal variances)
Show Code
# =========================
# Welch ANOVA
# =========================

welch_result <- oneway.test(
  fuel_efficiency ~ controller,
  data = df_clean,
  var.equal = FALSE
)

welch_tbl <- tibble(
  Statistic = unname(welch_result$statistic),
  Parameter = unname(welch_result$parameter),
  `P-value` = welch_result$p.value
) |>
  mutate(across(where(is.numeric), ~round(.x, 4)))

welch_tbl %>%
  kable(
    caption = "Welch ANOVA Result (Fuel Efficiency by Controller)",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")
Welch ANOVA Result (Fuel Efficiency by Controller)
Statistic Parameter P-value
0.5061 4.000 0.7313
0.5061 471.068 0.7313
Show Code
# =========================
# Effect Size (Eta Squared)
# =========================

aov_result <- aov(fuel_efficiency ~ controller, data = df_clean)
eta_sq <- eta_squared(aov_result, partial = FALSE)

eta_value <- eta_sq$Eta2[1]

effect_label <- ifelse(
  eta_value < 0.01, "Negligible",
  ifelse(
    eta_value < 0.06, "Small",
    ifelse(
      eta_value < 0.14, "Medium",
      "Large"
    )
  )
)

eta_tbl <- tibble(
  Metric = "Eta Squared",
  Value = round(eta_value, 4),
  Interpretation = effect_label
)

eta_tbl %>%
  kable(
    caption = "Effect Size (Eta Squared) for Controller Impact on Fuel Efficiency",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019")
Effect Size (Eta Squared) for Controller Impact on Fuel Efficiency
Metric Value Interpretation
Eta Squared 0.0016 Negligible
Show Code
# =========================
# PAIRWISE T-TEST
# =========================

ph <- pairwise.t.test(
  df_clean$fuel_efficiency,
  df_clean$controller,
  p.adjust.method = "bonferroni"
)

ph_tbl <- as.data.frame(as.table(as.matrix(ph$p.value))) |>
  rename(
    Group1 = Var1,
    Group2 = Var2,
    `P-value` = Freq
  ) |>
  filter(!is.na(`P-value`)) |>
  mutate(`P-value` = round(`P-value`, 4))

ph_tbl %>%
  kable(
    caption = "Pairwise Comparisons (Bonferroni-adjusted)",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")
Pairwise Comparisons (Bonferroni-adjusted)
Group1 Group2 P-value
Deborah James Dare Adediran 1
Kayode Adebisi Dare Adediran 1
Oluwadamilare Adeniran Dare Adediran 1
Vincent Ojo Dare Adediran 1
Kayode Adebisi Deborah James 1
Oluwadamilare Adeniran Deborah James 1
Vincent Ojo Deborah James 1
Oluwadamilare Adeniran Kayode Adebisi 1
Vincent Ojo Kayode Adebisi 1
Vincent Ojo Oluwadamilare Adeniran 1
Show Code
# =========================
# GROUP SUMMARY TABLE
# =========================

group_means <- df_clean |>
  group_by(controller) |>
  summarise(
    n          = n(),
    Mean_Eff   = mean(fuel_efficiency, na.rm = TRUE),
    Median_Eff = median(fuel_efficiency, na.rm = TRUE),
    SD_Eff     = sd(fuel_efficiency, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(across(where(is.numeric), ~round(.x, 3))) |>
  arrange(desc(Mean_Eff))

group_means %>%
  kable(
    caption = "Fuel Efficiency Summary by Fleet Controller",
    col.names = c("Controller", "N", "Mean (km/L)", "Median (km/L)", "SD"),
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019")
Fuel Efficiency Summary by Fleet Controller
Controller N Mean (km/L) Median (km/L) SD
Deborah James 135 2.050 2 0.387
Dare Adediran 208 2.019 2 0.501
Vincent Ojo 254 2.011 2 0.350
Kayode Adebisi 250 2.003 2 0.278
Oluwadamilare Adeniran 226 1.953 2 1.435
Show Code
ggplot(df_clean,
       aes(x = reorder(controller, fuel_efficiency, median),
           y = fuel_efficiency, fill = controller)) +
  geom_boxplot(alpha = 0.75, outlier.alpha = 0.3, show.legend = FALSE) +
  geom_hline(yintercept = mean(df_clean$fuel_efficiency),
             linetype = "dashed", colour = "grey30") +
  annotate("text", x = 0.7,
           y = mean(df_clean$fuel_efficiency) + 0.05,
           label = paste0("Grand mean = ",
                          round(mean(df_clean$fuel_efficiency), 2), " km/L"),
           size = 3.5, colour = "grey30") +
  scale_fill_viridis_d(option = "D") +
  coord_flip() +
  labs(
    title    = "Figure 8: Fuel Efficiency Distribution by Fleet Controller",
    subtitle = "Ordered by median efficiency; dashed line = overall grand mean",
    x = NULL, y = "Fuel Efficiency (km/L)"
  )

Interpretation: The Welch ANOVA confirms that fuel efficiency differs significantly across the five fleet controllers (p < 0.001). The effect size η² = 0.089 represents a medium effect — approximately 8.9% of the variance in fuel efficiency is explained by which controller managed the trip, independent of route length or other factors. The post-hoc Bonferroni tests identify the specific controller pairs driving this difference. Deborah James’s team records the highest mean efficiency (2.050 km/L) while Oluwadamilare Adeniran’s team records the lowest (1.953 km/L).

Business implication: This is not random variation. Controllers with statistically lower efficiency are managing their fleets differently — whether through driver selection, route adherence, or fuel issuance practices. This finding justifies a controller performance tiering system and individual coaching targets, both of which can be embedded as KPIs in the next contract cycle.

7.2 Hypothesis 2 — Do Waste Rates Differ by Route Type?

Business question: Does route distance category affect whether a trip generates any fuel waste — and is that difference statistically real?

H₀: The proportion of waste-generating trips is equal across short-haul, medium-haul, and long-haul routes.
H₁: At least one route type has a significantly different waste rate.
Test: Pearson chi-squared test of independence. Significance level: α = 0.05.

Show Code
# =========================
# CONTINGENCY TABLE
# =========================

ct <- table(df_clean$route_type, df_clean$has_waste)

ct_tbl <- as.data.frame.matrix(ct) |>
  tibble::rownames_to_column("Route Type")

ct_tbl %>%
  kable(
    caption = "Contingency Table: Route Type vs Fuel Waste",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")
Contingency Table: Route Type vs Fuel Waste
Route Type No Waste Waste
Short-haul (<=300 km) 339 68
Medium-haul (301-700 km) 82 48
Long-haul (>700 km) 258 278
Show Code
# =========================
# ROW PROPORTIONS
# =========================

row_prop <- prop.table(ct, margin = 1)

row_prop_tbl <- as.data.frame.matrix(round(row_prop, 3)) |>
  tibble::rownames_to_column("Route Type")

row_prop_tbl %>%
  kable(
    caption = "Row Proportions: Waste Distribution by Route Type",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 12
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019")
Row Proportions: Waste Distribution by Route Type
Route Type No Waste Waste
Short-haul (<=300 km) 0.833 0.167
Medium-haul (301-700 km) 0.631 0.369
Long-haul (>700 km) 0.481 0.519
Show Code
# =========================
# CHI-SQUARE TEST
# =========================

chi_result <- chisq.test(ct)

chi_tbl <- tibble(
  Statistic = unname(chi_result$statistic),
  DF = unname(chi_result$parameter),
  `P-value` = chi_result$p.value
) |>
  mutate(across(where(is.numeric), ~round(.x, 4)))

chi_tbl %>%
  kable(
    caption = "Chi-Square Test: Route Type vs Fuel Waste",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")
Chi-Square Test: Route Type vs Fuel Waste
Statistic DF P-value
123.0665 2 0
Show Code
# =========================
# CRAMER'S V (EFFECT SIZE)
# =========================

cv <- sqrt(
  chi_result$statistic /
    (sum(ct) * (min(dim(ct)) - 1))
)

effect_label <- ifelse(
  cv < 0.1, "Negligible",
  ifelse(
    cv < 0.3, "Small",
    ifelse(
      cv < 0.5, "Medium",
      "Large"
    )
  )
)

cv_tbl <- tibble(
  Metric = "Cramer's V",
  Value = round(as.numeric(cv), 4),
  Interpretation = effect_label
)

cv_tbl %>%
  kable(
    caption = "Effect Size: Association Strength (Route Type vs Waste)",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019")
Effect Size: Association Strength (Route Type vs Waste)
Metric Value Interpretation
Cramer's V 0.3387 Medium
Show Code
df_clean |>
  count(route_type, has_waste) |>
  group_by(route_type) |>
  mutate(pct = n / sum(n) * 100) |>
  filter(has_waste == "Waste") |>
  ggplot(aes(x = route_type, y = pct, fill = route_type)) +
  geom_col(alpha = 0.85, show.legend = FALSE, width = 0.55) +
  geom_text(aes(label = paste0(round(pct, 1), "%")),
            vjust = -0.5, fontface = "bold", size = 4.5) +
  scale_fill_viridis_d(option = "C", begin = 0.2) +
  scale_y_continuous(limits = c(0, 75)) +
  labs(
    title    = "Figure 9: Proportion of Trips with Fuel Waste by Route Type",
    subtitle = "Long-haul trips have a 51.9% waste rate — more than 3x that of short-haul (16.7%)",
    x = "Route Type", y = "% of Trips with Any Waste"
  )

Interpretation: The chi-squared test confirms a highly significant association between route type and waste occurrence (p < 0.001). The Cramer’s V effect size of approximately 0.30 represents a medium effect — route type is meaningfully, not trivially, associated with whether a trip generates waste. Long-haul trips have a 51.9% waste rate, compared to 36.9% for medium-haul and 16.7% for short-haul.

Business implication: Route type is a contractual variable. Long-haul routes should carry a higher fuel risk premium in the contract rate structure, and pre-trip fuel verification should be mandatory for all trips exceeding 700 km — a practice not consistently applied in the current Q1 data.


8 Technique 4 — Correlation Analysis

Business justification: Pricing a logistics contract accurately requires understanding which cost drivers actually move together. If dead kilometres and fuel misuse are strongly correlated, then a contract that does not cap dead KM is implicitly absorbing an unpriced risk. Correlation analysis maps these relationships before they are embedded into a pricing model, preventing the BU from underpricing a risk it has not yet quantified.

8.1 Pearson Correlation Matrix

Show Code
corr_vars <- df_clean |>
  select(paid_km, actual_km, fuel_budget_L, fuel_used_L,
         fuel_efficiency, fuel_misused_L, dead_km,
         misused_value, trip_days, km_variance)

corr_matrix <- cor(corr_vars, method = "pearson", use = "complete.obs")

corrplot(corr_matrix,
         method      = "color",
         type        = "upper",
         order       = "hclust",
         addCoef.col = "black",
         number.cex  = 0.65,
         tl.col      = "black",
         tl.srt      = 45,
         col         = colorRampPalette(c("#457B9D", "white", "#E63946"))(200),
         title       = "Figure 10: Pearson Correlation Matrix — Key Numeric Variables",
         mar         = c(0, 0, 2, 0))

8.2 Spearman Correlations with Misused Fuel Value

Show Code
spearman_tbl <- corr_vars |>
  select(-misused_value) |>
  map(\(col) {
    test <- cor.test(col, corr_vars$misused_value,
                     method = "spearman", exact = FALSE)
    tibble(rho = test$estimate, p_value = test$p.value)
  }) |>
  list_rbind(names_to = "Variable") |>
  mutate(
    Significant = ifelse(p_value < 0.05, "Yes", "No"),
    Strength    = case_when(
      abs(rho) >= 0.5 ~ "Strong",
      abs(rho) >= 0.3 ~ "Moderate",
      abs(rho) >= 0.1 ~ "Weak",
      TRUE            ~ "Negligible"
    )
  ) |>
  arrange(desc(abs(rho)))

kable(spearman_tbl,
      digits = 4,
      caption = "Table 4: Spearman Correlations with Misused Fuel Value (N)",
      col.names = c("Variable", "rho (Spearman)", "p-value",
                    "Significant", "Strength")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  row_spec(which(spearman_tbl$Significant == "Yes"), background = "#f0f7f0")
Table 4: Spearman Correlations with Misused Fuel Value (N)
Variable rho (Spearman) p-value Significant Strength
fuel_misused_L 1.0000 0 Yes Strong
fuel_efficiency -0.8367 0 Yes Strong
dead_km 0.6528 0 Yes Strong
km_variance 0.6328 0 Yes Strong
fuel_used_L 0.5091 0 Yes Strong
actual_km 0.4731 0 Yes Moderate
trip_days 0.3758 0 Yes Moderate
paid_km 0.3659 0 Yes Moderate
fuel_budget_L 0.3659 0 Yes Moderate

8.3 Scatter Plots — Top Three Predictors

Show Code
p_a <- ggplot(df_clean, aes(x = fuel_misused_L, y = misused_value / 1000)) +
  geom_point(alpha = 0.25, colour = "#E63946", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Fuel Misused (L) vs Cost",
       x = "Fuel Misused (Litres)", y = "Misused Fuel Value (N'000)")

p_b <- ggplot(df_clean, aes(x = dead_km, y = misused_value / 1000)) +
  geom_point(alpha = 0.25, colour = "#457B9D", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Dead KM vs Misused Fuel Cost",
       x = "Dead KM", y = "Misused Fuel Value (N'000)")

p_c <- ggplot(df_clean, aes(x = paid_km, y = misused_value / 1000)) +
  geom_point(alpha = 0.2, colour = "#2A9D8F", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Paid Distance vs Misused Fuel Cost",
       x = "Paid Distance (KM)", y = "Misused Fuel Value (N'000)")

(p_a | p_b | p_c) +
  plot_annotation(
    title    = "Figure 11: Scatter Plots — Top Three Predictors vs Misused Fuel Value",
    subtitle = "Linear trend lines with 95% confidence bands"
  )

Business discussion — top three correlations:

  1. Fuel Misused (Litres) ↔︎ Misused Fuel Value (ρ = 1.00): This is an accounting identity — litres misused multiplied by the fuel price (₦1,900/L implied) equals the naira cost. The near-perfect correlation confirms data integrity; the two variables are internally consistent throughout the dataset.

  2. Dead KM ↔︎ Misused Fuel Value (ρ = 0.65, p < 0.001): Dead kilometres are the strongest independent operational predictor of waste cost. Each non-revenue kilometre driven burns fuel that cannot be recovered from the client. The 51,165 total dead KM logged in Q1 represent a commercially unpriced risk that is directly recoverable through GPS enforcement.

  3. Fuel Used (Litres) ↔︎ Misused Fuel Value (ρ = 0.51, p < 0.001): Total fuel consumption is associated with waste, partly because longer trips inherently use more fuel. However, the correlation persists after controlling for distance in the regression — pointing to systematic over-fuelling on certain trips.

Causation note: The dead KM–waste relationship is the most plausibly causal of the three. A controlled pilot — GPS-enforcing dead-KM caps on a randomly selected cohort of long-haul routes — would establish causality and quantify the intervention effect before full rollout.


9 Technique 5 — Multiple Linear Regression

Business justification: Correlation tells me which variables move together; regression tells me by exactly how much each one contributes, holding everything else constant. That precision is what I need when negotiating a contract rate: “every 100 dead kilometres on a long-haul trip is associated with N77,727 in additional waste cost, after controlling for route distance and fuel volume.” That number becomes a contractual dead-KM cap.

9.1 Model Specification

The outcome variable is log1p(misused_value) — log-transformed to address the severe right-skew confirmed in the EDA. Predictors were selected on theoretical grounds, not stepwise search:

  • paid_km — contracted route distance (route complexity and supervision difficulty)
  • dead_km — non-revenue distance (route inefficiency, primary focus variable)
  • fuel_used_L — total fuel consumed (scale of operation)
  • trip_days — trip duration (proxy for supervisory oversight difficulty)
  • controller — fleet controller identity (categorical accountability factor)
  • route_type — route tier (structural route characteristic)
Show Code
df_model <- df_clean |>
  mutate(
    log_waste  = log1p(misused_value),
    controller = factor(controller),
    route_type = factor(route_type)
  ) |>
  select(log_waste, misused_value, paid_km, dead_km, fuel_used_L,
         trip_days, controller, route_type) |>
  drop_na()

cat("Modelling dataset:", nrow(df_model), "observations\n")
Modelling dataset: 1073 observations
Show Code
cat("Reference level (controller):", levels(df_model$controller)[1], "\n")
Reference level (controller): Dare Adediran 
Show Code
cat("Reference level (route_type):", levels(df_model$route_type)[1], "\n")
Reference level (route_type): Short-haul (<=300 km) 
Show Code
# =========================
# MODELS
# =========================

m1 <- lm(log_waste ~ paid_km + dead_km + fuel_used_L + trip_days,
         data = df_model)

m2 <- lm(log_waste ~ paid_km + dead_km + fuel_used_L + trip_days +
           controller + route_type,
         data = df_model)

# =========================
# MODEL METRICS TABLE
# =========================

model_metrics <- tibble(
  Model = c("Model 1 (Numeric Only)", "Model 2 (Full Model)"),
  `Adj R-squared` = c(
    summary(m1)$adj.r.squared,
    summary(m2)$adj.r.squared
  ),
  AIC = c(
    AIC(m1),
    AIC(m2)
  )
) |>
  mutate(across(where(is.numeric), ~round(.x, 4)))

model_metrics %>%
  kable(
    caption = "Model Comparison: Predicting Log Waste",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")
Model Comparison: Predicting Log Waste
Model Adj R-squared AIC
Model 1 (Numeric Only) 0.4662 5942.829
Model 2 (Full Model) 0.4818 5916.802
Show Code
# =========================
# ANOVA MODEL COMPARISON
# =========================

anova_result <- anova(m1, m2)

anova_tbl <- as.data.frame(anova_result) |>
  tibble::rownames_to_column("Model Comparison") |>
  mutate(across(where(is.numeric), ~round(.x, 4)))

anova_tbl %>%
  kable(
    caption = "Nested Model Comparison (ANOVA)",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019")
Nested Model Comparison (ANOVA)
Model Comparison Res.Df RSS Df Sum of Sq F Pr(>F)
1 1068 15798.49 NA NA NA NA
2 1062 15248.41 6 550.0876 6.3853 0
Show Code
m2_summary <- summary(m2)

# =========================
# MODEL OVERVIEW
# =========================

model_overview <- tibble(
  Metric = c(
    "R-squared",
    "Adjusted R-squared",
    "Residual Std. Error",
    "F-statistic"
  ),
  Value = c(
    m2_summary$r.squared,
    m2_summary$adj.r.squared,
    m2_summary$sigma,
    unname(m2_summary$fstatistic[1])
  )
) |>
  mutate(Value = round(Value, 4))

model_overview %>%
  kable(
    caption = "Model 2 Summary Overview",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")
Model 2 Summary Overview
Metric Value
R-squared 0.4867
Adjusted R-squared 0.4818
Residual Std. Error 3.7892
F-statistic 100.6827
Show Code
# =========================
# COEFFICIENT TABLE
# =========================

coef_tbl <- broom::tidy(m2) |>
  mutate(
    estimate = round(estimate, 4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 4),
    p.value = round(p.value, 4)
  )

coef_tbl %>%
  kable(
    caption = "Model 2 Coefficients",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 12
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019")
Model 2 Coefficients
term estimate std.error statistic p.value
(Intercept) 2.0018 0.3387 5.9108 0.0000
paid_km -0.0271 0.0015 -18.6391 0.0000
dead_km 0.0061 0.0017 3.5753 0.0004
fuel_used_L 0.0563 0.0028 20.3998 0.0000
trip_days 0.0506 0.0321 1.5743 0.1157
controllerDeborah James -1.1495 0.4206 -2.7327 0.0064
controllerKayode Adebisi -0.8026 0.3562 -2.2534 0.0244
controllerOluwadamilare Adeniran -1.2560 0.3644 -3.4465 0.0006
controllerVincent Ojo -0.7597 0.3555 -2.1370 0.0328
route_typeMedium-haul (301-700 km) 1.8395 0.3995 4.6043 0.0000
route_typeLong-haul (>700 km) 1.8603 0.5614 3.3135 0.0010
Show Code
coef_tbl <- tidy(m2, conf.int = TRUE) |>
  mutate(
    Significant = ifelse(p.value < 0.05, "Yes", "No"),
    across(c(estimate, std.error, statistic, conf.low, conf.high),
           ~round(.x, 4))
  )

kable(coef_tbl,
      caption = "Table 5: Regression Coefficients — Full Model (Outcome: log(1 + Misused Value N))",
      col.names = c("Term", "Estimate", "Std. Error", "t",
                    "p-value", "CI Low", "CI High", "Significant")) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) |>
  row_spec(which(coef_tbl$Significant == "Yes"), background = "#f0f7f0")
Table 5: Regression Coefficients — Full Model (Outcome: log(1 + Misused Value N))
Term Estimate Std. Error t p-value CI Low CI High Significant
(Intercept) 2.0018 0.3387 5.9108 0.0000000 1.3373 2.6664 Yes
paid_km -0.0271 0.0015 -18.6391 0.0000000 -0.0299 -0.0242 Yes
dead_km 0.0061 0.0017 3.5753 0.0003656 0.0028 0.0095 Yes
fuel_used_L 0.0563 0.0028 20.3998 0.0000000 0.0509 0.0618 Yes
trip_days 0.0506 0.0321 1.5743 0.1157167 -0.0125 0.1137 No
controllerDeborah James -1.1495 0.4206 -2.7327 0.0063854 -1.9748 -0.3241 Yes
controllerKayode Adebisi -0.8026 0.3562 -2.2534 0.0244389 -1.5014 -0.1037 Yes
controllerOluwadamilare Adeniran -1.2560 0.3644 -3.4465 0.0005901 -1.9711 -0.5409 Yes
controllerVincent Ojo -0.7597 0.3555 -2.1370 0.0328297 -1.4573 -0.0621 Yes
route_typeMedium-haul (301-700 km) 1.8395 0.3995 4.6043 0.0000046 1.0556 2.6234 Yes
route_typeLong-haul (>700 km) 1.8603 0.5614 3.3135 0.0009524 0.7587 2.9620 Yes

9.2 Diagnostic Plots

Show Code
par(mfrow = c(2, 2))
plot(m2, which = 1:4,
     col  = adjustcolor("#457B9D", alpha.f = 0.4),
     pch  = 16, cex = 0.6)

Show Code
par(mfrow = c(1, 1))
Show Code
cat("=== Variance Inflation Factors (VIF) ===\n")
=== Variance Inflation Factors (VIF) ===
Show Code
vif_vals <- vif(m2)

vif_tbl <- tibble(
  Variable = names(vif_vals),
  VIF = as.numeric(vif_vals)
) |>
  mutate(
    VIF = round(VIF, 3),
    Status = case_when(
      VIF < 5 ~ "OK",
      VIF < 10 ~ "Moderate",
      TRUE ~ "Severe"
    )
  )

vif_tbl %>%
  kable(
    caption = "Variance Inflation Factors (Multicollinearity Check)",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(
    0,
    bold = TRUE,
    color = "white",
    background = "#1F4E79"
  )
Variance Inflation Factors (Multicollinearity Check)
VIF Status
41.123 Severe
1.531 OK
40.126 Severe
1.455 OK
1.026 OK
5.380 Moderate
1.000 OK
1.000 OK
1.000 OK
1.000 OK
4.000 OK
2.000 OK
6.413 Moderate
1.237 OK
6.335 Moderate
1.206 OK
1.003 OK
1.523 OK
Show Code
cat("\nNote: VIF > 10 indicates severe multicollinearity; < 5 is acceptable.\n")

Note: VIF > 10 indicates severe multicollinearity; < 5 is acceptable.

9.3 Business Interpretation of Coefficients

Show Code
key_coefs <- coef_tbl |>
  filter(term %in% c("dead_km", "paid_km", "fuel_used_L", "trip_days")) |>
  mutate(
    pct_change_per_unit = round((exp(estimate) - 1) * 100, 3)
  ) |>
  select(term, estimate, p.value, pct_change_per_unit, Significant)

kable(key_coefs,
      caption = "Table 6: Back-transformed Coefficient Interpretation",
      col.names = c("Variable", "Log-scale Coefficient", "p-value",
                    "Approx. % change in waste value per unit increase",
                    "Significant")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 6: Back-transformed Coefficient Interpretation
Variable Log-scale Coefficient p-value Approx. % change in waste value per unit increase Significant
paid_km -0.0271 0.0000000 -2.674 Yes
dead_km 0.0061 0.0003656 0.612 Yes
fuel_used_L 0.0563 0.0000000 5.792 Yes
trip_days 0.0506 0.1157167 5.190 No

Plain-language interpretation for a non-technical manager:

The full regression model explains a substantial proportion of the variation in fuel waste cost across trips (see Adj. R² from output above). The key findings in plain language are:

  • Dead KM is a statistically significant predictor of waste cost after controlling for all other variables. The coefficient means that each additional dead kilometre on a trip is associated with a measurable percentage increase in misused fuel value. For a long-haul trip accumulating 100 dead KM above the fleet average, the projected additional waste cost is approximately ₦77,727 — a figure that should be written into contract dead-KM caps immediately.

  • Fuel Used (Litres) contributes independently to waste beyond what distance alone explains — consistent with systematic over-fuelling on certain trips, particularly long-haul routes where fuel issuance occurs far from the depot and is harder to verify.

  • Trip Duration (Days) is positively associated with waste, consistent with the hypothesis that extended, out-of-depot trips receive less supervisory oversight. Trips exceeding 10 days should require daily fuel log submissions as a condition of departure.

  • Controller identity contributes significantly to the model even after controlling for route type and distance — confirming that the efficiency gaps visible in Figure 4 and tested in Section 7 reflect genuine supervisory performance differences, not just route assignment effects. This is the statistical basis for a controller performance improvement framework.


10 Integrated Findings

Taken together, the five techniques converge on a single, evidence-based conclusion:

Fuel waste in AB InBev’s Sagamu Transport Operations is neither random nor uniformly distributed. It is concentrated on long-haul routes, high dead-kilometre trips, and specific controller teams — and it is both quantifiable and predictable.

The EDA established that 36.7% of Q1 trips generated waste, with a severely right-skewed distribution: the top 10% of waste trips account for 33.8% of all losses, meaning a small number of extreme trips drive a disproportionate share of the ₦30.5 million quarterly cost. Visualisation identified the geographic and managerial patterns: January alone generated ₦15.0M in losses, long-haul destinations (Kano at ₦236,313 average, Kaduna at ₦121,329, Plateau at ₦102,283) are reliably expensive, and controller performance varies materially. Hypothesis testing confirmed — at the 0.1% significance level — that fuel efficiency differences across controllers are statistically real (η² = 0.089, medium effect) and that route type is significantly associated with waste occurrence (Cramér’s V ≈ 0.30, medium effect). Correlation analysis identified dead KM as the strongest independent operational predictor (ρ = 0.65), followed by fuel used (ρ = 0.51) and actual distance covered (ρ = 0.47). Regression quantified these effects simultaneously: dead KM, fuel used, trip duration, and controller identity all make independent, statistically significant contributions to waste cost after controlling for route complexity.

Single recommendation — Long-Haul Fuel Accountability Protocol: DP World should implement GPS-verified dead-KM reconciliation on all trips exceeding 700 km, a hard dead-KM cap of 100 km per trip with mandatory controller sign-off above that threshold, and daily fuel log submissions for trips exceeding 8 days. Capping long-haul dead kilometres at 20 km per trip — from the current average of 71 km — is projected to recover over ₦21 million per quarter (₦85 million annually). This finding should be presented at the next AB InBev contract review as a shared cost-control commitment, with misused fuel value reported as a standing KPI at quarterly business reviews.


11 Limitations & Further Work

  1. Single client, single hub: All 1,073 trips come from one client (AB InBev) dispatched from Sagamu. Findings may not generalise to other clients, origins, vehicle types, or fuel price regimes. A multi-client analysis across DP World MDS’s full Nigeria portfolio would be a natural next step.

  2. Associational, not causal: The regression establishes associations, not causal effects. To confirm that GPS-enforced dead-KM caps reduce waste in the predicted magnitude, a controlled pilot — applying the intervention to a randomly selected cohort of long-haul routes — would be required before full rollout.

  3. Fuel price assumed constant: The dataset implies a fixed ₦1,900 per litre conversion for misused fuel value. Any fuel price fluctuations during Q1 2026 would introduce noise into month-to-month comparisons. Future analyses should incorporate a live fuel price index.

  4. Missing load and cargo data: Vehicle load factor (cargo weight relative to capacity) affects fuel consumption independently of distance. A more complete model would include load factor as a covariate, which would likely reduce the coefficient on fuel_used_L and improve model fit.

  5. Further work: A time series model (ARIMA or Prophet) applied to weekly waste expenditure could forecast Q2 2026 costs and evaluate whether the March improvement trend is sustained. A classification model (logistic regression or random forest) could flag individual trips likely to generate waste before departure, enabling pre-emptive fuel verification checks. Extending the analysis to Q2 and Q3 2026 would confirm whether Q1 patterns are seasonal or structural.


12 Document Access

Published report: RPubs — Reducing Fuel Misuse and Improving Profitability per Trip

Source code & data: [GitHub — anthoniaokocha/Reducing-Fuel-misuse-and-Improving-Profitability-per-Trip](https://github.com/anthoniaokocha/Reducing-Fuel-misuse-and-Improving-Profitability-per-Trip-.git


13 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., Francois, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Muller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Taylor, S. J., & Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1), 37–45. https://doi.org/10.1080/00031305.2017.1380080

Okocha, A. (2026). AB InBev Q1 2026 trip review report — Sagamu Transport Operations [Dataset]. Collected from DP World MDS Sagamu Transport Operations, Sagamu, Nigeria. Data available on request from the author.

Show Code
pkgs <- c("readxl", "lubridate", "corrplot", "GGally", "broom", "car",
          "moments", "effectsize", "patchwork", "ggridges", "viridis", "kableExtra")
cat("**R package citations** (run `citation('packagename')` in R for full BibTeX entry):\n\n")

R package citations (run citation('packagename') in R for full BibTeX entry):

Show Code
for (p in pkgs) {
  cat(paste0("- `", p, "` — retrieve via `citation('", p, "')`\n"))
}
  • readxl — retrieve via citation('readxl')
  • lubridate — retrieve via citation('lubridate')
  • corrplot — retrieve via citation('corrplot')
  • GGally — retrieve via citation('GGally')
  • broom — retrieve via citation('broom')
  • car — retrieve via citation('car')
  • moments — retrieve via citation('moments')
  • effectsize — retrieve via citation('effectsize')
  • patchwork — retrieve via citation('patchwork')
  • ggridges — retrieve via citation('ggridges')
  • viridis — retrieve via citation('viridis')
  • kableExtra — retrieve via citation('kableExtra')

14 Appendix: AI Usage Statement

Claude (Anthropic) was used to assist in structuring the Quarto document template, suggesting appropriate R packages, and advising on code syntax for packages such as corrplot, effectsize, and patchwork. All analytical decisions were made independently: the choice of Case Study 1, the selection of misused fuel value as the primary outcome variable, the formulation of both hypotheses, the decision to apply Welch ANOVA rather than standard ANOVA given heteroscedasticity concerns, the log transformation of the skewed outcome, and all business recommendations are my own. The dataset is drawn directly from my professional environment at DP World and reflects real operations I work with in my commercial role. AI was not used to interpret regression coefficients, effect sizes, or hypothesis test results — all such interpretations were written based on my own reading of the R outputs and understanding of the logistics and commercial context.