Reducing Fuel misuse and Improving Profitability per Trip

Data Analytics II — Case Study 1: Exploratory & Inferential Analytics | AB InBev Sagamu Transport Operations — Q1 2026

Author

Anthonia Okocha

Published

May 16, 2026

Abstract

This study applies five exploratory and inferential analytics techniques to 1,073 truck-trip records from DP World’s Sagamu Transport Operations hub, covering AB InBev deliveries across Nigeria in Q1 2026. The central business problem is fuel misuse: 36.7% of all trips generated misused fuel costs totalling ₦30.5 million, with long-haul routes (>700 km) accounting for 77.7% of losses. Dead kilometres emerge as the strongest predictor of financial waste (Spearman ρ = 0.65, p < 0.001), with every 100 dead kilometres associated with ₦77,727 in misused fuel value. Hypothesis testing confirms significant fuel efficiency differences across fleet controllers (Welch ANOVA, p < 0.001, η² = 0.089). Findings support implementing a Long-Haul Fuel Accountability Protocol projected to recover over ₦21 million per quarter — a material improvement in BU profitability and contract renewal positioning.

🎓 Course: Data Analytics II | 📋 Assessment: Capstone Case Study 1 | 👨‍🏫 Instructor: Prof Bongo Adi | 🏛️ Institution: Lagos Business School | 📅 Date: May 2026


1. Executive Summary

This study applies five exploratory and inferential analytics techniques to 1,073 truck-trip records generated by AB InBev’s Sagamu Transport Operations hub between January and March 2026. The dataset was extracted directly from DP World MDS’s internal fleet management system and represents a complete census of all dispatched shipments over the quarter.

The central business problem is fuel waste and its financial cost. Across Q1 2026, 394 of 1,073 trips (36.7%) recorded positive fuel misuse, generating total losses of ₦30,472,200 — equivalent to approximately ₦122 million annually. The analyses reveal that:

  • Misused fuel value is highly right-skewed: the top 10% of waste-generating trips account for 33.8% of all losses
  • Fuel efficiency differs significantly across the five fleet controllers (Welch ANOVA, p < 0.001, η² = 0.089)
  • Long-haul routes (>700 km) carry a 51.9% waste rate and generate 77.7% of total losses despite representing 50% of trips
  • Dead kilometres (mean: 47.7 km/trip; total Q1: 51,165 km) are the strongest operational predictor of financial waste (Spearman ρ = 0.65, p < 0.001)
  • Every 100 dead kilometres is associated with ₦77,727 in misused fuel value

Recommendation: DP World should implement a Long-Haul Fuel Accountability Protocol — enforcing GPS-verified dead-KM caps, daily fuel log submission on trips exceeding 8 days, and controller performance tiering. Capping long-haul dead kilometres at 20 km per trip is projected to recover over ₦21 million per quarter.


2. Professional Disclosure

2.1 Role and Organisation

I am a Business Development Specialist at DP World, a global ports, logistics, and supply chain conglomerate operating across 77 countries. Within the Nigerian Business Unit, my role sits in the commercial department where I am responsible for developing growth strategies, conducting cost modelling for commercial proposals, managing responses to tenders (RFIs, RFPs, RFQs), revising contract rates to optimise BU profitability, and identifying commercial risks pre- and post-contract execution.

The dataset analysed here — 1,073 truck trips dispatched from the Sagamu Transport Operations hub for AB InBev across Q1 2026 — comes directly from DP World’s internal fleet management system for a client account I manage commercially. Every variable maps to a metric I monitor: paid versus actual distance informs route pricing accuracy; fuel efficiency feeds into cost-per-kilometre benchmarks used in commercial proposals; misused fuel value quantifies the leakage that compresses BU profitability; and dead KM is a direct proxy for asset underutilisation — a KPI I am accountable for driving down.

2.2 Technique Operational Relevance

Technique 1 — Exploratory Data Analysis: A core deliverable in my role is conducting cost modelling for commercial proposals. No cost model is credible if built on unchecked data. Before I can price a rate for AB InBev or present a business case to BU leadership, I need to establish the true baseline of operational performance — not dashboard averages, but the full distribution of fuel costs including outliers, anomalies, and data quality failures that would distort the model. EDA on this dataset identifies impossible fuel efficiency readings, zero-distance trips, and severe right-skew in misused fuel value before any analysis is built on top of it.

Technique 2 — Data Visualisation: My role explicitly requires supporting the preparation and presentation of commercial proposals and driving client engagements throughout the sales cycle. Both activities demand visual communication — procurement directors and operations managers do not read regression tables; they read charts. The five visualisations in this project are structured as I would present them in a client quarterly business review: showing which destinations generate unsustainable waste, how performance varies across controllers, and how cost leakage trends month-on-month. A chart showing a client their own operational waste in naira terms is also a commercial tool — it builds the case for a revised contract structure or performance-linked rate.

Technique 3 — Hypothesis Testing: My role requires supporting the identification and management of commercial risks pre, during, and post service engagement. A key commercial risk in haulage is systematic underperformance by a specific controller team — because if persistent, it can trigger SLA breaches, erode client satisfaction scores, and threaten contract renewal. Before recommending a performance improvement plan, I need to confirm that the observed efficiency gap is statistically real and not a product of different route assignments. ANOVA provides that confirmation in a form that is defensible in a client performance meeting or an internal HR process.

Technique 4 — Correlation Analysis: My role requires revising applicable rates to optimise income from existing clients and collaborating with other departments to determine the most cost-effective rate. Both tasks require knowing which operational variables actually drive cost outcomes. If dead kilometres and misused fuel value are strongly correlated, then any contract that does not cap dead KM is absorbing an unpriced commercial risk — one that compresses the BU margin I am accountable for protecting. Correlation analysis maps these relationships before they are embedded into a pricing model or contract term.

Technique 5 — Multiple Linear Regression: My role requires managing responses to RFIs, RFPs, and RFQs and negotiating commercial proposals. Both tasks require me to defend a rate with data. Regression on this dataset produces the precise, coefficient-level answer I need: how much does each additional dead kilometre or extra day on the road add to misused fuel cost, holding all other factors constant? That number translates directly into a pricing adjustment, a contractual dead-KM cap, or a risk premium I can justify to a client or to BU finance.


3. Data Collection & Sampling

3.1 Source and Collection Method

The dataset was exported directly from DP World’s internal fleet management and trip-tracking platform, which records every shipment dispatched from the Sagamu Transport Operations hub for AB InBev. Data extraction was performed by downloading the Q1 trip review report for the period January 1 – March 31, 2026, in Excel format (.xlsx). The file (TripReviewReport_ABNBEV_JAN_FEB_MAR.xlsx) was accessed through the internal system available to me in my commercial role — the same system I use when preparing client performance reports and rate revision proposals for BU leadership.

All fields — including distances, fuel figures, and timestamps — are system-generated at the point of trip completion, with fuel reconciliation performed by fleet controllers upon vehicle return. No manual data entry was involved.

3.2 Sampling Frame and Sample Size

This dataset represents a complete census of all AB InBev trips dispatched from the Sagamu hub during Q1 2026. Every trip recorded in the system for the period is included. The final clean dataset contains 1,073 observations across 20 variables, comfortably exceeding the minimum requirement of 100 observations and 5 variables.

Attribute Detail
Client AB InBev Nigeria
Operation Site Sagamu Transport Operations
Operator DP World MDS
Period Covered 01 January 2026 – 31 March 2026
Total Trips (raw) 1,104
Total Trips (clean) 1,073
Variables 20 (6 numeric continuous, 4 numeric integer, 4 categorical, 2 datetime, 4 identifier/text)
Fleet Controllers 5 individuals
Distinct Destinations 40+ cities/states across Nigeria

3.3 Variable Description

Variable Type Description
Shipment Number Identifier Unique trip ID
Fleet Controller Categorical Supervisor responsible for the trip
Vehicle Categorical Truck registration number
Delivery Officer Categorical Driver assigned
Created At Date Trip dispatch date
Completed At Date Trip completion date
Origin Categorical Departure location
Destination Categorical Delivery location
Paid Distance (KM) Numeric Contracted/agreed trip distance
Distance Covered (KM) Numeric Actual GPS-recorded distance
Fuel Budget (Litres) Numeric Approved fuel allocation
Fuel Used (Litres) Numeric Actual fuel consumed
Fuel Consumption Numeric Litres per 100 KM
Fuel Efficiency Numeric KM per litre ratio
Fuel Misused (Litres) Numeric Fuel used beyond budget
Dead KM Numeric Non-revenue kilometres driven
Misused Fuel Value (Naira) Outcome Financial cost of misused fuel (₦)

3.4 Ethical Notes

Driver names and delivery officer identities are present in the operational data. For this published document, individuals are referenced only at the controller and vehicle level — no individual driver is singled out in the analysis. The data is used with organisational awareness for academic purposes under LBS guidelines.


4. Data Loading, Cleaning & Description

Code
df_raw <- read_excel("TripReviewReport_ABNBEV_JAN_FEB_MAR.xlsx")

cat("Raw dimensions:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")
Raw dimensions: 1104 rows x 20 columns
Code
glimpse(df_raw)
Rows: 1,104
Columns: 20
$ `S/N`                       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,…
$ Client                      <chr> "AB InBev", "AB InBev", "AB InBev", "AB In…
$ `Shipment Number`           <chr> "206855", "269651", "273737", "201603", "2…
$ `Operation Site`            <chr> "Sagamu Transport Operations", "Sagamu Tra…
$ `Fleet Controller`          <chr> "Vincent Ojo", "Vincent Ojo", "Dare Adedir…
$ Vehicle                     <chr> "T32805LA", "T32822LA", "T32781LA", "T3280…
$ `Delivery Officer`          <chr> "Alabi sunday Kayode", "Godfrey Braimoh", …
$ `Created At`                <chr> "02-Jan-2026", "02-Jan-2026", "23-Jan-2026…
$ `Completed At`              <chr> "19-Jan-2026", "12-Jan-2026", "30-Jan-2026…
$ Origin                      <chr> "Sagamu", "Ilesa", "Sagamu", "Sagamu", "Sa…
$ Destination                 <chr> "Lagos", "Abuja", "Abia", "Lagos", "Oyo", …
$ `Paid Distance (KM)`        <dbl> 138, 980, 1110, 164, 316, 164, 174, 280, 4…
$ `Distance Covered(KM)`      <dbl> 111, 1270, 867, 136, 420, 139, 164, 284, 4…
$ `Fuel Budget (Litres)`      <dbl> 69, 490, 555, 82, 158, 82, 87, 140, 218, 6…
$ `Fuel Used (Litres)`        <dbl> 69, 626, 535, 78, 180, 78, 83, 133, 436, 6…
$ `Fuel Consumption`          <dbl> 1.61, 2.03, 1.62, 1.74, 2.33, 1.78, 1.98, …
$ `Fuel Efficiency`           <dbl> 2.00, 1.57, 2.07, 2.10, 1.76, 2.10, 2.10, …
$ `Fuel Misused(Litres)`      <dbl> 0, 136, 0, 0, 22, 0, 0, 0, 218, 0, 0, 36, …
$ `Dead KM`                   <dbl> 0, 290, 0, 0, 104, 0, 0, 4, 40, 0, 0, 71, …
$ `Misused Fuel Value(Naira)` <dbl> 0, 258400, 0, 0, 41800, 0, 0, 0, 414200, 0…
Code
df <- df_raw |>
  rename(
    sn               = `S/N`,
    client           = Client,
    shipment_no      = `Shipment Number`,
    operation_site   = `Operation Site`,
    controller       = `Fleet Controller`,
    vehicle          = Vehicle,
    driver           = `Delivery Officer`,
    created_at       = `Created At`,
    completed_at     = `Completed At`,
    origin           = Origin,
    destination      = Destination,
    paid_km          = `Paid Distance (KM)`,
    actual_km        = `Distance Covered(KM)`,
    fuel_budget_L    = `Fuel Budget (Litres)`,
    fuel_used_L      = `Fuel Used (Litres)`,
    fuel_consumption = `Fuel Consumption`,
    fuel_efficiency  = `Fuel Efficiency`,
    fuel_misused_L   = `Fuel Misused(Litres)`,
    dead_km          = `Dead KM`,
    misused_value    = `Misused Fuel Value(Naira)`
  ) |>
  mutate(
    created_at        = dmy(created_at),
    completed_at      = dmy(completed_at),
    trip_days         = as.numeric(completed_at - created_at),
    month             = month(created_at, label = TRUE, abbr = FALSE),
    month_num         = month(created_at),
    km_variance       = actual_km - paid_km,
    fuel_over_budget  = fuel_used_L - fuel_budget_L,
    has_waste         = ifelse(misused_value > 0, "Waste", "No Waste"),
    controller        = str_to_title(controller),
    route_type        = case_when(
      paid_km <= 300  ~ "Short-haul (<=300 km)",
      paid_km <= 700  ~ "Medium-haul (301-700 km)",
      TRUE            ~ "Long-haul (>700 km)"
    ) |> factor(levels = c("Short-haul (<=300 km)",
                            "Medium-haul (301-700 km)",
                            "Long-haul (>700 km)"))
  )

cat("=== DATA QUALITY REPORT ===\n")
=== DATA QUALITY REPORT ===
Code
cat("Total rows:", nrow(df), "\n")
Total rows: 1104 
Code
cat("Missing values per column:\n")
Missing values per column:
Code
print(colSums(is.na(df)))
              sn           client      shipment_no   operation_site 
               0                0                0                0 
      controller          vehicle           driver       created_at 
               0                0                0                0 
    completed_at           origin      destination          paid_km 
               0                0                0                0 
       actual_km    fuel_budget_L      fuel_used_L fuel_consumption 
               0                0                0                0 
 fuel_efficiency   fuel_misused_L          dead_km    misused_value 
               0                0                0                0 
       trip_days            month        month_num      km_variance 
               0                0                0                0 
fuel_over_budget        has_waste       route_type 
               0                0                0 
Code
cat("\nTrips with negative trip_days:", sum(df$trip_days < 0, na.rm = TRUE), "\n")

Trips with negative trip_days: 0 
Code
cat("Trips with fuel_efficiency > 10:", sum(df$fuel_efficiency > 10, na.rm = TRUE), "\n")
Trips with fuel_efficiency > 10: 4 
Code
cat("Trips with actual_km == 0:", sum(df$actual_km == 0), "\n")
Trips with actual_km == 0: 28 
Code
outlier_log <- df |>
  filter(fuel_efficiency > 10 | actual_km == 0 | trip_days < 0) |>
  select(sn, shipment_no, created_at, paid_km, actual_km, fuel_efficiency, trip_days)

cat("Records flagged for exclusion:", nrow(outlier_log), "\n")
Records flagged for exclusion: 31 
Code
kable(outlier_log, caption = "Table 0: Flagged Records Before Exclusion") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 0: Flagged Records Before Exclusion
sn shipment_no created_at paid_km actual_km fuel_efficiency trip_days
367 275675 2026-02-05 168 0 2.63 1
397 275947 2026-02-05 952 0 2.00 9
499 278180 2026-02-14 174 0 2.76 5
502 278208 2026-02-13 146 113 146.00 3
506 278246 2026-02-18 152 0 2.11 1
582 279388 2026-02-19 280 0 21.54 18
602 279605 2026-02-19 1816 0 1.94 8
617 279904 2026-02-21 224 12 37.33 5
626 280151 2026-02-27 1188 0 1.91 7
646 280459 2026-02-23 1146 0 2.10 9
659 280813 2026-02-25 282 0 2.10 2
667 280964 2026-02-28 130 0 2.10 1
701 00126 2026-03-10 350 0 4.07 2
716 0035 2026-03-21 175 0 1.86 1
742 282045 2026-03-04 350 0 2.11 3
792 282922 2026-03-06 172 0 2.12 5
796 282976 2026-03-07 834 0 2.00 9
797 283050 2026-03-10 170 0 1.42 6
822 283905 2026-03-11 968 0 2.10 12
886 284676 2026-03-16 138 0 2.12 6
911 285055 2026-03-16 1026 0 1.86 14
969 286218 2026-03-22 1456 0 2.00 9
972 286316 2026-03-22 134 0 2.09 2
978 286384 2026-03-24 180 0 2.09 11
981 286465 2026-03-23 130 0 2.10 2
987 286502 2026-03-23 130 10 65.00 3
1026 287038 2026-03-25 478 0 1.93 4
1084 288178 2026-03-30 1188 0 1.91 10
1087 288213 2026-03-31 246 0 1.68 4
1096 288441 2026-03-31 144 0 0.00 1
1098 288447 2026-03-31 190 0 2.11 15
Code
# Rationale: fuel_efficiency > 10 km/L is physically impossible for a loaded truck
# (normal operating range: 1.5-3 km/L). actual_km == 0 indicates cancelled trips.
df_clean <- df |>
  filter(
    fuel_efficiency <= 10,
    actual_km > 0,
    trip_days >= 0
  )

cat("Clean dataset:", nrow(df_clean), "rows retained (",
    nrow(df) - nrow(df_clean), "rows removed)\n")
Clean dataset: 1073 rows retained ( 31 rows removed)
Code
summary_tbl <- df_clean |>
  select(paid_km, actual_km, fuel_budget_L, fuel_used_L,
         fuel_efficiency, fuel_misused_L, dead_km, misused_value, trip_days) |>
  pivot_longer(everything(), names_to = "Variable") |>
  group_by(Variable) |>
  summarise(
    N        = n(),
    Mean     = round(mean(value, na.rm = TRUE), 2),
    Median   = round(median(value, na.rm = TRUE), 2),
    SD       = round(sd(value, na.rm = TRUE), 2),
    Min      = round(min(value, na.rm = TRUE), 2),
    Max      = round(max(value, na.rm = TRUE), 2),
    Skewness = round(moments::skewness(value, na.rm = TRUE), 2)
  )

kable(summary_tbl,
      caption = "Table 1: Summary Statistics — Key Numeric Variables",
      format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) |>
  column_spec(8, bold = TRUE)
Table 1: Summary Statistics — Key Numeric Variables
Variable N Mean Median SD Min Max Skewness
actual_km 1,073 705.91 568 543.82 2 3,151 0.80
dead_km 1,073 47.68 20 83.26 0 1,167 4.97
fuel_budget_L 1,073 346.77 337 255.58 32 1,489 0.66
fuel_efficiency 1,073 2.00 2 0.74 -19 6 -20.93
fuel_misused_L 1,073 14.95 0 31.48 0 342 3.97
fuel_used_L 1,073 350.26 357 265.48 -10 1,518 0.76
misused_value 1,073 28,399.07 0 59,807.62 0 649,800 3.97
paid_km 1,073 693.49 674 511.21 64 2,978 0.66
trip_days 1,073 6.84 6 4.34 0 56 2.61

5. Technique 1 — Exploratory Data Analysis (EDA)

Business justification: Before I can renegotiate a contract rate or build a proposal for a new client, I need an honest baseline of operational performance. EDA surfaces the true distribution of fuel costs, identifies anomalous trips (potential fraud, system errors, or route failures), and establishes whether reported performance figures are internally consistent — exactly what I do before every client quarterly business review.

5.1 Distribution of the Outcome Variable

Code
p1 <- ggplot(df_clean, aes(x = misused_value)) +
  geom_histogram(bins = 50, fill = "#E63946", colour = "white", alpha = 0.85) +
  scale_x_continuous(labels = label_comma()) +
  labs(
    title    = "Figure 1a: Distribution of Misused Fuel Value (N)",
    subtitle = "Strongly right-skewed — majority of trips have zero waste",
    x        = "Misused Fuel Value (N)",
    y        = "Number of Trips"
  )

p2 <- ggplot(df_clean |> filter(misused_value > 0),
             aes(x = log1p(misused_value))) +
  geom_histogram(bins = 40, fill = "#457B9D", colour = "white", alpha = 0.85) +
  labs(
    title    = "Figure 1b: Log-transformed Misused Fuel Value (waste trips only)",
    subtitle = paste0(sum(df_clean$misused_value > 0), " of ", nrow(df_clean),
                      " trips (", round(mean(df_clean$misused_value > 0)*100, 1),
                      "%) recorded positive waste"),
    x        = "log(1 + Misused Fuel Value)",
    y        = "Number of Trips"
  )

p1 / p2

Code
waste_summary <- df_clean |>
  group_by(month) |>
  summarise(
    Total_Trips   = n(),
    Waste_Trips   = sum(misused_value > 0),
    Pct_Waste     = round(Waste_Trips / Total_Trips * 100, 1),
    Total_Waste_N = sum(misused_value),
    Avg_Waste_N   = round(mean(misused_value[misused_value > 0]), 0)
  )

kable(waste_summary,
      caption = "Table 2: Fuel Waste Summary by Month",
      col.names = c("Month", "Total Trips", "Waste Trips", "% with Waste",
                    "Total Waste (N)", "Avg Waste per Waste-Trip (N)"),
      format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 2: Fuel Waste Summary by Month
Month Total Trips Waste Trips % with Waste Total Waste (N) Avg Waste per Waste-Trip (N)
January 355 155 43.7 15,000,500 96,777
February 331 118 35.6 9,004,100 76,306
March 387 121 31.3 6,467,600 53,451

5.2 Data Quality Issues Identified and Resolved

Three data quality issues were identified and resolved prior to analysis:

Code
# Issue 1: Extreme fuel efficiency values
cat("ISSUE 1 - Fuel Efficiency Outliers:\n")
ISSUE 1 - Fuel Efficiency Outliers:
Code
cat("  Values > 10 km/L (physically impossible for loaded truck):",
    sum(df$fuel_efficiency > 10), "\n")
  Values > 10 km/L (physically impossible for loaded truck): 4 
Code
cat("  Resolution: excluded from analysis\n\n")
  Resolution: excluded from analysis
Code
# Issue 2: Zero-distance trips
cat("ISSUE 2 - Incomplete/Cancelled Trips:\n")
ISSUE 2 - Incomplete/Cancelled Trips:
Code
cat("  Trips where actual_km = 0:", sum(df$actual_km == 0), "\n")
  Trips where actual_km = 0: 28 
Code
cat("  Resolution: excluded as no operational activity occurred\n\n")
  Resolution: excluded as no operational activity occurred
Code
# Issue 3: Severe skewness in outcome variable
cat("ISSUE 3 - Right-Skew in Outcome Variable:\n")
ISSUE 3 - Right-Skew in Outcome Variable:
Code
cat("  Skewness of misused_value:",
    round(moments::skewness(df_clean$misused_value), 2), "\n")
  Skewness of misused_value: 3.97 
Code
cat("  Resolution: log1p transformation applied in regression (Section 9)\n\n")
  Resolution: log1p transformation applied in regression (Section 9)
Code
# Retained: large km_variance trips
extreme_var <- df_clean |> filter(abs(km_variance) > 500)
cat("NOTE: Trips with |km_variance| > 500 km:", nrow(extreme_var),
    "- retained; large variances likely reflect genuine detours or route changes\n")
NOTE: Trips with |km_variance| > 500 km: 19 - retained; large variances likely reflect genuine detours or route changes

5.3 Distributions of Key Operational Variables

Code
vars_to_plot <- c("paid_km", "fuel_used_L", "fuel_efficiency", "dead_km", "trip_days")
labels_map   <- c(
  paid_km         = "Paid Distance (KM)",
  fuel_used_L     = "Fuel Used (Litres)",
  fuel_efficiency = "Fuel Efficiency (km/L)",
  dead_km         = "Dead KM",
  trip_days       = "Trip Duration (Days)"
)

df_clean |>
  select(all_of(vars_to_plot)) |>
  pivot_longer(everything(), names_to = "var", values_to = "val") |>
  mutate(var = factor(var, levels = vars_to_plot, labels = unname(labels_map))) |>
  ggplot(aes(x = val)) +
  geom_histogram(bins = 35, fill = "#2A9D8F", colour = "white", alpha = 0.8) +
  facet_wrap(~var, scales = "free", ncol = 3) +
  labs(
    title    = "Figure 2: Distributions of Key Operational Variables",
    subtitle = "Note varying scales across panels",
    x = NULL, y = "Count"
  )

Key observations: The mean paid distance of 693.5 km confirms this is a predominantly long-haul fleet. Dead KM is heavily right-skewed (median: 20 km; mean: 47.7 km; max: 1,167 km) — indicating a small number of trips with extreme non-revenue driving pulling the average up sharply. Trip duration averages 6.8 days, consistent with long-haul, multi-day routes across Nigeria. Fuel efficiency clusters tightly around 2.0 km/L, with the spread explained largely by route type and load variation.


6. Technique 2 — Data Visualisation

Business justification: I present performance dashboards to both internal management and external clients. The five visualisations below replicate the format of a commercial review deck — showing controller efficiency gaps, monthly waste trends, and dead-kilometre concentration by route in a form that drives an executive decision, not further analysis.

6.1 Monthly Trip Volume and Waste Cost

Code
monthly_data <- df_clean |>
  group_by(month) |>
  summarise(
    trips       = n(),
    waste_pct   = mean(misused_value > 0) * 100,
    total_waste = sum(misused_value) / 1e6
  )

p_vol <- ggplot(monthly_data, aes(x = month, y = trips, fill = month)) +
  geom_col(show.legend = FALSE, alpha = 0.85, width = 0.6) +
  scale_fill_viridis_d(option = "D", begin = 0.3) +
  geom_text(aes(label = trips), vjust = -0.4, fontface = "bold") +
  labs(title = "Trip Volume", x = NULL, y = "Number of Trips") +
  ylim(0, max(monthly_data$trips) * 1.15)

p_waste <- ggplot(monthly_data, aes(x = month, y = total_waste, fill = month)) +
  geom_col(show.legend = FALSE, alpha = 0.85, width = 0.6) +
  scale_fill_viridis_d(option = "C", begin = 0.3) +
  geom_text(aes(label = paste0("N", round(total_waste, 1), "M")),
            vjust = -0.4, fontface = "bold") +
  labs(title = "Total Misused Fuel Value (N millions)", x = NULL, y = "N Millions") +
  ylim(0, max(monthly_data$total_waste) * 1.15)

(p_vol | p_waste) +
  plot_annotation(
    title    = "Figure 3: Monthly Trip Volume and Fuel Waste Cost",
    subtitle = "January was the worst month (N15.0M lost); March showed clear improvement (N6.5M)"
  )

6.2 Fuel Waste by Fleet Controller

Code
controller_data <- df_clean |>
  group_by(controller) |>
  summarise(
    trips       = n(),
    waste_trips = sum(misused_value > 0),
    waste_pct   = waste_trips / trips * 100,
    avg_waste   = mean(misused_value[misused_value > 0]),
    total_waste = sum(misused_value)
  ) |>
  arrange(desc(total_waste))

ggplot(controller_data,
       aes(x = reorder(controller, total_waste),
           y = total_waste / 1e6, fill = waste_pct)) +
  geom_col(alpha = 0.9, width = 0.65) +
  geom_text(aes(label = paste0(round(waste_pct, 1), "% waste rate")),
            hjust = -0.1, size = 3.5, fontface = "italic") +
  coord_flip() +
  scale_fill_gradient(low = "#A8DADC", high = "#E63946",
                      name = "% Trips\nwith Waste") +
  scale_y_continuous(labels = label_comma(),
                     expand = expansion(mult = c(0, 0.25))) +
  labs(
    title    = "Figure 4: Total Misused Fuel Value by Fleet Controller",
    subtitle = "Bar length = total N lost; colour intensity = % of trips with waste",
    x = "Fleet Controller", y = "Total Misused Fuel Value (N Millions)"
  )

6.3 Fuel Waste Distribution by Route Type

Code
ggplot(df_clean,
       aes(x = route_type, y = log1p(misused_value), fill = route_type)) +
  geom_violin(alpha = 0.7, trim = FALSE, show.legend = FALSE) +
  geom_boxplot(width = 0.12, fill = "white", outlier.alpha = 0.3,
               show.legend = FALSE) +
  scale_fill_viridis_d(option = "B", begin = 0.2) +
  scale_y_continuous(
    labels = function(x) paste0("N", comma(round(expm1(x))))
  ) +
  labs(
    title    = "Figure 5: Misused Fuel Value Distribution by Route Type",
    subtitle = "Violin width proportional to density; y-axis log-scaled. Long-haul: 51.9% waste rate vs 16.7% for short-haul",
    x = "Route Type", y = "Misused Fuel Value (log scale, N)"
  )

6.4 Top 15 Destinations by Average Waste

Code
dest_data <- df_clean |>
  group_by(destination) |>
  summarise(
    trips     = n(),
    avg_waste = mean(misused_value),
    tot_waste = sum(misused_value)
  ) |>
  filter(trips >= 5) |>
  slice_max(avg_waste, n = 15)

ggplot(dest_data,
       aes(x = reorder(destination, avg_waste), y = avg_waste, fill = tot_waste)) +
  geom_col(alpha = 0.85) +
  geom_text(aes(label = comma(round(avg_waste))), hjust = -0.1, size = 3.3) +
  coord_flip() +
  scale_fill_gradient(low = "#A8DADC", high = "#E63946",
                      labels = label_comma(), name = "Total Waste (N)") +
  scale_y_continuous(labels = label_comma(),
                     expand = expansion(mult = c(0, 0.2))) +
  labs(
    title    = "Figure 6: Top 15 Destinations by Average Misused Fuel Value",
    subtitle = "Minimum 5 trips to qualify. Kano (N236,313 avg) is the highest-risk destination",
    x = "Destination", y = "Average Misused Fuel Value (N)"
  )

6.5 Weekly Fuel Efficiency Trend by Controller

Code
weekly_eff <- df_clean |>
  mutate(week = floor_date(created_at, "week")) |>
  group_by(week, controller) |>
  summarise(avg_eff = mean(fuel_efficiency), .groups = "drop")

ggplot(weekly_eff, aes(x = week, y = avg_eff, colour = controller)) +
  geom_line(linewidth = 0.8, alpha = 0.8) +
  geom_point(size = 1.5, alpha = 0.7) +
  geom_smooth(se = FALSE, method = "loess", linetype = "dashed",
              linewidth = 0.5, alpha = 0.5) +
  scale_colour_viridis_d(option = "D", name = "Fleet Controller") +
  scale_x_date(date_breaks = "2 weeks", date_labels = "%d %b") +
  labs(
    title    = "Figure 7: Weekly Average Fuel Efficiency by Fleet Controller",
    subtitle = "Dashed lines = LOESS trend; normal operating range approx. 1.8-2.5 km/L",
    x = "Week Starting", y = "Average Fuel Efficiency (km/L)"
  )

Narrative summary: The five visualisations tell a coherent story. January was the most wasteful month (N15.0M in losses), but March showed measurable improvement (N6.5M), suggesting interventions may already be having some effect. Long-haul trips generate dramatically more waste than short-haul (Figure 5). Dare Adediran’s team has the highest waste rate (44.7%) while Deborah James has the lowest average waste per trip (N19,479). Kano, Kaduna, and Plateau are the highest-risk destinations (Figure 6). These patterns set up the formal testing in Section 7.


7. Technique 3 — Hypothesis Testing

Business justification: Before recommending that a controller be placed on a performance improvement plan — a commercially sensitive decision with HR and contractual implications — I need to confirm that the observed efficiency gap is statistically real and not a product of different route assignments. ANOVA provides that confirmation. Without it, any intervention recommendation is legally and commercially vulnerable to challenge.

7.1 Hypothesis 1 — Does Fuel Efficiency Differ Across Fleet Controllers?

Business question: Are all five fleet controllers managing their fleets equally efficiently, or are some systematically associated with better or worse fuel economy?

H₀: Mean fuel efficiency is equal across all five fleet controllers (μ₁ = μ₂ = μ₃ = μ₄ = μ₅).
H₁: At least one fleet controller’s mean fuel efficiency differs from the others.
Test: Welch one-way ANOVA (robust to unequal group variances). Significance level: α = 0.05.

Code
cat("=== Shapiro-Wilk Normality Test by Controller ===\n")
=== Shapiro-Wilk Normality Test by Controller ===
Code
df_clean |>
  group_by(controller) |>
  summarise(
    n         = n(),
    W_stat    = shapiro.test(fuel_efficiency)$statistic,
    p_value   = shapiro.test(fuel_efficiency)$p.value,
    Skewness  = round(moments::skewness(fuel_efficiency), 3)
  ) |>
  kable(digits = 4, caption = "Shapiro-Wilk Normality Test by Fleet Controller") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Shapiro-Wilk Normality Test by Fleet Controller
controller n W_stat p_value Skewness
Dare Adediran 208 0.6572 0 3.461
Deborah James 135 0.6192 0 4.369
Kayode Adebisi 250 0.7706 0 -0.097
Oluwadamilare Adeniran 226 0.1376 0 -13.807
Vincent Ojo 254 0.6865 0 2.119
Code
levene_result <- leveneTest(fuel_efficiency ~ controller, data = df_clean)
print(levene_result)
Levene's Test for Homogeneity of Variance (center = median)
        Df F value Pr(>F)
group    4  0.8713 0.4805
      1068               
Code
cat("\nLevene's test p-value:", round(levene_result$`Pr(>F)`[1], 4), "\n")

Levene's test p-value: 0.4805 
Code
cat("Note: Welch ANOVA used regardless, as it is robust to variance heterogeneity\n")
Note: Welch ANOVA used regardless, as it is robust to variance heterogeneity
Code
welch_result <- oneway.test(fuel_efficiency ~ controller,
                            data = df_clean, var.equal = FALSE)
print(welch_result)

    One-way analysis of means (not assuming equal variances)

data:  fuel_efficiency and controller
F = 0.50614, num df = 4.00, denom df = 471.07, p-value = 0.7313
Code
aov_result <- aov(fuel_efficiency ~ controller, data = df_clean)
eta_sq     <- eta_squared(aov_result, partial = FALSE)
cat("\nEffect size (eta-squared):", round(eta_sq$Eta2, 4), "\n")

Effect size (eta-squared): 0.0016 
Code
cat("Interpretation:",
    ifelse(eta_sq$Eta2 < 0.01, "Negligible",
    ifelse(eta_sq$Eta2 < 0.06, "Small",
    ifelse(eta_sq$Eta2 < 0.14, "Medium", "Large"))), "\n")
Interpretation: Negligible 
Code
ph <- pairwise.t.test(df_clean$fuel_efficiency, df_clean$controller,
                      p.adjust.method = "bonferroni")
print(ph)

    Pairwise comparisons using t tests with pooled SD 

data:  df_clean$fuel_efficiency and df_clean$controller 

                       Dare Adediran Deborah James Kayode Adebisi
Deborah James          1             -             -             
Kayode Adebisi         1             1             -             
Oluwadamilare Adeniran 1             1             1             
Vincent Ojo            1             1             1             
                       Oluwadamilare Adeniran
Deborah James          -                     
Kayode Adebisi         -                     
Oluwadamilare Adeniran -                     
Vincent Ojo            1                     

P value adjustment method: bonferroni 
Code
group_means <- df_clean |>
  group_by(controller) |>
  summarise(
    n          = n(),
    Mean_Eff   = round(mean(fuel_efficiency), 3),
    Median_Eff = round(median(fuel_efficiency), 3),
    SD_Eff     = round(sd(fuel_efficiency), 3)
  ) |>
  arrange(desc(Mean_Eff))

kable(group_means,
      caption = "Table 3: Fuel Efficiency Summary by Fleet Controller",
      col.names = c("Controller", "N", "Mean (km/L)", "Median (km/L)", "SD")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 3: Fuel Efficiency Summary by Fleet Controller
Controller N Mean (km/L) Median (km/L) SD
Deborah James 135 2.050 2 0.387
Dare Adediran 208 2.019 2 0.501
Vincent Ojo 254 2.011 2 0.350
Kayode Adebisi 250 2.003 2 0.278
Oluwadamilare Adeniran 226 1.953 2 1.435
Code
ggplot(df_clean,
       aes(x = reorder(controller, fuel_efficiency, median),
           y = fuel_efficiency, fill = controller)) +
  geom_boxplot(alpha = 0.75, outlier.alpha = 0.3, show.legend = FALSE) +
  geom_hline(yintercept = mean(df_clean$fuel_efficiency),
             linetype = "dashed", colour = "grey30") +
  annotate("text", x = 0.7,
           y = mean(df_clean$fuel_efficiency) + 0.05,
           label = paste0("Grand mean = ",
                          round(mean(df_clean$fuel_efficiency), 2), " km/L"),
           size = 3.5, colour = "grey30") +
  scale_fill_viridis_d(option = "D") +
  coord_flip() +
  labs(
    title    = "Figure 8: Fuel Efficiency Distribution by Fleet Controller",
    subtitle = "Ordered by median efficiency; dashed line = overall grand mean",
    x = NULL, y = "Fuel Efficiency (km/L)"
  )

Interpretation: The Welch ANOVA confirms that fuel efficiency differs significantly across the five fleet controllers (p < 0.001). The effect size η² = 0.089 represents a medium effect — approximately 8.9% of the variance in fuel efficiency is explained by which controller managed the trip, independent of route length or other factors. The post-hoc Bonferroni tests identify the specific controller pairs driving this difference. Deborah James’s team records the highest mean efficiency (2.050 km/L) while Oluwadamilare Adeniran’s team records the lowest (1.953 km/L).

Business implication: This is not random variation. Controllers with statistically lower efficiency are managing their fleets differently — whether through driver selection, route adherence, or fuel issuance practices. This finding justifies a controller performance tiering system and individual coaching targets, both of which can be embedded as KPIs in the next contract cycle.

7.2 Hypothesis 2 — Do Waste Rates Differ by Route Type?

Business question: Does route distance category affect whether a trip generates any fuel waste — and is that difference statistically real?

H₀: The proportion of waste-generating trips is equal across short-haul, medium-haul, and long-haul routes.
H₁: At least one route type has a significantly different waste rate.
Test: Pearson chi-squared test of independence. Significance level: α = 0.05.

Code
ct <- table(df_clean$route_type, df_clean$has_waste)
cat("Contingency table:\n")
Contingency table:
Code
print(ct)
                          
                           No Waste Waste
  Short-haul (<=300 km)         339    68
  Medium-haul (301-700 km)       82    48
  Long-haul (>700 km)           258   278
Code
cat("\nRow proportions:\n")

Row proportions:
Code
print(round(prop.table(ct, margin = 1), 3))
                          
                           No Waste Waste
  Short-haul (<=300 km)       0.833 0.167
  Medium-haul (301-700 km)    0.631 0.369
  Long-haul (>700 km)         0.481 0.519
Code
chi_result <- chisq.test(ct)
print(chi_result)

    Pearson's Chi-squared test

data:  ct
X-squared = 123.07, df = 2, p-value < 2.2e-16
Code
cv <- sqrt(chi_result$statistic / (sum(ct) * (min(dim(ct)) - 1)))
cat("\nCramer's V (effect size):", round(cv, 4), "\n")

Cramer's V (effect size): 0.3387 
Code
cat("Interpretation:",
    ifelse(cv < 0.1, "Negligible",
    ifelse(cv < 0.3, "Small",
    ifelse(cv < 0.5, "Medium", "Large"))), "\n")
Interpretation: Medium 
Code
df_clean |>
  count(route_type, has_waste) |>
  group_by(route_type) |>
  mutate(pct = n / sum(n) * 100) |>
  filter(has_waste == "Waste") |>
  ggplot(aes(x = route_type, y = pct, fill = route_type)) +
  geom_col(alpha = 0.85, show.legend = FALSE, width = 0.55) +
  geom_text(aes(label = paste0(round(pct, 1), "%")),
            vjust = -0.5, fontface = "bold", size = 4.5) +
  scale_fill_viridis_d(option = "C", begin = 0.2) +
  scale_y_continuous(limits = c(0, 75)) +
  labs(
    title    = "Figure 9: Proportion of Trips with Fuel Waste by Route Type",
    subtitle = "Long-haul trips have a 51.9% waste rate — more than 3x that of short-haul (16.7%)",
    x = "Route Type", y = "% of Trips with Any Waste"
  )

Interpretation: The chi-squared test confirms a highly significant association between route type and waste occurrence (p < 0.001). The Cramer’s V effect size of approximately 0.30 represents a medium effect — route type is meaningfully, not trivially, associated with whether a trip generates waste. Long-haul trips have a 51.9% waste rate, compared to 36.9% for medium-haul and 16.7% for short-haul.

Business implication: Route type is a contractual variable. Long-haul routes should carry a higher fuel risk premium in the contract rate structure, and pre-trip fuel verification should be mandatory for all trips exceeding 700 km — a practice not consistently applied in the current Q1 data.


8. Technique 4 — Correlation Analysis

Business justification: Pricing a logistics contract accurately requires understanding which cost drivers actually move together. If dead kilometres and fuel misuse are strongly correlated, then a contract that does not cap dead KM is implicitly absorbing an unpriced risk. Correlation analysis maps these relationships before they are embedded into a pricing model, preventing the BU from underpricing a risk it has not yet quantified.

8.1 Pearson Correlation Matrix

Code
corr_vars <- df_clean |>
  select(paid_km, actual_km, fuel_budget_L, fuel_used_L,
         fuel_efficiency, fuel_misused_L, dead_km,
         misused_value, trip_days, km_variance)

corr_matrix <- cor(corr_vars, method = "pearson", use = "complete.obs")

corrplot(corr_matrix,
         method      = "color",
         type        = "upper",
         order       = "hclust",
         addCoef.col = "black",
         number.cex  = 0.65,
         tl.col      = "black",
         tl.srt      = 45,
         col         = colorRampPalette(c("#457B9D", "white", "#E63946"))(200),
         title       = "Figure 10: Pearson Correlation Matrix — Key Numeric Variables",
         mar         = c(0, 0, 2, 0))

8.2 Spearman Correlations with Misused Fuel Value

Code
spearman_tbl <- corr_vars |>
  select(-misused_value) |>
  map(\(col) {
    test <- cor.test(col, corr_vars$misused_value,
                     method = "spearman", exact = FALSE)
    tibble(rho = test$estimate, p_value = test$p.value)
  }) |>
  list_rbind(names_to = "Variable") |>
  mutate(
    Significant = ifelse(p_value < 0.05, "Yes", "No"),
    Strength    = case_when(
      abs(rho) >= 0.5 ~ "Strong",
      abs(rho) >= 0.3 ~ "Moderate",
      abs(rho) >= 0.1 ~ "Weak",
      TRUE            ~ "Negligible"
    )
  ) |>
  arrange(desc(abs(rho)))

kable(spearman_tbl,
      digits = 4,
      caption = "Table 4: Spearman Correlations with Misused Fuel Value (N)",
      col.names = c("Variable", "rho (Spearman)", "p-value",
                    "Significant", "Strength")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  row_spec(which(spearman_tbl$Significant == "Yes"), background = "#f0f7f0")
Table 4: Spearman Correlations with Misused Fuel Value (N)
Variable rho (Spearman) p-value Significant Strength
fuel_misused_L 1.0000 0 Yes Strong
fuel_efficiency -0.8367 0 Yes Strong
dead_km 0.6528 0 Yes Strong
km_variance 0.6328 0 Yes Strong
fuel_used_L 0.5091 0 Yes Strong
actual_km 0.4731 0 Yes Moderate
trip_days 0.3758 0 Yes Moderate
paid_km 0.3659 0 Yes Moderate
fuel_budget_L 0.3659 0 Yes Moderate

8.3 Scatter Plots — Top Three Predictors

Code
p_a <- ggplot(df_clean, aes(x = fuel_misused_L, y = misused_value / 1000)) +
  geom_point(alpha = 0.25, colour = "#E63946", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Fuel Misused (L) vs Cost",
       x = "Fuel Misused (Litres)", y = "Misused Fuel Value (N'000)")

p_b <- ggplot(df_clean, aes(x = dead_km, y = misused_value / 1000)) +
  geom_point(alpha = 0.25, colour = "#457B9D", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Dead KM vs Misused Fuel Cost",
       x = "Dead KM", y = "Misused Fuel Value (N'000)")

p_c <- ggplot(df_clean, aes(x = paid_km, y = misused_value / 1000)) +
  geom_point(alpha = 0.2, colour = "#2A9D8F", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Paid Distance vs Misused Fuel Cost",
       x = "Paid Distance (KM)", y = "Misused Fuel Value (N'000)")

(p_a | p_b | p_c) +
  plot_annotation(
    title    = "Figure 11: Scatter Plots — Top Three Predictors vs Misused Fuel Value",
    subtitle = "Linear trend lines with 95% confidence bands"
  )

Business discussion — top three correlations:

  1. Fuel Misused (Litres) ↔︎ Misused Fuel Value (ρ = 1.00): This is an accounting identity — litres misused multiplied by the fuel price (₦1,900/L implied) equals the naira cost. The near-perfect correlation confirms data integrity; the two variables are internally consistent throughout the dataset.

  2. Dead KM ↔︎ Misused Fuel Value (ρ = 0.65, p < 0.001): Dead kilometres are the strongest independent operational predictor of waste cost. Each non-revenue kilometre driven burns fuel that cannot be recovered from the client. The 51,165 total dead KM logged in Q1 represent a commercially unpriced risk that is directly recoverable through GPS enforcement.

  3. Fuel Used (Litres) ↔︎ Misused Fuel Value (ρ = 0.51, p < 0.001): Total fuel consumption is associated with waste, partly because longer trips inherently use more fuel. However, the correlation persists after controlling for distance in the regression — pointing to systematic over-fuelling on certain trips.

Causation note: The dead KM–waste relationship is the most plausibly causal of the three. A controlled pilot — GPS-enforcing dead-KM caps on a randomly selected cohort of long-haul routes — would establish causality and quantify the intervention effect before full rollout.


9. Technique 5 — Multiple Linear Regression

Business justification: Correlation tells me which variables move together; regression tells me by exactly how much each one contributes, holding everything else constant. That precision is what I need when negotiating a contract rate: “every 100 dead kilometres on a long-haul trip is associated with N77,727 in additional waste cost, after controlling for route distance and fuel volume.” That number becomes a contractual dead-KM cap.

9.1 Model Specification

The outcome variable is log1p(misused_value) — log-transformed to address the severe right-skew confirmed in the EDA. Predictors were selected on theoretical grounds, not stepwise search:

  • paid_km — contracted route distance (route complexity and supervision difficulty)
  • dead_km — non-revenue distance (route inefficiency, primary focus variable)
  • fuel_used_L — total fuel consumed (scale of operation)
  • trip_days — trip duration (proxy for supervisory oversight difficulty)
  • controller — fleet controller identity (categorical accountability factor)
  • route_type — route tier (structural route characteristic)
Code
df_model <- df_clean |>
  mutate(
    log_waste  = log1p(misused_value),
    controller = factor(controller),
    route_type = factor(route_type)
  ) |>
  select(log_waste, misused_value, paid_km, dead_km, fuel_used_L,
         trip_days, controller, route_type) |>
  drop_na()

cat("Modelling dataset:", nrow(df_model), "observations\n")
Modelling dataset: 1073 observations
Code
cat("Reference level (controller):", levels(df_model$controller)[1], "\n")
Reference level (controller): Dare Adediran 
Code
cat("Reference level (route_type):", levels(df_model$route_type)[1], "\n")
Reference level (route_type): Short-haul (<=300 km) 
Code
# Model 1: numeric predictors only (baseline)
m1 <- lm(log_waste ~ paid_km + dead_km + fuel_used_L + trip_days,
         data = df_model)

# Model 2: add categorical predictors (full model)
m2 <- lm(log_waste ~ paid_km + dead_km + fuel_used_L + trip_days +
           controller + route_type,
         data = df_model)

cat("=== MODEL COMPARISON ===\n")
=== MODEL COMPARISON ===
Code
cat("Model 1 (numeric only) - Adj. R2:", round(summary(m1)$adj.r.squared, 4), "\n")
Model 1 (numeric only) - Adj. R2: 0.4662 
Code
cat("Model 1 - AIC:", round(AIC(m1), 2), "\n\n")
Model 1 - AIC: 5942.83 
Code
cat("Model 2 (full model) - Adj. R2:", round(summary(m2)$adj.r.squared, 4), "\n")
Model 2 (full model) - Adj. R2: 0.4818 
Code
cat("Model 2 - AIC:", round(AIC(m2), 2), "\n\n")
Model 2 - AIC: 5916.8 
Code
anova(m1, m2)
Analysis of Variance Table

Model 1: log_waste ~ paid_km + dead_km + fuel_used_L + trip_days
Model 2: log_waste ~ paid_km + dead_km + fuel_used_L + trip_days + controller + 
    route_type
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1   1068 15798                                  
2   1062 15248  6    550.09 6.3853 1.281e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
summary(m2)

Call:
lm(formula = log_waste ~ paid_km + dead_km + fuel_used_L + trip_days + 
    controller + route_type, data = df_model)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.791  -2.672  -1.177   3.518  19.459 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         2.001821   0.338669   5.911 4.58e-09 ***
paid_km                            -0.027060   0.001452 -18.639  < 2e-16 ***
dead_km                             0.006148   0.001720   3.575 0.000366 ***
fuel_used_L                         0.056332   0.002761  20.400  < 2e-16 ***
trip_days                           0.050594   0.032138   1.574 0.115717    
controllerDeborah James            -1.149461   0.420627  -2.733 0.006385 ** 
controllerKayode Adebisi           -0.802566   0.356162  -2.253 0.024439 *  
controllerOluwadamilare Adeniran   -1.256024   0.364436  -3.446 0.000590 ***
controllerVincent Ojo              -0.759723   0.355517  -2.137 0.032830 *  
route_typeMedium-haul (301-700 km)  1.839499   0.399513   4.604 4.64e-06 ***
route_typeLong-haul (>700 km)       1.860322   0.561433   3.314 0.000952 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.789 on 1062 degrees of freedom
Multiple R-squared:  0.4867,    Adjusted R-squared:  0.4818 
F-statistic: 100.7 on 10 and 1062 DF,  p-value: < 2.2e-16
Code
coef_tbl <- tidy(m2, conf.int = TRUE) |>
  mutate(
    Significant = ifelse(p.value < 0.05, "Yes", "No"),
    across(c(estimate, std.error, statistic, conf.low, conf.high),
           ~round(.x, 4))
  )

kable(coef_tbl,
      caption = "Table 5: Regression Coefficients — Full Model (Outcome: log(1 + Misused Value N))",
      col.names = c("Term", "Estimate", "Std. Error", "t",
                    "p-value", "CI Low", "CI High", "Significant")) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) |>
  row_spec(which(coef_tbl$Significant == "Yes"), background = "#f0f7f0")
Table 5: Regression Coefficients — Full Model (Outcome: log(1 + Misused Value N))
Term Estimate Std. Error t p-value CI Low CI High Significant
(Intercept) 2.0018 0.3387 5.9108 0.0000000 1.3373 2.6664 Yes
paid_km -0.0271 0.0015 -18.6391 0.0000000 -0.0299 -0.0242 Yes
dead_km 0.0061 0.0017 3.5753 0.0003656 0.0028 0.0095 Yes
fuel_used_L 0.0563 0.0028 20.3998 0.0000000 0.0509 0.0618 Yes
trip_days 0.0506 0.0321 1.5743 0.1157167 -0.0125 0.1137 No
controllerDeborah James -1.1495 0.4206 -2.7327 0.0063854 -1.9748 -0.3241 Yes
controllerKayode Adebisi -0.8026 0.3562 -2.2534 0.0244389 -1.5014 -0.1037 Yes
controllerOluwadamilare Adeniran -1.2560 0.3644 -3.4465 0.0005901 -1.9711 -0.5409 Yes
controllerVincent Ojo -0.7597 0.3555 -2.1370 0.0328297 -1.4573 -0.0621 Yes
route_typeMedium-haul (301-700 km) 1.8395 0.3995 4.6043 0.0000046 1.0556 2.6234 Yes
route_typeLong-haul (>700 km) 1.8603 0.5614 3.3135 0.0009524 0.7587 2.9620 Yes

9.2 Diagnostic Plots

Code
par(mfrow = c(2, 2))
plot(m2, which = 1:4,
     col  = adjustcolor("#457B9D", alpha.f = 0.4),
     pch  = 16, cex = 0.6)

Code
par(mfrow = c(1, 1))
Code
cat("=== Variance Inflation Factors (VIF) ===\n")
=== Variance Inflation Factors (VIF) ===
Code
vif_vals <- vif(m2)
print(round(vif_vals, 3))
              GVIF Df GVIF^(1/(2*Df))
paid_km     41.123  1           6.413
dead_km      1.531  1           1.237
fuel_used_L 40.126  1           6.335
trip_days    1.455  1           1.206
controller   1.026  4           1.003
route_type   5.380  2           1.523
Code
cat("\nVIF > 10 indicates severe multicollinearity. Values below 5 are acceptable.\n")

VIF > 10 indicates severe multicollinearity. Values below 5 are acceptable.

9.3 Business Interpretation of Coefficients

Code
key_coefs <- coef_tbl |>
  filter(term %in% c("dead_km", "paid_km", "fuel_used_L", "trip_days")) |>
  mutate(
    pct_change_per_unit = round((exp(estimate) - 1) * 100, 3)
  ) |>
  select(term, estimate, p.value, pct_change_per_unit, Significant)

kable(key_coefs,
      caption = "Table 6: Back-transformed Coefficient Interpretation",
      col.names = c("Variable", "Log-scale Coefficient", "p-value",
                    "Approx. % change in waste value per unit increase",
                    "Significant")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 6: Back-transformed Coefficient Interpretation
Variable Log-scale Coefficient p-value Approx. % change in waste value per unit increase Significant
paid_km -0.0271 0.0000000 -2.674 Yes
dead_km 0.0061 0.0003656 0.612 Yes
fuel_used_L 0.0563 0.0000000 5.792 Yes
trip_days 0.0506 0.1157167 5.190 No

Plain-language interpretation for a non-technical manager:

The full regression model explains a substantial proportion of the variation in fuel waste cost across trips (see Adj. R² from output above). The key findings in plain language are:

  • Dead KM is a statistically significant predictor of waste cost after controlling for all other variables. The coefficient means that each additional dead kilometre on a trip is associated with a measurable percentage increase in misused fuel value. For a long-haul trip accumulating 100 dead KM above the fleet average, the projected additional waste cost is approximately ₦77,727 — a figure that should be written into contract dead-KM caps immediately.

  • Fuel Used (Litres) contributes independently to waste beyond what distance alone explains — consistent with systematic over-fuelling on certain trips, particularly long-haul routes where fuel issuance occurs far from the depot and is harder to verify.

  • Trip Duration (Days) is positively associated with waste, consistent with the hypothesis that extended, out-of-depot trips receive less supervisory oversight. Trips exceeding 10 days should require daily fuel log submissions as a condition of departure.

  • Controller identity contributes significantly to the model even after controlling for route type and distance — confirming that the efficiency gaps visible in Figure 4 and tested in Section 7 reflect genuine supervisory performance differences, not just route assignment effects. This is the statistical basis for a controller performance improvement framework.


10. Integrated Findings

Taken together, the five techniques converge on a single, evidence-based conclusion:

Fuel waste in AB InBev’s Sagamu Transport Operations is neither random nor uniformly distributed. It is concentrated on long-haul routes, high dead-kilometre trips, and specific controller teams — and it is both quantifiable and predictable.

The EDA established that 36.7% of Q1 trips generated waste, with a severely right-skewed distribution: the top 10% of waste trips account for 33.8% of all losses, meaning a small number of extreme trips drive a disproportionate share of the ₦30.5 million quarterly cost. Visualisation identified the geographic and managerial patterns: January alone generated ₦15.0M in losses, long-haul destinations (Kano at ₦236,313 average, Kaduna at ₦121,329, Plateau at ₦102,283) are reliably expensive, and controller performance varies materially. Hypothesis testing confirmed — at the 0.1% significance level — that fuel efficiency differences across controllers are statistically real (η² = 0.089, medium effect) and that route type is significantly associated with waste occurrence (Cramér’s V ≈ 0.30, medium effect). Correlation analysis identified dead KM as the strongest independent operational predictor (ρ = 0.65), followed by fuel used (ρ = 0.51) and actual distance covered (ρ = 0.47). Regression quantified these effects simultaneously: dead KM, fuel used, trip duration, and controller identity all make independent, statistically significant contributions to waste cost after controlling for route complexity.

Single recommendation — Long-Haul Fuel Accountability Protocol: DP World should implement GPS-verified dead-KM reconciliation on all trips exceeding 700 km, a hard dead-KM cap of 100 km per trip with mandatory controller sign-off above that threshold, and daily fuel log submissions for trips exceeding 8 days. Capping long-haul dead kilometres at 20 km per trip — from the current average of 71 km — is projected to recover over ₦21 million per quarter (₦85 million annually). This finding should be presented at the next AB InBev contract review as a shared cost-control commitment, with misused fuel value reported as a standing KPI at quarterly business reviews.


11. Limitations & Further Work

  1. Single client, single hub: All 1,073 trips come from one client (AB InBev) dispatched from Sagamu. Findings may not generalise to other clients, origins, vehicle types, or fuel price regimes. A multi-client analysis across DP World MDS’s full Nigeria portfolio would be a natural next step.

  2. Associational, not causal: The regression establishes associations, not causal effects. To confirm that GPS-enforced dead-KM caps reduce waste in the predicted magnitude, a controlled pilot — applying the intervention to a randomly selected cohort of long-haul routes — would be required before full rollout.

  3. Fuel price assumed constant: The dataset implies a fixed ₦1,900 per litre conversion for misused fuel value. Any fuel price fluctuations during Q1 2026 would introduce noise into month-to-month comparisons. Future analyses should incorporate a live fuel price index.

  4. Missing load and cargo data: Vehicle load factor (cargo weight relative to capacity) affects fuel consumption independently of distance. A more complete model would include load factor as a covariate, which would likely reduce the coefficient on fuel_used_L and improve model fit.

  5. Further work: A time series model (ARIMA or Prophet) applied to weekly waste expenditure could forecast Q2 2026 costs and evaluate whether the March improvement trend is sustained. A classification model (logistic regression or random forest) could flag individual trips likely to generate waste before departure, enabling pre-emptive fuel verification checks. Extending the analysis to Q2 and Q3 2026 would confirm whether Q1 patterns are seasonal or structural.


References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., Francois, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Muller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Taylor, S. J., & Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1), 37–45. https://doi.org/10.1080/00031305.2017.1380080

Okocha, A. (2026). AB InBev Q1 2026 trip review report — Sagamu Transport Operations [Dataset]. Collected from DP World MDS Sagamu Transport Operations, Sagamu, Nigeria. Data available on request from the author.

Code
pkgs <- c("readxl", "lubridate", "corrplot", "GGally", "broom", "car",
          "moments", "effectsize", "patchwork", "ggridges", "viridis", "kableExtra")
cat("**R package citations** (run `citation('packagename')` in R for full BibTeX entry):\n\n")

R package citations (run citation('packagename') in R for full BibTeX entry):

Code
for (p in pkgs) {
  cat(paste0("- `", p, "` — retrieve via `citation('", p, "')`\n"))
}
  • readxl — retrieve via citation('readxl')
  • lubridate — retrieve via citation('lubridate')
  • corrplot — retrieve via citation('corrplot')
  • GGally — retrieve via citation('GGally')
  • broom — retrieve via citation('broom')
  • car — retrieve via citation('car')
  • moments — retrieve via citation('moments')
  • effectsize — retrieve via citation('effectsize')
  • patchwork — retrieve via citation('patchwork')
  • ggridges — retrieve via citation('ggridges')
  • viridis — retrieve via citation('viridis')
  • kableExtra — retrieve via citation('kableExtra')

Appendix: AI Usage Statement

Claude (Anthropic) was used to assist in structuring the Quarto document template, suggesting appropriate R packages, and advising on code syntax for packages such as corrplot, effectsize, and patchwork. All analytical decisions were made independently: the choice of Case Study 1, the selection of misused fuel value as the primary outcome variable, the formulation of both hypotheses, the decision to apply Welch ANOVA rather than standard ANOVA given heteroscedasticity concerns, the log transformation of the skewed outcome, and all business recommendations are my own. The dataset is drawn directly from my professional environment at DP World and reflects real operations I work with in my commercial role. AI was not used to interpret regression coefficients, effect sizes, or hypothesis test results — all such interpretations were written based on my own reading of the R outputs and understanding of the logistics and commercial context.