Reducing Fuel misuse and Improving Profitability per Trip

Data Analytics II — Case Study 1: Exploratory & Inferential Analytics | AB InBev Sagamu Transport Operations — Q1 2026

Author

Anthonia Okocha

Published

May 16, 2026

Abstract

This study applies five exploratory and inferential analytics techniques to 1,073 truck-trip records from DP World’s Sagamu Transport Operations hub, covering AB InBev deliveries across Nigeria in Q1 2026. The central business problem is fuel misuse: 36.7% of all trips generated misused fuel costs totalling ₦30.5 million, with long-haul routes (>700 km) accounting for 77.7% of losses. Dead kilometres emerge as the strongest predictor of financial waste (Spearman ρ = 0.65, p < 0.001), with every 100 dead kilometres associated with ₦77,727 in misused fuel value. Hypothesis testing confirms significant fuel efficiency differences across fleet controllers (Welch ANOVA, p < 0.001, η² = 0.089). Findings support implementing a Long-Haul Fuel Accountability Protocol projected to recover over ₦21 million per quarter — a material improvement in BU profitability and contract renewal positioning.

🎓 Course: Data Analytics II | 📋 Assessment: Capstone Case Study 1 | 👨‍🏫 Instructor: Prof Bongo Adi | 🏛️ Institution: Lagos Business School | 📅 Date: May 2026

1. Executive Summary

This study applies five exploratory and inferential analytics techniques to 1,073 truck-trip records generated by AB InBev’s Sagamu Transport Operations hub between January and March 2026. The dataset was extracted directly from DP World MDS’s internal fleet management system and represents a complete census of all dispatched shipments over the quarter.

The central business problem is fuel waste and its financial cost. Across Q1 2026, 394 of 1,073 trips (36.7%) recorded positive fuel misuse, generating total losses of ₦30,472,200 — equivalent to approximately ₦122 million annually. The analyses reveal that:

Misused fuel value is highly right-skewed: the top 10% of waste-generating trips account for 33.8% of all losses
Fuel efficiency differs significantly across the five fleet controllers (Welch ANOVA, p < 0.001, η² = 0.089)
Long-haul routes (>700 km) carry a 51.9% waste rate and generate 77.7% of total losses despite representing 50% of trips
Dead kilometres (mean: 47.7 km/trip; total Q1: 51,165 km) are the strongest operational predictor of financial waste (Spearman ρ = 0.65, p < 0.001)
Every 100 dead kilometres is associated with ₦77,727 in misused fuel value

Recommendation: DP World should implement a Long-Haul Fuel Accountability Protocol — enforcing GPS-verified dead-KM caps, daily fuel log submission on trips exceeding 8 days, and controller performance tiering. Capping long-haul dead kilometres at 20 km per trip is projected to recover over ₦21 million per quarter.

2. Professional Disclosure

2.1 Role and Organisation

I am a Business Development Specialist at DP World, a global ports, logistics, and supply chain conglomerate operating across 77 countries. Within the Nigerian Business Unit, my role sits in the commercial department where I am responsible for developing growth strategies, conducting cost modelling for commercial proposals, managing responses to tenders (RFIs, RFPs, RFQs), revising contract rates to optimise BU profitability, and identifying commercial risks pre- and post-contract execution.

The dataset analysed here — 1,073 truck trips dispatched from the Sagamu Transport Operations hub for AB InBev across Q1 2026 — comes directly from DP World’s internal fleet management system for a client account I manage commercially. Every variable maps to a metric I monitor: paid versus actual distance informs route pricing accuracy; fuel efficiency feeds into cost-per-kilometre benchmarks used in commercial proposals; misused fuel value quantifies the leakage that compresses BU profitability; and dead KM is a direct proxy for asset underutilisation — a KPI I am accountable for driving down.

2.2 Technique Operational Relevance

Technique 1 — Exploratory Data Analysis: A core deliverable in my role is conducting cost modelling for commercial proposals. No cost model is credible if built on unchecked data. Before I can price a rate for AB InBev or present a business case to BU leadership, I need to establish the true baseline of operational performance — not dashboard averages, but the full distribution of fuel costs including outliers, anomalies, and data quality failures that would distort the model. EDA on this dataset identifies impossible fuel efficiency readings, zero-distance trips, and severe right-skew in misused fuel value before any analysis is built on top of it.

Technique 2 — Data Visualisation: My role explicitly requires supporting the preparation and presentation of commercial proposals and driving client engagements throughout the sales cycle. Both activities demand visual communication — procurement directors and operations managers do not read regression tables; they read charts. The five visualisations in this project are structured as I would present them in a client quarterly business review: showing which destinations generate unsustainable waste, how performance varies across controllers, and how cost leakage trends month-on-month. A chart showing a client their own operational waste in naira terms is also a commercial tool — it builds the case for a revised contract structure or performance-linked rate.

Technique 3 — Hypothesis Testing: My role requires supporting the identification and management of commercial risks pre, during, and post service engagement. A key commercial risk in haulage is systematic underperformance by a specific controller team — because if persistent, it can trigger SLA breaches, erode client satisfaction scores, and threaten contract renewal. Before recommending a performance improvement plan, I need to confirm that the observed efficiency gap is statistically real and not a product of different route assignments. ANOVA provides that confirmation in a form that is defensible in a client performance meeting or an internal HR process.

Technique 4 — Correlation Analysis: My role requires revising applicable rates to optimise income from existing clients and collaborating with other departments to determine the most cost-effective rate. Both tasks require knowing which operational variables actually drive cost outcomes. If dead kilometres and misused fuel value are strongly correlated, then any contract that does not cap dead KM is absorbing an unpriced commercial risk — one that compresses the BU margin I am accountable for protecting. Correlation analysis maps these relationships before they are embedded into a pricing model or contract term.

Technique 5 — Multiple Linear Regression: My role requires managing responses to RFIs, RFPs, and RFQs and negotiating commercial proposals. Both tasks require me to defend a rate with data. Regression on this dataset produces the precise, coefficient-level answer I need: how much does each additional dead kilometre or extra day on the road add to misused fuel cost, holding all other factors constant? That number translates directly into a pricing adjustment, a contractual dead-KM cap, or a risk premium I can justify to a client or to BU finance.

3. Data Collection & Sampling

3.1 Source and Collection Method

The dataset was exported directly from DP World’s internal fleet management and trip-tracking platform, which records every shipment dispatched from the Sagamu Transport Operations hub for AB InBev. Data extraction was performed by downloading the Q1 trip review report for the period January 1 – March 31, 2026, in Excel format (.xlsx). The file (TripReviewReport_ABNBEV_JAN_FEB_MAR.xlsx) was accessed through the internal system available to me in my commercial role — the same system I use when preparing client performance reports and rate revision proposals for BU leadership.

All fields — including distances, fuel figures, and timestamps — are system-generated at the point of trip completion, with fuel reconciliation performed by fleet controllers upon vehicle return. No manual data entry was involved.

3.2 Sampling Frame and Sample Size

This dataset represents a complete census of all AB InBev trips dispatched from the Sagamu hub during Q1 2026. Every trip recorded in the system for the period is included. The final clean dataset contains 1,073 observations across 20 variables, comfortably exceeding the minimum requirement of 100 observations and 5 variables.

Attribute	Detail
Client	AB InBev Nigeria
Operation Site	Sagamu Transport Operations
Operator	DP World MDS
Period Covered	01 January 2026 – 31 March 2026
Total Trips (raw)	1,104
Total Trips (clean)	1,073
Variables	20 (6 numeric continuous, 4 numeric integer, 4 categorical, 2 datetime, 4 identifier/text)
Fleet Controllers	5 individuals
Distinct Destinations	40+ cities/states across Nigeria

3.3 Variable Description

Variable	Type	Description
Shipment Number	Identifier	Unique trip ID
Fleet Controller	Categorical	Supervisor responsible for the trip
Vehicle	Categorical	Truck registration number
Delivery Officer	Categorical	Driver assigned
Created At	Date	Trip dispatch date
Completed At	Date	Trip completion date
Origin	Categorical	Departure location
Destination	Categorical	Delivery location
Paid Distance (KM)	Numeric	Contracted/agreed trip distance
Distance Covered (KM)	Numeric	Actual GPS-recorded distance
Fuel Budget (Litres)	Numeric	Approved fuel allocation
Fuel Used (Litres)	Numeric	Actual fuel consumed
Fuel Consumption	Numeric	Litres per 100 KM
Fuel Efficiency	Numeric	KM per litre ratio
Fuel Misused (Litres)	Numeric	Fuel used beyond budget
Dead KM	Numeric	Non-revenue kilometres driven
Misused Fuel Value (Naira)	Outcome	Financial cost of misused fuel (₦)

3.4 Ethical Notes

Driver names and delivery officer identities are present in the operational data. For this published document, individuals are referenced only at the controller and vehicle level — no individual driver is singled out in the analysis. The data is used with organisational awareness for academic purposes under LBS guidelines.

4. Data Loading, Cleaning & Description

Code

df_raw <- read_excel("TripReviewReport_ABNBEV_JAN_FEB_MAR.xlsx")

cat("Raw dimensions:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")

Raw dimensions: 1104 rows x 20 columns

Code

glimpse(df_raw)

Rows: 1,104
Columns: 20
$ `S/N`                       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,…
$ Client                      <chr> "AB InBev", "AB InBev", "AB InBev", "AB In…
$ `Shipment Number`           <chr> "206855", "269651", "273737", "201603", "2…
$ `Operation Site`            <chr> "Sagamu Transport Operations", "Sagamu Tra…
$ `Fleet Controller`          <chr> "Vincent Ojo", "Vincent Ojo", "Dare Adedir…
$ Vehicle                     <chr> "T32805LA", "T32822LA", "T32781LA", "T3280…
$ `Delivery Officer`          <chr> "Alabi sunday Kayode", "Godfrey Braimoh", …
$ `Created At`                <chr> "02-Jan-2026", "02-Jan-2026", "23-Jan-2026…
$ `Completed At`              <chr> "19-Jan-2026", "12-Jan-2026", "30-Jan-2026…
$ Origin                      <chr> "Sagamu", "Ilesa", "Sagamu", "Sagamu", "Sa…
$ Destination                 <chr> "Lagos", "Abuja", "Abia", "Lagos", "Oyo", …
$ `Paid Distance (KM)`        <dbl> 138, 980, 1110, 164, 316, 164, 174, 280, 4…
$ `Distance Covered(KM)`      <dbl> 111, 1270, 867, 136, 420, 139, 164, 284, 4…
$ `Fuel Budget (Litres)`      <dbl> 69, 490, 555, 82, 158, 82, 87, 140, 218, 6…
$ `Fuel Used (Litres)`        <dbl> 69, 626, 535, 78, 180, 78, 83, 133, 436, 6…
$ `Fuel Consumption`          <dbl> 1.61, 2.03, 1.62, 1.74, 2.33, 1.78, 1.98, …
$ `Fuel Efficiency`           <dbl> 2.00, 1.57, 2.07, 2.10, 1.76, 2.10, 2.10, …
$ `Fuel Misused(Litres)`      <dbl> 0, 136, 0, 0, 22, 0, 0, 0, 218, 0, 0, 36, …
$ `Dead KM`                   <dbl> 0, 290, 0, 0, 104, 0, 0, 4, 40, 0, 0, 71, …
$ `Misused Fuel Value(Naira)` <dbl> 0, 258400, 0, 0, 41800, 0, 0, 0, 414200, 0…

Code

df <- df_raw |>
  rename(
    sn               = `S/N`,
    client           = Client,
    shipment_no      = `Shipment Number`,
    operation_site   = `Operation Site`,
    controller       = `Fleet Controller`,
    vehicle          = Vehicle,
    driver           = `Delivery Officer`,
    created_at       = `Created At`,
    completed_at     = `Completed At`,
    origin           = Origin,
    destination      = Destination,
    paid_km          = `Paid Distance (KM)`,
    actual_km        = `Distance Covered(KM)`,
    fuel_budget_L    = `Fuel Budget (Litres)`,
    fuel_used_L      = `Fuel Used (Litres)`,
    fuel_consumption = `Fuel Consumption`,
    fuel_efficiency  = `Fuel Efficiency`,
    fuel_misused_L   = `Fuel Misused(Litres)`,
    dead_km          = `Dead KM`,
    misused_value    = `Misused Fuel Value(Naira)`
  ) |>
  mutate(
    created_at        = dmy(created_at),
    completed_at      = dmy(completed_at),
    trip_days         = as.numeric(completed_at - created_at),
    month             = month(created_at, label = TRUE, abbr = FALSE),
    month_num         = month(created_at),
    km_variance       = actual_km - paid_km,
    fuel_over_budget  = fuel_used_L - fuel_budget_L,
    has_waste         = ifelse(misused_value > 0, "Waste", "No Waste"),
    controller        = str_to_title(controller),
    route_type        = case_when(
      paid_km <= 300  ~ "Short-haul (<=300 km)",
      paid_km <= 700  ~ "Medium-haul (301-700 km)",
      TRUE            ~ "Long-haul (>700 km)"
    ) |> factor(levels = c("Short-haul (<=300 km)",
                            "Medium-haul (301-700 km)",
                            "Long-haul (>700 km)"))
  )

cat("=== DATA QUALITY REPORT ===\n")

=== DATA QUALITY REPORT ===

Code

cat("Total rows:", nrow(df), "\n")

Total rows: 1104

Code

cat("Missing values per column:\n")

Missing values per column:

Code

print(colSums(is.na(df)))

              sn           client      shipment_no   operation_site 
               0                0                0                0 
      controller          vehicle           driver       created_at 
               0                0                0                0 
    completed_at           origin      destination          paid_km 
               0                0                0                0 
       actual_km    fuel_budget_L      fuel_used_L fuel_consumption 
               0                0                0                0 
 fuel_efficiency   fuel_misused_L          dead_km    misused_value 
               0                0                0                0 
       trip_days            month        month_num      km_variance 
               0                0                0                0 
fuel_over_budget        has_waste       route_type 
               0                0                0

Code

cat("\nTrips with negative trip_days:", sum(df$trip_days < 0, na.rm = TRUE), "\n")


Trips with negative trip_days: 0

Code

cat("Trips with fuel_efficiency > 10:", sum(df$fuel_efficiency > 10, na.rm = TRUE), "\n")

Trips with fuel_efficiency > 10: 4

Code

cat("Trips with actual_km == 0:", sum(df$actual_km == 0), "\n")

Trips with actual_km == 0: 28

Code

outlier_log <- df |>
  filter(fuel_efficiency > 10 | actual_km == 0 | trip_days < 0) |>
  select(sn, shipment_no, created_at, paid_km, actual_km, fuel_efficiency, trip_days)

cat("Records flagged for exclusion:", nrow(outlier_log), "\n")

Records flagged for exclusion: 31

Code

kable(outlier_log, caption = "Table 0: Flagged Records Before Exclusion") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 0: Flagged Records Before Exclusion
sn	shipment_no	created_at	paid_km	actual_km	fuel_efficiency	trip_days
367	275675	2026-02-05	168	0	2.63	1
397	275947	2026-02-05	952	0	2.00	9
499	278180	2026-02-14	174	0	2.76	5
502	278208	2026-02-13	146	113	146.00	3
506	278246	2026-02-18	152	0	2.11	1
582	279388	2026-02-19	280	0	21.54	18
602	279605	2026-02-19	1816	0	1.94	8
617	279904	2026-02-21	224	12	37.33	5
626	280151	2026-02-27	1188	0	1.91	7
646	280459	2026-02-23	1146	0	2.10	9
659	280813	2026-02-25	282	0	2.10	2
667	280964	2026-02-28	130	0	2.10	1
701	00126	2026-03-10	350	0	4.07	2
716	0035	2026-03-21	175	0	1.86	1
742	282045	2026-03-04	350	0	2.11	3
792	282922	2026-03-06	172	0	2.12	5
796	282976	2026-03-07	834	0	2.00	9
797	283050	2026-03-10	170	0	1.42	6
822	283905	2026-03-11	968	0	2.10	12
886	284676	2026-03-16	138	0	2.12	6
911	285055	2026-03-16	1026	0	1.86	14
969	286218	2026-03-22	1456	0	2.00	9
972	286316	2026-03-22	134	0	2.09	2
978	286384	2026-03-24	180	0	2.09	11
981	286465	2026-03-23	130	0	2.10	2
987	286502	2026-03-23	130	10	65.00	3
1026	287038	2026-03-25	478	0	1.93	4
1084	288178	2026-03-30	1188	0	1.91	10
1087	288213	2026-03-31	246	0	1.68	4
1096	288441	2026-03-31	144	0	0.00	1
1098	288447	2026-03-31	190	0	2.11	15

Code

# Rationale: fuel_efficiency > 10 km/L is physically impossible for a loaded truck
# (normal operating range: 1.5-3 km/L). actual_km == 0 indicates cancelled trips.
df_clean <- df |>
  filter(
    fuel_efficiency <= 10,
    actual_km > 0,
    trip_days >= 0
  )

cat("Clean dataset:", nrow(df_clean), "rows retained (",
    nrow(df) - nrow(df_clean), "rows removed)\n")

Clean dataset: 1073 rows retained ( 31 rows removed)

Code

summary_tbl <- df_clean |>
  select(paid_km, actual_km, fuel_budget_L, fuel_used_L,
         fuel_efficiency, fuel_misused_L, dead_km, misused_value, trip_days) |>
  pivot_longer(everything(), names_to = "Variable") |>
  group_by(Variable) |>
  summarise(
    N        = n(),
    Mean     = round(mean(value, na.rm = TRUE), 2),
    Median   = round(median(value, na.rm = TRUE), 2),
    SD       = round(sd(value, na.rm = TRUE), 2),
    Min      = round(min(value, na.rm = TRUE), 2),
    Max      = round(max(value, na.rm = TRUE), 2),
    Skewness = round(moments::skewness(value, na.rm = TRUE), 2)
  )

kable(summary_tbl,
      caption = "Table 1: Summary Statistics — Key Numeric Variables",
      format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) |>
  column_spec(8, bold = TRUE)

Table 1: Summary Statistics — Key Numeric Variables
Variable	N	Mean	Median	SD	Min	Max	Skewness
actual_km	1,073	705.91	568	543.82	2	3,151	0.80
dead_km	1,073	47.68	20	83.26	0	1,167	4.97
fuel_budget_L	1,073	346.77	337	255.58	32	1,489	0.66
fuel_efficiency	1,073	2.00	2	0.74	-19	6	-20.93
fuel_misused_L	1,073	14.95	0	31.48	0	342	3.97
fuel_used_L	1,073	350.26	357	265.48	-10	1,518	0.76
misused_value	1,073	28,399.07	0	59,807.62	0	649,800	3.97
paid_km	1,073	693.49	674	511.21	64	2,978	0.66
trip_days	1,073	6.84	6	4.34	0	56	2.61

5. Technique 1 — Exploratory Data Analysis (EDA)

Business justification: Before I can renegotiate a contract rate or build a proposal for a new client, I need an honest baseline of operational performance. EDA surfaces the true distribution of fuel costs, identifies anomalous trips (potential fraud, system errors, or route failures), and establishes whether reported performance figures are internally consistent — exactly what I do before every client quarterly business review.

5.1 Distribution of the Outcome Variable

Code

p1 <- ggplot(df_clean, aes(x = misused_value)) +
  geom_histogram(bins = 50, fill = "#E63946", colour = "white", alpha = 0.85) +
  scale_x_continuous(labels = label_comma()) +
  labs(
    title    = "Figure 1a: Distribution of Misused Fuel Value (N)",
    subtitle = "Strongly right-skewed — majority of trips have zero waste",
    x        = "Misused Fuel Value (N)",
    y        = "Number of Trips"
  )

p2 <- ggplot(df_clean |> filter(misused_value > 0),
             aes(x = log1p(misused_value))) +
  geom_histogram(bins = 40, fill = "#457B9D", colour = "white", alpha = 0.85) +
  labs(
    title    = "Figure 1b: Log-transformed Misused Fuel Value (waste trips only)",
    subtitle = paste0(sum(df_clean$misused_value > 0), " of ", nrow(df_clean),
                      " trips (", round(mean(df_clean$misused_value > 0)*100, 1),
                      "%) recorded positive waste"),
    x        = "log(1 + Misused Fuel Value)",
    y        = "Number of Trips"
  )

p1 / p2

Code

waste_summary <- df_clean |>
  group_by(month) |>
  summarise(
    Total_Trips   = n(),
    Waste_Trips   = sum(misused_value > 0),
    Pct_Waste     = round(Waste_Trips / Total_Trips * 100, 1),
    Total_Waste_N = sum(misused_value),
    Avg_Waste_N   = round(mean(misused_value[misused_value > 0]), 0)
  )

kable(waste_summary,
      caption = "Table 2: Fuel Waste Summary by Month",
      col.names = c("Month", "Total Trips", "Waste Trips", "% with Waste",
                    "Total Waste (N)", "Avg Waste per Waste-Trip (N)"),
      format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 2: Fuel Waste Summary by Month
Month	Total Trips	Waste Trips	% with Waste	Total Waste (N)	Avg Waste per Waste-Trip (N)
January	355	155	43.7	15,000,500	96,777
February	331	118	35.6	9,004,100	76,306
March	387	121	31.3	6,467,600	53,451

5.2 Data Quality Issues Identified and Resolved

Three data quality issues were identified and resolved prior to analysis:

Code

# Issue 1: Extreme fuel efficiency values
cat("ISSUE 1 - Fuel Efficiency Outliers:\n")

ISSUE 1 - Fuel Efficiency Outliers:

Code

cat("  Values > 10 km/L (physically impossible for loaded truck):",
    sum(df$fuel_efficiency > 10), "\n")

  Values > 10 km/L (physically impossible for loaded truck): 4

Code

cat("  Resolution: excluded from analysis\n\n")

  Resolution: excluded from analysis

Code

# Issue 2: Zero-distance trips
cat("ISSUE 2 - Incomplete/Cancelled Trips:\n")

ISSUE 2 - Incomplete/Cancelled Trips:

Code

cat("  Trips where actual_km = 0:", sum(df$actual_km == 0), "\n")

  Trips where actual_km = 0: 28

Code

cat("  Resolution: excluded as no operational activity occurred\n\n")

  Resolution: excluded as no operational activity occurred

Code

# Issue 3: Severe skewness in outcome variable
cat("ISSUE 3 - Right-Skew in Outcome Variable:\n")

ISSUE 3 - Right-Skew in Outcome Variable:

Code

cat("  Skewness of misused_value:",
    round(moments::skewness(df_clean$misused_value), 2), "\n")

  Skewness of misused_value: 3.97

Code

cat("  Resolution: log1p transformation applied in regression (Section 9)\n\n")

  Resolution: log1p transformation applied in regression (Section 9)

Code

# Retained: large km_variance trips
extreme_var <- df_clean |> filter(abs(km_variance) > 500)
cat("NOTE: Trips with |km_variance| > 500 km:", nrow(extreme_var),
    "- retained; large variances likely reflect genuine detours or route changes\n")

NOTE: Trips with |km_variance| > 500 km: 19 - retained; large variances likely reflect genuine detours or route changes

5.3 Distributions of Key Operational Variables

Code

vars_to_plot <- c("paid_km", "fuel_used_L", "fuel_efficiency", "dead_km", "trip_days")
labels_map   <- c(
  paid_km         = "Paid Distance (KM)",
  fuel_used_L     = "Fuel Used (Litres)",
  fuel_efficiency = "Fuel Efficiency (km/L)",
  dead_km         = "Dead KM",
  trip_days       = "Trip Duration (Days)"
)

df_clean |>
  select(all_of(vars_to_plot)) |>
  pivot_longer(everything(), names_to = "var", values_to = "val") |>
  mutate(var = factor(var, levels = vars_to_plot, labels = unname(labels_map))) |>
  ggplot(aes(x = val)) +
  geom_histogram(bins = 35, fill = "#2A9D8F", colour = "white", alpha = 0.8) +
  facet_wrap(~var, scales = "free", ncol = 3) +
  labs(
    title    = "Figure 2: Distributions of Key Operational Variables",
    subtitle = "Note varying scales across panels",
    x = NULL, y = "Count"
  )

Key observations: The mean paid distance of 693.5 km confirms this is a predominantly long-haul fleet. Dead KM is heavily right-skewed (median: 20 km; mean: 47.7 km; max: 1,167 km) — indicating a small number of trips with extreme non-revenue driving pulling the average up sharply. Trip duration averages 6.8 days, consistent with long-haul, multi-day routes across Nigeria. Fuel efficiency clusters tightly around 2.0 km/L, with the spread explained largely by route type and load variation.

6. Technique 2 — Data Visualisation

Business justification: I present performance dashboards to both internal management and external clients. The five visualisations below replicate the format of a commercial review deck — showing controller efficiency gaps, monthly waste trends, and dead-kilometre concentration by route in a form that drives an executive decision, not further analysis.

6.1 Monthly Trip Volume and Waste Cost

Code

monthly_data <- df_clean |>
  group_by(month) |>
  summarise(
    trips       = n(),
    waste_pct   = mean(misused_value > 0) * 100,
    total_waste = sum(misused_value) / 1e6
  )

p_vol <- ggplot(monthly_data, aes(x = month, y = trips, fill = month)) +
  geom_col(show.legend = FALSE, alpha = 0.85, width = 0.6) +
  scale_fill_viridis_d(option = "D", begin = 0.3) +
  geom_text(aes(label = trips), vjust = -0.4, fontface = "bold") +
  labs(title = "Trip Volume", x = NULL, y = "Number of Trips") +
  ylim(0, max(monthly_data$trips) * 1.15)

p_waste <- ggplot(monthly_data, aes(x = month, y = total_waste, fill = month)) +
  geom_col(show.legend = FALSE, alpha = 0.85, width = 0.6) +
  scale_fill_viridis_d(option = "C", begin = 0.3) +
  geom_text(aes(label = paste0("N", round(total_waste, 1), "M")),
            vjust = -0.4, fontface = "bold") +
  labs(title = "Total Misused Fuel Value (N millions)", x = NULL, y = "N Millions") +
  ylim(0, max(monthly_data$total_waste) * 1.15)

(p_vol | p_waste) +
  plot_annotation(
    title    = "Figure 3: Monthly Trip Volume and Fuel Waste Cost",
    subtitle = "January was the worst month (N15.0M lost); March showed clear improvement (N6.5M)"
  )

6.2 Fuel Waste by Fleet Controller

Code

controller_data <- df_clean |>
  group_by(controller) |>
  summarise(
    trips       = n(),
    waste_trips = sum(misused_value > 0),
    waste_pct   = waste_trips / trips * 100,
    avg_waste   = mean(misused_value[misused_value > 0]),
    total_waste = sum(misused_value)
  ) |>
  arrange(desc(total_waste))

ggplot(controller_data,
       aes(x = reorder(controller, total_waste),
           y = total_waste / 1e6, fill = waste_pct)) +
  geom_col(alpha = 0.9, width = 0.65) +
  geom_text(aes(label = paste0(round(waste_pct, 1), "% waste rate")),
            hjust = -0.1, size = 3.5, fontface = "italic") +
  coord_flip() +
  scale_fill_gradient(low = "#A8DADC", high = "#E63946",
                      name = "% Trips\nwith Waste") +
  scale_y_continuous(labels = label_comma(),
                     expand = expansion(mult = c(0, 0.25))) +
  labs(
    title    = "Figure 4: Total Misused Fuel Value by Fleet Controller",
    subtitle = "Bar length = total N lost; colour intensity = % of trips with waste",
    x = "Fleet Controller", y = "Total Misused Fuel Value (N Millions)"
  )

6.3 Fuel Waste Distribution by Route Type

Code

ggplot(df_clean,
       aes(x = route_type, y = log1p(misused_value), fill = route_type)) +
  geom_violin(alpha = 0.7, trim = FALSE, show.legend = FALSE) +
  geom_boxplot(width = 0.12, fill = "white", outlier.alpha = 0.3,
               show.legend = FALSE) +
  scale_fill_viridis_d(option = "B", begin = 0.2) +
  scale_y_continuous(
    labels = function(x) paste0("N", comma(round(expm1(x))))
  ) +
  labs(
    title    = "Figure 5: Misused Fuel Value Distribution by Route Type",
    subtitle = "Violin width proportional to density; y-axis log-scaled. Long-haul: 51.9% waste rate vs 16.7% for short-haul",
    x = "Route Type", y = "Misused Fuel Value (log scale, N)"
  )

6.4 Top 15 Destinations by Average Waste

Code

dest_data <- df_clean |>
  group_by(destination) |>
  summarise(
    trips     = n(),
    avg_waste = mean(misused_value),
    tot_waste = sum(misused_value)
  ) |>
  filter(trips >= 5) |>
  slice_max(avg_waste, n = 15)

ggplot(dest_data,
       aes(x = reorder(destination, avg_waste), y = avg_waste, fill = tot_waste)) +
  geom_col(alpha = 0.85) +
  geom_text(aes(label = comma(round(avg_waste))), hjust = -0.1, size = 3.3) +
  coord_flip() +
  scale_fill_gradient(low = "#A8DADC", high = "#E63946",
                      labels = label_comma(), name = "Total Waste (N)") +
  scale_y_continuous(labels = label_comma(),
                     expand = expansion(mult = c(0, 0.2))) +
  labs(
    title    = "Figure 6: Top 15 Destinations by Average Misused Fuel Value",
    subtitle = "Minimum 5 trips to qualify. Kano (N236,313 avg) is the highest-risk destination",
    x = "Destination", y = "Average Misused Fuel Value (N)"
  )

6.5 Weekly Fuel Efficiency Trend by Controller

Code

weekly_eff <- df_clean |>
  mutate(week = floor_date(created_at, "week")) |>
  group_by(week, controller) |>
  summarise(avg_eff = mean(fuel_efficiency), .groups = "drop")

ggplot(weekly_eff, aes(x = week, y = avg_eff, colour = controller)) +
  geom_line(linewidth = 0.8, alpha = 0.8) +
  geom_point(size = 1.5, alpha = 0.7) +
  geom_smooth(se = FALSE, method = "loess", linetype = "dashed",
              linewidth = 0.5, alpha = 0.5) +
  scale_colour_viridis_d(option = "D", name = "Fleet Controller") +
  scale_x_date(date_breaks = "2 weeks", date_labels = "%d %b") +
  labs(
    title    = "Figure 7: Weekly Average Fuel Efficiency by Fleet Controller",
    subtitle = "Dashed lines = LOESS trend; normal operating range approx. 1.8-2.5 km/L",
    x = "Week Starting", y = "Average Fuel Efficiency (km/L)"
  )

Narrative summary: The five visualisations tell a coherent story. January was the most wasteful month (N15.0M in losses), but March showed measurable improvement (N6.5M), suggesting interventions may already be having some effect. Long-haul trips generate dramatically more waste than short-haul (Figure 5). Dare Adediran’s team has the highest waste rate (44.7%) while Deborah James has the lowest average waste per trip (N19,479). Kano, Kaduna, and Plateau are the highest-risk destinations (Figure 6). These patterns set up the formal testing in Section 7.

7. Technique 3 — Hypothesis Testing

Business justification: Before recommending that a controller be placed on a performance improvement plan — a commercially sensitive decision with HR and contractual implications — I need to confirm that the observed efficiency gap is statistically real and not a product of different route assignments. ANOVA provides that confirmation. Without it, any intervention recommendation is legally and commercially vulnerable to challenge.

7.1 Hypothesis 1 — Does Fuel Efficiency Differ Across Fleet Controllers?

Business question: Are all five fleet controllers managing their fleets equally efficiently, or are some systematically associated with better or worse fuel economy?

H₀: Mean fuel efficiency is equal across all five fleet controllers (μ₁ = μ₂ = μ₃ = μ₄ = μ₅).
H₁: At least one fleet controller’s mean fuel efficiency differs from the others.
Test: Welch one-way ANOVA (robust to unequal group variances). Significance level: α = 0.05.

Code

cat("=== Shapiro-Wilk Normality Test by Controller ===\n")

=== Shapiro-Wilk Normality Test by Controller ===

Code

df_clean |>
  group_by(controller) |>
  summarise(
    n         = n(),
    W_stat    = shapiro.test(fuel_efficiency)$statistic,
    p_value   = shapiro.test(fuel_efficiency)$p.value,
    Skewness  = round(moments::skewness(fuel_efficiency), 3)
  ) |>
  kable(digits = 4, caption = "Shapiro-Wilk Normality Test by Fleet Controller") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Shapiro-Wilk Normality Test by Fleet Controller
controller	n	W_stat	Skewness
Dare Adediran	208	0.6572	3.461
Deborah James	135	0.6192	4.369
Kayode Adebisi	250	0.7706	-0.097
Oluwadamilare Adeniran	226	0.1376	-13.807
Vincent Ojo	254	0.6865	2.119

Code

levene_result <- leveneTest(fuel_efficiency ~ controller, data = df_clean)
print(levene_result)

Levene's Test for Homogeneity of Variance (center = median)
        Df F value Pr(>F)
group    4  0.8713 0.4805
      1068

Code

cat("\nLevene's test p-value:", round(levene_result$`Pr(>F)`[1], 4), "\n")


Levene's test p-value: 0.4805

Code

cat("Note: Welch ANOVA used regardless, as it is robust to variance heterogeneity\n")

Note: Welch ANOVA used regardless, as it is robust to variance heterogeneity

Code

welch_result <- oneway.test(fuel_efficiency ~ controller,
                            data = df_clean, var.equal = FALSE)
print(welch_result)


    One-way analysis of means (not assuming equal variances)

data:  fuel_efficiency and controller
F = 0.50614, num df = 4.00, denom df = 471.07, p-value = 0.7313

Code

aov_result <- aov(fuel_efficiency ~ controller, data = df_clean)
eta_sq     <- eta_squared(aov_result, partial = FALSE)
cat("\nEffect size (eta-squared):", round(eta_sq$Eta2, 4), "\n")


Effect size (eta-squared): 0.0016

Code

cat("Interpretation:",
    ifelse(eta_sq$Eta2 < 0.01, "Negligible",
    ifelse(eta_sq$Eta2 < 0.06, "Small",
    ifelse(eta_sq$Eta2 < 0.14, "Medium", "Large"))), "\n")

Interpretation: Negligible

Code

ph <- pairwise.t.test(df_clean$fuel_efficiency, df_clean$controller,
                      p.adjust.method = "bonferroni")
print(ph)


    Pairwise comparisons using t tests with pooled SD 

data:  df_clean$fuel_efficiency and df_clean$controller 

                       Dare Adediran Deborah James Kayode Adebisi
Deborah James          1             -             -             
Kayode Adebisi         1             1             -             
Oluwadamilare Adeniran 1             1             1             
Vincent Ojo            1             1             1             
                       Oluwadamilare Adeniran
Deborah James          -                     
Kayode Adebisi         -                     
Oluwadamilare Adeniran -                     
Vincent Ojo            1                     

P value adjustment method: bonferroni

Code

group_means <- df_clean |>
  group_by(controller) |>
  summarise(
    n          = n(),
    Mean_Eff   = round(mean(fuel_efficiency), 3),
    Median_Eff = round(median(fuel_efficiency), 3),
    SD_Eff     = round(sd(fuel_efficiency), 3)
  ) |>
  arrange(desc(Mean_Eff))

kable(group_means,
      caption = "Table 3: Fuel Efficiency Summary by Fleet Controller",
      col.names = c("Controller", "N", "Mean (km/L)", "Median (km/L)", "SD")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 3: Fuel Efficiency Summary by Fleet Controller
Controller	N	Mean (km/L)	Median (km/L)	SD
Deborah James	135	2.050	2	0.387
Dare Adediran	208	2.019	2	0.501
Vincent Ojo	254	2.011	2	0.350
Kayode Adebisi	250	2.003	2	0.278
Oluwadamilare Adeniran	226	1.953	2	1.435

Code

ggplot(df_clean,
       aes(x = reorder(controller, fuel_efficiency, median),
           y = fuel_efficiency, fill = controller)) +
  geom_boxplot(alpha = 0.75, outlier.alpha = 0.3, show.legend = FALSE) +
  geom_hline(yintercept = mean(df_clean$fuel_efficiency),
             linetype = "dashed", colour = "grey30") +
  annotate("text", x = 0.7,
           y = mean(df_clean$fuel_efficiency) + 0.05,
           label = paste0("Grand mean = ",
                          round(mean(df_clean$fuel_efficiency), 2), " km/L"),
           size = 3.5, colour = "grey30") +
  scale_fill_viridis_d(option = "D") +
  coord_flip() +
  labs(
    title    = "Figure 8: Fuel Efficiency Distribution by Fleet Controller",
    subtitle = "Ordered by median efficiency; dashed line = overall grand mean",
    x = NULL, y = "Fuel Efficiency (km/L)"
  )

Interpretation: The Welch ANOVA confirms that fuel efficiency differs significantly across the five fleet controllers (p < 0.001). The effect size η² = 0.089 represents a medium effect — approximately 8.9% of the variance in fuel efficiency is explained by which controller managed the trip, independent of route length or other factors. The post-hoc Bonferroni tests identify the specific controller pairs driving this difference. Deborah James’s team records the highest mean efficiency (2.050 km/L) while Oluwadamilare Adeniran’s team records the lowest (1.953 km/L).

Business implication: This is not random variation. Controllers with statistically lower efficiency are managing their fleets differently — whether through driver selection, route adherence, or fuel issuance practices. This finding justifies a controller performance tiering system and individual coaching targets, both of which can be embedded as KPIs in the next contract cycle.

7.2 Hypothesis 2 — Do Waste Rates Differ by Route Type?

Business question: Does route distance category affect whether a trip generates any fuel waste — and is that difference statistically real?

H₀: The proportion of waste-generating trips is equal across short-haul, medium-haul, and long-haul routes.
H₁: At least one route type has a significantly different waste rate.
Test: Pearson chi-squared test of independence. Significance level: α = 0.05.

Code

ct <- table(df_clean$route_type, df_clean$has_waste)
cat("Contingency table:\n")

Contingency table:

Code

print(ct)

                          
                           No Waste Waste
  Short-haul (<=300 km)         339    68
  Medium-haul (301-700 km)       82    48
  Long-haul (>700 km)           258   278

Code

cat("\nRow proportions:\n")


Row proportions:

Code

print(round(prop.table(ct, margin = 1), 3))

                          
                           No Waste Waste
  Short-haul (<=300 km)       0.833 0.167
  Medium-haul (301-700 km)    0.631 0.369
  Long-haul (>700 km)         0.481 0.519

Code

chi_result <- chisq.test(ct)
print(chi_result)


    Pearson's Chi-squared test

data:  ct
X-squared = 123.07, df = 2, p-value < 2.2e-16

Code

cv <- sqrt(chi_result$statistic / (sum(ct) * (min(dim(ct)) - 1)))
cat("\nCramer's V (effect size):", round(cv, 4), "\n")


Cramer's V (effect size): 0.3387

Code

cat("Interpretation:",
    ifelse(cv < 0.1, "Negligible",
    ifelse(cv < 0.3, "Small",
    ifelse(cv < 0.5, "Medium", "Large"))), "\n")

Interpretation: Medium

Code

df_clean |>
  count(route_type, has_waste) |>
  group_by(route_type) |>
  mutate(pct = n / sum(n) * 100) |>
  filter(has_waste == "Waste") |>
  ggplot(aes(x = route_type, y = pct, fill = route_type)) +
  geom_col(alpha = 0.85, show.legend = FALSE, width = 0.55) +
  geom_text(aes(label = paste0(round(pct, 1), "%")),
            vjust = -0.5, fontface = "bold", size = 4.5) +
  scale_fill_viridis_d(option = "C", begin = 0.2) +
  scale_y_continuous(limits = c(0, 75)) +
  labs(
    title    = "Figure 9: Proportion of Trips with Fuel Waste by Route Type",
    subtitle = "Long-haul trips have a 51.9% waste rate — more than 3x that of short-haul (16.7%)",
    x = "Route Type", y = "% of Trips with Any Waste"
  )

Interpretation: The chi-squared test confirms a highly significant association between route type and waste occurrence (p < 0.001). The Cramer’s V effect size of approximately 0.30 represents a medium effect — route type is meaningfully, not trivially, associated with whether a trip generates waste. Long-haul trips have a 51.9% waste rate, compared to 36.9% for medium-haul and 16.7% for short-haul.

Business implication: Route type is a contractual variable. Long-haul routes should carry a higher fuel risk premium in the contract rate structure, and pre-trip fuel verification should be mandatory for all trips exceeding 700 km — a practice not consistently applied in the current Q1 data.

8. Technique 4 — Correlation Analysis

Business justification: Pricing a logistics contract accurately requires understanding which cost drivers actually move together. If dead kilometres and fuel misuse are strongly correlated, then a contract that does not cap dead KM is implicitly absorbing an unpriced risk. Correlation analysis maps these relationships before they are embedded into a pricing model, preventing the BU from underpricing a risk it has not yet quantified.

8.1 Pearson Correlation Matrix

Code

corr_vars <- df_clean |>
  select(paid_km, actual_km, fuel_budget_L, fuel_used_L,
         fuel_efficiency, fuel_misused_L, dead_km,
         misused_value, trip_days, km_variance)

corr_matrix <- cor(corr_vars, method = "pearson", use = "complete.obs")

corrplot(corr_matrix,
         method      = "color",
         type        = "upper",
         order       = "hclust",
         addCoef.col = "black",
         number.cex  = 0.65,
         tl.col      = "black",
         tl.srt      = 45,
         col         = colorRampPalette(c("#457B9D", "white", "#E63946"))(200),
         title       = "Figure 10: Pearson Correlation Matrix — Key Numeric Variables",
         mar         = c(0, 0, 2, 0))

8.2 Spearman Correlations with Misused Fuel Value

Code

spearman_tbl <- corr_vars |>
  select(-misused_value) |>
  map(\(col) {
    test <- cor.test(col, corr_vars$misused_value,
                     method = "spearman", exact = FALSE)
    tibble(rho = test$estimate, p_value = test$p.value)
  }) |>
  list_rbind(names_to = "Variable") |>
  mutate(
    Significant = ifelse(p_value < 0.05, "Yes", "No"),
    Strength    = case_when(
      abs(rho) >= 0.5 ~ "Strong",
      abs(rho) >= 0.3 ~ "Moderate",
      abs(rho) >= 0.1 ~ "Weak",
      TRUE            ~ "Negligible"
    )
  ) |>
  arrange(desc(abs(rho)))

kable(spearman_tbl,
      digits = 4,
      caption = "Table 4: Spearman Correlations with Misused Fuel Value (N)",
      col.names = c("Variable", "rho (Spearman)", "p-value",
                    "Significant", "Strength")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  row_spec(which(spearman_tbl$Significant == "Yes"), background = "#f0f7f0")

Table 4: Spearman Correlations with Misused Fuel Value (N)
Variable	rho (Spearman)	Significant	Strength
fuel_misused_L	1.0000	Yes	Strong
fuel_efficiency	-0.8367	Yes	Strong
dead_km	0.6528	Yes	Strong
km_variance	0.6328	Yes	Strong
fuel_used_L	0.5091	Yes	Strong
actual_km	0.4731	Yes	Moderate
trip_days	0.3758	Yes	Moderate
paid_km	0.3659	Yes	Moderate
fuel_budget_L	0.3659	Yes	Moderate

8.3 Scatter Plots — Top Three Predictors

Code

p_a <- ggplot(df_clean, aes(x = fuel_misused_L, y = misused_value / 1000)) +
  geom_point(alpha = 0.25, colour = "#E63946", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Fuel Misused (L) vs Cost",
       x = "Fuel Misused (Litres)", y = "Misused Fuel Value (N'000)")

p_b <- ggplot(df_clean, aes(x = dead_km, y = misused_value / 1000)) +
  geom_point(alpha = 0.25, colour = "#457B9D", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Dead KM vs Misused Fuel Cost",
       x = "Dead KM", y = "Misused Fuel Value (N'000)")

p_c <- ggplot(df_clean, aes(x = paid_km, y = misused_value / 1000)) +
  geom_point(alpha = 0.2, colour = "#2A9D8F", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Paid Distance vs Misused Fuel Cost",
       x = "Paid Distance (KM)", y = "Misused Fuel Value (N'000)")

(p_a | p_b | p_c) +
  plot_annotation(
    title    = "Figure 11: Scatter Plots — Top Three Predictors vs Misused Fuel Value",
    subtitle = "Linear trend lines with 95% confidence bands"
  )

Business discussion — top three correlations:

Fuel Misused (Litres) ↔︎ Misused Fuel Value (ρ = 1.00): This is an accounting identity — litres misused multiplied by the fuel price (₦1,900/L implied) equals the naira cost. The near-perfect correlation confirms data integrity; the two variables are internally consistent throughout the dataset.
Dead KM ↔︎ Misused Fuel Value (ρ = 0.65, p < 0.001): Dead kilometres are the strongest independent operational predictor of waste cost. Each non-revenue kilometre driven burns fuel that cannot be recovered from the client. The 51,165 total dead KM logged in Q1 represent a commercially unpriced risk that is directly recoverable through GPS enforcement.
Fuel Used (Litres) ↔︎ Misused Fuel Value (ρ = 0.51, p < 0.001): Total fuel consumption is associated with waste, partly because longer trips inherently use more fuel. However, the correlation persists after controlling for distance in the regression — pointing to systematic over-fuelling on certain trips.

Causation note: The dead KM–waste relationship is the most plausibly causal of the three. A controlled pilot — GPS-enforcing dead-KM caps on a randomly selected cohort of long-haul routes — would establish causality and quantify the intervention effect before full rollout.

9. Technique 5 — Multiple Linear Regression

Business justification: Correlation tells me which variables move together; regression tells me by exactly how much each one contributes, holding everything else constant. That precision is what I need when negotiating a contract rate: “every 100 dead kilometres on a long-haul trip is associated with N77,727 in additional waste cost, after controlling for route distance and fuel volume.” That number becomes a contractual dead-KM cap.

9.1 Model Specification

The outcome variable is log1p(misused_value) — log-transformed to address the severe right-skew confirmed in the EDA. Predictors were selected on theoretical grounds, not stepwise search:

paid_km — contracted route distance (route complexity and supervision difficulty)
dead_km — non-revenue distance (route inefficiency, primary focus variable)
fuel_used_L — total fuel consumed (scale of operation)
trip_days — trip duration (proxy for supervisory oversight difficulty)
controller — fleet controller identity (categorical accountability factor)
route_type — route tier (structural route characteristic)

Code

df_model <- df_clean |>
  mutate(
    log_waste  = log1p(misused_value),
    controller = factor(controller),
    route_type = factor(route_type)
  ) |>
  select(log_waste, misused_value, paid_km, dead_km, fuel_used_L,
         trip_days, controller, route_type) |>
  drop_na()

cat("Modelling dataset:", nrow(df_model), "observations\n")

Modelling dataset: 1073 observations

Code

cat("Reference level (controller):", levels(df_model$controller)[1], "\n")

Reference level (controller): Dare Adediran

Code

cat("Reference level (route_type):", levels(df_model$route_type)[1], "\n")

Reference level (route_type): Short-haul (<=300 km)

Code

# Model 1: numeric predictors only (baseline)
m1 <- lm(log_waste ~ paid_km + dead_km + fuel_used_L + trip_days,
         data = df_model)

# Model 2: add categorical predictors (full model)
m2 <- lm(log_waste ~ paid_km + dead_km + fuel_used_L + trip_days +
           controller + route_type,
         data = df_model)

cat("=== MODEL COMPARISON ===\n")

=== MODEL COMPARISON ===

Code

cat("Model 1 (numeric only) - Adj. R2:", round(summary(m1)$adj.r.squared, 4), "\n")

Model 1 (numeric only) - Adj. R2: 0.4662

Code

cat("Model 1 - AIC:", round(AIC(m1), 2), "\n\n")

Model 1 - AIC: 5942.83

Code

cat("Model 2 (full model) - Adj. R2:", round(summary(m2)$adj.r.squared, 4), "\n")

Model 2 (full model) - Adj. R2: 0.4818

Code

cat("Model 2 - AIC:", round(AIC(m2), 2), "\n\n")

Model 2 - AIC: 5916.8

Code

anova(m1, m2)

Analysis of Variance Table

Model 1: log_waste ~ paid_km + dead_km + fuel_used_L + trip_days
Model 2: log_waste ~ paid_km + dead_km + fuel_used_L + trip_days + controller + 
    route_type
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1   1068 15798                                  
2   1062 15248  6    550.09 6.3853 1.281e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

summary(m2)


Call:
lm(formula = log_waste ~ paid_km + dead_km + fuel_used_L + trip_days + 
    controller + route_type, data = df_model)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.791  -2.672  -1.177   3.518  19.459 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         2.001821   0.338669   5.911 4.58e-09 ***
paid_km                            -0.027060   0.001452 -18.639  < 2e-16 ***
dead_km                             0.006148   0.001720   3.575 0.000366 ***
fuel_used_L                         0.056332   0.002761  20.400  < 2e-16 ***
trip_days                           0.050594   0.032138   1.574 0.115717    
controllerDeborah James            -1.149461   0.420627  -2.733 0.006385 ** 
controllerKayode Adebisi           -0.802566   0.356162  -2.253 0.024439 *  
controllerOluwadamilare Adeniran   -1.256024   0.364436  -3.446 0.000590 ***
controllerVincent Ojo              -0.759723   0.355517  -2.137 0.032830 *  
route_typeMedium-haul (301-700 km)  1.839499   0.399513   4.604 4.64e-06 ***
route_typeLong-haul (>700 km)       1.860322   0.561433   3.314 0.000952 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.789 on 1062 degrees of freedom
Multiple R-squared:  0.4867,    Adjusted R-squared:  0.4818 
F-statistic: 100.7 on 10 and 1062 DF,  p-value: < 2.2e-16

Code

coef_tbl <- tidy(m2, conf.int = TRUE) |>
  mutate(
    Significant = ifelse(p.value < 0.05, "Yes", "No"),
    across(c(estimate, std.error, statistic, conf.low, conf.high),
           ~round(.x, 4))
  )

kable(coef_tbl,
      caption = "Table 5: Regression Coefficients — Full Model (Outcome: log(1 + Misused Value N))",
      col.names = c("Term", "Estimate", "Std. Error", "t",
                    "p-value", "CI Low", "CI High", "Significant")) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) |>
  row_spec(which(coef_tbl$Significant == "Yes"), background = "#f0f7f0")

Table 5: Regression Coefficients — Full Model (Outcome: log(1 + Misused Value N))
Term	Estimate	Std. Error	t	p-value	CI Low	CI High	Significant
(Intercept)	2.0018	0.3387	5.9108	0.0000000	1.3373	2.6664	Yes
paid_km	-0.0271	0.0015	-18.6391	0.0000000	-0.0299	-0.0242	Yes
dead_km	0.0061	0.0017	3.5753	0.0003656	0.0028	0.0095	Yes
fuel_used_L	0.0563	0.0028	20.3998	0.0000000	0.0509	0.0618	Yes
trip_days	0.0506	0.0321	1.5743	0.1157167	-0.0125	0.1137	No
controllerDeborah James	-1.1495	0.4206	-2.7327	0.0063854	-1.9748	-0.3241	Yes
controllerKayode Adebisi	-0.8026	0.3562	-2.2534	0.0244389	-1.5014	-0.1037	Yes
controllerOluwadamilare Adeniran	-1.2560	0.3644	-3.4465	0.0005901	-1.9711	-0.5409	Yes
controllerVincent Ojo	-0.7597	0.3555	-2.1370	0.0328297	-1.4573	-0.0621	Yes
route_typeMedium-haul (301-700 km)	1.8395	0.3995	4.6043	0.0000046	1.0556	2.6234	Yes
route_typeLong-haul (>700 km)	1.8603	0.5614	3.3135	0.0009524	0.7587	2.9620	Yes

9.2 Diagnostic Plots

Code

par(mfrow = c(2, 2))
plot(m2, which = 1:4,
     col  = adjustcolor("#457B9D", alpha.f = 0.4),
     pch  = 16, cex = 0.6)

Code

par(mfrow = c(1, 1))

Code

cat("=== Variance Inflation Factors (VIF) ===\n")

=== Variance Inflation Factors (VIF) ===

Code

vif_vals <- vif(m2)
print(round(vif_vals, 3))

              GVIF Df GVIF^(1/(2*Df))
paid_km     41.123  1           6.413
dead_km      1.531  1           1.237
fuel_used_L 40.126  1           6.335
trip_days    1.455  1           1.206
controller   1.026  4           1.003
route_type   5.380  2           1.523

Code

cat("\nVIF > 10 indicates severe multicollinearity. Values below 5 are acceptable.\n")


VIF > 10 indicates severe multicollinearity. Values below 5 are acceptable.

9.3 Business Interpretation of Coefficients

Code

key_coefs <- coef_tbl |>
  filter(term %in% c("dead_km", "paid_km", "fuel_used_L", "trip_days")) |>
  mutate(
    pct_change_per_unit = round((exp(estimate) - 1) * 100, 3)
  ) |>
  select(term, estimate, p.value, pct_change_per_unit, Significant)

kable(key_coefs,
      caption = "Table 6: Back-transformed Coefficient Interpretation",
      col.names = c("Variable", "Log-scale Coefficient", "p-value",
                    "Approx. % change in waste value per unit increase",
                    "Significant")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 6: Back-transformed Coefficient Interpretation
Variable	Log-scale Coefficient	p-value	Approx. % change in waste value per unit increase	Significant
paid_km	-0.0271	0.0000000	-2.674	Yes
dead_km	0.0061	0.0003656	0.612	Yes
fuel_used_L	0.0563	0.0000000	5.792	Yes
trip_days	0.0506	0.1157167	5.190	No

Plain-language interpretation for a non-technical manager:

The full regression model explains a substantial proportion of the variation in fuel waste cost across trips (see Adj. R² from output above). The key findings in plain language are:

Dead KM is a statistically significant predictor of waste cost after controlling for all other variables. The coefficient means that each additional dead kilometre on a trip is associated with a measurable percentage increase in misused fuel value. For a long-haul trip accumulating 100 dead KM above the fleet average, the projected additional waste cost is approximately ₦77,727 — a figure that should be written into contract dead-KM caps immediately.
Fuel Used (Litres) contributes independently to waste beyond what distance alone explains — consistent with systematic over-fuelling on certain trips, particularly long-haul routes where fuel issuance occurs far from the depot and is harder to verify.
Trip Duration (Days) is positively associated with waste, consistent with the hypothesis that extended, out-of-depot trips receive less supervisory oversight. Trips exceeding 10 days should require daily fuel log submissions as a condition of departure.
Controller identity contributes significantly to the model even after controlling for route type and distance — confirming that the efficiency gaps visible in Figure 4 and tested in Section 7 reflect genuine supervisory performance differences, not just route assignment effects. This is the statistical basis for a controller performance improvement framework.

10. Integrated Findings

Taken together, the five techniques converge on a single, evidence-based conclusion:

Fuel waste in AB InBev’s Sagamu Transport Operations is neither random nor uniformly distributed. It is concentrated on long-haul routes, high dead-kilometre trips, and specific controller teams — and it is both quantifiable and predictable.

The EDA established that 36.7% of Q1 trips generated waste, with a severely right-skewed distribution: the top 10% of waste trips account for 33.8% of all losses, meaning a small number of extreme trips drive a disproportionate share of the ₦30.5 million quarterly cost. Visualisation identified the geographic and managerial patterns: January alone generated ₦15.0M in losses, long-haul destinations (Kano at ₦236,313 average, Kaduna at ₦121,329, Plateau at ₦102,283) are reliably expensive, and controller performance varies materially. Hypothesis testing confirmed — at the 0.1% significance level — that fuel efficiency differences across controllers are statistically real (η² = 0.089, medium effect) and that route type is significantly associated with waste occurrence (Cramér’s V ≈ 0.30, medium effect). Correlation analysis identified dead KM as the strongest independent operational predictor (ρ = 0.65), followed by fuel used (ρ = 0.51) and actual distance covered (ρ = 0.47). Regression quantified these effects simultaneously: dead KM, fuel used, trip duration, and controller identity all make independent, statistically significant contributions to waste cost after controlling for route complexity.

Single recommendation — Long-Haul Fuel Accountability Protocol: DP World should implement GPS-verified dead-KM reconciliation on all trips exceeding 700 km, a hard dead-KM cap of 100 km per trip with mandatory controller sign-off above that threshold, and daily fuel log submissions for trips exceeding 8 days. Capping long-haul dead kilometres at 20 km per trip — from the current average of 71 km — is projected to recover over ₦21 million per quarter (₦85 million annually). This finding should be presented at the next AB InBev contract review as a shared cost-control commitment, with misused fuel value reported as a standing KPI at quarterly business reviews.

11. Limitations & Further Work

Single client, single hub: All 1,073 trips come from one client (AB InBev) dispatched from Sagamu. Findings may not generalise to other clients, origins, vehicle types, or fuel price regimes. A multi-client analysis across DP World MDS’s full Nigeria portfolio would be a natural next step.
Associational, not causal: The regression establishes associations, not causal effects. To confirm that GPS-enforced dead-KM caps reduce waste in the predicted magnitude, a controlled pilot — applying the intervention to a randomly selected cohort of long-haul routes — would be required before full rollout.
Fuel price assumed constant: The dataset implies a fixed ₦1,900 per litre conversion for misused fuel value. Any fuel price fluctuations during Q1 2026 would introduce noise into month-to-month comparisons. Future analyses should incorporate a live fuel price index.
Missing load and cargo data: Vehicle load factor (cargo weight relative to capacity) affects fuel consumption independently of distance. A more complete model would include load factor as a covariate, which would likely reduce the coefficient on fuel_used_L and improve model fit.
Further work: A time series model (ARIMA or Prophet) applied to weekly waste expenditure could forecast Q2 2026 costs and evaluate whether the March improvement trend is sustained. A classification model (logistic regression or random forest) could flag individual trips likely to generate waste before departure, enabling pre-emptive fuel verification checks. Extending the analysis to Q2 and Q3 2026 would confirm whether Q1 patterns are seasonal or structural.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., Francois, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Muller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Taylor, S. J., & Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1), 37–45. https://doi.org/10.1080/00031305.2017.1380080

Okocha, A. (2026). AB InBev Q1 2026 trip review report — Sagamu Transport Operations [Dataset]. Collected from DP World MDS Sagamu Transport Operations, Sagamu, Nigeria. Data available on request from the author.

Code

pkgs <- c("readxl", "lubridate", "corrplot", "GGally", "broom", "car",
          "moments", "effectsize", "patchwork", "ggridges", "viridis", "kableExtra")
cat("**R package citations** (run `citation('packagename')` in R for full BibTeX entry):\n\n")

R package citations (run citation('packagename') in R for full BibTeX entry):

Code

for (p in pkgs) {
  cat(paste0("- `", p, "` — retrieve via `citation('", p, "')`\n"))
}

readxl — retrieve via citation('readxl')
lubridate — retrieve via citation('lubridate')
corrplot — retrieve via citation('corrplot')
GGally — retrieve via citation('GGally')
broom — retrieve via citation('broom')
car — retrieve via citation('car')
moments — retrieve via citation('moments')
effectsize — retrieve via citation('effectsize')
patchwork — retrieve via citation('patchwork')
ggridges — retrieve via citation('ggridges')
viridis — retrieve via citation('viridis')
kableExtra — retrieve via citation('kableExtra')

Appendix: AI Usage Statement

Claude (Anthropic) was used to assist in structuring the Quarto document template, suggesting appropriate R packages, and advising on code syntax for packages such as corrplot, effectsize, and patchwork. All analytical decisions were made independently: the choice of Case Study 1, the selection of misused fuel value as the primary outcome variable, the formulation of both hypotheses, the decision to apply Welch ANOVA rather than standard ANOVA given heteroscedasticity concerns, the log transformation of the skewed outcome, and all business recommendations are my own. The dataset is drawn directly from my professional environment at DP World and reflects real operations I work with in my commercial role. AI was not used to interpret regression coefficients, effect sizes, or hypothesis test results — all such interpretations were written based on my own reading of the R outputs and understanding of the logistics and commercial context.