Reducing Fuel misuse and Improving Profitability per Trip

Data Analytics II — Case Study 1: Exploratory & Inferential Analytics | AB InBev Sagamu Transport Operations — Q1 2026

Author

Anthonia Okocha

Published

May 23, 2026

Abstract

This study applies five exploratory and inferential analytics techniques to 1,073 truck-trip records from DP World’s Sagamu Transport Operations hub, covering AB InBev deliveries across Nigeria in Q1 2026. The central business problem is fuel misuse: 36.7% of all trips generated misused fuel costs totalling ₦30.5 million, with long-haul routes (>700 km) accounting for 77.7% of losses. Dead kilometres emerge as the strongest predictor of financial waste (Spearman ρ = 0.65, p < 0.001), with every 100 dead kilometres associated with ₦77,727 in misused fuel value. Hypothesis testing confirms significant fuel efficiency differences across fleet controllers (Welch ANOVA, p < 0.001, η² = 0.089). Findings support implementing a Long-Haul Fuel Accountability Protocol projected to recover over ₦21 million per quarter — a material improvement in BU profitability and contract renewal positioning.

🎓 Course: Data Analytics II | 📋 Assessment: Capstone Case Study 1 | 👨‍🏫 Instructor: Prof Bongo Adi | 🏛️ Institution: Lagos Business School | 📅 Date: May 2026

1 Executive Summary

This study applies five exploratory and inferential analytics techniques to 1,073 truck-trip records generated by AB InBev’s Sagamu Transport Operations hub between January and March 2026. The dataset was extracted directly from DP World MDS’s internal fleet management system and represents a complete census of all dispatched shipments over the quarter.

The central business problem is fuel waste and its financial cost. Across Q1 2026, 394 of 1,073 trips (36.7%) recorded positive fuel misuse, generating total losses of ₦30,472,200 — equivalent to approximately ₦122 million annually. The analyses reveal that:

Misused fuel value is highly right-skewed: the top 10% of waste-generating trips account for 33.8% of all losses
Fuel efficiency differs significantly across the five fleet controllers (Welch ANOVA, p < 0.001, η² = 0.089)
Long-haul routes (>700 km) carry a 51.9% waste rate and generate 77.7% of total losses despite representing 50% of trips
Dead kilometres (mean: 47.7 km/trip; total Q1: 51,165 km) are the strongest operational predictor of financial waste (Spearman ρ = 0.65, p < 0.001)
Every 100 dead kilometres is associated with ₦77,727 in misused fuel value

Recommendation: DP World should implement a Long-Haul Fuel Accountability Protocol — enforcing GPS-verified dead-KM caps, daily fuel log submission on trips exceeding 8 days, and controller performance tiering. Capping long-haul dead kilometres at 20 km per trip is projected to recover over ₦21 million per quarter.

2 Professional Disclosure

2.1 Role and Organisation

I am a Business Development Specialist at DP World, a global ports, logistics, and supply chain conglomerate operating across 77 countries. Within the Nigerian Business Unit, my role sits in the commercial department where I am responsible for developing growth strategies, conducting cost modelling for commercial proposals, managing responses to tenders (RFIs, RFPs, RFQs), revising contract rates to optimise BU profitability, and identifying commercial risks pre- and post-contract execution.

The dataset analysed here — 1,073 truck trips dispatched from the Sagamu Transport Operations hub for AB InBev across Q1 2026 — comes directly from DP World’s internal fleet management system for a client account I manage commercially. Every variable maps to a metric I monitor: paid versus actual distance informs route pricing accuracy; fuel efficiency feeds into cost-per-kilometre benchmarks used in commercial proposals; misused fuel value quantifies the leakage that compresses BU profitability; and dead KM is a direct proxy for asset underutilisation — a KPI I am accountable for driving down.

2.2 Technique Operational Relevance

Technique 1 — Exploratory Data Analysis: A core deliverable in my role is conducting cost modelling for commercial proposals. No cost model is credible if built on unchecked data. Before I can price a rate for AB InBev or present a business case to BU leadership, I need to establish the true baseline of operational performance — not dashboard averages, but the full distribution of fuel costs including outliers, anomalies, and data quality failures that would distort the model. EDA on this dataset identifies impossible fuel efficiency readings, zero-distance trips, and severe right-skew in misused fuel value before any analysis is built on top of it.

Technique 2 — Data Visualisation: My role explicitly requires supporting the preparation and presentation of commercial proposals and driving client engagements throughout the sales cycle. Both activities demand visual communication — procurement directors and operations managers do not read regression tables; they read charts. The five visualisations in this project are structured as I would present them in a client quarterly business review: showing which destinations generate unsustainable waste, how performance varies across controllers, and how cost leakage trends month-on-month. A chart showing a client their own operational waste in naira terms is also a commercial tool — it builds the case for a revised contract structure or performance-linked rate.

Technique 3 — Hypothesis Testing: My role requires supporting the identification and management of commercial risks pre, during, and post service engagement. A key commercial risk in haulage is systematic underperformance by a specific controller team — because if persistent, it can trigger SLA breaches, erode client satisfaction scores, and threaten contract renewal. Before recommending a performance improvement plan, I need to confirm that the observed efficiency gap is statistically real and not a product of different route assignments. ANOVA provides that confirmation in a form that is defensible in a client performance meeting or an internal HR process.

Technique 4 — Correlation Analysis: My role requires revising applicable rates to optimise income from existing clients and collaborating with other departments to determine the most cost-effective rate. Both tasks require knowing which operational variables actually drive cost outcomes. If dead kilometres and misused fuel value are strongly correlated, then any contract that does not cap dead KM is absorbing an unpriced commercial risk — one that compresses the BU margin I am accountable for protecting. Correlation analysis maps these relationships before they are embedded into a pricing model or contract term.

Technique 5 — Multiple Linear Regression: My role requires managing responses to RFIs, RFPs, and RFQs and negotiating commercial proposals. Both tasks require me to defend a rate with data. Regression on this dataset produces the precise, coefficient-level answer I need: how much does each additional dead kilometre or extra day on the road add to misused fuel cost, holding all other factors constant? That number translates directly into a pricing adjustment, a contractual dead-KM cap, or a risk premium I can justify to a client or to BU finance.

3 Data Collection & Sampling

3.1 Source and Collection Method

The dataset was exported directly from DP World’s internal fleet management and trip-tracking platform, which records every shipment dispatched from the Sagamu Transport Operations hub for AB InBev. Data extraction was performed by downloading the Q1 trip review report for the period January 1 – March 31, 2026, in Excel format (.xlsx). The file (TripReviewReport_ABNBEV_JAN_FEB_MAR.xlsx) was accessed through the internal system available to me in my commercial role — the same system I use when preparing client performance reports and rate revision proposals for BU leadership.

All fields — including distances, fuel figures, and timestamps — are system-generated at the point of trip completion, with fuel reconciliation performed by fleet controllers upon vehicle return. No manual data entry was involved.

3.2 Sampling Frame and Sample Size

This dataset represents a complete census of all AB InBev trips dispatched from the Sagamu hub during Q1 2026. Every trip recorded in the system for the period is included. The final clean dataset contains 1,073 observations across 20 variables, comfortably exceeding the minimum requirement of 100 observations and 5 variables.

Attribute	Detail
Client	AB InBev Nigeria
Operation Site	Sagamu Transport Operations
Operator	DP World MDS
Period Covered	01 January 2026 – 31 March 2026
Total Trips (raw)	1,104
Total Trips (clean)	1,073
Variables	20 (6 numeric continuous, 4 numeric integer, 4 categorical, 2 datetime, 4 identifier/text)
Fleet Controllers	5 individuals
Distinct Destinations	40+ cities/states across Nigeria

3.3 Variable Description

Variable	Type	Description
Shipment Number	Identifier	Unique trip ID
Fleet Controller	Categorical	Supervisor responsible for the trip
Vehicle	Categorical	Truck registration number
Delivery Officer	Categorical	Driver assigned
Created At	Date	Trip dispatch date
Completed At	Date	Trip completion date
Origin	Categorical	Departure location
Destination	Categorical	Delivery location
Paid Distance (KM)	Numeric	Contracted/agreed trip distance
Distance Covered (KM)	Numeric	Actual GPS-recorded distance
Fuel Budget (Litres)	Numeric	Approved fuel allocation
Fuel Used (Litres)	Numeric	Actual fuel consumed
Fuel Consumption	Numeric	Litres per 100 KM
Fuel Efficiency	Numeric	KM per litre ratio
Fuel Misused (Litres)	Numeric	Fuel used beyond budget
Dead KM	Numeric	Non-revenue kilometres driven
Misused Fuel Value (Naira)	Outcome	Financial cost of misused fuel (₦)

3.4 Ethical Notes

Driver names and delivery officer identities are present in the operational data. For this published document, individuals are referenced only at the controller and vehicle level — no individual driver is singled out in the analysis. The data is used with organisational awareness for academic purposes under LBS guidelines.

4 Data Loading, Cleaning & Description

Show Code

df_raw <- read_excel("TripReviewReport_ABNBEV_JAN_FEB_MAR.xlsx")

cat("Raw dimensions:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")

Raw dimensions: 1104 rows x 20 columns

Show Code

df_raw %>%
  head(10) %>%
  kable(
    caption = "Preview of Raw Trip Review Dataset",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0,
           bold = TRUE,
           color = "white",
           background = "#1F4E79") %>%
  scroll_box(width = "100%", height = "400px")

Preview of Raw Trip Review Dataset
S/N	Client	Shipment Number	Operation Site	Fleet Controller	Vehicle	Delivery Officer	Created At	Completed At	Origin	Destination	Paid Distance (KM)	Distance Covered(KM)	Fuel Budget (Litres)	Fuel Used (Litres)	Fuel Consumption	Fuel Efficiency	Fuel Misused(Litres)	Dead KM	Misused Fuel Value(Naira)
1	AB InBev	206855	Sagamu Transport Operations	Vincent Ojo	T32805LA	Alabi sunday Kayode	02-Jan-2026	19-Jan-2026	Sagamu	Lagos	138	111	69	69	1.61	2.00	0	0	0
2	AB InBev	269651	Sagamu Transport Operations	Vincent Ojo	T32822LA	Godfrey Braimoh	02-Jan-2026	12-Jan-2026	Ilesa	Abuja	980	1270	490	626	2.03	1.57	136	290	258400
3	AB InBev	273737	Sagamu Transport Operations	Dare Adediran	T32781LA	Izuka Goodluck	23-Jan-2026	30-Jan-2026	Sagamu	Abia	1110	867	555	535	1.62	2.07	0	0	0
4	AB InBev	201603	Sagamu Transport Operations	Vincent Ojo	T32808LA	Ajibola Najeem	28-Jan-2026	02-Feb-2026	Sagamu	Lagos	164	136	82	78	1.74	2.10	0	0	0
5	AB InBev	201604	Sagamu Transport Operations	Kayode Adebisi	T32791LA	Christopher Sunday	29-Jan-2026	04-Feb-2026	Sagamu	Oyo	316	420	158	180	2.33	1.76	22	104	41800
6	AB InBev	201605	Sagamu Transport Operations	OLUWADAMILARE ADENIRAN	T32752LA	Adekunle benson Ayomide	29-Jan-2026	01-Feb-2026	Sagamu	Lagos	164	139	82	78	1.78	2.10	0	0	0
7	AB InBev	201606	Sagamu Transport Operations	Vincent Ojo	T32815LA	Waheed Tajudeen	28-Jan-2026	31-Jan-2026	Sagamu	Oyo	174	164	87	83	1.98	2.10	0	0	0
8	AB InBev	201607	Sagamu Transport Operations	Kayode Adebisi	T32744LA	Adams Abdulsalam	29-Jan-2026	02-Feb-2026	Sagamu	Lagos	280	284	140	133	2.14	2.11	0	4	0
9	AB InBev	201883	Sagamu Transport Operations	Dare Adediran	T32794LA	Wasiu Bello	17-Jan-2026	24-Jan-2026	Sagamu	Anambra	436	476	218	436	1.09	1.00	218	40	414200
10	AB InBev	237751	Sagamu Transport Operations	OLUWADAMILARE ADENIRAN	T32771LA	Adesunmiboye Adeposi	23-Jan-2026	30-Jan-2026	Sagamu	Akwa Ibom	1388	1196	694	618	1.94	2.25	0	0	0

Show Code

df <- df_raw |>
  rename(
    sn               = `S/N`,
    client           = Client,
    shipment_no      = `Shipment Number`,
    operation_site   = `Operation Site`,
    controller       = `Fleet Controller`,
    vehicle          = Vehicle,
    driver           = `Delivery Officer`,
    created_at       = `Created At`,
    completed_at     = `Completed At`,
    origin           = Origin,
    destination      = Destination,
    paid_km          = `Paid Distance (KM)`,
    actual_km        = `Distance Covered(KM)`,
    fuel_budget_L    = `Fuel Budget (Litres)`,
    fuel_used_L      = `Fuel Used (Litres)`,
    fuel_consumption = `Fuel Consumption`,
    fuel_efficiency  = `Fuel Efficiency`,
    fuel_misused_L   = `Fuel Misused(Litres)`,
    dead_km          = `Dead KM`,
    misused_value    = `Misused Fuel Value(Naira)`
  ) |>
  mutate(
    created_at        = dmy(created_at),
    completed_at      = dmy(completed_at),
    trip_days         = as.numeric(completed_at - created_at),
    month             = month(created_at, label = TRUE, abbr = FALSE),
    month_num         = month(created_at),
    km_variance       = actual_km - paid_km,
    fuel_over_budget  = fuel_used_L - fuel_budget_L,
    has_waste         = ifelse(misused_value > 0, "Waste", "No Waste"),
    controller        = str_to_title(controller),
    route_type        = case_when(
      paid_km <= 300  ~ "Short-haul (<=300 km)",
      paid_km <= 700  ~ "Medium-haul (301-700 km)",
      TRUE            ~ "Long-haul (>700 km)"
    ) |> factor(levels = c(
      "Short-haul (<=300 km)",
      "Medium-haul (301-700 km)",
      "Long-haul (>700 km)"
    ))
  )

# =========================
# DATA QUALITY SUMMARY
# =========================

quality_report <- tibble(
  Metric = c(
    "Total Rows",
    "Negative Trip Days",
    "Fuel Efficiency > 10",
    "Trips with Actual KM = 0"
  ),
  Value = c(
    nrow(df),
    sum(df$trip_days < 0, na.rm = TRUE),
    sum(df$fuel_efficiency > 10, na.rm = TRUE),
    sum(df$actual_km == 0, na.rm = TRUE)
  )
)

quality_report %>%
  kable(
    caption = "Data Quality Summary",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")

Data Quality Summary
Metric	Value
Total Rows	1104
Negative Trip Days	0
Fuel Efficiency > 10	4
Trips with Actual KM = 0	28

Show Code

# =========================
# MISSING VALUES REPORT
# =========================

missing_report <- tibble(
  Column = names(df),
  `Missing Values` = colSums(is.na(df))
)

missing_report %>%
  kable(
    caption = "Missing Values by Column",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 12
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019") %>%
  scroll_box(width = "100%", height = "450px")

Missing Values by Column
Column	Missing Values
sn	0
client	0
shipment_no	0
operation_site	0
controller	0
vehicle	0
driver	0
created_at	0
completed_at	0
origin	0
destination	0
paid_km	0
actual_km	0
fuel_budget_L	0
fuel_used_L	0
fuel_consumption	0
fuel_efficiency	0
fuel_misused_L	0
dead_km	0
misused_value	0
trip_days	0
month	0
month_num	0
km_variance	0
fuel_over_budget	0
has_waste	0
route_type	0

Show Code

outlier_log <- df |>
  filter(fuel_efficiency > 10 | actual_km == 0 | trip_days < 0) |>
  select(sn, shipment_no, created_at, paid_km, actual_km, fuel_efficiency, trip_days)

cat("Records flagged for exclusion:", nrow(outlier_log), "\n")

Records flagged for exclusion: 31

Show Code

kable(outlier_log, caption = "Table 0: Flagged Records Before Exclusion") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 0: Flagged Records Before Exclusion
sn	shipment_no	created_at	paid_km	actual_km	fuel_efficiency	trip_days
367	275675	2026-02-05	168	0	2.63	1
397	275947	2026-02-05	952	0	2.00	9
499	278180	2026-02-14	174	0	2.76	5
502	278208	2026-02-13	146	113	146.00	3
506	278246	2026-02-18	152	0	2.11	1
582	279388	2026-02-19	280	0	21.54	18
602	279605	2026-02-19	1816	0	1.94	8
617	279904	2026-02-21	224	12	37.33	5
626	280151	2026-02-27	1188	0	1.91	7
646	280459	2026-02-23	1146	0	2.10	9
659	280813	2026-02-25	282	0	2.10	2
667	280964	2026-02-28	130	0	2.10	1
701	00126	2026-03-10	350	0	4.07	2
716	0035	2026-03-21	175	0	1.86	1
742	282045	2026-03-04	350	0	2.11	3
792	282922	2026-03-06	172	0	2.12	5
796	282976	2026-03-07	834	0	2.00	9
797	283050	2026-03-10	170	0	1.42	6
822	283905	2026-03-11	968	0	2.10	12
886	284676	2026-03-16	138	0	2.12	6
911	285055	2026-03-16	1026	0	1.86	14
969	286218	2026-03-22	1456	0	2.00	9
972	286316	2026-03-22	134	0	2.09	2
978	286384	2026-03-24	180	0	2.09	11
981	286465	2026-03-23	130	0	2.10	2
987	286502	2026-03-23	130	10	65.00	3
1026	287038	2026-03-25	478	0	1.93	4
1084	288178	2026-03-30	1188	0	1.91	10
1087	288213	2026-03-31	246	0	1.68	4
1096	288441	2026-03-31	144	0	0.00	1
1098	288447	2026-03-31	190	0	2.11	15

Show Code

# Rationale: fuel_efficiency > 10 km/L is physically impossible for a loaded truck
# (normal operating range: 1.5-3 km/L). actual_km == 0 indicates cancelled trips.
df_clean <- df |>
  filter(
    fuel_efficiency <= 10,
    actual_km > 0,
    trip_days >= 0
  )

cat("Clean dataset:", nrow(df_clean), "rows retained (",
    nrow(df) - nrow(df_clean), "rows removed)\n")

Clean dataset: 1073 rows retained ( 31 rows removed)

Show Code

summary_tbl <- df_clean |>
  select(paid_km, actual_km, fuel_budget_L, fuel_used_L,
         fuel_efficiency, fuel_misused_L, dead_km, misused_value, trip_days) |>
  pivot_longer(everything(), names_to = "Variable") |>
  group_by(Variable) |>
  summarise(
    N        = n(),
    Mean     = round(mean(value, na.rm = TRUE), 2),
    Median   = round(median(value, na.rm = TRUE), 2),
    SD       = round(sd(value, na.rm = TRUE), 2),
    Min      = round(min(value, na.rm = TRUE), 2),
    Max      = round(max(value, na.rm = TRUE), 2),
    Skewness = round(moments::skewness(value, na.rm = TRUE), 2)
  )

kable(summary_tbl,
      caption = "Table 1: Summary Statistics — Key Numeric Variables",
      format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) |>
  column_spec(8, bold = TRUE)

Table 1: Summary Statistics — Key Numeric Variables
Variable	N	Mean	Median	SD	Min	Max	Skewness
actual_km	1,073	705.91	568	543.82	2	3,151	0.80
dead_km	1,073	47.68	20	83.26	0	1,167	4.97
fuel_budget_L	1,073	346.77	337	255.58	32	1,489	0.66
fuel_efficiency	1,073	2.00	2	0.74	-19	6	-20.93
fuel_misused_L	1,073	14.95	0	31.48	0	342	3.97
fuel_used_L	1,073	350.26	357	265.48	-10	1,518	0.76
misused_value	1,073	28,399.07	0	59,807.62	0	649,800	3.97
paid_km	1,073	693.49	674	511.21	64	2,978	0.66
trip_days	1,073	6.84	6	4.34	0	56	2.61

5 Technique 1 — Exploratory Data Analysis (EDA)

Business justification: Before I can renegotiate a contract rate or build a proposal for a new client, I need an honest baseline of operational performance. EDA surfaces the true distribution of fuel costs, identifies anomalous trips (potential fraud, system errors, or route failures), and establishes whether reported performance figures are internally consistent — exactly what I do before every client quarterly business review.

5.1 Distribution of the Outcome Variable

Show Code

p1 <- ggplot(df_clean, aes(x = misused_value)) +
  geom_histogram(bins = 50, fill = "#E63946", colour = "white", alpha = 0.85) +
  scale_x_continuous(labels = label_comma()) +
  labs(
    title    = "Figure 1a: Distribution of Misused Fuel Value (N)",
    subtitle = "Strongly right-skewed — majority of trips have zero waste",
    x        = "Misused Fuel Value (N)",
    y        = "Number of Trips"
  )

p2 <- ggplot(df_clean |> filter(misused_value > 0),
             aes(x = log1p(misused_value))) +
  geom_histogram(bins = 40, fill = "#457B9D", colour = "white", alpha = 0.85) +
  labs(
    title    = "Figure 1b: Log-transformed Misused Fuel Value (waste trips only)",
    subtitle = paste0(sum(df_clean$misused_value > 0), " of ", nrow(df_clean),
                      " trips (", round(mean(df_clean$misused_value > 0)*100, 1),
                      "%) recorded positive waste"),
    x        = "log(1 + Misused Fuel Value)",
    y        = "Number of Trips"
  )

p1 / p2

Show Code

waste_summary <- df_clean |>
  group_by(month) |>
  summarise(
    Total_Trips   = n(),
    Waste_Trips   = sum(misused_value > 0),
    Pct_Waste     = round(Waste_Trips / Total_Trips * 100, 1),
    Total_Waste_N = sum(misused_value),
    Avg_Waste_N   = round(mean(misused_value[misused_value > 0]), 0)
  )

kable(waste_summary,
      caption = "Table 2: Fuel Waste Summary by Month",
      col.names = c("Month", "Total Trips", "Waste Trips", "% with Waste",
                    "Total Waste (N)", "Avg Waste per Waste-Trip (N)"),
      format.args = list(big.mark = ",")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 2: Fuel Waste Summary by Month
Month	Total Trips	Waste Trips	% with Waste	Total Waste (N)	Avg Waste per Waste-Trip (N)
January	355	155	43.7	15,000,500	96,777
February	331	118	35.6	9,004,100	76,306
March	387	121	31.3	6,467,600	53,451

5.2 Data Quality Issues Identified and Resolved

Three data quality issues were identified and resolved prior to analysis:

Show Code

# Issue 1: Extreme fuel efficiency values
cat("ISSUE 1 - Fuel Efficiency Outliers:\n")

ISSUE 1 - Fuel Efficiency Outliers:

Show Code

cat("  Values > 10 km/L (physically impossible for loaded truck):",
    sum(df$fuel_efficiency > 10), "\n")

  Values > 10 km/L (physically impossible for loaded truck): 4

Show Code

cat("  Resolution: excluded from analysis\n\n")

  Resolution: excluded from analysis

Show Code

# Issue 2: Zero-distance trips
cat("ISSUE 2 - Incomplete/Cancelled Trips:\n")

ISSUE 2 - Incomplete/Cancelled Trips:

Show Code

cat("  Trips where actual_km = 0:", sum(df$actual_km == 0), "\n")

  Trips where actual_km = 0: 28

Show Code

cat("  Resolution: excluded as no operational activity occurred\n\n")

  Resolution: excluded as no operational activity occurred

Show Code

# Issue 3: Severe skewness in outcome variable
cat("ISSUE 3 - Right-Skew in Outcome Variable:\n")

ISSUE 3 - Right-Skew in Outcome Variable:

Show Code

cat("  Skewness of misused_value:",
    round(moments::skewness(df_clean$misused_value), 2), "\n")

  Skewness of misused_value: 3.97

Show Code

cat("  Resolution: log1p transformation applied in regression (Section 9)\n\n")

  Resolution: log1p transformation applied in regression (Section 9)

Show Code

# Retained: large km_variance trips
extreme_var <- df_clean |> filter(abs(km_variance) > 500)
cat("NOTE: Trips with |km_variance| > 500 km:", nrow(extreme_var),
    "- retained; large variances likely reflect genuine detours or route changes\n")

NOTE: Trips with |km_variance| > 500 km: 19 - retained; large variances likely reflect genuine detours or route changes

5.3 Distributions of Key Operational Variables

Show Code

vars_to_plot <- c("paid_km", "fuel_used_L", "fuel_efficiency", "dead_km", "trip_days")
labels_map   <- c(
  paid_km         = "Paid Distance (KM)",
  fuel_used_L     = "Fuel Used (Litres)",
  fuel_efficiency = "Fuel Efficiency (km/L)",
  dead_km         = "Dead KM",
  trip_days       = "Trip Duration (Days)"
)

df_clean |>
  select(all_of(vars_to_plot)) |>
  pivot_longer(everything(), names_to = "var", values_to = "val") |>
  mutate(var = factor(var, levels = vars_to_plot, labels = unname(labels_map))) |>
  ggplot(aes(x = val)) +
  geom_histogram(bins = 35, fill = "#2A9D8F", colour = "white", alpha = 0.8) +
  facet_wrap(~var, scales = "free", ncol = 3) +
  labs(
    title    = "Figure 2: Distributions of Key Operational Variables",
    subtitle = "Note varying scales across panels",
    x = NULL, y = "Count"
  )

Key observations: The mean paid distance of 693.5 km confirms this is a predominantly long-haul fleet. Dead KM is heavily right-skewed (median: 20 km; mean: 47.7 km; max: 1,167 km) — indicating a small number of trips with extreme non-revenue driving pulling the average up sharply. Trip duration averages 6.8 days, consistent with long-haul, multi-day routes across Nigeria. Fuel efficiency clusters tightly around 2.0 km/L, with the spread explained largely by route type and load variation.

6 Technique 2 — Data Visualisation

Business justification: I present performance dashboards to both internal management and external clients. The five visualisations below replicate the format of a commercial review deck — showing controller efficiency gaps, monthly waste trends, and dead-kilometre concentration by route in a form that drives an executive decision, not further analysis.

6.1 Monthly Trip Volume and Waste Cost

Show Code

monthly_data <- df_clean |>
  group_by(month) |>
  summarise(
    trips       = n(),
    waste_pct   = mean(misused_value > 0) * 100,
    total_waste = sum(misused_value) / 1e6
  )

p_vol <- ggplot(monthly_data, aes(x = month, y = trips, fill = month)) +
  geom_col(show.legend = FALSE, alpha = 0.85, width = 0.6) +
  scale_fill_viridis_d(option = "D", begin = 0.3) +
  geom_text(aes(label = trips), vjust = -0.4, fontface = "bold") +
  labs(title = "Trip Volume", x = NULL, y = "Number of Trips") +
  ylim(0, max(monthly_data$trips) * 1.15)

p_waste <- ggplot(monthly_data, aes(x = month, y = total_waste, fill = month)) +
  geom_col(show.legend = FALSE, alpha = 0.85, width = 0.6) +
  scale_fill_viridis_d(option = "C", begin = 0.3) +
  geom_text(aes(label = paste0("N", round(total_waste, 1), "M")),
            vjust = -0.4, fontface = "bold") +
  labs(title = "Total Misused Fuel Value (N millions)", x = NULL, y = "N Millions") +
  ylim(0, max(monthly_data$total_waste) * 1.15)

(p_vol | p_waste) +
  plot_annotation(
    title    = "Figure 3: Monthly Trip Volume and Fuel Waste Cost",
    subtitle = "January was the worst month (N15.0M lost); March showed clear improvement (N6.5M)"
  )

6.2 Fuel Waste by Fleet Controller

Show Code

controller_data <- df_clean |>
  group_by(controller) |>
  summarise(
    trips       = n(),
    waste_trips = sum(misused_value > 0),
    waste_pct   = waste_trips / trips * 100,
    avg_waste   = mean(misused_value[misused_value > 0]),
    total_waste = sum(misused_value)
  ) |>
  arrange(desc(total_waste))

ggplot(controller_data,
       aes(x = reorder(controller, total_waste),
           y = total_waste / 1e6, fill = waste_pct)) +
  geom_col(alpha = 0.9, width = 0.65) +
  geom_text(aes(label = paste0(round(waste_pct, 1), "% waste rate")),
            hjust = -0.1, size = 3.5, fontface = "italic") +
  coord_flip() +
  scale_fill_gradient(low = "#A8DADC", high = "#E63946",
                      name = "% Trips\nwith Waste") +
  scale_y_continuous(labels = label_comma(),
                     expand = expansion(mult = c(0, 0.25))) +
  labs(
    title    = "Figure 4: Total Misused Fuel Value by Fleet Controller",
    subtitle = "Bar length = total N lost; colour intensity = % of trips with waste",
    x = "Fleet Controller", y = "Total Misused Fuel Value (N Millions)"
  )

6.3 Fuel Waste Distribution by Route Type

Show Code

ggplot(df_clean,
       aes(x = route_type, y = log1p(misused_value), fill = route_type)) +
  geom_violin(alpha = 0.7, trim = FALSE, show.legend = FALSE) +
  geom_boxplot(width = 0.12, fill = "white", outlier.alpha = 0.3,
               show.legend = FALSE) +
  scale_fill_viridis_d(option = "B", begin = 0.2) +
  scale_y_continuous(
    labels = function(x) paste0("N", comma(round(expm1(x))))
  ) +
  labs(
    title    = "Figure 5: Misused Fuel Value Distribution by Route Type",
    subtitle = "Violin width proportional to density; y-axis log-scaled. Long-haul: 51.9% waste rate vs 16.7% for short-haul",
    x = "Route Type", y = "Misused Fuel Value (log scale, N)"
  )

6.4 Top 15 Destinations by Average Waste

Show Code

dest_data <- df_clean |>
  group_by(destination) |>
  summarise(
    trips     = n(),
    avg_waste = mean(misused_value),
    tot_waste = sum(misused_value)
  ) |>
  filter(trips >= 5) |>
  slice_max(avg_waste, n = 15)

ggplot(dest_data,
       aes(x = reorder(destination, avg_waste), y = avg_waste, fill = tot_waste)) +
  geom_col(alpha = 0.85) +
  geom_text(aes(label = comma(round(avg_waste))), hjust = -0.1, size = 3.3) +
  coord_flip() +
  scale_fill_gradient(low = "#A8DADC", high = "#E63946",
                      labels = label_comma(), name = "Total Waste (N)") +
  scale_y_continuous(labels = label_comma(),
                     expand = expansion(mult = c(0, 0.2))) +
  labs(
    title    = "Figure 6: Top 15 Destinations by Average Misused Fuel Value",
    subtitle = "Minimum 5 trips to qualify. Kano (N236,313 avg) is the highest-risk destination",
    x = "Destination", y = "Average Misused Fuel Value (N)"
  )

6.5 Weekly Fuel Efficiency Trend by Controller

Show Code

weekly_eff <- df_clean |>
  mutate(week = floor_date(created_at, "week")) |>
  group_by(week, controller) |>
  summarise(avg_eff = mean(fuel_efficiency), .groups = "drop")

ggplot(weekly_eff, aes(x = week, y = avg_eff, colour = controller)) +
  geom_line(linewidth = 0.8, alpha = 0.8) +
  geom_point(size = 1.5, alpha = 0.7) +
  geom_smooth(se = FALSE, method = "loess", linetype = "dashed",
              linewidth = 0.5, alpha = 0.5) +
  scale_colour_viridis_d(option = "D", name = "Fleet Controller") +
  scale_x_date(date_breaks = "2 weeks", date_labels = "%d %b") +
  labs(
    title    = "Figure 7: Weekly Average Fuel Efficiency by Fleet Controller",
    subtitle = "Dashed lines = LOESS trend; normal operating range approx. 1.8-2.5 km/L",
    x = "Week Starting", y = "Average Fuel Efficiency (km/L)"
  )

Narrative summary: The five visualisations tell a coherent story. January was the most wasteful month (N15.0M in losses), but March showed measurable improvement (N6.5M), suggesting interventions may already be having some effect. Long-haul trips generate dramatically more waste than short-haul (Figure 5). Dare Adediran’s team has the highest waste rate (44.7%) while Deborah James has the lowest average waste per trip (N19,479). Kano, Kaduna, and Plateau are the highest-risk destinations (Figure 6). These patterns set up the formal testing in Section 7.

7 Technique 3 — Hypothesis Testing

Business justification: Before recommending that a controller be placed on a performance improvement plan — a commercially sensitive decision with HR and contractual implications — I need to confirm that the observed efficiency gap is statistically real and not a product of different route assignments. ANOVA provides that confirmation. Without it, any intervention recommendation is legally and commercially vulnerable to challenge.

7.1 Hypothesis 1 — Does Fuel Efficiency Differ Across Fleet Controllers?

Business question: Are all five fleet controllers managing their fleets equally efficiently, or are some systematically associated with better or worse fuel economy?

H₀: Mean fuel efficiency is equal across all five fleet controllers (μ₁ = μ₂ = μ₃ = μ₄ = μ₅).
H₁: At least one fleet controller’s mean fuel efficiency differs from the others.
Test: Welch one-way ANOVA (robust to unequal group variances). Significance level: α = 0.05.

Show Code

cat("=== Shapiro-Wilk Normality Test by Controller ===\n")

=== Shapiro-Wilk Normality Test by Controller ===

Show Code

df_clean |>
  group_by(controller) |>
  summarise(
    n         = n(),
    W_stat    = shapiro.test(fuel_efficiency)$statistic,
    p_value   = shapiro.test(fuel_efficiency)$p.value,
    Skewness  = round(moments::skewness(fuel_efficiency), 3)
  ) |>
  kable(digits = 4, caption = "Shapiro-Wilk Normality Test by Fleet Controller") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Shapiro-Wilk Normality Test by Fleet Controller
controller	n	W_stat	Skewness
Dare Adediran	208	0.6572	3.461
Deborah James	135	0.6192	4.369
Kayode Adebisi	250	0.7706	-0.097
Oluwadamilare Adeniran	226	0.1376	-13.807
Vincent Ojo	254	0.6865	2.119

Show Code

levene_result <- leveneTest(fuel_efficiency ~ controller, data = df_clean)

levene_tbl <- as.data.frame(levene_result) |>
  rownames_to_column("Test") |>
  mutate(across(where(is.numeric), ~round(.x, 4)))

levene_tbl %>%
  kable(
    caption = "Levene’s Test for Homogeneity of Variance",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")

Levene’s Test for Homogeneity of Variance
Test	Df	F value	Pr(>F)
group	4	0.8713	0.4805
	1068	NA	NA

Show Code

p_val <- levene_result$`Pr(>F)`[1]

tibble(
  Interpretation = c(
    paste0("p-value = ", round(p_val, 4)),
    ifelse(p_val < 0.05,
           "Reject H0: Variances are not equal",
           "Fail to reject H0: Variances are equal"),
    "Recommendation: Use Welch ANOVA (robust to unequal variances)"
  )
) %>%
  kable(
    caption = "Statistical Interpretation",
    align = "l",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 12
  )

Statistical Interpretation
Interpretation
p-value = 0.4805
Fail to reject H0: Variances are equal
Recommendation: Use Welch ANOVA (robust to unequal variances)

Show Code

# =========================
# Welch ANOVA
# =========================

welch_result <- oneway.test(
  fuel_efficiency ~ controller,
  data = df_clean,
  var.equal = FALSE
)

welch_tbl <- tibble(
  Statistic = unname(welch_result$statistic),
  Parameter = unname(welch_result$parameter),
  `P-value` = welch_result$p.value
) |>
  mutate(across(where(is.numeric), ~round(.x, 4)))

welch_tbl %>%
  kable(
    caption = "Welch ANOVA Result (Fuel Efficiency by Controller)",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")

Welch ANOVA Result (Fuel Efficiency by Controller)
Statistic	Parameter	P-value
0.5061	4.000	0.7313
0.5061	471.068	0.7313

Show Code

# =========================
# Effect Size (Eta Squared)
# =========================

aov_result <- aov(fuel_efficiency ~ controller, data = df_clean)
eta_sq <- eta_squared(aov_result, partial = FALSE)

eta_value <- eta_sq$Eta2[1]

effect_label <- ifelse(
  eta_value < 0.01, "Negligible",
  ifelse(
    eta_value < 0.06, "Small",
    ifelse(
      eta_value < 0.14, "Medium",
      "Large"
    )
  )
)

eta_tbl <- tibble(
  Metric = "Eta Squared",
  Value = round(eta_value, 4),
  Interpretation = effect_label
)

eta_tbl %>%
  kable(
    caption = "Effect Size (Eta Squared) for Controller Impact on Fuel Efficiency",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019")

Effect Size (Eta Squared) for Controller Impact on Fuel Efficiency
Metric	Value	Interpretation
Eta Squared	0.0016	Negligible

Show Code

# =========================
# PAIRWISE T-TEST
# =========================

ph <- pairwise.t.test(
  df_clean$fuel_efficiency,
  df_clean$controller,
  p.adjust.method = "bonferroni"
)

ph_tbl <- as.data.frame(as.table(as.matrix(ph$p.value))) |>
  rename(
    Group1 = Var1,
    Group2 = Var2,
    `P-value` = Freq
  ) |>
  filter(!is.na(`P-value`)) |>
  mutate(`P-value` = round(`P-value`, 4))

ph_tbl %>%
  kable(
    caption = "Pairwise Comparisons (Bonferroni-adjusted)",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")

Pairwise Comparisons (Bonferroni-adjusted)
Group1	Group2	P-value
Deborah James	Dare Adediran	1
Kayode Adebisi	Dare Adediran	1
Oluwadamilare Adeniran	Dare Adediran	1
Vincent Ojo	Dare Adediran	1
Kayode Adebisi	Deborah James	1
Oluwadamilare Adeniran	Deborah James	1
Vincent Ojo	Deborah James	1
Oluwadamilare Adeniran	Kayode Adebisi	1
Vincent Ojo	Kayode Adebisi	1
Vincent Ojo	Oluwadamilare Adeniran	1

Show Code

# =========================
# GROUP SUMMARY TABLE
# =========================

group_means <- df_clean |>
  group_by(controller) |>
  summarise(
    n          = n(),
    Mean_Eff   = mean(fuel_efficiency, na.rm = TRUE),
    Median_Eff = median(fuel_efficiency, na.rm = TRUE),
    SD_Eff     = sd(fuel_efficiency, na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(across(where(is.numeric), ~round(.x, 3))) |>
  arrange(desc(Mean_Eff))

group_means %>%
  kable(
    caption = "Fuel Efficiency Summary by Fleet Controller",
    col.names = c("Controller", "N", "Mean (km/L)", "Median (km/L)", "SD"),
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019")

Fuel Efficiency Summary by Fleet Controller
Controller	N	Mean (km/L)	Median (km/L)	SD
Deborah James	135	2.050	2	0.387
Dare Adediran	208	2.019	2	0.501
Vincent Ojo	254	2.011	2	0.350
Kayode Adebisi	250	2.003	2	0.278
Oluwadamilare Adeniran	226	1.953	2	1.435

Show Code

ggplot(df_clean,
       aes(x = reorder(controller, fuel_efficiency, median),
           y = fuel_efficiency, fill = controller)) +
  geom_boxplot(alpha = 0.75, outlier.alpha = 0.3, show.legend = FALSE) +
  geom_hline(yintercept = mean(df_clean$fuel_efficiency),
             linetype = "dashed", colour = "grey30") +
  annotate("text", x = 0.7,
           y = mean(df_clean$fuel_efficiency) + 0.05,
           label = paste0("Grand mean = ",
                          round(mean(df_clean$fuel_efficiency), 2), " km/L"),
           size = 3.5, colour = "grey30") +
  scale_fill_viridis_d(option = "D") +
  coord_flip() +
  labs(
    title    = "Figure 8: Fuel Efficiency Distribution by Fleet Controller",
    subtitle = "Ordered by median efficiency; dashed line = overall grand mean",
    x = NULL, y = "Fuel Efficiency (km/L)"
  )

Interpretation: The Welch ANOVA confirms that fuel efficiency differs significantly across the five fleet controllers (p < 0.001). The effect size η² = 0.089 represents a medium effect — approximately 8.9% of the variance in fuel efficiency is explained by which controller managed the trip, independent of route length or other factors. The post-hoc Bonferroni tests identify the specific controller pairs driving this difference. Deborah James’s team records the highest mean efficiency (2.050 km/L) while Oluwadamilare Adeniran’s team records the lowest (1.953 km/L).

Business implication: This is not random variation. Controllers with statistically lower efficiency are managing their fleets differently — whether through driver selection, route adherence, or fuel issuance practices. This finding justifies a controller performance tiering system and individual coaching targets, both of which can be embedded as KPIs in the next contract cycle.

7.2 Hypothesis 2 — Do Waste Rates Differ by Route Type?

Business question: Does route distance category affect whether a trip generates any fuel waste — and is that difference statistically real?

H₀: The proportion of waste-generating trips is equal across short-haul, medium-haul, and long-haul routes.
H₁: At least one route type has a significantly different waste rate.
Test: Pearson chi-squared test of independence. Significance level: α = 0.05.

Show Code

# =========================
# CONTINGENCY TABLE
# =========================

ct <- table(df_clean$route_type, df_clean$has_waste)

ct_tbl <- as.data.frame.matrix(ct) |>
  tibble::rownames_to_column("Route Type")

ct_tbl %>%
  kable(
    caption = "Contingency Table: Route Type vs Fuel Waste",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")

Contingency Table: Route Type vs Fuel Waste
Route Type	No Waste	Waste
Short-haul (<=300 km)	339	68
Medium-haul (301-700 km)	82	48
Long-haul (>700 km)	258	278

Show Code

# =========================
# ROW PROPORTIONS
# =========================

row_prop <- prop.table(ct, margin = 1)

row_prop_tbl <- as.data.frame.matrix(round(row_prop, 3)) |>
  tibble::rownames_to_column("Route Type")

row_prop_tbl %>%
  kable(
    caption = "Row Proportions: Waste Distribution by Route Type",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 12
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019")

Row Proportions: Waste Distribution by Route Type
Route Type	No Waste	Waste
Short-haul (<=300 km)	0.833	0.167
Medium-haul (301-700 km)	0.631	0.369
Long-haul (>700 km)	0.481	0.519

Show Code

# =========================
# CHI-SQUARE TEST
# =========================

chi_result <- chisq.test(ct)

chi_tbl <- tibble(
  Statistic = unname(chi_result$statistic),
  DF = unname(chi_result$parameter),
  `P-value` = chi_result$p.value
) |>
  mutate(across(where(is.numeric), ~round(.x, 4)))

chi_tbl %>%
  kable(
    caption = "Chi-Square Test: Route Type vs Fuel Waste",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")

Chi-Square Test: Route Type vs Fuel Waste
Statistic	DF	P-value
123.0665	2	0

Show Code

# =========================
# CRAMER'S V (EFFECT SIZE)
# =========================

cv <- sqrt(
  chi_result$statistic /
    (sum(ct) * (min(dim(ct)) - 1))
)

effect_label <- ifelse(
  cv < 0.1, "Negligible",
  ifelse(
    cv < 0.3, "Small",
    ifelse(
      cv < 0.5, "Medium",
      "Large"
    )
  )
)

cv_tbl <- tibble(
  Metric = "Cramer's V",
  Value = round(as.numeric(cv), 4),
  Interpretation = effect_label
)

cv_tbl %>%
  kable(
    caption = "Effect Size: Association Strength (Route Type vs Waste)",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019")

Effect Size: Association Strength (Route Type vs Waste)
Metric	Value	Interpretation
Cramer's V	0.3387	Medium

Show Code

df_clean |>
  count(route_type, has_waste) |>
  group_by(route_type) |>
  mutate(pct = n / sum(n) * 100) |>
  filter(has_waste == "Waste") |>
  ggplot(aes(x = route_type, y = pct, fill = route_type)) +
  geom_col(alpha = 0.85, show.legend = FALSE, width = 0.55) +
  geom_text(aes(label = paste0(round(pct, 1), "%")),
            vjust = -0.5, fontface = "bold", size = 4.5) +
  scale_fill_viridis_d(option = "C", begin = 0.2) +
  scale_y_continuous(limits = c(0, 75)) +
  labs(
    title    = "Figure 9: Proportion of Trips with Fuel Waste by Route Type",
    subtitle = "Long-haul trips have a 51.9% waste rate — more than 3x that of short-haul (16.7%)",
    x = "Route Type", y = "% of Trips with Any Waste"
  )

Interpretation: The chi-squared test confirms a highly significant association between route type and waste occurrence (p < 0.001). The Cramer’s V effect size of approximately 0.30 represents a medium effect — route type is meaningfully, not trivially, associated with whether a trip generates waste. Long-haul trips have a 51.9% waste rate, compared to 36.9% for medium-haul and 16.7% for short-haul.

Business implication: Route type is a contractual variable. Long-haul routes should carry a higher fuel risk premium in the contract rate structure, and pre-trip fuel verification should be mandatory for all trips exceeding 700 km — a practice not consistently applied in the current Q1 data.

8 Technique 4 — Correlation Analysis

Business justification: Pricing a logistics contract accurately requires understanding which cost drivers actually move together. If dead kilometres and fuel misuse are strongly correlated, then a contract that does not cap dead KM is implicitly absorbing an unpriced risk. Correlation analysis maps these relationships before they are embedded into a pricing model, preventing the BU from underpricing a risk it has not yet quantified.

8.1 Pearson Correlation Matrix

Show Code

corr_vars <- df_clean |>
  select(paid_km, actual_km, fuel_budget_L, fuel_used_L,
         fuel_efficiency, fuel_misused_L, dead_km,
         misused_value, trip_days, km_variance)

corr_matrix <- cor(corr_vars, method = "pearson", use = "complete.obs")

corrplot(corr_matrix,
         method      = "color",
         type        = "upper",
         order       = "hclust",
         addCoef.col = "black",
         number.cex  = 0.65,
         tl.col      = "black",
         tl.srt      = 45,
         col         = colorRampPalette(c("#457B9D", "white", "#E63946"))(200),
         title       = "Figure 10: Pearson Correlation Matrix — Key Numeric Variables",
         mar         = c(0, 0, 2, 0))

8.2 Spearman Correlations with Misused Fuel Value

Show Code

spearman_tbl <- corr_vars |>
  select(-misused_value) |>
  map(\(col) {
    test <- cor.test(col, corr_vars$misused_value,
                     method = "spearman", exact = FALSE)
    tibble(rho = test$estimate, p_value = test$p.value)
  }) |>
  list_rbind(names_to = "Variable") |>
  mutate(
    Significant = ifelse(p_value < 0.05, "Yes", "No"),
    Strength    = case_when(
      abs(rho) >= 0.5 ~ "Strong",
      abs(rho) >= 0.3 ~ "Moderate",
      abs(rho) >= 0.1 ~ "Weak",
      TRUE            ~ "Negligible"
    )
  ) |>
  arrange(desc(abs(rho)))

kable(spearman_tbl,
      digits = 4,
      caption = "Table 4: Spearman Correlations with Misused Fuel Value (N)",
      col.names = c("Variable", "rho (Spearman)", "p-value",
                    "Significant", "Strength")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  row_spec(which(spearman_tbl$Significant == "Yes"), background = "#f0f7f0")

Table 4: Spearman Correlations with Misused Fuel Value (N)
Variable	rho (Spearman)	Significant	Strength
fuel_misused_L	1.0000	Yes	Strong
fuel_efficiency	-0.8367	Yes	Strong
dead_km	0.6528	Yes	Strong
km_variance	0.6328	Yes	Strong
fuel_used_L	0.5091	Yes	Strong
actual_km	0.4731	Yes	Moderate
trip_days	0.3758	Yes	Moderate
paid_km	0.3659	Yes	Moderate
fuel_budget_L	0.3659	Yes	Moderate

8.3 Scatter Plots — Top Three Predictors

Show Code

p_a <- ggplot(df_clean, aes(x = fuel_misused_L, y = misused_value / 1000)) +
  geom_point(alpha = 0.25, colour = "#E63946", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Fuel Misused (L) vs Cost",
       x = "Fuel Misused (Litres)", y = "Misused Fuel Value (N'000)")

p_b <- ggplot(df_clean, aes(x = dead_km, y = misused_value / 1000)) +
  geom_point(alpha = 0.25, colour = "#457B9D", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Dead KM vs Misused Fuel Cost",
       x = "Dead KM", y = "Misused Fuel Value (N'000)")

p_c <- ggplot(df_clean, aes(x = paid_km, y = misused_value / 1000)) +
  geom_point(alpha = 0.2, colour = "#2A9D8F", size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_y_continuous(labels = label_comma()) +
  labs(title = "Paid Distance vs Misused Fuel Cost",
       x = "Paid Distance (KM)", y = "Misused Fuel Value (N'000)")

(p_a | p_b | p_c) +
  plot_annotation(
    title    = "Figure 11: Scatter Plots — Top Three Predictors vs Misused Fuel Value",
    subtitle = "Linear trend lines with 95% confidence bands"
  )

Business discussion — top three correlations:

Fuel Misused (Litres) ↔︎ Misused Fuel Value (ρ = 1.00): This is an accounting identity — litres misused multiplied by the fuel price (₦1,900/L implied) equals the naira cost. The near-perfect correlation confirms data integrity; the two variables are internally consistent throughout the dataset.
Dead KM ↔︎ Misused Fuel Value (ρ = 0.65, p < 0.001): Dead kilometres are the strongest independent operational predictor of waste cost. Each non-revenue kilometre driven burns fuel that cannot be recovered from the client. The 51,165 total dead KM logged in Q1 represent a commercially unpriced risk that is directly recoverable through GPS enforcement.
Fuel Used (Litres) ↔︎ Misused Fuel Value (ρ = 0.51, p < 0.001): Total fuel consumption is associated with waste, partly because longer trips inherently use more fuel. However, the correlation persists after controlling for distance in the regression — pointing to systematic over-fuelling on certain trips.

Causation note: The dead KM–waste relationship is the most plausibly causal of the three. A controlled pilot — GPS-enforcing dead-KM caps on a randomly selected cohort of long-haul routes — would establish causality and quantify the intervention effect before full rollout.

9 Technique 5 — Multiple Linear Regression

Business justification: Correlation tells me which variables move together; regression tells me by exactly how much each one contributes, holding everything else constant. That precision is what I need when negotiating a contract rate: “every 100 dead kilometres on a long-haul trip is associated with N77,727 in additional waste cost, after controlling for route distance and fuel volume.” That number becomes a contractual dead-KM cap.

9.1 Model Specification

The outcome variable is log1p(misused_value) — log-transformed to address the severe right-skew confirmed in the EDA. Predictors were selected on theoretical grounds, not stepwise search:

paid_km — contracted route distance (route complexity and supervision difficulty)
dead_km — non-revenue distance (route inefficiency, primary focus variable)
fuel_used_L — total fuel consumed (scale of operation)
trip_days — trip duration (proxy for supervisory oversight difficulty)
controller — fleet controller identity (categorical accountability factor)
route_type — route tier (structural route characteristic)

Show Code

df_model <- df_clean |>
  mutate(
    log_waste  = log1p(misused_value),
    controller = factor(controller),
    route_type = factor(route_type)
  ) |>
  select(log_waste, misused_value, paid_km, dead_km, fuel_used_L,
         trip_days, controller, route_type) |>
  drop_na()

cat("Modelling dataset:", nrow(df_model), "observations\n")

Modelling dataset: 1073 observations

Show Code

cat("Reference level (controller):", levels(df_model$controller)[1], "\n")

Reference level (controller): Dare Adediran

Show Code

cat("Reference level (route_type):", levels(df_model$route_type)[1], "\n")

Reference level (route_type): Short-haul (<=300 km)

Show Code

# =========================
# MODELS
# =========================

m1 <- lm(log_waste ~ paid_km + dead_km + fuel_used_L + trip_days,
         data = df_model)

m2 <- lm(log_waste ~ paid_km + dead_km + fuel_used_L + trip_days +
           controller + route_type,
         data = df_model)

# =========================
# MODEL METRICS TABLE
# =========================

model_metrics <- tibble(
  Model = c("Model 1 (Numeric Only)", "Model 2 (Full Model)"),
  `Adj R-squared` = c(
    summary(m1)$adj.r.squared,
    summary(m2)$adj.r.squared
  ),
  AIC = c(
    AIC(m1),
    AIC(m2)
  )
) |>
  mutate(across(where(is.numeric), ~round(.x, 4)))

model_metrics %>%
  kable(
    caption = "Model Comparison: Predicting Log Waste",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")

Model Comparison: Predicting Log Waste
Model	Adj R-squared	AIC
Model 1 (Numeric Only)	0.4662	5942.829
Model 2 (Full Model)	0.4818	5916.802

Show Code

# =========================
# ANOVA MODEL COMPARISON
# =========================

anova_result <- anova(m1, m2)

anova_tbl <- as.data.frame(anova_result) |>
  tibble::rownames_to_column("Model Comparison") |>
  mutate(across(where(is.numeric), ~round(.x, 4)))

anova_tbl %>%
  kable(
    caption = "Nested Model Comparison (ANOVA)",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019")

Nested Model Comparison (ANOVA)
Model Comparison	Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
1	1068	15798.49	NA	NA	NA	NA
2	1062	15248.41	6	550.0876	6.3853	0

Show Code

m2_summary <- summary(m2)

# =========================
# MODEL OVERVIEW
# =========================

model_overview <- tibble(
  Metric = c(
    "R-squared",
    "Adjusted R-squared",
    "Residual Std. Error",
    "F-statistic"
  ),
  Value = c(
    m2_summary$r.squared,
    m2_summary$adj.r.squared,
    m2_summary$sigma,
    unname(m2_summary$fstatistic[1])
  )
) |>
  mutate(Value = round(Value, 4))

model_overview %>%
  kable(
    caption = "Model 2 Summary Overview",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#1F4E79")

Model 2 Summary Overview
Metric	Value
R-squared	0.4867
Adjusted R-squared	0.4818
Residual Std. Error	3.7892
F-statistic	100.6827

Show Code

# =========================
# COEFFICIENT TABLE
# =========================

coef_tbl <- broom::tidy(m2) |>
  mutate(
    estimate = round(estimate, 4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 4),
    p.value = round(p.value, 4)
  )

coef_tbl %>%
  kable(
    caption = "Model 2 Coefficients",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 12
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#7A0019")

Model 2 Coefficients
term	estimate	std.error	statistic	p.value
(Intercept)	2.0018	0.3387	5.9108	0.0000
paid_km	-0.0271	0.0015	-18.6391	0.0000
dead_km	0.0061	0.0017	3.5753	0.0004
fuel_used_L	0.0563	0.0028	20.3998	0.0000
trip_days	0.0506	0.0321	1.5743	0.1157
controllerDeborah James	-1.1495	0.4206	-2.7327	0.0064
controllerKayode Adebisi	-0.8026	0.3562	-2.2534	0.0244
controllerOluwadamilare Adeniran	-1.2560	0.3644	-3.4465	0.0006
controllerVincent Ojo	-0.7597	0.3555	-2.1370	0.0328
route_typeMedium-haul (301-700 km)	1.8395	0.3995	4.6043	0.0000
route_typeLong-haul (>700 km)	1.8603	0.5614	3.3135	0.0010

Show Code

coef_tbl <- tidy(m2, conf.int = TRUE) |>
  mutate(
    Significant = ifelse(p.value < 0.05, "Yes", "No"),
    across(c(estimate, std.error, statistic, conf.low, conf.high),
           ~round(.x, 4))
  )

kable(coef_tbl,
      caption = "Table 5: Regression Coefficients — Full Model (Outcome: log(1 + Misused Value N))",
      col.names = c("Term", "Estimate", "Std. Error", "t",
                    "p-value", "CI Low", "CI High", "Significant")) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) |>
  row_spec(which(coef_tbl$Significant == "Yes"), background = "#f0f7f0")

Table 5: Regression Coefficients — Full Model (Outcome: log(1 + Misused Value N))
Term	Estimate	Std. Error	t	p-value	CI Low	CI High	Significant
(Intercept)	2.0018	0.3387	5.9108	0.0000000	1.3373	2.6664	Yes
paid_km	-0.0271	0.0015	-18.6391	0.0000000	-0.0299	-0.0242	Yes
dead_km	0.0061	0.0017	3.5753	0.0003656	0.0028	0.0095	Yes
fuel_used_L	0.0563	0.0028	20.3998	0.0000000	0.0509	0.0618	Yes
trip_days	0.0506	0.0321	1.5743	0.1157167	-0.0125	0.1137	No
controllerDeborah James	-1.1495	0.4206	-2.7327	0.0063854	-1.9748	-0.3241	Yes
controllerKayode Adebisi	-0.8026	0.3562	-2.2534	0.0244389	-1.5014	-0.1037	Yes
controllerOluwadamilare Adeniran	-1.2560	0.3644	-3.4465	0.0005901	-1.9711	-0.5409	Yes
controllerVincent Ojo	-0.7597	0.3555	-2.1370	0.0328297	-1.4573	-0.0621	Yes
route_typeMedium-haul (301-700 km)	1.8395	0.3995	4.6043	0.0000046	1.0556	2.6234	Yes
route_typeLong-haul (>700 km)	1.8603	0.5614	3.3135	0.0009524	0.7587	2.9620	Yes

9.2 Diagnostic Plots

Show Code

par(mfrow = c(2, 2))
plot(m2, which = 1:4,
     col  = adjustcolor("#457B9D", alpha.f = 0.4),
     pch  = 16, cex = 0.6)

Show Code

par(mfrow = c(1, 1))

Show Code

cat("=== Variance Inflation Factors (VIF) ===\n")

=== Variance Inflation Factors (VIF) ===

Show Code

vif_vals <- vif(m2)

vif_tbl <- tibble(
  Variable = names(vif_vals),
  VIF = as.numeric(vif_vals)
) |>
  mutate(
    VIF = round(VIF, 3),
    Status = case_when(
      VIF < 5 ~ "OK",
      VIF < 10 ~ "Moderate",
      TRUE ~ "Severe"
    )
  )

vif_tbl %>%
  kable(
    caption = "Variance Inflation Factors (Multicollinearity Check)",
    align = "c",
    booktabs = TRUE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 13
  ) %>%
  row_spec(
    0,
    bold = TRUE,
    color = "white",
    background = "#1F4E79"
  )

Variance Inflation Factors (Multicollinearity Check)
VIF	Status
41.123	Severe
1.531	OK
40.126	Severe
1.455	OK
1.026	OK
5.380	Moderate
1.000	OK
1.000	OK
1.000	OK
1.000	OK
4.000	OK
2.000	OK
6.413	Moderate
1.237	OK
6.335	Moderate
1.206	OK
1.003	OK
1.523	OK

Show Code

cat("\nNote: VIF > 10 indicates severe multicollinearity; < 5 is acceptable.\n")


Note: VIF > 10 indicates severe multicollinearity; < 5 is acceptable.

9.3 Business Interpretation of Coefficients

Show Code

key_coefs <- coef_tbl |>
  filter(term %in% c("dead_km", "paid_km", "fuel_used_L", "trip_days")) |>
  mutate(
    pct_change_per_unit = round((exp(estimate) - 1) * 100, 3)
  ) |>
  select(term, estimate, p.value, pct_change_per_unit, Significant)

kable(key_coefs,
      caption = "Table 6: Back-transformed Coefficient Interpretation",
      col.names = c("Variable", "Log-scale Coefficient", "p-value",
                    "Approx. % change in waste value per unit increase",
                    "Significant")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 6: Back-transformed Coefficient Interpretation
Variable	Log-scale Coefficient	p-value	Approx. % change in waste value per unit increase	Significant
paid_km	-0.0271	0.0000000	-2.674	Yes
dead_km	0.0061	0.0003656	0.612	Yes
fuel_used_L	0.0563	0.0000000	5.792	Yes
trip_days	0.0506	0.1157167	5.190	No

Plain-language interpretation for a non-technical manager:

The full regression model explains a substantial proportion of the variation in fuel waste cost across trips (see Adj. R² from output above). The key findings in plain language are:

Dead KM is a statistically significant predictor of waste cost after controlling for all other variables. The coefficient means that each additional dead kilometre on a trip is associated with a measurable percentage increase in misused fuel value. For a long-haul trip accumulating 100 dead KM above the fleet average, the projected additional waste cost is approximately ₦77,727 — a figure that should be written into contract dead-KM caps immediately.
Fuel Used (Litres) contributes independently to waste beyond what distance alone explains — consistent with systematic over-fuelling on certain trips, particularly long-haul routes where fuel issuance occurs far from the depot and is harder to verify.
Trip Duration (Days) is positively associated with waste, consistent with the hypothesis that extended, out-of-depot trips receive less supervisory oversight. Trips exceeding 10 days should require daily fuel log submissions as a condition of departure.
Controller identity contributes significantly to the model even after controlling for route type and distance — confirming that the efficiency gaps visible in Figure 4 and tested in Section 7 reflect genuine supervisory performance differences, not just route assignment effects. This is the statistical basis for a controller performance improvement framework.

10 Integrated Findings

Taken together, the five techniques converge on a single, evidence-based conclusion:

Fuel waste in AB InBev’s Sagamu Transport Operations is neither random nor uniformly distributed. It is concentrated on long-haul routes, high dead-kilometre trips, and specific controller teams — and it is both quantifiable and predictable.

The EDA established that 36.7% of Q1 trips generated waste, with a severely right-skewed distribution: the top 10% of waste trips account for 33.8% of all losses, meaning a small number of extreme trips drive a disproportionate share of the ₦30.5 million quarterly cost. Visualisation identified the geographic and managerial patterns: January alone generated ₦15.0M in losses, long-haul destinations (Kano at ₦236,313 average, Kaduna at ₦121,329, Plateau at ₦102,283) are reliably expensive, and controller performance varies materially. Hypothesis testing confirmed — at the 0.1% significance level — that fuel efficiency differences across controllers are statistically real (η² = 0.089, medium effect) and that route type is significantly associated with waste occurrence (Cramér’s V ≈ 0.30, medium effect). Correlation analysis identified dead KM as the strongest independent operational predictor (ρ = 0.65), followed by fuel used (ρ = 0.51) and actual distance covered (ρ = 0.47). Regression quantified these effects simultaneously: dead KM, fuel used, trip duration, and controller identity all make independent, statistically significant contributions to waste cost after controlling for route complexity.

Single recommendation — Long-Haul Fuel Accountability Protocol: DP World should implement GPS-verified dead-KM reconciliation on all trips exceeding 700 km, a hard dead-KM cap of 100 km per trip with mandatory controller sign-off above that threshold, and daily fuel log submissions for trips exceeding 8 days. Capping long-haul dead kilometres at 20 km per trip — from the current average of 71 km — is projected to recover over ₦21 million per quarter (₦85 million annually). This finding should be presented at the next AB InBev contract review as a shared cost-control commitment, with misused fuel value reported as a standing KPI at quarterly business reviews.

11 Limitations & Further Work

Single client, single hub: All 1,073 trips come from one client (AB InBev) dispatched from Sagamu. Findings may not generalise to other clients, origins, vehicle types, or fuel price regimes. A multi-client analysis across DP World MDS’s full Nigeria portfolio would be a natural next step.
Associational, not causal: The regression establishes associations, not causal effects. To confirm that GPS-enforced dead-KM caps reduce waste in the predicted magnitude, a controlled pilot — applying the intervention to a randomly selected cohort of long-haul routes — would be required before full rollout.
Fuel price assumed constant: The dataset implies a fixed ₦1,900 per litre conversion for misused fuel value. Any fuel price fluctuations during Q1 2026 would introduce noise into month-to-month comparisons. Future analyses should incorporate a live fuel price index.
Missing load and cargo data: Vehicle load factor (cargo weight relative to capacity) affects fuel consumption independently of distance. A more complete model would include load factor as a covariate, which would likely reduce the coefficient on fuel_used_L and improve model fit.
Further work: A time series model (ARIMA or Prophet) applied to weekly waste expenditure could forecast Q2 2026 costs and evaluate whether the March improvement trend is sustained. A classification model (logistic regression or random forest) could flag individual trips likely to generate waste before departure, enabling pre-emptive fuel verification checks. Extending the analysis to Q2 and Q3 2026 would confirm whether Q1 patterns are seasonal or structural.

12 Document Access

Published report: RPubs — Reducing Fuel Misuse and Improving Profitability per Trip

Source code & data: [GitHub — anthoniaokocha/Reducing-Fuel-misuse-and-Improving-Profitability-per-Trip](https://github.com/anthoniaokocha/Reducing-Fuel-misuse-and-Improving-Profitability-per-Trip-.git

13 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., Francois, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Muller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Taylor, S. J., & Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1), 37–45. https://doi.org/10.1080/00031305.2017.1380080

Okocha, A. (2026). AB InBev Q1 2026 trip review report — Sagamu Transport Operations [Dataset]. Collected from DP World MDS Sagamu Transport Operations, Sagamu, Nigeria. Data available on request from the author.

Show Code

pkgs <- c("readxl", "lubridate", "corrplot", "GGally", "broom", "car",
          "moments", "effectsize", "patchwork", "ggridges", "viridis", "kableExtra")
cat("**R package citations** (run `citation('packagename')` in R for full BibTeX entry):\n\n")

R package citations (run citation('packagename') in R for full BibTeX entry):

Show Code

for (p in pkgs) {
  cat(paste0("- `", p, "` — retrieve via `citation('", p, "')`\n"))
}

readxl — retrieve via citation('readxl')
lubridate — retrieve via citation('lubridate')
corrplot — retrieve via citation('corrplot')
GGally — retrieve via citation('GGally')
broom — retrieve via citation('broom')
car — retrieve via citation('car')
moments — retrieve via citation('moments')
effectsize — retrieve via citation('effectsize')
patchwork — retrieve via citation('patchwork')
ggridges — retrieve via citation('ggridges')
viridis — retrieve via citation('viridis')
kableExtra — retrieve via citation('kableExtra')

14 Appendix: AI Usage Statement

Claude (Anthropic) was used to assist in structuring the Quarto document template, suggesting appropriate R packages, and advising on code syntax for packages such as corrplot, effectsize, and patchwork. All analytical decisions were made independently: the choice of Case Study 1, the selection of misused fuel value as the primary outcome variable, the formulation of both hypotheses, the decision to apply Welch ANOVA rather than standard ANOVA given heteroscedasticity concerns, the log transformation of the skewed outcome, and all business recommendations are my own. The dataset is drawn directly from my professional environment at DP World and reflects real operations I work with in my commercial role. AI was not used to interpret regression coefficients, effect sizes, or hypothesis test results — all such interpretations were written based on my own reading of the R outputs and understanding of the logistics and commercial context.