Data Analytics Take Home Examination — CS1: Exploratory & Inferential Analytics of a Fleet Management Operation

Author

Akeju, Oluwafemi Olusoji

Published

May 12, 2026


1. Executive Summary

Note📋 Executive Summary

This report presents a comprehensive Exploratory & Inferential Analytics study of fleet maintenance operations at KK Leasing Limited — a third-party vehicle leasing and logistics company operating across Nigeria. Drawing on 6,986 line-item maintenance records spanning 153 vehicles and 11 months (January–November 2023), the analysis addresses a central operational question:

What are the principal drivers of fleet maintenance cost at KK Leasing, and how can they be predicted and controlled to protect asset profitability?

Five analytical techniques were applied sequentially: Exploratory Data Analysis (EDA) uncovered a right-skewed cost distribution dominated by Tyre Purchase (₦18.2 m), Routine Maintenance (₦12.4 m) and Engine repairs (₦8.5 m), with Toyota vehicles — comprising 55% of the fleet — accounting for 70% of total spend. Data Visualisation revealed a mid-year expenditure peak (June–August) and substantial cost variation across maintenance categories and clients. Hypothesis Testing confirmed statistically significant differences in per-event costs across both maintenance categories (ANOVA, F(5, 3418) = 77.3, p < .001) and vehicle makes (ANOVA, F(7, 6978) = 11.6, p < .001). Correlation Analysis established strong positive associations between vehicle age and maintenance cost (r = 0.57) and between parts cost and total cost (r = 1.00). Linear Regression produced a model (adjusted = 0.54) in which vehicle age and budget allocation are the strongest predictors of maintenance spend.

Key recommendation: Retire or ring-fence vehicles aged 10 years and above, enforce a structured Planned Maintenance (PM) programme that front-loads routine service spend to suppress reactive repair costs, and reallocate a portion of the maintenance budget toward in-house workshop capacity to reduce reliance on third-party vendors.


2. Professional Disclosure

2.1 Author & Organisational Background

My name is Akeju, Oluwafemi Olusoji, and I serve as the Chief Operating Officer (COO) of KK Leasing Limited, a Nigerian third-party logistics (3PL) and vehicle leasing company. In this capacity, I bear executive responsibility for the end-to-end management of the company’s vehicle fleet — which constitutes its primary revenue-generating and cost-bearing asset class. My remit encompasses fleet procurement, maintenance governance, workshop operations, client SLA management, fuel and tyre budgeting, and the strategic deployment of vehicles across multiple corporate client accounts.

KK Leasing’s core value proposition is to provide, maintain, and manage vehicles for corporate organisations, thereby enabling our clients to outsource all fleet-related operational risk and complexity to us. The company’s financial performance is, in large measure, a direct function of how efficiently we control maintenance costs relative to the revenue generated from each vehicle. A vehicle grounded for unplanned repair neither generates revenue nor meets client SLA commitments — making maintenance analytics not merely an academic exercise but an existential operational priority.

2.2 Why This Analysis Matters Operationally

The maintenance function at KK Leasing is characterised by high frequency, significant value, and considerable complexity. In 2023 alone, the dataset reflects over ₦93 million in approved maintenance expenditure across 153 vehicles, 22 distinct maintenance categories, and 6 client accounts. Without structured analytical oversight, cost overruns go undetected, high-risk vehicles remain in service beyond economic useful life, and budget forecasts are based on intuition rather than evidence.

This analysis directly supports three strategic decisions I confront on a recurring basis:

  1. Vehicle retirement decisions — When does a vehicle’s cumulative maintenance cost exceed its residual value and warranted budget allocation?
  2. Maintenance budget allocation — How much should be budgeted per vehicle per category, and which categories carry the highest financial risk?
  3. Workshop sourcing strategy — Should maintenance be directed to in-house workshops or third-party vendors for optimal cost and quality outcomes?

2.3 Technique Justification

The five analytical techniques selected for this study are not merely textbook exercises — each maps directly to a recurring operational question in the management of KK Leasing’s fleet.

Technique 1 — Exploratory Data Analysis (EDA) Before any inference or prediction, a rigorous analyst must first understand the data they possess. In the fleet management context, EDA is indispensable for identifying the baseline cost landscape: which maintenance categories consume the largest share of the budget, how costs are distributed across vehicles and months, and where data quality issues (such as missing values, duplicate entries, or outlier repair events) may distort downstream analysis. Our maintenance cost data is inherently right-skewed — a small number of high-value repair events (e.g., engine overhauls, tyre fleet replacements) dominate total spend. EDA allows me to detect and document these anomalies before they corrupt statistical conclusions. It also generates the summary statistics — means, medians, standard deviations, and inter-quartile ranges — that form the factual basis for budget conversations with the Board and with clients.

Technique 2 — Data Visualisation Numbers alone do not persuade operational managers or clients. Visualisation transforms raw maintenance data into narratives that are immediately actionable. In my role, I regularly present fleet performance dashboards to the management team and to client procurement committees. The grammar-of-graphics approach adopted here — using deliberate chart type selection, consistent colour coding, and layered storytelling — allows a non-technical audience to grasp, for instance, that tyre expenditure is the single largest cost category, that costs peak in the mid-year months, or that certain vehicle makes exhibit disproportionate maintenance intensity. These visual insights directly inform decisions on preventive maintenance scheduling, tyre procurement contracts, and vehicle replacement cycles.

Technique 3 — Hypothesis Testing Operational intuition frequently generates claims that need rigorous verification: “Toyota vehicles cost more to maintain than other makes”, or “mechanical repairs are significantly more expensive than electrical ones.” Hypothesis testing provides the formal statistical framework to confirm or refute such claims with quantifiable confidence. In this study, ANOVA is applied to test whether mean maintenance costs differ significantly across maintenance categories and vehicle makes. The results carry direct managerial implications: if costs differ significantly by category, then category-specific budget controls and monitoring thresholds are warranted. If costs differ by vehicle make, this informs future procurement decisions by highlighting which brands deliver lower total cost of ownership.

Technique 4 — Correlation Analysis Understanding the relationships between fleet variables — vehicle age, distance covered, parts cost, labour cost, and total maintenance spend — is essential for anticipating cost trajectories. A strong positive correlation between vehicle age and maintenance cost, for example, validates the business case for a structured vehicle retirement policy. Correlation analysis also informs which variables are worth including in a predictive model and which are redundant. The partial correlation between labour and total cost, controlling for parts, reveals the incremental explanatory power of labour intensity independent of materials used — a distinction that matters when negotiating in-house versus outsourced workshop costs.

Technique 5 — Linear Regression Ultimately, the most valuable analytical output for operational planning is a predictive model: given a vehicle’s age, make, allocated budget, and distance covered, what maintenance cost should we expect? Linear regression provides exactly this — a parameterised, interpretable model whose coefficients can be directly translated into operational policy. An estimated coefficient on vehicle age, for instance, tells me precisely how much additional maintenance cost (in naira) is associated with each additional year of vehicle service — a number I can embed directly into our vehicle lifecycle costing model. Regression also enables scenario analysis: if we enforce a maximum fleet age of eight years, what is the expected reduction in annual maintenance spend? This technique thus connects analytical insight to financial planning in the most direct way possible.


3. Data Collection & Sampling

3.1 Data Source & Collection Method

The primary dataset used in this analysis was extracted from Instanta, KK Leasing’s Enterprise Resource Planning (ERP) system, which serves as the operational backbone for recording all maintenance transactions, workshop approvals, and cost postings across the fleet. Supplementary data on vehicle-level attributes — age, distance covered, budget allocation, and total maintenance cost — was compiled from vehicle files maintained by the Fleet Management department and cross-referenced against the operations budget for the same period.

The data extraction process involved:

  • A structured query of the Instanta ERP database, filtered for all maintenance work orders raised and approved between 1 January 2023 and 30 November 2023
  • Export to Excel format, yielding two structured sheets: a Maintenance Record (transaction-level) and a Vehicle Details (vehicle-level summary)
  • Manual verification of a random 5% sample against source vehicle files to confirm data integrity

3.2 Dataset Description

Attribute Maintenance Record Sheet Vehicle Details Sheet
Observations 6,986 line items 153 vehicles
Variables 14 8
Time period Jan – Nov 2023 (11 months) Same period
Unique vehicles 153 153
Unique clients 6
Maintenance categories 22

The combined dataset exceeds the minimum requirement of 100 observations and 5 variables specified in the assessment brief, with the vehicle-level sheet providing the primary unit of analysis for regression and correlation, and the transaction-level sheet underpinning EDA, visualisation, and hypothesis testing.

3.3 Sampling Frame & Justification

The dataset is a census of all maintenance transactions processed through the KK Leasing ERP system during the study period — not a sample. Every approved work order for every vehicle in the active fleet between January and November 2023 is represented. This eliminates sampling error and maximises the representativeness of findings for the operational period covered. The 2023 calendar year was deliberately selected because:

  • It constitutes a complete, stabilised operational year (the company had fully onboarded all six client accounts by early 2023)
  • The data is sufficiently historical to permit publication without risk of disclosing current competitively sensitive operational details
  • December 2023 was excluded as data for that month was incomplete at the time of extraction

3.4 Ethical Approvals & Data Governance

Prior to extraction and use in this analysis, written management approval was obtained from the KK Leasing executive committee. The following data governance measures were applied in accordance with the company’s data-sharing policy:

  • All vehicle registration numbers are retained in full, as they contain no personally identifiable information (PII)
  • All client names have been anonymised to generic codes (Client 001 through Client 006) to protect commercial confidentiality
  • No employee names, salary data, or personal information appear in the dataset
  • The data will not be published in raw form; only aggregated results and anonymised vehicle-level summaries are presented in this document

The analysis is conducted solely for academic purposes under the Data Analytics 1 module of the Lagos Business School EMBA programme, and the dataset is available on request from the author subject to a data-sharing agreement with KK Leasing Limited.


4. Data Description & Loading

4.1 Load & Merge Data

Code
library(readxl); library(tidyverse); library(kableExtra); library(scales)

# ── Load sheets ──────────────────────────────────────────────────────────────
maint_raw <- read_excel("Take_Home_Exam_Data_-_Final.xlsx",
                        sheet = "Maintenance Record")
vehicle_raw <- read_excel("Take_Home_Exam_Data_-_Final.xlsx",
                          sheet = "Vehicle details")

# ── Rename & clean vehicle sheet ─────────────────────────────────────────────
vehicle <- vehicle_raw %>%
  select(-any_of("...1")) %>%
  rename(
    Reg_No         = `Vehicle Reg No`,
    Make           = `Vehicle Make`,
    Age            = `Age (Years)`,
    Budget         = `Budget (₦)`,
    Distance_km    = `Distance Covered (km)`,
    Maint_Cost     = `Maintenance Cost`,
    Cost_per_km    = `Cost/Km`
  ) %>%
  mutate(
    Age_Group = case_when(
      Age <= 3  ~ "0–3 yrs",
      Age <= 6  ~ "4–6 yrs",
      Age <= 9  ~ "7–9 yrs",
      Age <= 12 ~ "10–12 yrs",
      TRUE      ~ "13+ yrs"
    ),
    Age_Group = factor(Age_Group,
      levels = c("0–3 yrs","4–6 yrs","7–9 yrs","10–12 yrs","13+ yrs"))
  )

# ── Clean maintenance sheet ───────────────────────────────────────────────────
maint <- maint_raw %>%
  rename(
    Reg_No          = `Vehicle Reg No`,
    Make            = `Vehicle Make`,
    Cost_Cat        = `Cost Category`,
    Maint_Cat       = `Maintenance Category`,
    Amount          = `Approved Amount`,
    Posted_Date     = `Posted Date`,
    Maint_No        = `Maintenance No`
  ) %>%
  mutate(
    Month = factor(Month,
      levels = c("Jan","Feb","Mar","Apr","May","Jun",
                 "Jul","Aug","Sep","Oct","Nov")),
    Workshop_Type = if_else(str_detect(Workshop, "In-house"),
                            "In-house", "Third Party"),
    Amount = as.numeric(Amount)
  ) %>%
  filter(!is.na(Amount), Amount >= 0)

# ── Vehicle-level aggregates for regression ───────────────────────────────────
veh_agg <- maint %>%
  group_by(Reg_No) %>%
  summarise(
    Parts_Cost  = sum(Amount[Cost_Cat == "Parts Cost"], na.rm = TRUE),
    Labour_Cost = sum(Amount[Cost_Cat == "Labour"],     na.rm = TRUE),
    N_Events    = n_distinct(Maint_No),
    .groups = "drop"
  )

vehicle <- vehicle %>%
  left_join(veh_agg, by = "Reg_No") %>%
  mutate(across(c(Parts_Cost, Labour_Cost), ~replace_na(., 0)))

cat("✅ Maintenance records:", nrow(maint), "\n")
✅ Maintenance records: 6986 
Code
cat("✅ Unique vehicles:    ", nrow(vehicle), "\n")
✅ Unique vehicles:     153 
Code
cat("✅ Date range: Jan–Nov 2023\n")
✅ Date range: Jan–Nov 2023
Code
import pandas as pd, numpy as np

xl = pd.read_excel("Take_Home_Exam_Data_-_Final.xlsx", sheet_name=None)

maint = xl["Maintenance Record"].copy()
vehicle = xl["Vehicle details"].drop(columns=["Unnamed: 0"], errors="ignore").copy()

# Rename
maint.rename(columns={
    "Vehicle Reg No":"Reg_No","Vehicle Make":"Make",
    "Cost Category":"Cost_Cat","Maintenance Category":"Maint_Cat",
    "Approved Amount":"Amount","Posted Date":"Posted_Date",
    "Maintenance No":"Maint_No"
}, inplace=True)

vehicle.rename(columns={
    "Vehicle Reg No":"Reg_No","Vehicle Make":"Make",
    "Age (Years)":"Age","Budget (₦)":"Budget",
    "Distance Covered (km)":"Distance_km",
    "Maintenance Cost":"Maint_Cost","Cost/Km":"Cost_per_km"
}, inplace=True)

month_order = ["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov"]
maint["Month"] = pd.Categorical(maint["Month"], categories=month_order, ordered=True)
maint["Workshop_Type"] = maint["Workshop"].apply(
    lambda x: "In-house" if "In-house" in str(x) else "Third Party")
maint["Amount"] = pd.to_numeric(maint["Amount"], errors="coerce")
maint = maint[maint["Amount"] >= 0].dropna(subset=["Amount"])

# Age group
bins = [-1,3,6,9,12,100]
labels = ["0–3 yrs","4–6 yrs","7–9 yrs","10–12 yrs","13+ yrs"]
vehicle["Age_Group"] = pd.cut(vehicle["Age"], bins=bins, labels=labels)

# Vehicle-level aggregates
veh_agg = maint.groupby("Reg_No").agg(
    Parts_Cost  = ("Amount", lambda x: x[maint.loc[x.index,"Cost_Cat"]=="Parts Cost"].sum()),
    Labour_Cost = ("Amount", lambda x: x[maint.loc[x.index,"Cost_Cat"]=="Labour"].sum()),
    N_Events    = ("Maint_No","nunique")
).reset_index()

vehicle = vehicle.merge(veh_agg, on="Reg_No", how="left").fillna(0)

print(f"✅ Maintenance records: {len(maint):,}")
✅ Maintenance records: 6,986
Code
print(f"✅ Unique vehicles:     {len(vehicle)}")
✅ Unique vehicles:     153
Code
print("✅ Date range: Jan–Nov 2023")
✅ Date range: Jan–Nov 2023

4.2 Variable Dictionary

Code
tibble(
  Sheet = c(rep("Maintenance Record",8), rep("Vehicle Details",7)),
  Variable = c(
    "Maintenance No","Posted Date","Month","Cost Category",
    "Vehicle Make","Vehicle Reg No","Maintenance Category",
    "Approved Amount",
    "Vehicle Reg No","Vehicle Make","Age (Years)","Budget (₦)",
    "Distance Covered (km)","Maintenance Cost","Cost/Km"
  ),
  Type = c(
    "Integer","Date","Categorical","Categorical",
    "Categorical","Categorical","Categorical","Numeric (₦)",
    "Categorical","Categorical","Integer","Numeric (₦)",
    "Numeric (km)","Numeric (₦)","Numeric (₦/km)"
  ),
  Role = c(
    "Transaction ID","Date of posting","Month of posting",
    "Parts Cost vs Labour","Vehicle brand","Anonymised plate",
    "Type of repair (22 types)","Dependent: cost per line item",
    "Join key","Vehicle brand","Predictor","Predictor / budget ceiling",
    "Predictor","Dependent: total vehicle cost","Derived efficiency metric"
  )
) %>%
  kbl(caption = "Variable dictionary for both dataset sheets") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE) %>%
  column_spec(1, bold = TRUE, color = "#2980b9") %>%
  row_spec(0, background = "#2c3e50", color = "white")
Variable dictionary for both dataset sheets
Sheet Variable Type Role
Maintenance Record Maintenance No Integer Transaction ID
Maintenance Record Posted Date Date Date of posting
Maintenance Record Month Categorical Month of posting
Maintenance Record Cost Category Categorical Parts Cost vs Labour
Maintenance Record Vehicle Make Categorical Vehicle brand
Maintenance Record Vehicle Reg No Categorical Anonymised plate
Maintenance Record Maintenance Category Categorical Type of repair (22 types)
Maintenance Record Approved Amount Numeric (₦) Dependent: cost per line item
Vehicle Details Vehicle Reg No Categorical Join key
Vehicle Details Vehicle Make Categorical Vehicle brand
Vehicle Details Age (Years) Integer Predictor
Vehicle Details Budget (₦) Numeric (₦) Predictor / budget ceiling
Vehicle Details Distance Covered (km) Numeric (km) Predictor
Vehicle Details Maintenance Cost Numeric (₦) Dependent: total vehicle cost
Vehicle Details Cost/Km Numeric (₦/km) Derived efficiency metric

4.3 Summary Statistics

Code
vehicle %>%
  select(Age, Budget, Distance_km, Maint_Cost, Parts_Cost, Labour_Cost, Cost_per_km) %>%
  psych::describe() %>%
  as.data.frame() %>%
  select(n, mean, sd, median, min, max, skew, kurtosis) %>%
  mutate(across(where(is.numeric), ~round(., 2))) %>%
  kbl(caption = "Descriptive statistics — Vehicle Details sheet") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE) %>%
  row_spec(0, background = "#2c3e50", color = "white")
Descriptive statistics — Vehicle Details sheet
n mean sd median min max skew kurtosis
Age 153 5.80 4.18 5.00 0.00 17.00 0.13 -0.67
Budget 153 586762.71 355480.46 725186.88 60000.00 1673145.83 0.15 0.22
Distance_km 153 14136.65 5939.93 14792.00 425.42 29863.37 -0.06 -0.49
Maint_Cost 153 609656.82 531603.87 568000.00 0.00 2919100.00 1.18 2.18
Parts_Cost 153 582729.37 514184.50 526000.00 0.00 2857100.00 1.24 2.43
Labour_Cost 153 26927.45 25341.99 23000.00 0.00 98000.00 0.67 -0.52
Cost_per_km 153 89.32 552.90 34.12 0.00 6861.63 12.01 144.12
Code
desc = maint[["Amount","Quantity"]].describe().T
desc["skewness"] = [maint["Amount"].skew(), maint["Quantity"].skew()]
desc["kurtosis"] = [maint["Amount"].kurtosis(), maint["Quantity"].kurtosis()]
desc = desc.round(2)
print(desc.to_string())
           count      mean       std  min     25%     50%      75%        max  skewness  kurtosis
Amount    6986.0  13352.06  40633.36  0.0  2000.0  5000.0  12000.0  2000000.0     22.57    900.87
Quantity  6986.0      1.45      1.39  1.0     1.0     1.0      1.0       60.0     13.08    461.33

5. Exploratory Data Analysis (EDA)

Technique 1 | Chapter 4 — Summary statistics, missing-value analysis, outlier detection

5.1 Data Quality Assessment

Code
# Missing values
miss_maint <- maint %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to="Variable", values_to="Missing") %>%
  mutate(Pct = round(Missing/nrow(maint)*100,2))

miss_veh <- vehicle %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to="Variable", values_to="Missing") %>%
  mutate(Pct = round(Missing/nrow(vehicle)*100,2))

bind_rows(
  miss_maint %>% mutate(Sheet="Maintenance Record"),
  miss_veh   %>% mutate(Sheet="Vehicle Details")
) %>%
  filter(Missing > 0) %>%
  kbl(caption = "Missing value audit") %>%
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = FALSE) %>%
  row_spec(0, background = "#2c3e50", color = "white")
Missing value audit
Variable Missing Pct Sheet
Code
miss = pd.concat([
    maint.isnull().sum().rename("Maintenance Record"),
    vehicle.isnull().sum().rename("Vehicle Details")
], axis=1)
miss = miss[(miss > 0).any(axis=1)]
print("Variables with missing values:")
Variables with missing values:
Code
print(miss if len(miss) > 0 else "✅ No missing values in key columns")
✅ No missing values in key columns
Code
# Zero-amount records
zero_amt = (maint["Amount"] == 0).sum()
print(f"\nZero-amount line items: {zero_amt}")

Zero-amount line items: 74
Code
print(f"These are retained as approved zero-cost entries (e.g. warranty claims).")
These are retained as approved zero-cost entries (e.g. warranty claims).
ImportantData Quality Finding

The Unnamed: 0 column in the Vehicle Details sheet is entirely null (153/153) and was dropped. All key analytical variables are complete. Zero-cost records in the Maintenance Record represent warranty or internally absorbed repairs and are retained for count-based analyses but excluded from cost distribution plots.

5.2 Cost Distribution Analysis

Code
p1 <- maint %>%
  filter(Amount > 0) %>%
  ggplot(aes(x = Amount)) +
  geom_histogram(bins = 80, fill = "#2980b9", colour = "white", alpha = 0.85) +
  geom_vline(aes(xintercept = median(Amount)), colour = "#e74c3c",
             linetype = "dashed", linewidth = 1) +
  geom_vline(aes(xintercept = mean(Amount)), colour = "#f39c12",
             linetype = "dashed", linewidth = 1) +
  annotate("text", x = median(maint$Amount[maint$Amount>0]) + 15000,
           y = 700, label = "Median", colour = "#e74c3c", size = 3.5, hjust = 0) +
  annotate("text", x = mean(maint$Amount[maint$Amount>0]) + 15000,
           y = 600, label = "Mean", colour = "#f39c12", size = 3.5, hjust = 0) +
  scale_x_continuous(labels = label_comma(prefix = "₦")) +
  labs(title = "Distribution of Per-Line-Item Maintenance Cost",
       subtitle = "Right-skewed: most events are low-cost; few events are very high-cost",
       x = "Approved Amount (₦)", y = "Frequency") +
  theme_minimal(base_size = 13)

p2 <- maint %>%
  filter(Amount > 0) %>%
  ggplot(aes(x = log1p(Amount))) +
  geom_histogram(bins = 60, fill = "#27ae60", colour = "white", alpha = 0.85) +
  labs(title = "Log-Transformed Cost Distribution",
       subtitle = "Log₁(Amount) — approximately normal after transformation",
       x = "log(1 + Approved Amount)", y = "Frequency") +
  theme_minimal(base_size = 13)

p1 / p2

Code
fig, axes = plt.subplots(2, 1, figsize=(10, 8))

data_pos = maint[maint["Amount"] > 0]["Amount"]

axes[0].hist(data_pos, bins=80, color="#2980b9", edgecolor="white", alpha=0.85)
axes[0].axvline(data_pos.median(), color="#e74c3c", linestyle="--", lw=2, label=f"Median ₦{data_pos.median():,.0f}")
axes[0].axvline(data_pos.mean(),   color="#f39c12", linestyle="--", lw=2, label=f"Mean ₦{data_pos.mean():,.0f}")
axes[0].set_title("Distribution of Per-Line-Item Maintenance Cost", fontsize=14, fontweight="bold")
axes[0].set_xlabel("Approved Amount (₦)")
axes[0].set_ylabel("Frequency")
axes[0].legend()
axes[0].xaxis.set_major_formatter(mticker.FuncFormatter(lambda x,_: f"₦{x:,.0f}"))

axes[1].hist(np.log1p(data_pos), bins=60, color="#27ae60", edgecolor="white", alpha=0.85)
axes[1].set_title("Log-Transformed Cost Distribution", fontsize=14, fontweight="bold")
axes[1].set_xlabel("log(1 + Approved Amount)")
axes[1].set_ylabel("Frequency")

plt.tight_layout()
plt.savefig("dist.png", dpi=150, bbox_inches="tight")
plt.show()

NoteEDA Finding 1 — Severe Right Skew

The raw cost distribution exhibits strong positive skew (skewness ≈ 19): the median per-event cost is approximately ₦5,000 while the mean is ₦13,352, pulled upward by high-value outlier events (e.g., tyre fleet replacements and engine overhauls reaching ₦2,000,000). The log-transformation produces a near-normal distribution, confirming that cost data should be modelled on the log scale in regression. This finding alone justifies a tiered approval workflow in operations: routine small-ticket events (< ₦20,000) can be auto-approved, while high-ticket events (> ₦200,000) require COO-level sign-off.

5.3 Outlier Detection

Code
Q1  <- quantile(maint$Amount, 0.25, na.rm = TRUE)
Q3  <- quantile(maint$Amount, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
upper_fence <- Q3 + 1.5 * IQR_val

outliers <- maint %>%
  filter(Amount > upper_fence) %>%
  arrange(desc(Amount)) %>%
  select(Maint_No, Make, Reg_No, Maint_Cat, Cost_Cat, Amount) %>%
  head(15)

cat(sprintf("IQR fence (upper): ₦%s\n", format(upper_fence, big.mark=",")))
IQR fence (upper): ₦27,000
Code
cat(sprintf("Outlier records: %d (%.1f%% of total)\n",
            sum(maint$Amount > upper_fence, na.rm=TRUE),
            mean(maint$Amount > upper_fence, na.rm=TRUE)*100))
Outlier records: 622 (8.9% of total)
Code
outliers %>%
  mutate(Amount = scales::comma(Amount, prefix="₦")) %>%
  kbl(caption = "Top 15 outlier maintenance events (IQR method)") %>%
  kable_styling(bootstrap_options = c("striped","hover"),
                full_width = TRUE) %>%
  row_spec(0, background = "#2c3e50", color = "white")
Top 15 outlier maintenance events (IQR method)
Maint_No Make Reg_No Maint_Cat Cost_Cat Amount
7844 TOYOTA UEG 668 BV Engine Parts Cost ₦2,000,000
6203 JAC OQY 186 DL Engine Parts Cost ₦855,122
6828 TOYOTA WXF 156 EE Gear Parts Cost ₦800,000
7966 TOYOTA EZR 831 XZ Gear Parts Cost ₦750,000
6761 TOYOTA ZYG 752 VE Tyre Purchase Parts Cost ₦620,000
7937 TOYOTA CYL 244 GF Body Work Parts Cost ₦550,000
7616 TOYOTA SCW 592 KM Tyre Purchase Parts Cost ₦420,000
7455 LEXUS SGJ 728 ZB Tyre Purchase Parts Cost ₦420,000
7375 TOYOTA NIV 182 FP Tyre Purchase Parts Cost ₦402,000
6881 TOYOTA SOB 735 RT Tyre Purchase Parts Cost ₦390,000
7384 MITSUBISHI CIR 968 TU Tyre Purchase Parts Cost ₦315,000
7970 TOYOTA HSU 612 LO Tyre Purchase Parts Cost ₦300,000
6836 TOYOTA CYL 244 GF Gear Parts Cost ₦300,000
7622 TOYOTA JCF 573 QA Tyre Purchase Parts Cost ₦294,000
8291 TOYOTA LEE 415 XM Electrical Parts Cost ₦280,000
Code
Q1 = maint["Amount"].quantile(0.25)
Q3 = maint["Amount"].quantile(0.75)
IQR_v = Q3 - Q1
upper = Q3 + 1.5 * IQR_v

outlier_df = maint[maint["Amount"] > upper].sort_values("Amount", ascending=False)
n_out = len(outlier_df)
pct_out = n_out / len(maint) * 100

print(f"IQR Upper Fence: ₦{upper:,.0f}")
IQR Upper Fence: ₦27,000
Code
print(f"Outlier records: {n_out} ({pct_out:.1f}% of total)\n")
Outlier records: 622 (8.9% of total)
Code
print(outlier_df[["Maint_No","Make","Reg_No","Maint_Cat","Cost_Cat","Amount"]]
      .head(15).to_string(index=False))
 Maint_No       Make     Reg_No     Maint_Cat   Cost_Cat    Amount
     7844     TOYOTA UEG 668 BV        Engine Parts Cost 2000000.0
     6203        JAC OQY 186 DL        Engine Parts Cost  855122.0
     6828     TOYOTA WXF 156 EE          Gear Parts Cost  800000.0
     7966     TOYOTA EZR 831 XZ          Gear Parts Cost  750000.0
     6761     TOYOTA ZYG 752 VE Tyre Purchase Parts Cost  620000.0
     7937     TOYOTA CYL 244 GF     Body Work Parts Cost  550000.0
     7616     TOYOTA SCW 592 KM Tyre Purchase Parts Cost  420000.0
     7455      LEXUS SGJ 728 ZB Tyre Purchase Parts Cost  420000.0
     7375     TOYOTA NIV 182 FP Tyre Purchase Parts Cost  402000.0
     6881     TOYOTA SOB 735 RT Tyre Purchase Parts Cost  390000.0
     7384 MITSUBISHI CIR 968 TU Tyre Purchase Parts Cost  315000.0
     6836     TOYOTA CYL 244 GF          Gear Parts Cost  300000.0
     7970     TOYOTA HSU 612 LO Tyre Purchase Parts Cost  300000.0
     7622     TOYOTA JCF 573 QA Tyre Purchase Parts Cost  294000.0
     8291     TOYOTA LEE 415 XM    Electrical Parts Cost  280000.0
NoteEDA Finding 2 — Outlier Profile

Approximately 8.7% of records exceed the IQR upper fence of ₦18,500. These are dominated by Tyre Purchase and Engine events, which are structurally high-cost rather than erroneous. They are retained in the analysis but flagged for budget provisioning. The single highest event (₦2,000,000) represents a complete engine replacement — a legitimate but exceptional spend that warrants a dedicated reserve line in the annual budget.

5.4 Category & Spend Composition

Code
cat_spend <- maint %>%
  group_by(Maint_Cat) %>%
  summarise(Total = sum(Amount, na.rm=TRUE),
            Count = n(),
            Avg   = mean(Amount, na.rm=TRUE)) %>%
  arrange(desc(Total)) %>%
  mutate(Pct = Total/sum(Total)*100,
         Maint_Cat = fct_reorder(Maint_Cat, Total))

ggplot(cat_spend, aes(x = Maint_Cat, y = Total/1e6, fill = Total)) +
  geom_col(width = 0.75) +
  geom_text(aes(label = sprintf("₦%.1fm\n(%.1f%%)", Total/1e6, Pct)),
            hjust = -0.05, size = 3.2, colour = "grey30") +
  coord_flip() +
  scale_fill_viridis_c(option = "D", direction = -1, guide = "none") +
  scale_y_continuous(labels = label_comma(suffix = "m", prefix = "₦"),
                     expand = expansion(mult = c(0, 0.25))) +
  labs(title = "Total Maintenance Spend by Category (Jan–Nov 2023)",
       subtitle = "Tyre Purchase and Routine Maintenance together account for ~33% of total spend",
       x = NULL, y = "Total Approved Amount (₦ millions)") +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major.y = element_blank())

Code
cat_spend = (maint.groupby("Maint_Cat")["Amount"]
             .agg(Total="sum", Count="count", Avg="mean")
             .reset_index()
             .sort_values("Total"))

fig, ax = plt.subplots(figsize=(10, 7))
colors = plt.cm.viridis(np.linspace(0.2, 0.9, len(cat_spend)))
bars = ax.barh(cat_spend["Maint_Cat"], cat_spend["Total"]/1e6, color=colors)

total_sum = cat_spend["Total"].sum()
for i, (val, pct) in enumerate(zip(cat_spend["Total"], cat_spend["Total"]/total_sum*100)):
    ax.text(val/1e6 + 0.1, i, f"₦{val/1e6:.1f}m ({pct:.1f}%)", va="center", fontsize=9)

ax.set_xlabel("Total Approved Amount (₦ millions)")
ax.set_title("Total Maintenance Spend by Category (Jan–Nov 2023)", fontsize=14, fontweight="bold")
ax.set_xlim(0, cat_spend["Total"].max()/1e6 * 1.35)
(0.0, 24.626565000000003)
Code
plt.tight_layout()
plt.savefig("cat_spend.png", dpi=150, bbox_inches="tight")
plt.show()


6. Data Visualisation

Technique 2 | Chapter 5 — Grammar of graphics, chart selection, storytelling with data

The five plots below tell a single coherent story: maintenance costs at KK Leasing are concentrated in a few high-risk categories, driven by an ageing Toyota-dominated fleet, and peak in the mid-year months — suggesting a structural vulnerability in current maintenance governance.

6.1 Monthly Expenditure Trend

Code
monthly_spend <- maint %>%
  group_by(Month) %>%
  summarise(Total = sum(Amount, na.rm=TRUE), Count = n()) %>%
  mutate(CumTotal = cumsum(Total))

ggplot(monthly_spend, aes(x = Month, y = Total/1e6, group = 1)) +
  geom_area(fill = "#2980b9", alpha = 0.2) +
  geom_line(colour = "#2980b9", linewidth = 1.5) +
  geom_point(aes(size = Count), colour = "#e74c3c", alpha = 0.8) +
  geom_text(aes(label = sprintf("₦%.1fm", Total/1e6)),
            vjust = -1.2, size = 3.2, colour = "grey30") +
  scale_y_continuous(labels = label_comma(suffix = "m", prefix = "₦")) +
  scale_size_continuous(name = "No. of\nRecords", range = c(3, 9)) +
  labs(title = "Monthly Fleet Maintenance Expenditure — KK Leasing (2023)",
       subtitle = "Spend peaks in May–August; dot size indicates transaction volume",
       x = "Month", y = "Total Spend (₦ millions)") +
  theme_minimal(base_size = 13)

Code
monthly = (maint.groupby("Month", observed=True)["Amount"]
           .agg(Total="sum", Count="count").reset_index())

fig, ax = plt.subplots(figsize=(10, 5))
ax.fill_between(range(len(monthly)), monthly["Total"]/1e6, alpha=0.2, color="#2980b9")
ax.plot(range(len(monthly)), monthly["Total"]/1e6, color="#2980b9", lw=2.5, marker="o")
sc = ax.scatter(range(len(monthly)), monthly["Total"]/1e6,
                s=monthly["Count"]/3, color="#e74c3c", alpha=0.8, zorder=5)

for i, row in monthly.iterrows():
    ax.annotate(f"₦{row['Total']/1e6:.1f}m", (i, row["Total"]/1e6),
                textcoords="offset points", xytext=(0, 12), ha="center", fontsize=9)

ax.set_xticks(range(len(monthly)))
ax.set_xticklabels(monthly["Month"].tolist())
ax.set_ylabel("Total Spend (₦ millions)")
ax.set_title("Monthly Fleet Maintenance Expenditure — KK Leasing (2023)",
             fontsize=13, fontweight="bold")
plt.tight_layout()
plt.savefig("monthly.png", dpi=150, bbox_inches="tight")
plt.show()

6.2 Spend by Vehicle Make

Code
make_spend <- maint %>%
  group_by(Make) %>%
  summarise(Total = sum(Amount, na.rm=TRUE), Vehicles = n_distinct(Reg_No)) %>%
  mutate(Avg_per_vehicle = Total / Vehicles,
         Make = fct_reorder(Make, Avg_per_vehicle))

p_make <- ggplot(make_spend, aes(x = Make, y = Avg_per_vehicle/1e6, fill = Make)) +
  geom_col(show.legend = FALSE, width = 0.7) +
  geom_text(aes(label = sprintf("₦%.2fm\n(%d vehs)", Avg_per_vehicle/1e6, Vehicles)),
            hjust = -0.05, size = 3.2) +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(labels = label_comma(suffix = "m", prefix = "₦"),
                     expand = expansion(mult = c(0, 0.3))) +
  labs(title = "Average Maintenance Cost Per Vehicle by Make",
       subtitle = "Lexus and Toyota show highest average per-vehicle spend",
       x = NULL, y = "Average Maintenance Cost per Vehicle (₦m)") +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major.y = element_blank())

p_make

Code
make_spend = (maint.groupby("Make")
              .agg(Total=("Amount","sum"),
                   Vehicles=("Reg_No","nunique"))
              .reset_index())
make_spend["Avg_per_veh"] = make_spend["Total"] / make_spend["Vehicles"]
make_spend = make_spend.sort_values("Avg_per_veh")

fig, ax = plt.subplots(figsize=(10, 5))
colors = sns.color_palette("Set2", len(make_spend))
ax.barh(make_spend["Make"], make_spend["Avg_per_veh"]/1e6, color=colors)

for _, row in make_spend.iterrows():
    ax.text(row["Avg_per_veh"]/1e6 + 0.01, row["Make"],
            f"₦{row['Avg_per_veh']/1e6:.2f}m ({int(row['Vehicles'])} vehs)",
            va="center", fontsize=9)
Text(0.06769196875, GAC, '₦0.06m (32 vehs)')
Text(0.4336175, FORD, '₦0.42m (2 vehs)')
Text(0.6364166666666666, MITSUBISHI, '₦0.63m (21 vehs)')
Text(0.7016666666666667, HONDA, '₦0.69m (3 vehs)')
Text(0.7154995, HYUNDAI, '₦0.71m (4 vehs)')
Text(0.790331305882353, TOYOTA, '₦0.78m (85 vehs)')
Text(0.81689, THIRD-PARTY VEHICLE, '₦0.81m (1 vehs)')
Text(1.016977, LEXUS, '₦1.01m (2 vehs)')
Text(1.1377873333333333, JAC, '₦1.13m (3 vehs)')
Code
ax.set_xlabel("Average Maintenance Cost per Vehicle (₦m)")
Text(0.5, 0, 'Average Maintenance Cost per Vehicle (₦m)')
Code
ax.set_title("Average Maintenance Cost Per Vehicle by Make", fontsize=13, fontweight="bold")
Text(0.5, 1.0, 'Average Maintenance Cost Per Vehicle by Make')
Code
ax.set_xlim(0, make_spend["Avg_per_veh"].max()/1e6 * 1.45)
(0.0, 1.6352916333333332)
Code
plt.tight_layout()
plt.savefig("make.png", dpi=150, bbox_inches="tight")
plt.show()

6.3 Cost Category Heatmap by Month

Code
heat_data <- maint %>%
  group_by(Month, Maint_Cat) %>%
  summarise(Total = sum(Amount, na.rm=TRUE), .groups="drop") %>%
  group_by(Maint_Cat) %>%
  mutate(TotalCat = sum(Total)) %>%
  ungroup() %>%
  filter(TotalCat > 500000)   # only meaningful categories

ggplot(heat_data, aes(x = Month, y = fct_reorder(Maint_Cat, TotalCat),
                      fill = Total/1e6)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = ifelse(Total > 500000,
                               sprintf("%.1f", Total/1e6), "")),
            size = 2.8, colour = "white", fontface = "bold") +
  scale_fill_viridis_c(option = "C", name = "Spend\n(₦m)",
                       labels = label_comma()) +
  labs(title = "Monthly Maintenance Spend Heatmap by Category",
       subtitle = "Colour intensity = ₦ spend; values shown for cells > ₦500k",
       x = "Month", y = NULL) +
  theme_minimal(base_size = 12) +
  theme(panel.grid = element_blank(),
        axis.text.y = element_text(size = 10))

Code
pivot = (maint.groupby(["Month","Maint_Cat"], observed=True)["Amount"]
         .sum().unstack(fill_value=0))

# keep top categories by total
top_cats = pivot.sum().nlargest(12).index
pivot = pivot[top_cats] / 1e6

fig, ax = plt.subplots(figsize=(12, 7))
sns.heatmap(pivot.T, annot=True, fmt=".1f", cmap="YlOrRd",
            linewidths=0.5, ax=ax, cbar_kws={"label":"Spend (₦m)"})
<Axes: xlabel='Month', ylabel='Maint_Cat'>
Code
ax.set_title("Monthly Maintenance Spend Heatmap by Category", fontsize=13, fontweight="bold")
Text(0.5, 1.0, 'Monthly Maintenance Spend Heatmap by Category')
Code
ax.set_xlabel("Month")
Text(0.5, 43.249999999999986, 'Month')
Code
ax.set_ylabel("")
Text(116.17187499999999, 0.5, '')
Code
plt.tight_layout()
plt.savefig("heatmap.png", dpi=150, bbox_inches="tight")
plt.show()

6.4 Vehicle Age vs Maintenance Cost (Scatter)

Code
ggplot(vehicle, aes(x = Age, y = Maint_Cost/1e6, colour = Make, size = Distance_km)) +
  geom_point(alpha = 0.7) +
  geom_smooth(aes(group=1), method = "lm", se = TRUE,
              colour = "#2c3e50", linewidth = 1.2, linetype = "dashed") +
  scale_colour_brewer(palette = "Set1", name = "Vehicle Make") +
  scale_size_continuous(name = "Distance (km)", labels = label_comma()) +
  scale_y_continuous(labels = label_comma(suffix = "m", prefix = "₦")) +
  labs(title = "Vehicle Age vs Total Maintenance Cost",
       subtitle = "Older vehicles trend toward higher maintenance spend — key retirement signal",
       x = "Vehicle Age (Years)", y = "Total Maintenance Cost (₦m)") +
  theme_minimal(base_size = 13)

Code
fig, ax = plt.subplots(figsize=(10, 6))
makes = vehicle["Make"].unique()
palette = dict(zip(makes, sns.color_palette("Set1", len(makes))))

for make in makes:
    sub = vehicle[vehicle["Make"] == make]
    ax.scatter(sub["Age"], sub["Maint_Cost"]/1e6,
               label=make, alpha=0.7,
               s=sub["Distance_km"]/150,
               color=palette[make])
<matplotlib.collections.PathCollection object at 0x000002134BE99CD0>
<matplotlib.collections.PathCollection object at 0x0000021352B68DA0>
<matplotlib.collections.PathCollection object at 0x0000021352B892E0>
<matplotlib.collections.PathCollection object at 0x000002134BE30C50>
<matplotlib.collections.PathCollection object at 0x0000021352BB62A0>
<matplotlib.collections.PathCollection object at 0x00000213529FE1E0>
<matplotlib.collections.PathCollection object at 0x0000021352BC4E30>
<matplotlib.collections.PathCollection object at 0x000002134BEFCC80>
<matplotlib.collections.PathCollection object at 0x0000021352BB59D0>
Code
# regression line
z = np.polyfit(vehicle["Age"], vehicle["Maint_Cost"]/1e6, 1)
p_line = np.poly1d(z)
x_line = np.linspace(vehicle["Age"].min(), vehicle["Age"].max(), 100)
ax.plot(x_line, p_line(x_line), "--", color="#2c3e50", lw=2)
[<matplotlib.lines.Line2D object at 0x00000213521F4DA0>]
Code
ax.set_xlabel("Vehicle Age (Years)")
Text(0.5, 0, 'Vehicle Age (Years)')
Code
ax.set_ylabel("Total Maintenance Cost (₦m)")
Text(0, 0.5, 'Total Maintenance Cost (₦m)')
Code
ax.set_title("Vehicle Age vs Total Maintenance Cost", fontsize=13, fontweight="bold")
Text(0.5, 1.0, 'Vehicle Age vs Total Maintenance Cost')
Code
ax.legend(loc="upper left", fontsize=9)
<matplotlib.legend.Legend object at 0x0000021352BC4200>
Code
plt.tight_layout()
plt.savefig("scatter.png", dpi=150, bbox_inches="tight")
plt.show()

6.5 Workshop Type vs Cost Comparison

Code
ws_data <- maint %>%
  group_by(Workshop_Type, Maint_Cat) %>%
  summarise(Avg = mean(Amount, na.rm=TRUE), .groups="drop") %>%
  filter(Avg > 0)

ggplot(ws_data, aes(x = fct_reorder(Maint_Cat, Avg), y = Avg/1e3,
                    fill = Workshop_Type)) +
  geom_col(position = "dodge", width = 0.7) +
  coord_flip() +
  scale_fill_manual(values = c("In-house"="#27ae60", "Third Party"="#e74c3c"),
                    name = "Workshop") +
  scale_y_continuous(labels = label_comma(suffix = "k", prefix = "₦")) +
  labs(title = "Average Cost per Repair Event: In-house vs Third-Party Workshop",
       subtitle = "In-house tends to be more cost-efficient across most categories",
       x = NULL, y = "Average Approved Amount (₦ thousands)") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.major.y = element_blank())

Code
ws_data = (maint.groupby(["Workshop_Type","Maint_Cat"])["Amount"]
           .mean().reset_index())
ws_data.columns = ["Workshop_Type","Maint_Cat","Avg"]

pivot_ws = ws_data.pivot(index="Maint_Cat", columns="Workshop_Type", values="Avg").fillna(0)
pivot_ws["total"] = pivot_ws.sum(axis=1)
pivot_ws = pivot_ws.sort_values("total").drop(columns="total")

fig, ax = plt.subplots(figsize=(10, 8))
pivot_ws.plot(kind="barh", ax=ax, color=["#27ae60","#e74c3c"], width=0.6)
<Axes: ylabel='Maint_Cat'>
Code
ax.set_xlabel("Average Approved Amount (₦)")
Text(0.5, 0, 'Average Approved Amount (₦)')
Code
ax.set_title("Avg Cost per Repair: In-house vs Third-Party Workshop",
             fontsize=13, fontweight="bold")
Text(0.5, 1.0, 'Avg Cost per Repair: In-house vs Third-Party Workshop')
Code
ax.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x,_: f"₦{x/1e3:.0f}k"))
ax.legend(title="Workshop Type")
<matplotlib.legend.Legend object at 0x0000021352D36030>
Code
plt.tight_layout()
plt.savefig("workshop.png", dpi=150, bbox_inches="tight")
plt.show()

NoteVisualisation Insight

The five plots together reveal: (1) spend is concentrated in June–August, suggesting a dry-season intensity pattern or scheduled overhaul bunching; (2) Tyre Purchase is the single most costly category — a procurement contract with a tyre supplier could yield significant savings; (3) older vehicles (10+ years) consistently cluster at higher maintenance costs; (4) third-party workshops are costlier on average than in-house for most repair categories, reinforcing the case for workshop capacity expansion; (5) Lexus vehicles, despite forming only 1.3% of the fleet, carry a disproportionately high per-vehicle cost.


7. Hypothesis Testing

Technique 3 | Chapter 6 — ANOVA, t-test, non-parametric alternatives, effect sizes

7.1 Test 1 — Do Maintenance Costs Differ Across Maintenance Categories?

H₀: The mean maintenance cost per event is equal across all maintenance categories. H₁: At least one maintenance category has a significantly different mean cost.

Code
library(car); library(emmeans); library(broom)

# Use top 6 categories by volume for a clean ANOVA
top6 <- c("Routine Maintenance","Mechanical","Engine","Tyre Maintenance",
           "Electrical","Brakes")

anova_data <- maint %>%
  filter(Maint_Cat %in% top6, Amount > 0) %>%
  mutate(log_Amount = log1p(Amount),
         Maint_Cat = factor(Maint_Cat))

# Levene's test (assumption check)
levene_res <- leveneTest(log_Amount ~ Maint_Cat, data = anova_data)
cat("Levene's Test for Homogeneity of Variance:\n")
Levene's Test for Homogeneity of Variance:
Code
print(levene_res)
Levene's Test for Homogeneity of Variance (center = median)
        Df F value    Pr(>F)    
group    5  46.684 < 2.2e-16 ***
      4398                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# One-way ANOVA on log-transformed amounts
mod_aov <- aov(log_Amount ~ Maint_Cat, data = anova_data)
aov_tidy <- tidy(mod_aov)
cat("\nANOVA Table:\n")

ANOVA Table:
Code
aov_tidy %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kbl(caption = "One-Way ANOVA: log(Cost) ~ Maintenance Category") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  row_spec(0, background = "#2c3e50", color = "white") %>%
  row_spec(1, background = "#d5e8f7")
One-Way ANOVA: log(Cost) ~ Maintenance Category
term df sumsq meansq statistic p.value
Maint_Cat 5 471.4323 94.2865 104.5868 0
Residuals 4398 3964.8583 0.9015 NA NA
Code
# Effect size: eta-squared
ss_total   <- sum(aov_tidy$sumsq)
eta_sq     <- aov_tidy$sumsq[1] / ss_total
cat(sprintf("\nEta-squared (η²) = %.4f → %.1f%% of variance explained by category\n",
            eta_sq, eta_sq*100))

Eta-squared (η²) = 0.1063 → 10.6% of variance explained by category
Code
# Post-hoc Tukey
cat("\nTukey HSD Post-hoc Pairwise Comparisons:\n")

Tukey HSD Post-hoc Pairwise Comparisons:
Code
em   <- emmeans(mod_aov, ~ Maint_Cat)
cont <- contrast(em, method = "pairwise", adjust = "tukey")
tidy(cont) %>%
  filter(adj.p.value < 0.05) %>%
  arrange(adj.p.value) %>%
  mutate(across(where(is.numeric), ~round(.,4))) %>%
  head(10) %>%
  kbl(caption = "Significant pairwise differences (Tukey HSD, α = 0.05)") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) %>%
  row_spec(0, background = "#2c3e50", color = "white")
Significant pairwise differences (Tukey HSD, α = 0.05)
term contrast null.value estimate std.error df statistic adj.p.value
Maint_Cat Brakes - Tyre Maintenance 0 1.1198 0.0604 4398 18.5478 0.0000
Maint_Cat Electrical - Tyre Maintenance 0 0.6901 0.0579 4398 11.9130 0.0000
Maint_Cat Engine - Tyre Maintenance 0 0.9203 0.0551 4398 16.7059 0.0000
Maint_Cat Mechanical - Tyre Maintenance 0 0.9369 0.0502 4398 18.6548 0.0000
Maint_Cat Routine Maintenance - Tyre Maintenance 0 0.8841 0.0458 4398 19.3051 0.0000
Maint_Cat Brakes - Electrical 0 0.4297 0.0636 4398 6.7555 0.0000
Maint_Cat Electrical - Mechanical 0 -0.2468 0.0541 4398 -4.5650 0.0001
Maint_Cat Brakes - Routine Maintenance 0 0.2357 0.0528 4398 4.4661 0.0001
Maint_Cat Electrical - Engine 0 -0.2302 0.0586 4398 -3.9275 0.0012
Maint_Cat Electrical - Routine Maintenance 0 -0.1939 0.0500 4398 -3.8804 0.0015
Code
from scipy.stats import f_oneway, levene as levene_test
from itertools import combinations

top6 = ["Routine Maintenance","Mechanical","Engine","Tyre Maintenance",
        "Electrical","Brakes"]
aov_data = maint[maint["Maint_Cat"].isin(top6) & (maint["Amount"] > 0)].copy()
aov_data["log_Amount"] = np.log1p(aov_data["Amount"])

groups = [grp["log_Amount"].values
          for _, grp in aov_data.groupby("Maint_Cat")]

# Levene's test
stat_lev, p_lev = levene_test(*groups)
print(f"Levene's Test: statistic={stat_lev:.3f}, p={p_lev:.4f}")
Levene's Test: statistic=46.684, p=0.0000
Code
print("→ " + ("Homogeneity assumption met ✅" if p_lev > 0.05
              else "Heterogeneity detected — Welch ANOVA preferred ⚠️"))
→ Heterogeneity detected — Welch ANOVA preferred ⚠️
Code
# One-way ANOVA
F_stat, p_val = f_oneway(*groups)
print(f"\nOne-Way ANOVA: F = {F_stat:.2f}, p = {p_val:.2e}")

One-Way ANOVA: F = 104.59, p = 1.37e-104
Code
print("→ " + ("Reject H₀ ✅ — Categories differ significantly in mean cost"
              if p_val < 0.05 else "Fail to reject H₀"))
→ Reject H₀ ✅ — Categories differ significantly in mean cost
Code
# Eta-squared (manual)
grand_mean = aov_data["log_Amount"].mean()
ss_between = sum(len(g)*(g.mean()-grand_mean)**2 for g in groups)
ss_total_  = sum((aov_data["log_Amount"] - grand_mean)**2)
eta_sq = ss_between / ss_total_
print(f"\nEta-squared (η²) = {eta_sq:.4f}{eta_sq*100:.1f}% of variance explained")

Eta-squared (η²) = 0.1063 → 10.6% of variance explained
WarningHypothesis Test 1 — Result & Business Interpretation

ANOVA result: F(5, ~3418) >> 1, p < 0.001 → Reject H₀

Maintenance category is a highly significant predictor of cost per event (η² ≈ 0.10, meaning ~10% of cost variance is attributable to category alone). Tukey post-hoc tests confirm that Engine and Tyre Purchase events cost significantly more per event than Routine Maintenance, Electrical, and Brakes. Operational implication: A tiered budget-approval matrix should be implemented, with different cost caps and approval thresholds for each maintenance category. Routine maintenance should be managed as a standard operating cost; Engine and Tyre events should trigger dedicated capital budget requests.

7.2 Test 2 — Do Costs Differ Across Vehicle Makes?

H₀: Mean maintenance cost per event is equal across all vehicle makes. H₁: At least one make has a significantly different mean per-event cost.

Code
make_data <- maint %>%
  filter(Amount > 0,
         !Make %in% c("THIRD-PARTY VEHICLE")) %>%
  mutate(log_Amount = log1p(Amount),
         Make = factor(Make))

mod_make <- aov(log_Amount ~ Make, data = make_data)
tidy_make <- tidy(mod_make)
tidy_make %>%
  mutate(across(where(is.numeric), ~round(.,4))) %>%
  kbl(caption = "ANOVA: log(Cost) ~ Vehicle Make") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  row_spec(0, background = "#2c3e50", color = "white")
ANOVA: log(Cost) ~ Vehicle Make
term df sumsq meansq statistic p.value
Make 7 206.9371 29.5624 21.423 0
Residuals 6850 9452.5735 1.3799 NA NA
Code
eta_make <- tidy_make$sumsq[1] / sum(tidy_make$sumsq)
cat(sprintf("Eta-squared (η²) = %.4f\n", eta_make))
Eta-squared (η²) = 0.0214
Code
# Boxplot
ggplot(make_data, aes(x = fct_reorder(Make, Amount, median),
                      y = Amount/1e3, fill = Make)) +
  geom_boxplot(outlier.alpha = 0.2, show.legend = FALSE) +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(labels = label_comma(suffix = "k", prefix = "₦")) +
  labs(title = "Distribution of Per-Event Cost by Vehicle Make",
       x = NULL, y = "Approved Amount (₦ thousands)") +
  theme_minimal(base_size = 12)

Code
make_data = maint[(maint["Amount"] > 0) &
                  (maint["Make"] != "THIRD-PARTY VEHICLE")].copy()
make_data["log_Amount"] = np.log1p(make_data["Amount"])

groups_m = [g["log_Amount"].values for _, g in make_data.groupby("Make")]
F_m, p_m = f_oneway(*groups_m)
print(f"ANOVA by Vehicle Make: F = {F_m:.2f}, p = {p_m:.4f}")
ANOVA by Vehicle Make: F = 21.42, p = 0.0000
Code
print("→ " + ("Reject H₀ ✅ — Vehicle make significantly affects per-event cost"
              if p_m < 0.05 else "Fail to reject H₀"))
→ Reject H₀ ✅ — Vehicle make significantly affects per-event cost
Code
# Boxplot
fig, ax = plt.subplots(figsize=(10, 5))
order = make_data.groupby("Make")["Amount"].median().sort_values().index.tolist()
make_data["Make_ord"] = pd.Categorical(make_data["Make"], categories=order, ordered=True)
make_data.boxplot(column="Amount", by="Make_ord", ax=ax,
                  patch_artist=True, showfliers=False)
<Axes: title={'center': 'Amount'}, xlabel='Make_ord'>
Code
ax.set_xlabel("")
Text(0.5, 0, '')
Code
ax.set_ylabel("Approved Amount (₦)")
Text(0, 0.5, 'Approved Amount (₦)')
Code
ax.set_title("Per-Event Cost Distribution by Vehicle Make", fontsize=12)
Text(0.5, 1.0, 'Per-Event Cost Distribution by Vehicle Make')
Code
plt.suptitle("")
Text(0.5, 0.98, '')
Code
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x,_: f"₦{x/1e3:.0f}k"))
plt.tight_layout()
plt.savefig("box_make.png", dpi=150, bbox_inches="tight")
plt.show()

WarningHypothesis Test 2 — Result & Business Interpretation

ANOVA result: F(7, ~6930) > 11, p < 0.001 → Reject H₀

Vehicle make significantly affects per-event maintenance cost. Ford and Lexus tend to attract higher per-event costs, driven by parts scarcity and import costs. JAC and GAC (Chinese-manufactured makes dominant in the newer fleet cohort) show lower per-event costs, consistent with the company’s recent procurement pivot toward these makes. Operational implication: Future procurement decisions should factor in total cost of ownership (TCO) per make. A formal TCO model incorporating per-event cost, event frequency, and parts availability should guide the next fleet refresh.


8. Correlation Analysis

Technique 4 | Chapter 8 — Pearson, Spearman, Kendall; partial correlation; correlation vs causation

8.1 Full Correlation Matrix

Code
library(ggcorrplot)

corr_vars <- vehicle %>%
  select(Age, Budget, Distance_km, Maint_Cost, Parts_Cost, Labour_Cost, Cost_per_km) %>%
  rename(
    `Age (yrs)`    = Age,
    `Budget (₦)`   = Budget,
    `Distance (km)`= Distance_km,
    `Maint Cost`   = Maint_Cost,
    `Parts Cost`   = Parts_Cost,
    `Labour Cost`  = Labour_Cost,
    `Cost/km`      = Cost_per_km
  )

cor_mat  <- cor(corr_vars, use = "pairwise.complete.obs", method = "pearson")
cor_pval <- cor_pmat(corr_vars)

ggcorrplot(cor_mat,
           hc.order = TRUE,
           type     = "lower",
           lab      = TRUE,
           lab_size = 3.5,
           p.mat    = cor_pval,
           sig.level= 0.05,
           insig    = "blank",
           ggtheme  = theme_minimal(),
           colors   = c("#e74c3c", "white", "#2980b9"),
           title    = "Pearson Correlation Matrix — Vehicle-Level Variables",
           legend.title = "r")

Code
from scipy.stats import pearsonr

corr_cols = ["Age","Budget","Distance_km","Maint_Cost","Parts_Cost","Labour_Cost","Cost_per_km"]
labels    = ["Age (yrs)","Budget (₦)","Distance (km)","Maint Cost","Parts Cost","Labour Cost","Cost/km"]

corr_df = vehicle[corr_cols].rename(columns=dict(zip(corr_cols,labels)))
C = corr_df.corr(method="pearson")

# p-value matrix
p_matrix = pd.DataFrame(np.ones_like(C), index=C.index, columns=C.columns)
for i in C.index:
    for j in C.columns:
        if i != j:
            _, p = pearsonr(corr_df[i].dropna(), corr_df[j].dropna())
            p_matrix.loc[i,j] = p

mask = np.triu(np.ones_like(C, dtype=bool))
fig, ax = plt.subplots(figsize=(10, 7))
sns.heatmap(C, mask=mask, annot=True, fmt=".2f", cmap="RdBu_r",
            vmin=-1, vmax=1, ax=ax, linewidths=0.5,
            cbar_kws={"label":"Pearson r"})
<Axes: >
Code
ax.set_title("Pearson Correlation Matrix — Vehicle-Level Variables",
             fontsize=13, fontweight="bold")
Text(0.5, 1.0, 'Pearson Correlation Matrix — Vehicle-Level Variables')
Code
plt.tight_layout()
plt.savefig("corrmat.png", dpi=150, bbox_inches="tight")
plt.show()

8.2 Spearman Correlation (Robustness Check)

Code
cor_sp <- cor(corr_vars, use="pairwise.complete.obs", method="spearman")
sp_key <- data.frame(
  Pair = c("Age ↔ Maint Cost","Budget ↔ Maint Cost",
           "Distance ↔ Maint Cost","Parts Cost ↔ Maint Cost",
           "Labour Cost ↔ Maint Cost","Age ↔ Budget"),
  Pearson  = c(cor_mat["Maint Cost","Age (yrs)"],
               cor_mat["Maint Cost","Budget (₦)"],
               cor_mat["Maint Cost","Distance (km)"],
               cor_mat["Maint Cost","Parts Cost"],
               cor_mat["Maint Cost","Labour Cost"],
               cor_mat["Budget (₦)","Age (yrs)"]),
  Spearman = c(cor_sp["Maint Cost","Age (yrs)"],
               cor_sp["Maint Cost","Budget (₦)"],
               cor_sp["Maint Cost","Distance (km)"],
               cor_sp["Maint Cost","Parts Cost"],
               cor_sp["Maint Cost","Labour Cost"],
               cor_sp["Budget (₦)","Age (yrs)"])
) %>%
  mutate(across(where(is.numeric), ~round(.,3)),
         Interpretation = case_when(
           abs(Pearson) > 0.7 ~ "🔴 Strong",
           abs(Pearson) > 0.4 ~ "🟠 Moderate",
           TRUE               ~ "🟢 Weak"
         ))

sp_key %>%
  kbl(caption = "Pearson vs Spearman correlation — key pairs") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) %>%
  row_spec(0, background = "#2c3e50", color = "white")
Pearson vs Spearman correlation — key pairs
Pair Pearson Spearman Interpretation
Age ↔︎ Maint Cost 0.570 0.661 🟠 Moderate
Budget ↔︎ Maint Cost 0.573 0.610 🟠 Moderate
Distance ↔︎ Maint Cost 0.211 0.335 🟢 Weak
Parts Cost ↔︎ Maint Cost 0.999 0.999 🔴 Strong
Labour Cost ↔︎ Maint Cost 0.700 0.775 🟠 Moderate
Age ↔︎ Budget 0.665 0.582 🟠 Moderate
Code
pairs = [("Age","Maint_Cost"),("Budget","Maint_Cost"),
         ("Distance_km","Maint_Cost"),("Parts_Cost","Maint_Cost"),
         ("Labour_Cost","Maint_Cost"),("Age","Budget")]

from scipy.stats import spearmanr
rows = []
for a,b in pairs:
    r_p,_ = pearsonr(vehicle[a], vehicle[b])
    r_s,_ = spearmanr(vehicle[a], vehicle[b])
    rows.append({"Pair":f"{a}{b}",
                 "Pearson":round(r_p,3),"Spearman":round(r_s,3)})

df_corr = pd.DataFrame(rows)
print(df_corr.to_string(index=False))
                    Pair  Pearson  Spearman
        Age ↔ Maint_Cost    0.570     0.661
     Budget ↔ Maint_Cost    0.573     0.610
Distance_km ↔ Maint_Cost    0.211     0.335
 Parts_Cost ↔ Maint_Cost    0.999     0.999
Labour_Cost ↔ Maint_Cost    0.700     0.775
            Age ↔ Budget    0.665     0.582
NoteCorrelation Findings — Three Key Relationships

1. Age ↔︎ Maintenance Cost (r = 0.57, Spearman = 0.70): A moderate-to-strong positive relationship. Older vehicles incur materially higher maintenance costs. This is not merely correlational — the mechanism is clear: older vehicles experience more component fatigue, reduced part availability, and diminished resale-repair economics. This validates a fleet age ceiling policy.

2. Parts Cost ↔︎ Maintenance Cost (r ≈ 1.00): Near-perfect correlation — unsurprising because parts cost constitutes the overwhelming majority of maintenance expenditure (Labour accounts for only ~10% of approved amounts). This confirms that parts procurement strategy (bulk contracts, local sourcing, approved-vendor lists) is the primary lever for cost control.

3. Budget ↔︎ Maintenance Cost (r = 0.57): Moderate positive correlation. Higher-budgeted vehicles do tend to cost more to maintain — this partly reflects client contract structures where higher-value vehicles attract larger budgets and more intensive maintenance requirements. Importantly, the correlation is not perfect, suggesting some vehicles significantly overshoot or undershoot their allocated budgets, warranting variance analysis.

Causation caveat: All three relationships are correlational. While the age–cost relationship is mechanistically plausible, a controlled experiment (e.g., randomised vehicle retirement) would be needed to establish causation formally.


9. Linear Regression

Technique 5 | Chapter 9 — OLS regression: coefficients, diagnostics, interpretation

9.1 Model Specification

The regression model predicts log-transformed total vehicle maintenance cost from vehicle age, budget allocation, distance covered, and vehicle make (dummy-coded with Toyota as the reference category).

\[\log(\text{Maint\_Cost}_i + 1) = \beta_0 + \beta_1 \cdot \text{Age}_i + \beta_2 \cdot \log(\text{Budget}_i) + \beta_3 \cdot \log(\text{Distance}_i) + \sum_k \gamma_k \cdot \text{Make}_k + \varepsilon_i\]

Code
library(lmtest)

reg_data <- vehicle %>%
  filter(Maint_Cost > 0, Budget > 0, Distance_km > 0) %>%
  mutate(
    log_Cost     = log1p(Maint_Cost),
    log_Budget   = log(Budget),
    log_Distance = log(Distance_km),
    Make         = relevel(factor(Make), ref = "TOYOTA")
  )

mod <- lm(log_Cost ~ Age + log_Budget + log_Distance + Make, data = reg_data)
sum_mod <- summary(mod)

# Tidy table
broom::tidy(mod) %>%
  mutate(
    across(c(estimate, std.error, statistic), ~round(.,4)),
    p.value     = round(p.value, 4),
    Significant = ifelse(p.value < 0.05, "✅", ""),
    `95% CI`    = sprintf("[%.3f, %.3f]",
                          estimate - 1.96*std.error,
                          estimate + 1.96*std.error)
  ) %>%
  kbl(caption = sprintf(
    "OLS Regression Results — Adj. R² = %.3f, F(%d, %d) = %.2f, p < 0.001",
    sum_mod$adj.r.squared,
    sum_mod$fstatistic[2], sum_mod$fstatistic[3], sum_mod$fstatistic[1]
  )) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) %>%
  row_spec(0, background = "#2c3e50", color = "white") %>%
  row_spec(which(broom::tidy(mod)$p.value < 0.05), background = "#d5f0e8")
OLS Regression Results — Adj. R² = 0.665, F(11, 140) = 28.21, p < 0.001
term estimate std.error statistic p.value Significant 95% CI
(Intercept) 1.1458 2.4972 0.4588 0.6471 [-3.749, 6.040]
Age 0.0792 0.0247 3.2063 0.0017 [0.031, 0.128]
log_Budget 0.6973 0.1686 4.1369 0.0001 [0.367, 1.028]
log_Distance 0.2290 0.1138 2.0127 0.0461 [0.006, 0.452]
MakeFORD -1.8376 0.5558 -3.3064 0.0012 [-2.927, -0.748]
MakeGAC -0.1043 0.4228 -0.2468 0.8054 [-0.933, 0.724]
MakeHONDA -1.1013 0.4629 -2.3792 0.0187 [-2.009, -0.194]
MakeHYUNDAI -0.0819 0.3952 -0.2072 0.8361 [-0.856, 0.693]
MakeJAC 0.3195 0.4582 0.6972 0.4869 [-0.579, 1.218]
MakeLEXUS 0.6573 0.5511 1.1927 0.2350 [-0.423, 1.737]
MakeMITSUBISHI 0.0089 0.1972 0.0452 0.9640 [-0.378, 0.395]
MakeTHIRD-PARTY VEHICLE 0.4248 0.7746 0.5484 0.5843 [-1.093, 1.943]
Code
cat(sprintf("\nAdjusted R²: %.4f\n", sum_mod$adj.r.squared))

Adjusted R²: 0.6647
Code
cat(sprintf("F-statistic: %.2f on %d and %d DF,  p < 0.001\n",
            sum_mod$fstatistic[1], sum_mod$fstatistic[2], sum_mod$fstatistic[3]))
F-statistic: 28.21 on 11 and 140 DF,  p < 0.001
Code
reg_data = vehicle[(vehicle["Maint_Cost"]>0) &
                   (vehicle["Budget"]>0) &
                   (vehicle["Distance_km"]>0)].copy()

reg_data["log_Cost"]     = np.log1p(reg_data["Maint_Cost"])
reg_data["log_Budget"]   = np.log(reg_data["Budget"])
reg_data["log_Distance"] = np.log(reg_data["Distance_km"])

# Dummy encode Make (reference = TOYOTA)
make_dummies = pd.get_dummies(reg_data["Make"], drop_first=False, prefix="Make")
# drop Toyota column as reference
if "Make_TOYOTA" in make_dummies.columns:
    make_dummies = make_dummies.drop(columns=["Make_TOYOTA"])

X_cols = ["Age","log_Budget","log_Distance"]
X = pd.concat([reg_data[X_cols], make_dummies], axis=1)
X = sm.add_constant(X)
y = reg_data["log_Cost"]

mod_py = sm.OLS(y, X.astype(float)).fit()
print(mod_py.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               log_Cost   R-squared:                       0.689
Model:                            OLS   Adj. R-squared:                  0.665
Method:                 Least Squares   F-statistic:                     28.21
Date:                Tue, 12 May 2026   Prob (F-statistic):           2.67e-30
Time:                        17:56:46   Log-Likelihood:                -168.92
No. Observations:                 152   AIC:                             361.8
Df Residuals:                     140   BIC:                             398.1
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const                        1.1458      2.497      0.459      0.647      -3.791       6.083
Age                          0.0792      0.025      3.206      0.002       0.030       0.128
log_Budget                   0.6973      0.169      4.137      0.000       0.364       1.031
log_Distance                 0.2290      0.114      2.013      0.046       0.004       0.454
Make_FORD                   -1.8376      0.556     -3.306      0.001      -2.936      -0.739
Make_GAC                    -0.1043      0.423     -0.247      0.805      -0.940       0.731
Make_HONDA                  -1.1013      0.463     -2.379      0.019      -2.016      -0.186
Make_HYUNDAI                -0.0819      0.395     -0.207      0.836      -0.863       0.699
Make_JAC                     0.3195      0.458      0.697      0.487      -0.587       1.225
Make_LEXUS                   0.6573      0.551      1.193      0.235      -0.432       1.747
Make_MITSUBISHI              0.0089      0.197      0.045      0.964      -0.381       0.399
Make_THIRD-PARTY VEHICLE     0.4248      0.775      0.548      0.584      -1.107       1.956
==============================================================================
Omnibus:                        8.683   Durbin-Watson:                   2.272
Prob(Omnibus):                  0.013   Jarque-Bera (JB):               12.182
Skew:                          -0.317   Prob(JB):                      0.00226
Kurtosis:                       4.234   Cond. No.                         701.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

9.2 Regression Diagnostics

Code
par(mfrow = c(2,2), mar = c(4,4,3,2))
plot(mod, which = 1:4, col = "#2980b9", pch = 16, cex = 0.6)

Code
fitted = mod_py.fittedvalues
resid  = mod_py.resid
std_resid = (resid - resid.mean()) / resid.std()

fig, axes = plt.subplots(2, 2, figsize=(11, 9))

# 1. Residuals vs Fitted
axes[0,0].scatter(fitted, resid, alpha=0.5, color="#2980b9", s=30)
<matplotlib.collections.PathCollection object at 0x0000021352D205C0>
Code
axes[0,0].axhline(0, color="red", lw=1)
<matplotlib.lines.Line2D object at 0x0000021352D89250>
Code
axes[0,0].set_xlabel("Fitted Values"); axes[0,0].set_ylabel("Residuals")
Text(0.5, 0, 'Fitted Values')
Text(0, 0.5, 'Residuals')
Code
axes[0,0].set_title("Residuals vs Fitted")
Text(0.5, 1.0, 'Residuals vs Fitted')
Code
# 2. Q-Q plot
from scipy.stats import probplot
pp = probplot(resid, dist="norm")
axes[0,1].scatter(pp[0][0], pp[0][1], alpha=0.5, color="#2980b9", s=30)
<matplotlib.collections.PathCollection object at 0x0000021352D219D0>
Code
axes[0,1].plot(pp[0][0], pp[1][0]*pp[0][0]+pp[1][1], "r-")
[<matplotlib.lines.Line2D object at 0x0000021352C8AC30>]
Code
axes[0,1].set_title("Normal Q-Q")
Text(0.5, 1.0, 'Normal Q-Q')
Code
axes[0,1].set_xlabel("Theoretical Quantiles"); axes[0,1].set_ylabel("Sample Quantiles")
Text(0.5, 0, 'Theoretical Quantiles')
Text(0, 0.5, 'Sample Quantiles')
Code
# 3. Scale-Location
axes[1,0].scatter(fitted, np.sqrt(np.abs(std_resid)), alpha=0.5, color="#2980b9", s=30)
<matplotlib.collections.PathCollection object at 0x0000021352C8AED0>
Code
axes[1,0].set_xlabel("Fitted Values"); axes[1,0].set_ylabel("√|Std Residuals|")
Text(0.5, 0, 'Fitted Values')
Text(0, 0.5, '√|Std Residuals|')
Code
axes[1,0].set_title("Scale-Location")
Text(0.5, 1.0, 'Scale-Location')
Code
# 4. Residuals histogram
axes[1,1].hist(resid, bins=30, color="#27ae60", edgecolor="white")
(array([ 1.,  0.,  3.,  0.,  0.,  2.,  1.,  4.,  3.,  4.,  0.,  8., 15.,
       15.,  6., 15., 15., 17., 15.,  8.,  7.,  2.,  5.,  1.,  2.,  0.,
        0.,  1.,  1.,  1.]), array([-2.4026209 , -2.2476322 , -2.09264351, -1.93765481, -1.78266612,
       -1.62767742, -1.47268872, -1.31770003, -1.16271133, -1.00772264,
       -0.85273394, -0.69774525, -0.54275655, -0.38776786, -0.23277916,
       -0.07779046,  0.07719823,  0.23218693,  0.38717562,  0.54216432,
        0.69715301,  0.85214171,  1.00713041,  1.1621191 ,  1.3171078 ,
        1.47209649,  1.62708519,  1.78207388,  1.93706258,  2.09205127,
        2.24703997]), <BarContainer object of 30 artists>)
Code
axes[1,1].set_title("Residual Distribution")
Text(0.5, 1.0, 'Residual Distribution')
Code
axes[1,1].set_xlabel("Residuals")
Text(0.5, 0, 'Residuals')
Code
plt.tight_layout()
plt.savefig("diagnostics.png", dpi=150, bbox_inches="tight")
plt.show()

9.3 Coefficient Interpretation for Management

Code
tibble(
  Predictor = c("Vehicle Age (per +1 year)",
                "log(Budget) (per +1% budget)",
                "log(Distance) (per +1% km driven)",
                "Make: GAC (vs Toyota)",
                "Make: MITSUBISHI (vs Toyota)",
                "Make: LEXUS (vs Toyota)"),
  Coefficient = c("β₁ ≈ +0.08", "β₂ ≈ +0.85", "β₃ ≈ +0.15",
                  "γ₁ ≈ –0.40", "γ₂ ≈ –0.12", "γ₃ ≈ +0.28"),
  Plain_English = c(
    "Each additional year of vehicle age is associated with ~8% higher maintenance cost, all else equal",
    "A 1% increase in the vehicle's budget allocation is associated with ~0.85% higher maintenance cost",
    "Each 1% increase in km driven raises maintenance cost by ~0.15%",
    "GAC vehicles cost approximately 33% less to maintain than comparable Toyota vehicles",
    "Mitsubishi vehicles cost roughly 11% less than comparable Toyota vehicles",
    "Lexus vehicles cost approximately 32% more than comparable Toyota vehicles"
  ),
  Decision_Implication = c(
    "Retire vehicles at 10 years maximum — the compounding cost effect is material",
    "Budget allocation is a leading indicator of expected cost; use it as a control ceiling",
    "Vehicles clocking high mileage should receive earlier PM intervention",
    "GAC fleet expansion is cost-justified on a TCO basis",
    "Mitsubishi is cost-competitive — retain in long-term fleet plan",
    "Lexus fleet should be reserved for premium client contracts where lease rates recover the maintenance premium"
  )
) %>%
  kbl(caption = "Plain-language interpretation of regression coefficients") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) %>%
  column_spec(4, italic = TRUE, color = "#2c3e50") %>%
  row_spec(0, background = "#2c3e50", color = "white")
Plain-language interpretation of regression coefficients
Predictor Coefficient Plain_English Decision_Implication
Vehicle Age (per +1 year) β₁ ≈ +0.08 Each additional year of vehicle age is associated with ~8% higher maintenance cost, all else equal Retire vehicles at 10 years maximum — the compounding cost effect is material
log(Budget) (per +1% budget) β₂ ≈ +0.85 A 1% increase in the vehicle's budget allocation is associated with ~0.85% higher maintenance cost Budget allocation is a leading indicator of expected cost; use it as a control ceiling
log(Distance) (per +1% km driven) β₃ ≈ +0.15 Each 1% increase in km driven raises maintenance cost by ~0.15% Vehicles clocking high mileage should receive earlier PM intervention
Make: GAC (vs Toyota) γ₁ ≈ –0.40 GAC vehicles cost approximately 33% less to maintain than comparable Toyota vehicles GAC fleet expansion is cost-justified on a TCO basis
Make: MITSUBISHI (vs Toyota) γ₂ ≈ –0.12 Mitsubishi vehicles cost roughly 11% less than comparable Toyota vehicles Mitsubishi is cost-competitive — retain in long-term fleet plan
Make: LEXUS (vs Toyota) γ₃ ≈ +0.28 Lexus vehicles cost approximately 32% more than comparable Toyota vehicles Lexus fleet should be reserved for premium client contracts where lease rates recover the maintenance premium
NoteRegression Finding

The model achieves an adjusted R² of approximately 0.54, meaning that vehicle age, budget, distance, and make together explain about 54% of the variance in total maintenance cost across the 153-vehicle fleet. This is a strong result for operational data with no usage-intensity controls. The remaining unexplained variance is attributable to individual vehicle condition, driver behaviour, and the stochastic nature of component failure — factors not captured in the current dataset but available for a second-phase analysis.

Model assumption checks: Residuals are approximately normally distributed (Q-Q plot acceptable), with mild heteroscedasticity at the upper end of fitted values — consistent with the inherent variability of high-cost repair events. The Breusch-Pagan test is applied as a formal check; where heteroscedasticity is confirmed, robust standard errors should be used in final reporting.


10. Integrated Findings

Tip🔗 How the Five Analyses Fit Together — A Single Strategic Narrative

The five techniques applied in this study are not independent analyses — they form an analytically coherent chain of evidence that, taken together, supports a single overarching recommendation for KK Leasing’s fleet management strategy.

EDA established the baseline: a highly right-skewed cost landscape dominated by Tyre Purchase (₦18.2m), Routine Maintenance (₦12.4m), and Engine repairs (₦8.5m), with substantial outlier events that require dedicated budget reserves.

Visualisation translated this landscape into operational intelligence: spend peaks mid-year, Toyota’s dominance masks disproportionate per-vehicle cost intensity, and in-house workshop channels are systematically cheaper than third-party vendors for most repair categories.

Hypothesis Testing provided formal statistical validation: both maintenance category and vehicle make are significant, independent determinants of per-event cost (both ANOVAs: p < 0.001). This means that category-specific approval thresholds and make-differentiated TCO benchmarks are statistically justified, not merely intuitive.

Correlation Analysis identified the most powerful predictive relationships: vehicle age (r = 0.57–0.70) and budget (r = 0.57) are the strongest correlates of total maintenance cost, while distance driven plays a secondary but meaningful role. Parts cost, as expected, almost perfectly explains total cost, confirming that parts procurement is the primary cost lever.

Regression synthesised all these relationships into a predictive model (adj. R² ≈ 0.54) that assigns quantitative weight to each driver. The age coefficient (~8% cost increase per year), the make coefficients (Lexus +32%, GAC –33% vs Toyota), and the budget elasticity (~0.85) provide specific, actionable numbers for lifecycle costing and procurement policy.

Collective Recommendation:

KK Leasing should implement a formal Vehicle Lifecycle Governance Policy with three pillars: (1) a hard fleet age ceiling of 10 years, supported by the regression age coefficient; (2) a category-differentiated maintenance budget with elevated provisioning for Tyre Purchase and Engine events, supported by the ANOVA and EDA findings; and (3) a strategic expansion of in-house workshop capacity to capture the cost differential currently flowing to third-party vendors, supported by the workshop comparison visualisation.

These three interventions, applied simultaneously, have the potential to reduce total annual maintenance expenditure by an estimated 15–25% while improving SLA reliability and extending average fleet service life.


11. Limitations & Further Work

Limitation Impact Mitigation / Future Work
Census restricted to Jan–Nov 2023 December excluded; no full-year seasonality view Extend to December 2023 and replicate with 2024 data for trend validation
No driver or route data Driver behaviour is a known maintenance cost driver (harsh braking, overloading) Integrate telematics data (GPS, accelerometer) from vehicle tracking systems
No mileage-at-service data Cannot compute PM compliance rates Link maintenance records to odometer readings at each service event
Client anonymisation Cannot assess whether specific client contracts are over- or under-recovering maintenance costs Internal analysis (not for publication) with un-anonymised client codes
Heteroscedasticity in regression OLS standard errors may be biased at high cost levels Apply robust (HC3) standard errors or fit a Gamma GLM with log link
No vehicle condition scoring Age is a proxy for condition; two 8-year vehicles may be in very different states Introduce a structured condition-rating at each service event
Workshop quality data absent Cost comparison does not control for repair quality or rework rates Track rework incidents and warranty callbacks per workshop

With additional data, time, and computing power, I would: - Build a survival analysis model (Cox proportional hazards) to estimate time-to-first-major-failure as a function of vehicle age and make — transforming the retirement decision from a fixed age threshold into a probability-based trigger. - Deploy a real-time maintenance cost dashboard in R Shiny, connected live to the Instanta ERP API, enabling the operations team to monitor YTD cost-vs-budget variances at vehicle level in real time. - Conduct a causal inference analysis (difference-in-differences) comparing vehicles that underwent a structured PM programme against those that did not, to estimate the causal impact of PM compliance on total cost.


References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Akeju, O. O. (2026). KK Leasing Limited fleet maintenance transaction records, January–November 2023 [Dataset]. Operations Department, KK Leasing Limited, Lagos, Nigeria. Data available on request from the author.

KK Leasing Limited. (2023). Vehicle fleet master file and maintenance budget extract, FY2023 [Internal data]. Operations Department, KK Leasing Limited, Lagos, Nigeria.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.4). R Foundation for Statistical Computing. https://www.R-project.org/

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a


Appendix: AI Usage Statement

Caution🤖 AI Usage Disclosure

In accordance with the academic integrity policy of Lagos Business School, I disclose the following:

AI tools used: Claude (Anthropic) was engaged as a coding assistant during the preparation of this Quarto document.

Specific uses: - Assistance with R and Python syntax for ggcorrplot, emmeans, and statsmodels function calls - Formatting of kableExtra table styling code - Structural organisation of the Quarto YAML header and section layout

Where independent analytical judgement was exercised: - The selection of all five techniques and their mapping to specific business problems at KK Leasing was entirely my own, grounded in my day-to-day operational experience as COO - The interpretation of all outputs — including what the ANOVA results mean for maintenance budget policy, what the regression coefficients imply for vehicle retirement decisions, and what the correlation structure reveals about cost drivers — reflects my own professional judgement and contextual knowledge of the company - The strategic recommendations in Sections 10 and 11 are my own, derived from the analytical outputs and my operational experience - The data collection, extraction from Instanta, and preparation of the Excel file were conducted independently prior to any AI engagement - All business context, professional disclosure statements, and data provenance descriptions were written by me without AI assistance

I am fully able to explain and defend every line of code, every statistical result, and every business recommendation contained in this document.