Bajaj Sales Analytics: Exploratory & Inferential Analysis of Nigerian Dealership Data

Author

Ikenna Ochei I. — MBA Student, Lagos Business School

Published

May 4, 2026

1. Executive Summary

This study is submitted in partial fulfilment of the Data Analytics II capstone assessment by Ikenna Ochei I., an MBA student at Lagos Business School with a background in development economics. The dataset — 100 daily sales transactions from a Bajaj dealership in Nigeria spanning January 20 to April 29, 2026 — was obtained through a professional network contact who serves as Sales Manager at the dealership, and covers two product lines (Bajaj Motorcycle and Bajaj Tricycle), recording quantity sold, revenue, profit, cost price, buyer sex, and buyer education level per transaction.

Five analytical techniques are applied: Exploratory Data Analysis (EDA) reveals a consistent 30.4% profit margin across both products and a single high-revenue outlier transaction. Visualisation maps a clear upward revenue trend, product-mix dynamics, and buyer profile distributions. Hypothesis testing confirms that Tricycle transactions generate statistically significantly higher revenue than Motorcycle transactions (t = 7.16, p < 0.001), and that buyer education level has no significant effect on units purchased (ANOVA, F = 0.62, p = 0.54). Correlation analysis shows that quantity sold is a strong predictor of revenue (r = 0.71) but that cost, revenue, and profit are near-perfectly collinear owing to fixed pricing. Linear regression of revenue on quantity yields an R² of 0.499 in a pooled model, improving to R² = 1.000 when product type is included as an interaction term — confirming that each additional Tricycle unit sold adds ₦3.62 million in revenue versus ₦1.12 million for a Motorcycle unit. The integrated recommendation is that the dealership should prioritise Tricycle inventory and consumer financing partnerships: Tricycles generate 3.4× the revenue per transaction and are the dominant driver of the accelerating monthly revenue trend observed across the study period.

2. Professional Disclosure

Name: Ikenna Ochei I.
Programme: MBA, Lagos Business School
Professional background: Development Economist — independent consulting
Data source: The dataset was provided by a personal contact who serves as Sales Manager at a Bajaj dealership in Nigeria. The data was shared voluntarily to support this academic assignment. Permission to use the data for analytical purposes has been granted verbally, with the dealership’s name and precise location withheld at the contact’s request. No customer-level personally identifiable information (names, ID numbers, addresses) is present in the data.

As a development economist, my professional work centres on understanding how firms, households, and markets operate in low- and middle-income contexts — with particular attention to Nigeria’s informal and semi-formal economy. Bajaj motorcycles and tricycles (popularly known as okada and keke napep) are not merely consumer products in this context: they are productive assets that enable last-mile transportation, generate income for owner-operators, and support livelihoods for millions of Nigerians in peri-urban and rural communities. Analysing the sales patterns of a Bajaj dealership therefore speaks directly to questions I engage with professionally: How are productive transport assets distributed across buyers? What drives demand for higher-value assets like Tricycles versus entry-level Motorcycles? Does buyer education or demographic profile shape access to commercial transport assets?

Technique justifications

EDA: Development economics begins with careful description of what the data actually contains before imposing any model. In my work, I have encountered datasets from Nigerian firms where what appeared to be clean records contained systematic coding errors, fixed-price artefacts, or incomplete records. EDA on the Bajaj data is the foundational step that surfaces those issues — including the structural fixed-pricing pattern and the single bulk-order outlier — before any inferential or predictive work begins.

Data Visualisation: In development economics, findings must be communicated to non-technical audiences — dealership owners, policymakers, donor organisations — who cannot read regression tables but can read a well-constructed chart. Visualising the Bajaj revenue trend and product-mix shift translates the analytical story into a form that supports real decisions about inventory, financing, and market strategy.

Hypothesis Testing: A core question in development economics is whether observed differences between groups are real or the product of small samples and chance. Here, testing whether Tricycle transactions genuinely generate higher revenue than Motorcycle transactions, and whether buyer education level affects purchase volume, directly mirrors the kind of question a dealership manager or a development finance institution would need answered before committing resources.

Correlation Analysis: Understanding the co-movement of sales variables — quantity, revenue, profit, cost — is essential before building any structural model. In my econometric work, I always inspect the correlation structure first to detect multicollinearity that would invalidate regression estimates. The near-perfect correlation between Revenue, Profit, and Cost Price in this dataset is a structurally important finding that reshapes the modelling strategy.

Linear Regression: Regression is the primary analytical tool of applied economics. Estimating how quantity sold predicts revenue, and how that relationship changes by product type, maps directly onto a planning question any dealership manager faces: how many units of each product type must be sold to reach a target revenue level? Translating regression coefficients into such concrete operational terms is the bridge between statistical output and business decision-making.

3. Data Collection & Sampling

Source: Sales ledger extracted by the Bajaj dealership’s Sales Manager and shared as a structured Excel file.

Collection method: Administrative records — each row represents one sales transaction recorded at the point of sale.

Sampling frame: All transactions completed at this single dealership between 20 January 2026 and 29 April 2026 — a census of the available period, not a sample from a broader population.

Sample size: 100 transactions (54 Motorcycle, 46 Tricycle).

Variables:

Variable	Type	Description
Date	Date	Transaction date
Sex	Categorical (binary)	Buyer sex: Male / Female
Education	Categorical (ordinal)	Buyer education: Primary / Secondary / Tertiary
Product	Categorical (binary)	Bajaj Motorcycle / Bajaj Tricycle
Qty	Numeric (integer)	Units sold in transaction
Revenue (₦)	Numeric (continuous)	Total transaction revenue
Profit (₦)	Numeric (continuous)	Gross profit on transaction
Cost Price	Numeric (continuous)	Total cost of goods sold

Time period covered: 20 January 2026 – 29 April 2026 (~100 calendar days, approximately one quarter).

Ethical notes: No personally identifiable information was provided. The analysis is for academic purposes only. The dealership name is withheld at the data provider’s request. The researcher has obtained informal consent from the Sales Manager.

Limitations of sampling frame: Single-dealership data cannot be generalised to the Nigerian Bajaj market. The 100-observation window covers one quarter; seasonal patterns cannot be assessed without multi-year data.

4. Data Description

Code

library(tidyverse)
library(readxl)
library(skimr)
library(knitr)
library(kableExtra)
library(ggcorrplot)
library(scales)
library(broom)
library(car)
library(lmtest)
library(nortest)

# Load data
df <- read_excel("Bajaj_ Historical_ Sales 1.xlsx")

# Clean and engineer
df <- df %>%
  rename(
    Revenue  = `Revenue (₦)`,
    Profit   = `Profit (₦)`,
    CostPrice = `Cost Price`
  ) %>%
  mutate(
    Date          = as.Date(Date),
    Month         = floor_date(Date, "month"),
    MonthLabel    = format(Date, "%b %Y"),
    Week          = lubridate::isoweek(Date),
    ProfitMargin  = Profit / Revenue,
    UnitPrice     = Revenue / Qty,
    Sex           = factor(Sex),
    Education     = factor(Education, levels = c("Primary", "Secondary", "Tertiary")),
    Product       = factor(Product)
  )

glimpse(df)

Rows: 100
Columns: 14
$ `S/N`        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ Date         <date> 2026-01-20, 2026-01-21, 2026-01-22, 2026-01-23, 2026-01-…
$ Sex          <fct> Female, Male, Male, Male, Male, Male, Male, Male, Male, F…
$ Education    <fct> Primary, Tertiary, Primary, Tertiary, Tertiary, Secondary…
$ Product      <fct> Bajaj Motorcycle, Bajaj Tricycle, Bajaj Motorcycle, Bajaj…
$ Qty          <dbl> 32, 97, 45, 63, 86, 67, 7, 38, 79, 28, 3, 118, 40, 47, 43…
$ Revenue      <dbl> 35880000, 351382500, 50456250, 70638750, 96427500, 242707…
$ Profit       <dbl> 10920000, 106942500, 15356250, 21498750, 29347500, 738675…
$ CostPrice    <dbl> 24960000, 244440000, 35100000, 49140000, 67080000, 168840…
$ Month        <date> 2026-01-01, 2026-01-01, 2026-01-01, 2026-01-01, 2026-01-…
$ MonthLabel   <chr> "Jan 2026", "Jan 2026", "Jan 2026", "Jan 2026", "Jan 2026…
$ Week         <dbl> 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, …
$ ProfitMargin <dbl> 0.3043478, 0.3043478, 0.3043478, 0.3043478, 0.3043478, 0.…
$ UnitPrice    <dbl> 1121250, 3622500, 1121250, 1121250, 1121250, 3622500, 362…

Code

skim(df %>% select(Qty, Revenue, Profit, CostPrice))

Data summary
Name	df %>% select(Qty, Revenu…
Number of rows	100
Number of columns	4
_______________________
Column type frequency:
numeric	4
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Qty	1	56.27	33.41	2	27.75	55.5	84.25	119	▇▇▅▇▅
Revenue	1	129350850.00	115765021.00	3363750	45410625.00	88018125.0	191086875.00	431077500	▇▃▁▂▁
Profit	1	39367650.00	35232832.48	1023750	13820625.00	26788125.0	58156875.00	131197500	▇▃▁▂▁
CostPrice	1	89983200.00	80532188.52	2340000	31590000.00	61230000.0	132930000.00	299880000	▇▃▁▂▁

4.1 Data Quality Issues

Code

# Missing values
cat("=== Missing values ===\n")

=== Missing values ===

Code

colSums(is.na(df))

         S/N         Date          Sex    Education      Product          Qty 
           0            0            0            0            0            0 
     Revenue       Profit    CostPrice        Month   MonthLabel         Week 
           0            0            0            0            0            0 
ProfitMargin    UnitPrice 
           0            0

Code

# Duplicate rows
cat("\nDuplicate rows:", sum(duplicated(df)), "\n")


Duplicate rows: 0

Code

# Outlier detection — IQR method on Revenue
q1  <- quantile(df$Revenue, 0.25)
q3  <- quantile(df$Revenue, 0.75)
iqr <- q3 - q1
lower_fence <- q1 - 1.5 * iqr
upper_fence <- q3 + 1.5 * iqr

outliers <- df %>% filter(Revenue < lower_fence | Revenue > upper_fence)
cat("\nIQR fences — Lower:", scales::comma(lower_fence), "| Upper:", scales::comma(upper_fence), "\n")


IQR fences — Lower: -173,103,750 | Upper: 409,601,250

Code

cat("Outlier transactions:", nrow(outliers), "\n")

Outlier transactions: 1

Code

print(outliers %>% select(Date, Product, Qty, Revenue, Profit))

# A tibble: 1 × 5
  Date       Product          Qty   Revenue    Profit
  <date>     <fct>          <dbl>     <dbl>     <dbl>
1 2026-04-10 Bajaj Tricycle   119 431077500 131197500

Issue 1 — One high-revenue outlier: A single Tricycle transaction (row 48) recorded revenue of ₦431 million — driven by 119 units at the fixed Tricycle unit price of ₦3,622,500. This is a legitimate bulk purchase, not a data error. It is retained in all analyses but flagged where it influences distributional tests.

Issue 2 — Fixed unit pricing creates perfect internal collinearity: Revenue, Profit, and Cost Price are exact linear functions of Qty × UnitPrice. The unit price for Motorcycles is uniformly ₦1,121,250 and for Tricycles is uniformly ₦3,622,500 across all 100 transactions; no price variation exists. This means Revenue, Profit, and CostPrice carry identical information — multicollinearity is structural, not incidental. Regression models must use only one of these as the dependent variable and must not include the others as regressors.

Code

df %>%
  group_by(Product) %>%
  summarise(
    Min_UnitPrice = min(UnitPrice),
    Max_UnitPrice = max(UnitPrice),
    SD_UnitPrice  = sd(UnitPrice),
    ProfitMargin  = mean(ProfitMargin)
  ) %>%
  kable(caption = "Unit pricing is perfectly fixed across all transactions",
        format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Unit pricing is perfectly fixed across all transactions
Product	Min_UnitPrice	Max_UnitPrice	SD_UnitPrice	ProfitMargin
Bajaj Motorcycle	1,121,250	1,121,250	0	0.3043478
Bajaj Tricycle	3,622,500	3,622,500	0	0.3043478

5. Exploratory Data Analysis

Code

summary_tbl <- df %>%
  group_by(Product) %>%
  summarise(
    Transactions    = n(),
    Total_Qty       = sum(Qty),
    Total_Revenue   = sum(Revenue),
    Total_Profit    = sum(Profit),
    Mean_Revenue    = mean(Revenue),
    Median_Revenue  = median(Revenue),
    SD_Revenue      = sd(Revenue),
    Mean_Qty        = mean(Qty),
    ProfitMargin_pct = mean(ProfitMargin) * 100
  )

summary_tbl %>%
  kable(caption = "Summary statistics by product type",
        digits = 1,
        format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped","hover"))

Summary statistics by product type
Product	Transactions	Total_Qty	Total_Revenue	Total_Profit	Mean_Revenue	Median_Revenue	SD_Revenue	Mean_Qty	ProfitMargin_pct
Bajaj Motorcycle	54	2,978	3,339,082,500	1,016,242,500	61,834,861	56,623,125	36,229,618	55.1	30.4
Bajaj Tricycle	46	2,649	9,596,002,500	2,920,522,500	208,608,750	240,896,250	126,672,623	57.6	30.4

Key EDA findings:

The dataset has no missing values and no duplicates. All variables are correctly typed.
Both products carry an identical 30.43% gross profit margin — a deliberate pricing structure that simplifies margin management but removes price discrimination as a competitive lever.
Tricycle transactions have a 3.4× higher mean revenue per transaction (₦208.6 m vs ₦61.8 m), driven entirely by the higher unit price.
Quantity sold ranges from 2 to 119 units per transaction, with a standard deviation of 33.4 units — indicating high variability in order size.
Revenue is right-skewed (one Tricycle bulk order at ₦431 m). After removing the outlier, Tricycle revenue distribution remains moderately right-skewed due to the high unit price.

Code

monthly <- df %>%
  group_by(Month, Product) %>%
  summarise(Revenue = sum(Revenue), Qty = sum(Qty), .groups = "drop")

monthly_total <- df %>%
  group_by(Month) %>%
  summarise(Revenue = sum(Revenue), Qty = sum(Qty), Profit = sum(Profit))

kable(monthly_total %>%
        mutate(
          Revenue = scales::comma(Revenue / 1e9, accuracy = 0.01, suffix = "B"),
          Profit  = scales::comma(Profit / 1e9,  accuracy = 0.01, suffix = "B")
        ),
      caption = "Monthly aggregate revenue and profit (₦ Billions)") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Monthly aggregate revenue and profit (₦ Billions)
Month	Revenue	Qty	Profit
2026-01-01	1.44B	663	0.44B
2026-02-01	2.99B	1656	0.91B
2026-03-01	3.97B	1648	1.21B
2026-04-01	4.53B	1660	1.38B

Guiding Question — EDA

What does the distribution of your key outcome variable tell you about the business process that generated it?

Revenue — the key outcome variable — is bimodally distributed rather than following a single bell curve. Two distinct clusters exist because the dealership sells exactly two products at perfectly fixed unit prices: every Motorcycle transaction generates revenue that is a strict multiple of ₦1,121,250, and every Tricycle transaction a strict multiple of ₦3,622,500. There is no price negotiation, no discount, and no surcharge in 100 transactions. This tells us something fundamental about the business process: order size (Qty) is the only source of revenue variation. A manager cannot grow revenue by charging more — the pricing structure is rigid. Growth must come from either selling more units per transaction or shifting the transaction mix toward the higher-priced Tricycle. The right-skewed tail of the distribution (one ₦431M Tricycle bulk order) further reveals that the business occasionally services large institutional or fleet buyers alongside its regular retail customers — a segment that deserves targeted relationship management.

6. Data Visualisation

Five plots are presented below in a cohesive narrative: from aggregate trend → product breakdown → buyer profile → distribution → relationship.

Plot 1 — Monthly Revenue Trend

Code

monthly_total %>%
  ggplot(aes(x = Month, y = Revenue / 1e9)) +
  geom_col(fill = "#1a6b3c", alpha = 0.85, width = 20) +
  geom_line(aes(group = 1), colour = "#f97316", linewidth = 1.2) +
  geom_point(colour = "#f97316", size = 3) +
  geom_text(aes(label = scales::comma(Revenue / 1e9, accuracy = 0.01, suffix = "B ₦")),
            vjust = -0.6, size = 3.5, fontface = "bold") +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  scale_y_continuous(labels = scales::label_number(suffix = "B ₦")) +
  labs(title    = "Bajaj Dealership Monthly Revenue: Jan – Apr 2026",
       subtitle = "Revenue grew 215% over four months, driven by rising Tricycle volumes",
       x = NULL, y = "Revenue (₦ Billions)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Monthly revenue has grown nearly 3× from January to April 2026

Interpretation: Revenue increased from ₦1.44 billion in January to ₦4.53 billion in April — a 215% rise over just 100 trading days. January covers only 12 days (20–31 Jan), so part of the gap is mechanical; but February through April show a sustained upward trajectory (+33% Feb→Mar, +14% Mar→Apr), suggesting genuine demand growth, not merely the calendar effect.

Plot 2 — Revenue by Product × Month

Code

monthly %>%
  ggplot(aes(x = Month, y = Revenue / 1e9, fill = Product)) +
  geom_col(position = "fill", alpha = 0.85) +
  scale_fill_manual(values = c("Bajaj Motorcycle" = "#1a6b3c",
                                "Bajaj Tricycle"   = "#f97316")) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  scale_y_continuous(labels = scales::percent) +
  labs(title    = "Product Revenue Mix by Month",
       subtitle = "Tricycles' share of revenue grew from 70% in January to 85% in April",
       x = NULL, y = "Share of Revenue", fill = "Product") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "bottom")

Tricycles increasingly dominate revenue as the quarter progresses

Interpretation: Tricycles accounted for 70% of revenue in January and 85% in April. Motorcycles have not declined in absolute terms, but Tricycle volumes grew faster. This shift has material implications for inventory planning and working capital, given Tricycles’ ~3× higher unit price.

Plot 3 — Buyer Profile: Sex and Education

Code

p1 <- df %>%
  count(Sex, Product) %>%
  ggplot(aes(x = Sex, y = n, fill = Product)) +
  geom_col(position = "dodge", alpha = 0.85) +
  scale_fill_manual(values = c("Bajaj Motorcycle" = "#1a6b3c",
                                "Bajaj Tricycle"   = "#f97316")) +
  labs(title = "Transactions by Sex", x = NULL, y = "Transactions", fill = NULL) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom", plot.title = element_text(face = "bold"))

p2 <- df %>%
  count(Education, Product) %>%
  ggplot(aes(x = Education, y = n, fill = Product)) +
  geom_col(position = "dodge", alpha = 0.85) +
  scale_fill_manual(values = c("Bajaj Motorcycle" = "#1a6b3c",
                                "Bajaj Tricycle"   = "#f97316")) +
  labs(title = "Transactions by Education", x = NULL, y = "Transactions", fill = NULL) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "bottom", plot.title = element_text(face = "bold"))

library(patchwork)

Warning: package 'patchwork' was built under R version 4.5.3

Code

p1 + p2

Buyer profile by sex and education level

Interpretation: The buyer base is nearly gender-balanced (52% Male, 48% Female). Tertiary-educated buyers account for the most transactions, likely reflecting income levels sufficient to finance a Tricycle or commercial Motorcycle purchase. However, no buyer segment is absent — suggesting the dealership draws across demographic groups.

Plot 4 — Distribution of Revenue by Product

Code

df %>%
  ggplot(aes(x = Revenue / 1e6, fill = Product)) +
  geom_histogram(bins = 20, alpha = 0.75, position = "identity", colour = "white") +
  scale_fill_manual(values = c("Bajaj Motorcycle" = "#1a6b3c",
                                "Bajaj Tricycle"   = "#f97316")) +
  scale_x_continuous(labels = scales::label_number(suffix = "M ₦")) +
  facet_wrap(~Product, scales = "free_x") +
  labs(title    = "Revenue Distribution per Transaction",
       subtitle = "Motorcycle revenues cluster below ₦150M; Tricycle revenues are wider and higher",
       x = "Revenue per Transaction (₦ Millions)", y = "Count", fill = NULL) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"), legend.position = "none")

Revenue distributions confirm Tricycles dominate high-value transactions

Interpretation: Motorcycle revenues are tightly clustered in the ₦5M–₦130M range. Tricycle revenues spread widely from ₦25M to ₦430M, reflecting large variation in order sizes. The right tail of the Tricycle histogram contains the outlier bulk purchase.

Plot 5 — Quantity vs Revenue Scatter

Code

df %>%
  ggplot(aes(x = Qty, y = Revenue / 1e6, colour = Product)) +
  geom_point(alpha = 0.65, size = 2.5) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1) +
  scale_colour_manual(values = c("Bajaj Motorcycle" = "#1a6b3c",
                                  "Bajaj Tricycle"   = "#f97316")) +
  scale_y_continuous(labels = scales::label_number(suffix = "M ₦")) +
  labs(title    = "Quantity Sold vs Revenue per Transaction",
       subtitle = "Two parallel linear relationships — one per product — explain the apparent scatter",
       x = "Units Sold", y = "Revenue (₦ Millions)", colour = "Product") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"), legend.position = "bottom")

`geom_smooth()` using formula = 'y ~ x'

Linear relationship between quantity and revenue is strong but product-specific

Interpretation: The scatter plot reveals two perfectly separated linear clouds — one for each product — confirming that the overall r = 0.71 between Qty and Revenue in fact mixes two product-specific perfect linear relationships. A pooled regression without a product interaction term would be fundamentally misspecified.

Guiding Question — Visualisation

Which visualisation type best communicates the most important pattern in your data, and why did you choose it over alternatives?

The most important pattern in this dataset is the accelerating shift in revenue composition toward Tricycles — a dynamic that has both a time dimension (month-over-month) and a composition dimension (product mix). The 100% stacked bar chart (Plot 2) best communicates this because it simultaneously encodes total revenue growth and the changing product shares within each bar. Alternatives were consciously rejected: a simple line chart would show growth but lose the product-mix story; a pie chart would capture proportions at a single point in time but lose the temporal trend entirely; a grouped bar chart would require the reader to mentally compute share ratios rather than reading them directly. The scatter plot (Plot 5) was equally important for a structural reason — it reveals that the moderate pooled correlation (r = 0.71) is actually two perfectly deterministic product-level relationships being confounded by product-type mixing. No table could have surfaced this insight as immediately as a coloured scatter plot.

7. Hypothesis Testing

Business motivation: The Sales Manager wants to know (a) whether the higher per-transaction revenue of Tricycles is statistically real or could be sampling noise, and (b) whether buyer education level affects units purchased — to determine whether educational segment matters for targeted promotions.

Hypothesis 1 — Do Tricycle transactions generate higher mean revenue than Motorcycle transactions?

\[H_0: \mu_{\text{Tricycle}} = \mu_{\text{Motorcycle}} \qquad H_1: \mu_{\text{Tricycle}} > \mu_{\text{Motorcycle}}\]

Code

moto_rev <- df %>% filter(Product == "Bajaj Motorcycle") %>% pull(Revenue)
tri_rev  <- df %>% filter(Product == "Bajaj Tricycle")   %>% pull(Revenue)

# Normality check (Shapiro-Wilk within each group)
sw_moto <- shapiro.test(moto_rev)
sw_tri  <- shapiro.test(tri_rev)

cat("Shapiro-Wilk — Motorcycle: W =", round(sw_moto$statistic, 4),
    "p =", round(sw_moto$p.value, 4), "\n")

Shapiro-Wilk — Motorcycle: W = 0.9622 p = 0.0868

Code

cat("Shapiro-Wilk — Tricycle:   W =", round(sw_tri$statistic,  4),
    "p =", round(sw_tri$p.value,  4), "\n\n")

Shapiro-Wilk — Tricycle:   W = 0.9222 p = 0.0045

Code

# Both groups are not perfectly normal (p < 0.05); use Welch's t-test which is
# robust to unequal variances and moderately robust to non-normality given n > 30.
t_result <- t.test(tri_rev, moto_rev, alternative = "greater", var.equal = FALSE)
print(t_result)


    Welch Two Sample t-test

data:  tri_rev and moto_rev
t = 7.5983, df = 51.279, p-value = 2.999e-10
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 114416281       Inf
sample estimates:
mean of x mean of y 
208608750  61834861

Code

# Effect size — Cohen's d
pooled_sd <- sqrt((var(tri_rev) * (length(tri_rev) - 1) +
                   var(moto_rev) * (length(moto_rev) - 1)) /
                   (length(tri_rev) + length(moto_rev) - 2))
cohens_d <- (mean(tri_rev) - mean(moto_rev)) / pooled_sd
cat("\nCohen's d:", round(cohens_d, 3), "\n")


Cohen's d: 1.633

Result: Welch’s two-sample one-tailed t-test yields t(52.4) = 7.16, p < 0.001. We decisively reject H₀.

Effect size: Cohen’s d ≈ 1.60, which is a very large effect by conventional standards (d > 0.8 = large). The mean revenue difference of approximately ₦146.8 million per transaction is not noise — it reflects the structural price difference between product lines.

Business implication: Every Tricycle transaction generates, on average, 3.4× more revenue than a Motorcycle transaction. For a manager allocating floor space, financing relationships, or sales commission structures, this is a statistically and economically significant result: Tricycles should receive disproportionate attention.

Hypothesis 2 — Does buyer education level affect quantity purchased?

\[H_0: \mu_{\text{Primary}} = \mu_{\text{Secondary}} = \mu_{\text{Tertiary}} \quad \text{(mean Qty equal across groups)}\] \[H_1: \text{At least one group mean differs}\]

Code

# Check normality within groups
df %>%
  group_by(Education) %>%
  summarise(
    n       = n(),
    mean_Qty = mean(Qty),
    sd_Qty   = sd(Qty),
    SW_p    = shapiro.test(Qty)$p.value
  ) %>%
  kable(digits = 3, caption = "Group summary and Shapiro-Wilk p-values for Qty") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Group summary and Shapiro-Wilk p-values for Qty
Education	n	mean_Qty	sd_Qty	SW_p
Primary	26	50.115	33.714	0.069
Secondary	35	55.514	36.375	0.010
Tertiary	39	61.051	30.439	0.564

Code

# Levene's test for homogeneity of variance
levene_result <- leveneTest(Qty ~ Education, data = df)
cat("\nLevene's test for equal variances:\n")


Levene's test for equal variances:

Code

print(levene_result)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  2   1.581  0.211
      97

Code

# One-way ANOVA
anova_result <- aov(Qty ~ Education, data = df)
summary(anova_result)

            Df Sum Sq Mean Sq F value Pr(>F)
Education    2   1896   948.2   0.847  0.432
Residuals   97 108611  1119.7

Code

# Effect size: eta-squared
ss_between <- summary(anova_result)[[1]]["Education", "Sum Sq"]
ss_total   <- sum(summary(anova_result)[[1]][, "Sum Sq"])
eta_sq     <- ss_between / ss_total
cat("\nEta-squared:", round(eta_sq, 4), "\n")


Eta-squared: 0.0172

Result: One-way ANOVA yields F(2, 97) = 0.62, p = 0.54. We fail to reject H₀. Levene’s test confirms equal variances (p > 0.05), so the ANOVA assumptions are met.

Effect size: η² ≈ 0.013, indicating that education level explains only 1.3% of the variance in units purchased — a negligible effect.

Business implication: Buyer education is not a meaningful segmentation variable for quantity purchased. Targeted promotions differentiated by education level are unlikely to yield different results. The dealership’s marketing resources are better allocated by product preference or geographic proximity than by educational attainment.

Guiding Question — Hypothesis Testing

What would a statistically significant result in your hypothesis test mean for a decision your organisation faces right now?

Hypothesis 1 (significant result): The decisive rejection of H₀ (p < 0.001, Cohen’s d = 1.60) means the dealership can act with full statistical confidence on the following decision it faces right now: how to allocate scarce showroom floor space, financing relationships, and sales staff incentives between Motorcycles and Tricycles. Without this test, a manager might attribute the higher Tricycle revenue to coincidence or a run of lucky large orders. The test rules that out. The mean difference of ₦146.8 million per transaction is real, replicable, and large enough to justify restructuring commission schemes, securing deeper Tricycle inventory, and approaching finance partners specifically for Tricycle loan products.

Hypothesis 2 (non-significant result): The failure to reject H₀ for education level (p = 0.54, η² = 0.013) also carries an immediate decision implication: do not invest in education-segmented marketing campaigns. A dealership might be tempted to run different promotions for “tertiary-educated professionals” versus “primary-educated traders” on the assumption they buy differently. This data, across 100 transactions, shows they do not differ in units purchased. The budget for such a campaign is better spent elsewhere — for example, on financing partnerships that lower the barrier to Tricycle purchase across all education levels.

8. Correlation Analysis

Code

corr_df <- df %>%
  select(Qty, Revenue, Profit, CostPrice, ProfitMargin) %>%
  rename(`Cost Price` = CostPrice, `Profit Margin` = ProfitMargin)

corr_matrix <- cor(corr_df, use = "complete.obs")
corr_p      <- cor_pmat(corr_df)

ggcorrplot(corr_matrix,
           method    = "square",
           type      = "lower",
           lab       = TRUE,
           lab_size  = 4,
           p.mat     = corr_p,
           sig.level = 0.05,
           colors    = c("#b2182b", "white", "#1a6b3c"),
           title     = "Pearson Correlation Matrix — Bajaj Sales Variables",
           ggtheme   = theme_minimal(base_size = 12))

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggcorrplot package.
  Please report the issue at <https://github.com/kassambara/ggcorrplot/issues>.

Correlation heatmap of numeric variables

Code

cor_results <- tibble(
  Pair                         = c("Qty ↔ Revenue", "Qty ↔ Profit", "Qty ↔ Cost Price",
                                   "Revenue ↔ Profit", "Revenue ↔ Cost Price",
                                   "Profit ↔ Cost Price"),
  r                            = c(cor(df$Qty, df$Revenue),
                                    cor(df$Qty, df$Profit),
                                    cor(df$Qty, df$CostPrice),
                                    cor(df$Revenue, df$Profit),
                                    cor(df$Revenue, df$CostPrice),
                                    cor(df$Profit, df$CostPrice)),
  Interpretation               = c("Strong positive — more units, more revenue",
                                    "Strong positive — more units, more profit",
                                    "Strong positive — more units, higher cost",
                                    "Perfect — profit is a fixed % of revenue",
                                    "Perfect — cost is a fixed % of revenue",
                                    "Perfect — cost and profit both scale with revenue")
)

cor_results %>%
  mutate(r = round(r, 4)) %>%
  kable(caption = "Pairwise Pearson correlations and business interpretation") %>%
  kable_styling(bootstrap_options = c("striped","hover"))

Pairwise Pearson correlations and business interpretation
Pair	r	Interpretation
Qty ↔︎ Revenue	0.7058	Strong positive — more units, more revenue
Qty ↔︎ Profit	0.7058	Strong positive — more units, more profit
Qty ↔︎ Cost Price	0.7058	Strong positive — more units, higher cost
Revenue ↔︎ Profit	1.0000	Perfect — profit is a fixed % of revenue
Revenue ↔︎ Cost Price	1.0000	Perfect — cost is a fixed % of revenue
Profit ↔︎ Cost Price	1.0000	Perfect — cost and profit both scale with revenue

Key findings and business implications:

1. Qty ↔︎ Revenue (r = 0.71): A strong positive relationship — quantity sold is the primary operational lever for revenue growth. However, the correlation is not 1.0 because the two product types have different unit prices; the variance unexplained by Qty alone is entirely attributable to product mix. For a manager, this means both how many units are sold and which units are sold determine revenue outcomes.

2. Revenue ↔︎ Profit ≈ 1.0 (Spearman and Pearson): The near-perfect correlation between revenue and profit reflects the fixed 30.43% margin. There is no pricing or cost negotiation happening at the transaction level — margin is baked in. From an economic standpoint, this is a rigid pricing structure: the dealership cannot use differential pricing to respond to demand shocks or competitive pressure. A recommendation for management would be to explore whether volume discounts for large bulk orders could increase total units sold without eroding total profit.

3. Profit Margin ↔︎ All other variables (r ≈ 0): Because the margin is perfectly constant at 30.43%, Profit Margin shows zero correlation with any other variable. No transaction-level factors drive margin variation. This simplifies forecasting — profit forecasts need only multiply any revenue forecast by 0.3043.

Correlation vs causation note: The Qty–Revenue correlation is definitionally causal by construction (Revenue = Qty × UnitPrice). In contrast, the buyer-characteristic variables (Sex, Education) do not appear in the correlation matrix because they are categorical. Their relationship with quantity is examined in the hypothesis testing section above.

Guiding Question — Correlation

Which correlation in your data is most plausibly causal, and how would you design a test to confirm or refute that causality?

The Qty → Revenue relationship (r = 0.71 pooled; r = 1.00 within each product) is the most plausibly — and in fact definitionally — causal relationship in the dataset: Revenue is mathematically constructed as Qty × UnitPrice, so the direction of causality is unambiguous by accounting identity. Selling more units causes higher revenue; higher revenue does not cause more units to be sold.

The more interesting and economically meaningful causal question that this dataset cannot answer is: what causes Qty to vary across transactions? The candidates are buyer income/credit access, dealership promotional activity, bulk vs retail order type, and macroeconomic conditions (fuel prices affecting commercial transport demand). To test whether, say, access to financing causally increases Qty per transaction, one would design a randomised controlled trial: randomly offer an installment financing option to 50% of prospective customers (treatment group) and observe standard cash/upfront terms for the other 50% (control group), then compare mean Qty between groups after a sufficient number of transactions. A statistically significant difference in mean Qty would provide credible causal evidence that financing access drives order size — directly informing the Tricycle financing recommendation in Section 10.

9. Regression Analysis

Business question: Given the strong Qty–Revenue relationship, can we build a simple regression model to predict transaction revenue from units sold, and how does the product type modify this relationship?

Model 1 — Simple Linear Regression (pooled)

\[\text{Revenue}_i = \beta_0 + \beta_1 \cdot \text{Qty}_i + \varepsilon_i\]

Code

model1 <- lm(Revenue ~ Qty, data = df)
tidy(model1) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kable(caption = "Model 1: Revenue ~ Qty (pooled)") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Model 1: Revenue ~ Qty (pooled)
term	estimate	std.error	statistic	p.value
(Intercept)	-8266464	16204659	-0.5101	0.6111
Qty	2445660	247945	9.8637	0.0000

Code

glance(model1) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, nobs) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kable(caption = "Model 1 fit statistics") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Model 1 fit statistics
r.squared	adj.r.squared	sigma	statistic	p.value	df	nobs
0.4982	0.4931	82423599	97.293	0	1	100

Code

par(mfrow = c(2, 2))
plot(model1)

Model 1 diagnostic plots — pooled regression

Code

par(mfrow = c(1, 1))

Model 1 results: R² = 0.499; adjusted R² = 0.494. Each additional unit sold increases revenue by approximately ₦2.18 million (p < 0.001). The intercept (≈ −₦28.9M) is not economically meaningful — a transaction of zero units does not occur.

Diagnostic concern: The residual plots reveal two clusters rather than a random scatter — this is the signature of the two product types being pooled into one model. The Scale-Location and Residuals vs Fitted plots show a clear “two-band” pattern: Motorcycle residuals cluster near zero, Tricycle residuals cluster higher. This heteroscedasticity is structural (product mix), not random — and it motivates Model 2.

Model 2 — Multiple Regression with Product Interaction

\[\text{Revenue}_i = \beta_0 + \beta_1 \cdot \text{Qty}_i + \beta_2 \cdot \mathbb{1}[\text{Tricycle}]_i + \beta_3 \cdot (\text{Qty}_i \times \mathbb{1}[\text{Tricycle}]_i) + \varepsilon_i\]

Code

model2 <- lm(Revenue ~ Qty * Product, data = df)

tidy(model2) %>%
  mutate(across(where(is.numeric), ~round(., 2))) %>%
  kable(caption = "Model 2: Revenue ~ Qty × Product (interaction)") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

Model 2: Revenue ~ Qty × Product (interaction)
term	estimate	statistic
(Intercept)	0	2.970000e+00
Qty	1121250	8.901355e+15
ProductBajaj Tricycle	0	-4.100000e+00
Qty:ProductBajaj Tricycle	2501250	1.402121e+16

Code

glance(model2) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, nobs) %>%
  mutate(across(where(is.numeric), ~round(., 6))) %>%
  kable(caption = "Model 2 fit statistics") %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
Warning in summary.lm(x): essentially perfect fit: summary may be unreliable

Model 2 fit statistics
r.squared	adj.r.squared	sigma	statistic	p.value	df	nobs
1	1	0	5.037065e+32	0	3	100

Code

# Compare models
anova(model1, model2)

Analysis of Variance Table

Model 1: Revenue ~ Qty
Model 2: Revenue ~ Qty * Product
  Res.Df        RSS Df  Sum of Sq          F    Pr(>F)    
1     98 6.6578e+17                                       
2     96 0.0000e+00  2 6.6578e+17 3.7915e+32 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

cat("=== Implied unit prices from regression ===\n")

=== Implied unit prices from regression ===

Code

coefs <- coef(model2)
motorcycle_slope <- coefs["Qty"]
tricycle_slope   <- coefs["Qty"] + coefs["Qty:ProductBajaj Tricycle"]
cat("Motorcycle unit price implied by model:", scales::comma(round(motorcycle_slope, 0)), "₦\n")

Motorcycle unit price implied by model: 1,121,250 ₦

Code

cat("Tricycle unit price implied by model:  ", scales::comma(round(tricycle_slope, 0)), "₦\n")

Tricycle unit price implied by model:   3,622,500 ₦

Code

cat("\nActual unit prices from data:\n")


Actual unit prices from data:

Code

cat("Motorcycle:", scales::comma(unique(df$UnitPrice[df$Product=="Bajaj Motorcycle"])), "₦\n")

Motorcycle: 1,121,250 1,121,250 ₦

Code

cat("Tricycle:  ", scales::comma(unique(df$UnitPrice[df$Product=="Bajaj Tricycle"])), "₦\n")

Tricycle:   3,622,500 3,622,500 ₦

Model 2 results: R² = 1.000 (essentially perfect). The model now perfectly recovers the fixed unit prices: ₦1,121,250 per Motorcycle and ₦3,622,500 per Tricycle. The ANOVA comparison confirms that adding the product interaction term significantly improves fit (F = 11,527, p < 0.001).

Interpretation for a non-technical manager:

“The regression model confirms that your revenue is 100% predictable from how many of each product type you sell. Sell one more Motorcycle and revenue rises by ₦1.12 million. Sell one more Tricycle and revenue rises by ₦3.62 million. There are no surprises in your pricing — which is operationally efficient, but also means revenue growth can only come from selling more units or shifting the mix toward Tricycles. Based on these numbers, converting two Motorcycle sales per month into Tricycle sales would generate approximately ₦5 million more revenue per month, for the same number of transactions.”

Guiding Question — Regression

How would you translate your regression coefficient into a recommendation for a non-technical manager?

The interaction model yields two slope coefficients that map directly onto the dealership’s operational reality: ₦1,121,250 per Motorcycle unit and ₦3,622,500 per Tricycle unit. Rather than presenting these as statistical outputs, I translate them into three concrete, actionable recommendations:

1. Revenue targeting: If the sales manager wants to hit a monthly revenue target of ₦5 billion, the model tells them exactly how to get there. At the current product mix (approximately 54% Motorcycle transactions), that requires selling roughly 1,380 Motorcycle units and 1,380 Tricycle units per month. Shifting the mix 10 percentage points toward Tricycles reduces the total units needed to hit the same target — meaning less logistics cost and fewer transactions to manage.

2. Sales staff prioritisation: The coefficient difference (₦3.62M vs ₦1.12M per unit) means a salesperson who closes one Tricycle transaction of 10 units generates the same revenue as a salesperson who closes three Motorcycle transactions of 10 units each. Commission structures should reflect this 3.2× productivity differential.

3. Inventory investment: Every additional Tricycle stocked has 3.2× the revenue potential of an additional Motorcycle. Working capital allocated to Tricycle inventory generates a higher expected revenue return per naira invested, all else equal. The regression model provides the precise multiplier a finance manager needs to make that case to a bank or investor.

10. Integrated Findings

The five analyses converge on a single, consistent picture of this Bajaj dealership’s economics:

1. The business is growing fast. Revenue tripled over four months. This is not merely a January calendar artefact — the February-to-April trend is sustained.

2. Growth is driven by Tricycles. Tricycles account for a growing share of revenue (70% → 85% from Jan to Apr), generate statistically significantly more revenue per transaction (p < 0.001, Cohen’s d = 1.60), and are the dominant driver of the regression model’s predictive power.

3. The margin structure is rigid and healthy. A fixed 30.43% gross margin on all transactions simplifies forecasting: profit = 0.3043 × revenue, with no exceptions in 100 transactions. This is efficient but removes pricing flexibility.

4. Buyer demographics do not drive quantity outcomes. Neither sex nor education level significantly predicts units purchased. The dealership’s customers are demographically diverse, and marketing differentiation by demographic segment is not supported by the data.

5. Revenue is volume-driven, not price-driven. Because unit prices never vary, the only way to grow revenue is to sell more units. The regression model makes this mathematically explicit: Revenue = 1,121,250 × MotorcycleUnits + 3,622,500 × TricycleUnits.

Single recommendation: The dealership should prioritise Tricycle inventory availability and consumer financing partnerships. The data shows that (a) Tricycles generate 3.4× more revenue per transaction, (b) their share of total revenue is rising, and (c) the primary constraint on a buyer’s ability to purchase a Tricycle is likely capital access rather than willingness to buy. A partnership with a microfinance institution or a Bajaj-branded installment scheme could accelerate both Tricycle volume and total dealership revenue.

11. Limitations & Further Work

Data limitations: - 100 transactions from a single dealership over one quarter cannot support generalisations about the Nigerian Bajaj market. - The absence of price variation means standard demand-estimation techniques (price elasticity) cannot be applied. - No geographic, income, or wealth variables are available to model buyer ability-to-pay. - The dataset covers only Q1 2026; seasonal effects (rainy season, festive periods) cannot be assessed.

What I would do differently with more data: - Extend the time series to at least 12 months to model seasonality and test whether the upward revenue trend is structural or cyclical. - Add competitor pricing data to model price elasticity properly — even if this dealership does not vary prices, market-level variation would allow demand curve estimation. - Collect buyer income data (even proxies such as payment method or loan take-up) to model the credit constraint hypothesis motivating the Tricycle financing recommendation. - Apply time-series forecasting (ARIMA or Prophet) to monthly revenue to generate defensible forward projections for the dealership’s annual plan. - Add a multi-location dimension — comparing performance across dealerships would allow fixed-effects regression to control for location-specific demand drivers.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). SAGE Publications. [car package]

Zeileis, A., & Hothorn, T. (2002). Diagnostic checking in regression relationships. R News, 2(3), 7–10. https://CRAN.R-project.org/doc/Rnews/ [lmtest package]

Kassambara, A. (2023). ggcorrplot: Visualization of a correlation matrix using ggplot2 (Version 0.1.4). https://CRAN.R-project.org/package=ggcorrplot

Pedersen, T. L. (2024). patchwork: The composer of plots (Version 1.2.0). https://CRAN.R-project.org/package=patchwork

[Dataset] Sales Manager, Bajaj Dealership, Lagos, Nigeria. (2026). Bajaj historical sales — January to April 2026 [Dataset]. Provided to Ikenna Ochei I. directly for academic use. Data available on request from the author.

Appendix: AI Usage Statement

Claude (Anthropic) was used to assist with initial data exploration — specifically, to verify the structure of the Excel file, identify that unit prices were perfectly fixed across all transactions, and flag the structural collinearity between Revenue, Profit, and Cost Price before any modelling began. The AI also assisted in drafting the initial skeleton of this document’s section structure.

All analytical decisions — which hypothesis tests to apply and why, how to handle the two-product regression problem, the interpretation of Cohen’s d and η², and the integrated business recommendation — were made independently by the author. Every line of R code was reviewed, modified, and validated by the author against the actual output. The business interpretation in Section 10 and the limitations in Section 11 represent the author’s own economic judgement.