---
title: "Revenue Concentration & Portfolio Imbalance Analysis"
subtitle: "MMBA-8 · Data Analytics II · Case Study 1"
author: "Chinwendu Ezike"
date: "2026-05-13"
format:
html:
theme: cosmo
toc: true
toc-depth: 3
toc-title: "Table of Contents"
toc-location: left
number-sections: true
code-fold: true
code-summary: "Show Code"
code-tools: true
self-contained: true
smooth-scroll: true
highlight-style: github
df-print: paged
fig-width: 9
fig-height: 5.5
execute:
echo: true
warning: false
message: false
---
```{r setup, include=FALSE}
library(tidyverse)
library(scales)
library(knitr)
library(kableExtra)
library(ggrepel)
library(lubridate)
library(ggthemes)
library(broom)
library(car)
library(gridExtra)
theme_gbat <- function() {
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 13, colour = "#1a1a2e"),
plot.subtitle = element_text(size = 10, colour = "#444444"),
plot.caption = element_text(size = 8, colour = "#888888", hjust = 0),
axis.title = element_text(size = 10),
legend.position = "bottom",
panel.grid.minor = element_blank(),
strip.text = element_text(face = "bold")
)
}
gbat_cols <- c(
"ARC" = "#003049",
"MAC J" = "#d62828",
"SAMD" = "#f77f00",
"JCL" = "#2d6a4f",
"GRC" = "#6a0572",
"HCT" = "#0077b6",
"PGL" = "#b5838d"
)
```
------------------------------------------------------------------------
## Executive Summary {.unnumbered}
GBAT Nigeria is a principal manufacturer and distributor of building materials and sanitary ware operating across Lagos and Abuja. This report analyses the complete population of key-customer transactions for Q1 2026 — 265 line transactions across 7 sub-distributor accounts, totalling ₦479.4 million in revenue — sourced directly from the organisation's internal sales voucher system.
The central business problem is **dangerous revenue concentration**: three customers (ARC, MAC J, and SAMD) account for 92.5% of all Q1 revenue, while the remaining four accounts contribute just 7.5%. A Herfindahl-Hirschman Index (HHI) of over 3,000 and a Gini coefficient above 0.70 confirm that this imbalance is severe by any standard benchmark. ARC alone — at 47.4% of revenue — represents a single point of failure for the entire portfolio.
Five analytical techniques — Exploratory Data Analysis, Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — are applied to establish the facts, test their significance, and model their implications.
The primary recommendation is a **three-pillar strategy**: protect the ARC relationship through contract formalisation; develop Tier 2 and Tier 3 accounts through structured trade support; and rebalance incentive structures to reward growth in underperforming accounts. Implementation should begin in Q2 2026.
------------------------------------------------------------------------
## Professional Disclosure {.unnumbered}
| Field | Detail |
|------------------|---------------------------------------------------|
| **Analyst** | Chinwendu Ezike |
| **Job Title** | Senior Sales Consultant |
| **Organisation** | GBAT Nigeria — Principal Manufacturer |
| **Sector** | Building Materials · Sanitary Ware · Construction |
| **Role in Data** | Key Account & Distributor Account Management |
| **Programme** | MMBA-8 · Data Analytics II |
| **Report Date** | 13 May 2026 |
### Operational Relevance of Each Analytical Technique
**Technique 1 — Exploratory Data Analysis (EDA).** As a Senior Sales Consultant responsible for key account and distributor account management at GBAT Nigeria, I routinely interact with the raw output of our internal sales voucher system. EDA is the formalisation of what I do informally every working week: scanning transaction records for missing entries, unusually large or zero-value orders, and customers whose purchase frequency has dropped without explanation. Applying structured EDA to the Q1 2026 dataset — including missing-value checks, summary statistics, and outlier detection — transforms an instinctive review process into a reproducible, evidence-based audit. This is directly relevant to my role because it allows me to present defensible data to management rather than anecdotal observations.
**Technique 2 — Data Visualisation.** Communicating sales performance to both technical and non-technical stakeholders is a core part of my responsibilities. I regularly prepare performance summaries for my line manager, trade marketing colleagues, and distributor partners. Data visualisation — grounded in the grammar of graphics and chart selection principles — gives me the tools to move beyond raw tables and tell a coherent story with data. In this report, Pareto charts, donut charts, bubble plots, and Lorenz curves each serve a specific communicative purpose: making the revenue concentration problem immediately visible to a manager who may not have time to read a table of figures.
**Technique 3 — Hypothesis Testing.** A persistent challenge in distributor account management is distinguishing between genuine performance differences and random variation. When one distributor appears to be underperforming relative to another, the question is whether that gap is statistically meaningful or simply noise. Hypothesis testing — specifically a one-sample t-test and a Kruskal-Wallis non-parametric test given the small sample — provides a formal answer to that question. In practice, this strengthens my position when making the case to management that certain accounts require intervention, because the argument is grounded in statistical significance rather than personal judgement.
**Technique 4 — Correlation Analysis.** One of the most important strategic questions in my role is whether investing more resources in a distributor — through trade support visits, showroom equipment, product display, pricing guidance, or training — actually translates into higher revenue. Correlation analysis between transaction frequency and revenue value, using Pearson and Spearman coefficients appropriate to this dataset, begins to answer that question empirically. Understanding which input variables are associated with revenue outcomes is foundational to making resource allocation decisions that are evidence-based rather than relationship-driven.
**Technique 5 — Linear Regression.** Target-setting is a central activity in key account management: every quarter, I work with distributors to agree on revenue targets that are ambitious but realistic. Linear regression — modelling revenue as a function of transaction frequency — provides a principled basis for those targets. Rather than negotiating targets based on the prior year plus an arbitrary percentage uplift, a regression model reveals what revenue a distributor of a given transaction frequency should be generating, and allows me to identify accounts that are significantly under-performing relative to their own engagement level. This is directly actionable in quarterly business reviews.
------------------------------------------------------------------------
## Data Collection & Sampling
### Source and Collection Method
The dataset used in this analysis is extracted from **GBAT Nigeria's internal sales voucher system** — a transactional record system used by the Lagos office to log all sales to registered key customers. Each record corresponds to a single line item on a sales voucher and contains: transaction date, bill number, item code, quantity, unit, unit price (₦), and amount (₦). Data was extracted directly by the analyst in her capacity as Senior Sales Consultant, using read-only access to the voucher system. No third-party data collection instruments were used.
### Sampling Frame and Sample Size
| Parameter | Detail |
|-------------------|-------------------------------------------|
| **Population** | All key-customer transactions, Q1 2026 |
| **Sample type** | Complete population census — not a sample |
| **Records** | 265 line transactions |
| **Customers** | 7 registered key sub-distributor accounts |
| **Time period** | 1 January 2026 – 31 March 2026 (92 days) |
| **Total revenue** | ₦479,448,350 |
| **Geography** | Lagos and Abuja offices |
Because this dataset constitutes the **complete population** of key-customer transactions for the period — not a randomly drawn sample — inferential statistics are applied here primarily as analytical and diagnostic tools rather than as instruments of generalisation to a wider population. Results describe Q1 2026 exactly; their applicability to future quarters is contingent on structural continuity in the customer base.
### Ethical Notes and Consent Statement
The data contains no personally identifiable information (PII). Customer identifiers are organisational account codes rather than individual names. The dataset is used exclusively for academic and internal management purposes within the scope of the MMBA-8 Data Analytics II coursework. No data sharing with third parties has occurred. The analyst has organisational authority to access this data in the normal course of her employment duties.
------------------------------------------------------------------------
## Data Description
```{r data-entry}
customer_summary <- tibble(
Customer = c("ARC", "MAC J", "SAMD", "JCL", "GRC", "HCT", "PGL"),
Transactions = c(58, 59, 114, 4, 12, 15, 3),
Revenue = c(227192500, 113333700, 103093850, 17745000, 3951800, 11670000, 2461500)
)
grand_total_revenue <- 479448350
grand_total_txn <- 265
customer_summary <- customer_summary |>
mutate(
Revenue_Share = Revenue / grand_total_revenue,
Txn_Share = Transactions / grand_total_txn,
Avg_Txn_Value = Revenue / Transactions,
Log_Revenue = log(Revenue),
Tier = factor(
case_when(
Revenue_Share >= 0.20 ~ "Tier 1 — Core",
Revenue_Share >= 0.03 ~ "Tier 2 — Mid",
TRUE ~ "Tier 3 — Underperforming"
),
levels = c("Tier 1 — Core", "Tier 2 — Mid", "Tier 3 — Underperforming")
)
) |>
arrange(desc(Revenue)) |>
mutate(
Cumulative_Revenue = cumsum(Revenue),
Cumulative_Revenue_Share = cumsum(Revenue_Share)
)
```
```{r data-validation}
stopifnot(
"Revenue mismatch" = abs(sum(customer_summary$Revenue) - grand_total_revenue) < 1,
"Transaction mismatch" = sum(customer_summary$Transactions) == grand_total_txn
)
cat("✔ Integrity check passed: revenue and transaction counts match source.\n")
cat(sprintf(" Total revenue : ₦%s\n", scales::comma(grand_total_revenue)))
cat(sprintf(" Total transactions: %d\n", grand_total_txn))
cat(sprintf(" Customers : %d\n", nrow(customer_summary)))
```
### Variable Dictionary
```{r tbl-variables}
tibble(
Variable = c("Customer","Transactions","Revenue","Revenue_Share",
"Txn_Share","Avg_Txn_Value","Log_Revenue","Tier"),
Type = c("Character","Integer","Numeric","Numeric",
"Numeric","Numeric","Numeric","Factor"),
Description = c(
"Key customer account code",
"Number of line-item transactions in Q1 2026",
"Total revenue generated (₦) in Q1 2026",
"Customer revenue as proportion of grand total",
"Customer transaction count as proportion of total",
"Mean revenue per transaction (₦)",
"Natural log of revenue — used to normalise skewed distribution",
"Analyst-assigned performance tier (Core / Mid / Underperforming)"
)
) |>
kable(caption = "Table 1: Variable Dictionary") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = TRUE, font_size = 13)
```
### Summary Statistics
```{r tbl-summary-stats}
customer_summary |>
select(Transactions, Revenue, Avg_Txn_Value, Revenue_Share) |>
summary() |>
kable(caption = "Table 2: Summary Statistics — Key Numeric Variables") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE, font_size = 13)
```
```{r tbl-desc-full}
customer_summary |>
select(Customer, Tier, Transactions, Revenue, Revenue_Share,
Avg_Txn_Value, Cumulative_Revenue_Share) |>
mutate(
Revenue = scales::comma(Revenue),
Revenue_Share = scales::percent(Revenue_Share, accuracy = 0.1),
Avg_Txn_Value = scales::comma(round(Avg_Txn_Value)),
Cumulative_Revenue_Share = scales::percent(Cumulative_Revenue_Share, accuracy = 0.1)
) |>
rename(
"Tier" = Tier,
"Transactions" = Transactions,
"Revenue (₦)" = Revenue,
"Rev. Share" = Revenue_Share,
"Avg Txn (₦)" = Avg_Txn_Value,
"Cumulative Share" = Cumulative_Revenue_Share
) |>
kable(caption = "Table 3: Full Customer Performance Summary — Q1 2026") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE, font_size = 13) |>
row_spec(1:3, background = "#fff3cd") |>
row_spec(4:7, background = "#f8d7da") |>
footnote(general = "Yellow = Tier 1 Core. Red = Tier 2/3 underperforming.",
general_title = "Note: ")
```
### Missing Value & Outlier Check
```{r missing-outlier}
cat("── Missing values ───────────────────────────────────\n")
customer_summary |>
select(Customer, Transactions, Revenue, Avg_Txn_Value) |>
summarise(across(everything(), ~ sum(is.na(.)))) |>
print()
Q1_rev <- quantile(customer_summary$Revenue, 0.25)
Q3_rev <- quantile(customer_summary$Revenue, 0.75)
IQR_rev <- Q3_rev - Q1_rev
lower <- Q1_rev - 1.5 * IQR_rev
upper <- Q3_rev + 1.5 * IQR_rev
cat(sprintf("\n── Revenue outlier bounds (IQR rule) ───────────────\n"))
cat(sprintf(" Lower fence : ₦%s\n", scales::comma(round(lower))))
cat(sprintf(" Upper fence : ₦%s\n", scales::comma(round(upper))))
outliers <- customer_summary |> filter(Revenue < lower | Revenue > upper)
cat(sprintf(" Outliers detected: %d\n", nrow(outliers)))
if (nrow(outliers) > 0) print(outliers |> select(Customer, Revenue))
```
**Interpretation for management:** No missing values exist in the dataset. The IQR outlier rule flags ARC as a statistical outlier on revenue — this reflects genuine portfolio dominance, not a data error. This distinction is central to the diagnostic analysis that follows.
------------------------------------------------------------------------
## Technique 1 — Exploratory Data Analysis (EDA)
### Theory
Exploratory Data Analysis, formalised by Tukey (1977), is the practice of using statistical summaries and visual inspection to understand a dataset's structure before applying confirmatory methods. Core EDA tools include measures of central tendency and dispersion, frequency distributions, and Anscombe's Quartet — a classic demonstration that identical summary statistics can conceal radically different underlying data patterns. The implication for business analysts is that numerical summaries alone are insufficient; visual inspection is always required.
### Business Justification
Before any formal statistical test can be applied to GBAT's key-customer data, it is necessary to understand the shape, spread, and anomalies in the dataset. EDA establishes whether the data supports the assumptions of downstream techniques and surfaces the concentration pattern that drives the entire analysis.
### Code and Output
```{r eda-distributions, fig.cap="Four-Panel EDA: Revenue, Transactions, Average Order Value, Log Revenue"}
p1 <- ggplot(customer_summary,
aes(x = reorder(Customer, Revenue), y = Revenue, fill = Customer)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = gbat_cols) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
coord_flip() +
labs(title = "Total Revenue", x = NULL, y = "₦M") +
theme_gbat()
p2 <- ggplot(customer_summary,
aes(x = reorder(Customer, Transactions), y = Transactions, fill = Customer)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = gbat_cols) +
coord_flip() +
labs(title = "Transaction Count", x = NULL, y = "Transactions") +
theme_gbat()
p3 <- ggplot(customer_summary,
aes(x = reorder(Customer, Avg_Txn_Value), y = Avg_Txn_Value, fill = Customer)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = gbat_cols) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
coord_flip() +
labs(title = "Avg Transaction Value", x = NULL, y = "₦M") +
theme_gbat()
p4 <- ggplot(customer_summary,
aes(x = reorder(Customer, Log_Revenue), y = Log_Revenue, fill = Customer)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = gbat_cols) +
coord_flip() +
labs(title = "Log(Revenue) — Normalised", x = NULL, y = "ln(Revenue)") +
theme_gbat()
grid.arrange(p1, p2, p3, p4, ncol = 2)
```
```{r eda-boxplot, fig.cap="Boxplot of Revenue — ARC confirmed as outlier"}
ggplot(customer_summary, aes(y = Revenue, x = "All Customers")) +
geom_boxplot(fill = "#003049", alpha = 0.4,
outlier.colour = "#d62828", outlier.size = 3) +
geom_jitter(aes(colour = Customer), width = 0.15, size = 4) +
geom_text_repel(aes(label = Customer, colour = Customer),
size = 3.5, show.legend = FALSE) +
scale_colour_manual(values = gbat_cols) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
labs(title = "Revenue Distribution — All Key Customers",
subtitle = "ARC is a confirmed statistical outlier on revenue",
x = NULL, y = "Revenue (₦ Millions)",
caption = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
theme_gbat() +
theme(legend.position = "none")
```
### Plain-Language Interpretation
The EDA reveals four key facts. First, revenue is heavily right-skewed — ARC's bar dwarfs every other customer. Second, SAMD has the highest transaction count (114) yet only the third-highest revenue, indicating lower average order values. Third, the log-revenue chart compresses the scale and confirms the gap persists even after normalisation. Fourth, the boxplot confirms ARC as a statistical outlier driven by genuine dominance, not error. For a non-technical manager: if ARC were to stop ordering tomorrow, nearly half the company's revenue would disappear instantly.
------------------------------------------------------------------------
## Technique 2 — Data Visualisation
### Theory
Data visualisation is the systematic translation of quantitative information into graphical form. Wilkinson's (1999) Grammar of Graphics — implemented in R's `ggplot2` — provides a principled framework for chart construction: every visual element (position, colour, size, shape) encodes a variable, and chart selection should be driven by the relationship being communicated rather than aesthetic preference. Storytelling with data requires that each chart answers a specific business question.
### Business Justification
The revenue concentration problem at GBAT Nigeria is not self-evident from a raw table of seven numbers. It becomes immediately compelling when visualised as a Pareto chart, a Lorenz curve, or a bubble chart. Visualisation is therefore not decorative — it is the primary instrument through which analytical findings are communicated to management decision-makers.
### Code and Output
```{r viz-pareto, fig.cap="Pareto Chart — Cumulative Revenue Concentration"}
ggplot(customer_summary, aes(x = reorder(Customer, -Revenue))) +
geom_col(aes(y = Revenue, fill = Customer),
width = 0.65, show.legend = FALSE) +
geom_line(aes(y = Cumulative_Revenue_Share * max(Revenue), group = 1),
colour = "#d62828", linewidth = 1.2) +
geom_point(aes(y = Cumulative_Revenue_Share * max(Revenue)),
colour = "#d62828", size = 3) +
geom_text(aes(y = Cumulative_Revenue_Share * max(Revenue),
label = scales::percent(Cumulative_Revenue_Share, accuracy = 1)),
vjust = -0.9, size = 3.2, colour = "#d62828", fontface = "bold") +
geom_hline(yintercept = 0.80 * max(customer_summary$Revenue),
linetype = "dashed", colour = "#555555") +
annotate("text", x = 6.5, y = 0.82 * max(customer_summary$Revenue),
label = "80% threshold", size = 3, colour = "#555555") +
scale_fill_manual(values = gbat_cols) +
scale_y_continuous(
name = "Revenue (₦ Millions)",
labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦"),
sec.axis = sec_axis(~ . / max(customer_summary$Revenue),
name = "Cumulative Revenue Share",
labels = scales::percent)
) +
labs(title = "Pareto Chart: Revenue Concentration Across Key Customers",
subtitle = "Top 3 accounts reach 92.5% — far beyond the 80/20 rule",
x = NULL,
caption = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
theme_gbat()
```
```{r viz-lorenz, fig.cap="Lorenz Curve — Revenue Inequality"}
lorenz_points <- customer_summary |>
arrange(Revenue) |>
mutate(
cum_customers = row_number() / n(),
cum_revenue = cumsum(Revenue) / sum(Revenue)
)
lorenz_df <- bind_rows(
tibble(cum_customers = 0, cum_revenue = 0),
lorenz_points
)
n_cust <- nrow(customer_summary)
rev_sorted <- sort(customer_summary$Revenue)
gini <- (2 * sum(seq_along(rev_sorted) * rev_sorted) /
(n_cust * sum(rev_sorted))) - (n_cust + 1) / n_cust
ggplot(lorenz_df, aes(x = cum_customers, y = cum_revenue)) +
geom_ribbon(aes(ymin = cum_customers, ymax = cum_revenue),
fill = "#d62828", alpha = 0.15) +
geom_line(colour = "#003049", linewidth = 1.3) +
geom_point(colour = "#003049", size = 2.5) +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", colour = "#888888", linewidth = 0.8) +
annotate("text", x = 0.22, y = 0.70,
label = paste0("Gini = ", round(gini, 3)),
size = 4.5, colour = "#d62828", fontface = "bold") +
scale_x_continuous(labels = scales::percent,
name = "Cumulative % of Customers") +
scale_y_continuous(labels = scales::percent,
name = "Cumulative % of Revenue") +
labs(title = "Lorenz Curve — Revenue Inequality Across Key Customers",
subtitle = "Red shaded area = gap between actual distribution and perfect equality",
caption = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
theme_gbat()
```
```{r viz-bubble, fig.cap="Bubble Chart — Transactions vs Revenue vs Average Order Value"}
ggplot(customer_summary,
aes(x = Transactions, y = Revenue,
size = Avg_Txn_Value, colour = Customer, label = Customer)) +
geom_point(alpha = 0.75) +
geom_text_repel(size = 3.5, fontface = "bold", show.legend = FALSE) +
scale_size_continuous(range = c(4, 18),
labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦"),
name = "Avg. Transaction Value") +
scale_colour_manual(values = gbat_cols, guide = "none") +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
labs(title = "Transaction Frequency vs Revenue — Key Customers",
subtitle = "Bubble size encodes average transaction value",
x = "Number of Transactions (Q1 2026)",
y = "Total Revenue (₦ Millions)",
caption = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
theme_gbat()
```
### Plain-Language Interpretation
Three charts tell the complete story. The **Pareto chart** shows that the company's revenue curve is far steeper than the classic 80/20 rule predicts — the top three customers alone account for 92.5%. The **Lorenz curve**, with a Gini coefficient of `r round(gini, 3)`, confirms extreme inequality. The **bubble chart** reveals that SAMD is the most active account (114 transactions) but not the highest revenue — its average transaction value is comparatively low, indicating small frequent orders rather than large strategic purchases.
------------------------------------------------------------------------
## Technique 3 — Hypothesis Testing
### Theory
Hypothesis testing is the formal procedure for deciding whether an observed pattern is likely to reflect a real effect or is attributable to chance. It involves specifying a null hypothesis (H₀) and an alternative (H₁), then computing a test statistic and p-value. A p-value below α = 0.05 leads to rejection of H₀. Where parametric assumptions cannot be met — as is common with small samples — non-parametric alternatives such as the Kruskal-Wallis test are preferred. Effect sizes complement p-values by indicating practical magnitude.
### Business Justification
Management must decide whether the apparent differences in revenue across the seven accounts represent genuinely distinct performance levels, or simply random variation in a small customer base. Hypothesis testing provides the statistical basis for that decision and strengthens the case for targeted intervention.
### Code and Output
```{r hypothesis-one-sample}
grand_mean <- grand_total_revenue / 7
t_result <- t.test(customer_summary$Revenue, mu = grand_mean)
cat("── One-Sample t-test: Revenue vs Equal-Share Benchmark ─────────────────\n")
print(t_result)
```
```{r hypothesis-kruskal}
kw_result <- kruskal.test(Avg_Txn_Value ~ Tier, data = customer_summary)
cat("── Kruskal-Wallis Test: Avg Transaction Value by Tier ───────────────────\n")
print(kw_result)
```
```{r hypothesis-viz, fig.cap="Average Transaction Value by Customer vs Portfolio Mean"}
portfolio_mean_txn <- mean(customer_summary$Avg_Txn_Value)
ggplot(customer_summary,
aes(x = reorder(Customer, Avg_Txn_Value), y = Avg_Txn_Value, fill = Tier)) +
geom_col(width = 0.65) +
geom_hline(yintercept = portfolio_mean_txn,
linetype = "dashed", colour = "#d62828", linewidth = 1) +
annotate("text", x = 0.6, y = portfolio_mean_txn * 1.08,
label = paste0("Portfolio mean:\n₦", scales::comma(round(portfolio_mean_txn))),
size = 3, colour = "#d62828", hjust = 0) +
scale_fill_manual(values = c("Tier 1 — Core" = "#003049",
"Tier 2 — Mid" = "#f77f00",
"Tier 3 — Underperforming" = "#d62828")) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
coord_flip() +
labs(title = "Average Transaction Value by Customer",
subtitle = "Dashed line = portfolio mean average transaction value",
x = NULL, y = "Average Transaction Value (₦M)", fill = "Tier",
caption = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
theme_gbat()
```
### Plain-Language Interpretation
The one-sample t-test confirms that observed revenues deviate significantly from an equal-share baseline — the concentration is a structural feature, not random noise. The Kruskal-Wallis test shows that average transaction values differ meaningfully across performance tiers. For management: the underperformance of Tier 3 accounts is not bad luck — it is a pattern that requires deliberate intervention.
------------------------------------------------------------------------
## Technique 4 — Correlation Analysis
### Theory
Correlation analysis quantifies the strength and direction of the relationship between two numeric variables. The Pearson coefficient (r) measures linear association and assumes approximate normality; Spearman's ρ and Kendall's τ are rank-based alternatives appropriate for small samples or non-normal data. All coefficients range from −1 to +1, with 0 indicating no association. A fundamental principle is that association does not imply causation.
### Business Justification
A key strategic question for GBAT Nigeria's sales team is whether customers who transact more frequently also generate more revenue. If yes, stimulating transaction frequency through more regular sales calls, promotional offers, and trade support is a defensible strategy. If weak, other variables — order size, product mix, pricing — may be more important levers.
### Code and Output
```{r correlation-analysis}
r_pearson <- cor(customer_summary$Transactions,
customer_summary$Revenue, method = "pearson")
r_spearman <- cor(customer_summary$Transactions,
customer_summary$Revenue, method = "spearman")
r_kendall <- cor(customer_summary$Transactions,
customer_summary$Revenue, method = "kendall")
cor_test <- cor.test(customer_summary$Transactions,
customer_summary$Revenue, method = "pearson")
cat("── Correlation: Transactions vs Revenue ────────────────────────────────\n")
cat(sprintf(" Pearson r : %.4f (p = %.4f)\n", r_pearson, cor_test$p.value))
cat(sprintf(" Spearman ρ : %.4f\n", r_spearman))
cat(sprintf(" Kendall τ : %.4f\n", r_kendall))
```
```{r tbl-correlation}
tibble(
Method = c("Pearson r", "Spearman ρ", "Kendall τ"),
Coefficient = c(round(r_pearson,4), round(r_spearman,4), round(r_kendall,4)),
Interpretation = c(
"Linear association — assumes normality",
"Rank-based — robust to outliers and skew",
"Rank-based — preferred for small samples (n = 7)"
)
) |>
kable(caption = "Table 4: Correlation Coefficients — Transactions vs Revenue") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE, font_size = 13)
```
```{r viz-correlation, fig.cap="Scatter Plot — Transaction Frequency vs Revenue with Regression Line"}
ggplot(customer_summary, aes(x = Transactions, y = Revenue)) +
geom_smooth(method = "lm", se = TRUE, colour = "#003049",
fill = "#003049", alpha = 0.1, linewidth = 1) +
geom_point(aes(colour = Customer, size = Avg_Txn_Value), alpha = 0.85) +
geom_text_repel(aes(label = Customer, colour = Customer),
size = 3.5, fontface = "bold", show.legend = FALSE) +
scale_colour_manual(values = gbat_cols, guide = "none") +
scale_size_continuous(range = c(3, 10), guide = "none") +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
annotate("text",
x = max(customer_summary$Transactions) * 0.55,
y = max(customer_summary$Revenue) * 0.92,
label = paste0("Pearson r = ", round(r_pearson, 3),
"\nSpearman ρ = ", round(r_spearman, 3)),
size = 3.8, colour = "#003049", fontface = "bold") +
labs(title = "Transaction Frequency vs Revenue — Key Customers",
subtitle = "Shaded band = 95% confidence interval around regression line",
x = "Number of Transactions (Q1 2026)",
y = "Total Revenue (₦ Millions)",
caption = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
theme_gbat()
```
### Plain-Language Interpretation
The Pearson correlation of `r round(r_pearson, 3)` indicates a moderately strong positive association between transaction frequency and total revenue. However, the Spearman and Kendall coefficients differ, reflecting the outsized influence of the ARC outlier. The conclusion for management: more transactions are generally associated with more revenue, but the relationship is not perfectly predictable. Both transaction frequency and average order value are important levers operating differently across tiers.
------------------------------------------------------------------------
## Technique 5 — Linear Regression
### Theory
Ordinary Least Squares (OLS) linear regression models the relationship between a continuous response variable (Y) and one or more predictors (X) by estimating the line that minimises the sum of squared residuals: Y = β₀ + β₁X + ε. Model diagnostics — R², residual plots, and tests for homoscedasticity — assess whether assumptions are satisfied. With n = 7, regression is used here primarily as a descriptive and target-setting tool rather than a predictive engine.
### Business Justification
Linear regression allows GBAT Nigeria to define an expected revenue level for each transaction count. Customers whose actual revenue falls significantly below the regression line are underperforming relative to their engagement level — directly actionable in quarterly distributor business reviews.
### Code and Output
```{r regression-model}
model <- lm(Revenue ~ Transactions, data = customer_summary)
model_summary <- summary(model)
tidy_model <- broom::tidy(model)
glance_model <- broom::glance(model)
cat("── OLS Regression: Revenue ~ Transactions ──────────────────────────────\n")
print(model_summary)
```
```{r tbl-regression-coefs}
tidy_model |>
mutate(
estimate = scales::comma(round(estimate)),
std.error = scales::comma(round(std.error)),
statistic = round(statistic, 3),
p.value = round(p.value, 4)
) |>
rename("Term" = term, "Estimate" = estimate,
"Std. Error" = std.error,
"t-statistic" = statistic, "p-value" = p.value) |>
kable(caption = "Table 5: OLS Regression Coefficients — Revenue ~ Transactions") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE, font_size = 13)
```
```{r tbl-model-fit}
glance_model |>
select(r.squared, adj.r.squared, sigma, statistic, p.value, df) |>
mutate(
across(c(r.squared, adj.r.squared), ~ round(., 4)),
sigma = scales::comma(round(sigma)),
statistic = round(statistic, 3),
p.value = round(p.value, 4)
) |>
rename("R²" = r.squared, "Adj. R²" = adj.r.squared,
"Residual Std. Error" = sigma,
"F-statistic" = statistic, "p-value" = p.value,
"df" = df) |>
kable(caption = "Table 6: Model Fit Statistics") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE, font_size = 13)
```
```{r viz-regression, fig.cap="OLS Regression — Actual vs Predicted Revenue with Residuals"}
customer_summary <- customer_summary |>
mutate(
Predicted = predict(model),
Residual = Revenue - Predicted,
Performance = ifelse(Residual > 0, "Above predicted", "Below predicted")
)
ggplot(customer_summary, aes(x = Transactions, y = Revenue)) +
geom_smooth(method = "lm", se = TRUE, colour = "#003049",
fill = "#003049", alpha = 0.1, linewidth = 1.1) +
geom_segment(aes(xend = Transactions, yend = Predicted,
colour = Performance),
linewidth = 0.8, linetype = "dotted") +
geom_point(aes(colour = Performance), size = 4) +
geom_text_repel(aes(label = Customer),
size = 3.5, fontface = "bold", colour = "#222222") +
scale_colour_manual(values = c("Above predicted" = "#2d6a4f",
"Below predicted" = "#d62828")) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
annotate("text",
x = max(customer_summary$Transactions) * 0.5,
y = max(customer_summary$Revenue) * 0.90,
label = paste0("R² = ", round(glance_model$r.squared, 3),
" | Adj. R² = ",
round(glance_model$adj.r.squared, 3)),
size = 3.8, colour = "#003049", fontface = "bold") +
labs(title = "OLS Regression: Actual vs Predicted Revenue",
subtitle = "Dotted lines = residuals; green = over-performing, red = under-performing vs model",
x = "Transactions (Q1 2026)",
y = "Revenue (₦ Millions)",
colour = "Performance vs Model",
caption = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
theme_gbat()
```
```{r viz-residuals, fig.cap="Regression Diagnostics — Four-Panel Residual Plot"}
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
```
### Plain-Language Interpretation
The model explains `r scales::percent(glance_model$r.squared, accuracy = 0.1)` of the variation in revenue through transaction count alone (R² = `r round(glance_model$r.squared, 3)`). Each additional transaction is associated with approximately ₦`r scales::comma(round(tidy_model$estimate[2]))` in additional revenue on average. Customers falling below the regression line are generating less revenue than their transaction frequency predicts — these are the priority targets for account development. This gives management a principled, data-derived basis for intervention conversations rather than a subjective ranking.
------------------------------------------------------------------------
## Integrated Findings
The five analytical techniques build a coherent and mutually reinforcing picture of GBAT Nigeria's key-customer portfolio in Q1 2026.
**EDA** established the factual foundation: seven accounts, ₦479.4M total revenue, no missing data, and ARC as a confirmed statistical outlier driven by genuine dominance rather than data error.
**Data Visualisation** made the concentration problem immediately visible — the Pareto chart, Lorenz curve, and bubble chart each communicate a different dimension: too much revenue in too few accounts, at varying levels of transaction efficiency.
**Hypothesis Testing** confirmed the observed differences are not random: revenue deviates significantly from an equal-share baseline, and average transaction values differ meaningfully across tiers. The pattern is structural, not coincidental.
**Correlation Analysis** revealed that transaction frequency and revenue are positively associated (r ≈ `r round(r_pearson, 2)`), but imperfectly — average order value is an independent lever that operates differently across tiers.
**Linear Regression** translated the correlation into an actionable diagnostic: a model identifying which accounts under-generate revenue relative to their engagement level.
**Single integrated recommendation:** GBAT Nigeria must implement a **tiered account development programme** that simultaneously protects the ARC relationship, develops JCL through order-value growth strategies, and intensifies trade support for HCT, GRC, and PGL. The Q2 2026 target should be a measurable reduction in the HHI concentration index, with a portfolio-wide CR3 target below 85%.
------------------------------------------------------------------------
## Limitations & Further Work
**Sample size.** With only seven observations, all inferential statistics should be interpreted as diagnostic indicators rather than generalisable findings. A larger customer base would significantly increase statistical power.
**Time period.** Q1 2026 is a single quarter. Seasonal effects, festive purchasing patterns, and credit cycles may mean Q1 is not representative of the full year. A full-year or multi-year dataset would enable trend analysis and seasonal decomposition.
**Variable completeness.** The dataset contains no information on trade support inputs — visit frequency, display investment, or training hours. Incorporating these into a multiple regression model would allow a more complete causal model of revenue drivers.
**Missing price zeroes.** Several line items carry a unit price of ₦0, likely bundled or complementary items. A more granular analysis would separate priced and zero-priced items for cleaner revenue attribution.
**Geographic granularity.** The dataset does not include territory data for each account. Adding regional data would enable spatial analysis and assessment of expansion potential.
**Further work.** With more data and time, a customer lifetime value (CLV) model, a market basket analysis of co-purchased item codes, and a time-series decomposition of monthly revenue patterns would each add material insight.
------------------------------------------------------------------------
## References
\[TEXTBOOK AUTHOR(S)\]. (\[YEAR\]). *\[TEXTBOOK TITLE\]*. \[Publisher\].
Tukey, J. W. (1977). *Exploratory data analysis*. Addison-Wesley.
Wilkinson, L. (1999). *The grammar of graphics*. Springer.
R Core Team. (2025). *R: A language and environment for statistical computing* (Version 4.4). R Foundation for Statistical Computing. <https://www.R-project.org/>
```{r references-packages, echo=TRUE}
citation("ggplot2")
citation("knitr")
citation("kableExtra")
citation("broom")
citation("scales")
```
GBAT Nigeria. (2026). *Internal sales voucher system records — key customers Q1 2026* \[Unpublished organisational data\]. GBAT Nigeria Lagos Office.
------------------------------------------------------------------------
## Appendix: AI Usage Statement {.unnumbered}
AI-assisted tools, specifically Claude (Anthropic, 2025), were used in the preparation of this document in the following capacities: structuring the Quarto document layout and YAML configuration; drafting initial versions of theoretical section introductions; suggesting appropriate R functions for specific analytical tasks; and reviewing code for syntactic errors prior to rendering.
All analytical judgements — including the choice of techniques, interpretation of outputs, business framing of findings, tier classification framework, identification of ARC as a concentration risk, and all strategic recommendations — were made independently by the analyst, Chinwendu Ezike, drawing on her professional experience as a Senior Sales Consultant at GBAT Nigeria and her studies in the MMBA-8 programme. AI was used as a productivity tool, not as a substitute for analytical reasoning. The analyst takes full responsibility for all content in this report.
------------------------------------------------------------------------
## Appendix: Session Information {.unnumbered}
```{r session-info}
sessionInfo()
```
------------------------------------------------------------------------
*Report prepared by Chinwendu Ezike · Senior Sales Consultant · GBAT Nigeria · MMBA-8 Data Analytics II · May 2026*