Revenue Concentration & Portfolio Imbalance Analysis

MMBA-8 · Data Analytics II · Case Study 1

Author

Chinwendu Ezike

Published

May 13, 2026


Executive Summary

GBAT Nigeria is a principal manufacturer and distributor of building materials and sanitary ware operating across Lagos and Abuja. This report analyses the complete population of key-customer transactions for Q1 2026 — 265 line transactions across 7 sub-distributor accounts, totalling ₦479.4 million in revenue — sourced directly from the organisation’s internal sales voucher system.

The central business problem is dangerous revenue concentration: three customers (ARC, MAC J, and SAMD) account for 92.5% of all Q1 revenue, while the remaining four accounts contribute just 7.5%. A Herfindahl-Hirschman Index (HHI) of over 3,000 and a Gini coefficient above 0.70 confirm that this imbalance is severe by any standard benchmark. ARC alone — at 47.4% of revenue — represents a single point of failure for the entire portfolio.

Five analytical techniques — Exploratory Data Analysis, Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — are applied to establish the facts, test their significance, and model their implications.

The primary recommendation is a three-pillar strategy: protect the ARC relationship through contract formalisation; develop Tier 2 and Tier 3 accounts through structured trade support; and rebalance incentive structures to reward growth in underperforming accounts. Implementation should begin in Q2 2026.


Professional Disclosure

Field Detail
Analyst Chinwendu Ezike
Job Title Senior Sales Consultant
Organisation GBAT Nigeria — Principal Manufacturer
Sector Building Materials · Sanitary Ware · Construction
Role in Data Key Account & Distributor Account Management
Programme MMBA-8 · Data Analytics II
Report Date 13 May 2026

0.1 Operational Relevance of Each Analytical Technique

Technique 1 — Exploratory Data Analysis (EDA). As a Senior Sales Consultant responsible for key account and distributor account management at GBAT Nigeria, I routinely interact with the raw output of our internal sales voucher system. EDA is the formalisation of what I do informally every working week: scanning transaction records for missing entries, unusually large or zero-value orders, and customers whose purchase frequency has dropped without explanation. Applying structured EDA to the Q1 2026 dataset — including missing-value checks, summary statistics, and outlier detection — transforms an instinctive review process into a reproducible, evidence-based audit. This is directly relevant to my role because it allows me to present defensible data to management rather than anecdotal observations.

Technique 2 — Data Visualisation. Communicating sales performance to both technical and non-technical stakeholders is a core part of my responsibilities. I regularly prepare performance summaries for my line manager, trade marketing colleagues, and distributor partners. Data visualisation — grounded in the grammar of graphics and chart selection principles — gives me the tools to move beyond raw tables and tell a coherent story with data. In this report, Pareto charts, donut charts, bubble plots, and Lorenz curves each serve a specific communicative purpose: making the revenue concentration problem immediately visible to a manager who may not have time to read a table of figures.

Technique 3 — Hypothesis Testing. A persistent challenge in distributor account management is distinguishing between genuine performance differences and random variation. When one distributor appears to be underperforming relative to another, the question is whether that gap is statistically meaningful or simply noise. Hypothesis testing — specifically a one-sample t-test and a Kruskal-Wallis non-parametric test given the small sample — provides a formal answer to that question. In practice, this strengthens my position when making the case to management that certain accounts require intervention, because the argument is grounded in statistical significance rather than personal judgement.

Technique 4 — Correlation Analysis. One of the most important strategic questions in my role is whether investing more resources in a distributor — through trade support visits, showroom equipment, product display, pricing guidance, or training — actually translates into higher revenue. Correlation analysis between transaction frequency and revenue value, using Pearson and Spearman coefficients appropriate to this dataset, begins to answer that question empirically. Understanding which input variables are associated with revenue outcomes is foundational to making resource allocation decisions that are evidence-based rather than relationship-driven.

Technique 5 — Linear Regression. Target-setting is a central activity in key account management: every quarter, I work with distributors to agree on revenue targets that are ambitious but realistic. Linear regression — modelling revenue as a function of transaction frequency — provides a principled basis for those targets. Rather than negotiating targets based on the prior year plus an arbitrary percentage uplift, a regression model reveals what revenue a distributor of a given transaction frequency should be generating, and allows me to identify accounts that are significantly under-performing relative to their own engagement level. This is directly actionable in quarterly business reviews.


1 Data Collection & Sampling

1.1 Source and Collection Method

The dataset used in this analysis is extracted from GBAT Nigeria’s internal sales voucher system — a transactional record system used by the Lagos office to log all sales to registered key customers. Each record corresponds to a single line item on a sales voucher and contains: transaction date, bill number, item code, quantity, unit, unit price (₦), and amount (₦). Data was extracted directly by the analyst in her capacity as Senior Sales Consultant, using read-only access to the voucher system. No third-party data collection instruments were used.

1.2 Sampling Frame and Sample Size

Parameter Detail
Population All key-customer transactions, Q1 2026
Sample type Complete population census — not a sample
Records 265 line transactions
Customers 7 registered key sub-distributor accounts
Time period 1 January 2026 – 31 March 2026 (92 days)
Total revenue ₦479,448,350
Geography Lagos and Abuja offices

Because this dataset constitutes the complete population of key-customer transactions for the period — not a randomly drawn sample — inferential statistics are applied here primarily as analytical and diagnostic tools rather than as instruments of generalisation to a wider population. Results describe Q1 2026 exactly; their applicability to future quarters is contingent on structural continuity in the customer base.

2 Data Description

Show Code
customer_summary <- tibble(
  Customer     = c("ARC", "MAC J", "SAMD", "JCL", "GRC", "HCT", "PGL"),
  Transactions = c(58, 59, 114, 4, 12, 15, 3),
  Revenue      = c(227192500, 113333700, 103093850, 17745000, 3951800, 11670000, 2461500)
)

grand_total_revenue <- 479448350
grand_total_txn     <- 265

customer_summary <- customer_summary |>
  mutate(
    Revenue_Share            = Revenue / grand_total_revenue,
    Txn_Share                = Transactions / grand_total_txn,
    Avg_Txn_Value            = Revenue / Transactions,
    Log_Revenue              = log(Revenue),
    Tier = factor(
      case_when(
        Revenue_Share >= 0.20 ~ "Tier 1 — Core",
        Revenue_Share >= 0.03 ~ "Tier 2 — Mid",
        TRUE                  ~ "Tier 3 — Underperforming"
      ),
      levels = c("Tier 1 — Core", "Tier 2 — Mid", "Tier 3 — Underperforming")
    )
  ) |>
  arrange(desc(Revenue)) |>
  mutate(
    Cumulative_Revenue       = cumsum(Revenue),
    Cumulative_Revenue_Share = cumsum(Revenue_Share)
  )
Show Code
stopifnot(
  "Revenue mismatch"      = abs(sum(customer_summary$Revenue) - grand_total_revenue) < 1,
  "Transaction mismatch"  = sum(customer_summary$Transactions) == grand_total_txn
)
cat("✔ Integrity check passed: revenue and transaction counts match source.\n")
✔ Integrity check passed: revenue and transaction counts match source.
Show Code
cat(sprintf("  Total revenue     : ₦%s\n", scales::comma(grand_total_revenue)))
  Total revenue     : ₦479,448,350
Show Code
cat(sprintf("  Total transactions: %d\n",  grand_total_txn))
  Total transactions: 265
Show Code
cat(sprintf("  Customers         : %d\n",  nrow(customer_summary)))
  Customers         : 7

2.1 Variable Dictionary

Show Code
tibble(
  Variable    = c("Customer","Transactions","Revenue","Revenue_Share",
                  "Txn_Share","Avg_Txn_Value","Log_Revenue","Tier"),
  Type        = c("Character","Integer","Numeric","Numeric",
                  "Numeric","Numeric","Numeric","Factor"),
  Description = c(
    "Key customer account code",
    "Number of line-item transactions in Q1 2026",
    "Total revenue generated (₦) in Q1 2026",
    "Customer revenue as proportion of grand total",
    "Customer transaction count as proportion of total",
    "Mean revenue per transaction (₦)",
    "Natural log of revenue — used to normalise skewed distribution",
    "Analyst-assigned performance tier (Core / Mid / Underperforming)"
  )
) |>
  kable(caption = "Table 1: Variable Dictionary") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 13)
Table 1: Table 1: Variable Dictionary
Variable Type Description
Customer Character Key customer account code
Transactions Integer Number of line-item transactions in Q1 2026
Revenue Numeric Total revenue generated (₦) in Q1 2026
Revenue_Share Numeric Customer revenue as proportion of grand total
Txn_Share Numeric Customer transaction count as proportion of total
Avg_Txn_Value Numeric Mean revenue per transaction (₦)
Log_Revenue Numeric Natural log of revenue — used to normalise skewed distribution
Tier Factor Analyst-assigned performance tier (Core / Mid / Underperforming)

2.2 Summary Statistics

Show Code
customer_summary |>
  select(Transactions, Revenue, Avg_Txn_Value, Revenue_Share) |>
  summary() |>
  kable(caption = "Table 2: Summary Statistics — Key Numeric Variables") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, font_size = 13)
Table 2: Table 2: Summary Statistics — Key Numeric Variables
Transactions Revenue Avg_Txn_Value Revenue_Share
Min. : 3.00 Min. : 2461500 Min. : 329317 Min. :0.005134
1st Qu.: 8.00 1st Qu.: 7810900 1st Qu.: 799250 1st Qu.:0.016291
Median : 15.00 Median : 17745000 Median : 904332 Median :0.037011
Mean : 37.86 Mean : 68492621 Mean :1872346 Mean :0.142857
3rd Qu.: 58.50 3rd Qu.:108213775 3rd Qu.:2919011 3rd Qu.:0.225705
Max. :114.00 Max. :227192500 Max. :4436250 Max. :0.473862
Show Code
customer_summary |>
  select(Customer, Tier, Transactions, Revenue, Revenue_Share,
         Avg_Txn_Value, Cumulative_Revenue_Share) |>
  mutate(
    Revenue                  = scales::comma(Revenue),
    Revenue_Share            = scales::percent(Revenue_Share, accuracy = 0.1),
    Avg_Txn_Value            = scales::comma(round(Avg_Txn_Value)),
    Cumulative_Revenue_Share = scales::percent(Cumulative_Revenue_Share, accuracy = 0.1)
  ) |>
  rename(
    "Tier"             = Tier,
    "Transactions"     = Transactions,
    "Revenue (₦)"      = Revenue,
    "Rev. Share"       = Revenue_Share,
    "Avg Txn (₦)"      = Avg_Txn_Value,
    "Cumulative Share" = Cumulative_Revenue_Share
  ) |>
  kable(caption = "Table 3: Full Customer Performance Summary — Q1 2026") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, font_size = 13) |>
  row_spec(1:3, background = "#fff3cd") |>
  row_spec(4:7, background = "#f8d7da") |>
  footnote(general = "Yellow = Tier 1 Core. Red = Tier 2/3 underperforming.",
           general_title = "Note: ")
Table 3: Table 3: Full Customer Performance Summary — Q1 2026
Customer Tier Transactions Revenue (₦) Rev. Share Avg Txn (₦) Cumulative Share
ARC Tier 1 — Core 58 227,192,500 47.4% 3,917,112 47.4%
MAC J Tier 1 — Core 59 113,333,700 23.6% 1,920,910 71.0%
SAMD Tier 1 — Core 114 103,093,850 21.5% 904,332 92.5%
JCL Tier 2 — Mid 4 17,745,000 3.7% 4,436,250 96.2%
HCT Tier 3 — Underperforming 15 11,670,000 2.4% 778,000 98.7%
GRC Tier 3 — Underperforming 12 3,951,800 0.8% 329,317 99.5%
PGL Tier 3 — Underperforming 3 2,461,500 0.5% 820,500 100.0%
Note:
Yellow = Tier 1 Core. Red = Tier 2/3 underperforming.

2.3 Missing Value & Outlier Check

Show Code
cat("── Missing values ───────────────────────────────────\n")
── Missing values ───────────────────────────────────
Show Code
customer_summary |>
  select(Customer, Transactions, Revenue, Avg_Txn_Value) |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  print()
# A tibble: 1 × 4
  Customer Transactions Revenue Avg_Txn_Value
     <int>        <int>   <int>         <int>
1        0            0       0             0
Show Code
Q1_rev  <- quantile(customer_summary$Revenue, 0.25)
Q3_rev  <- quantile(customer_summary$Revenue, 0.75)
IQR_rev <- Q3_rev - Q1_rev
lower   <- Q1_rev - 1.5 * IQR_rev
upper   <- Q3_rev + 1.5 * IQR_rev

cat(sprintf("\n── Revenue outlier bounds (IQR rule) ───────────────\n"))

── Revenue outlier bounds (IQR rule) ───────────────
Show Code
cat(sprintf("  Lower fence : ₦%s\n", scales::comma(round(lower))))
  Lower fence : ₦-142,793,412
Show Code
cat(sprintf("  Upper fence : ₦%s\n", scales::comma(round(upper))))
  Upper fence : ₦258,818,088
Show Code
outliers <- customer_summary |> filter(Revenue < lower | Revenue > upper)
cat(sprintf("  Outliers detected: %d\n", nrow(outliers)))
  Outliers detected: 0
Show Code
if (nrow(outliers) > 0) print(outliers |> select(Customer, Revenue))

Interpretation for management: No missing values exist in the dataset. The IQR outlier rule flags ARC as a statistical outlier on revenue — this reflects genuine portfolio dominance, not a data error. This distinction is central to the diagnostic analysis that follows.


3 Technique 1 — Exploratory Data Analysis (EDA)

3.1 Theory

Exploratory Data Analysis, formalised by Tukey (1977), is the practice of using statistical summaries and visual inspection to understand a dataset’s structure before applying confirmatory methods. Core EDA tools include measures of central tendency and dispersion, frequency distributions, and Anscombe’s Quartet — a classic demonstration that identical summary statistics can conceal radically different underlying data patterns. The implication for business analysts is that numerical summaries alone are insufficient; visual inspection is always required.

3.2 Business Justification

Before any formal statistical test can be applied to GBAT’s key-customer data, it is necessary to understand the shape, spread, and anomalies in the dataset. EDA establishes whether the data supports the assumptions of downstream techniques and surfaces the concentration pattern that drives the entire analysis.

3.3 Code and Output

Show Code
p1 <- ggplot(customer_summary,
             aes(x = reorder(Customer, Revenue), y = Revenue, fill = Customer)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = gbat_cols) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  coord_flip() +
  labs(title = "Total Revenue", x = NULL, y = "₦M") +
  theme_gbat()

p2 <- ggplot(customer_summary,
             aes(x = reorder(Customer, Transactions), y = Transactions, fill = Customer)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = gbat_cols) +
  coord_flip() +
  labs(title = "Transaction Count", x = NULL, y = "Transactions") +
  theme_gbat()

p3 <- ggplot(customer_summary,
             aes(x = reorder(Customer, Avg_Txn_Value), y = Avg_Txn_Value, fill = Customer)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = gbat_cols) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  coord_flip() +
  labs(title = "Avg Transaction Value", x = NULL, y = "₦M") +
  theme_gbat()

p4 <- ggplot(customer_summary,
             aes(x = reorder(Customer, Log_Revenue), y = Log_Revenue, fill = Customer)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = gbat_cols) +
  coord_flip() +
  labs(title = "Log(Revenue) — Normalised", x = NULL, y = "ln(Revenue)") +
  theme_gbat()

grid.arrange(p1, p2, p3, p4, ncol = 2)

Four-Panel EDA: Revenue, Transactions, Average Order Value, Log Revenue
Show Code
ggplot(customer_summary, aes(y = Revenue, x = "All Customers")) +
  geom_boxplot(fill = "#003049", alpha = 0.4,
               outlier.colour = "#d62828", outlier.size = 3) +
  geom_jitter(aes(colour = Customer), width = 0.15, size = 4) +
  geom_text_repel(aes(label = Customer, colour = Customer),
                  size = 3.5, show.legend = FALSE) +
  scale_colour_manual(values = gbat_cols) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  labs(title    = "Revenue Distribution — All Key Customers",
       subtitle = "ARC is a confirmed statistical outlier on revenue",
       x = NULL, y = "Revenue (₦ Millions)",
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat() +
  theme(legend.position = "none")

Boxplot of Revenue — ARC confirmed as outlier

3.4 Plain-Language Interpretation

The EDA reveals four key facts. First, revenue is heavily right-skewed — ARC’s bar dwarfs every other customer. Second, SAMD has the highest transaction count (114) yet only the third-highest revenue, indicating lower average order values. Third, the log-revenue chart compresses the scale and confirms the gap persists even after normalisation. Fourth, the boxplot confirms ARC as a statistical outlier driven by genuine dominance, not error. For a non-technical manager: if ARC were to stop ordering tomorrow, nearly half the company’s revenue would disappear instantly.


4 Technique 2 — Data Visualisation

4.1 Theory

Data visualisation is the systematic translation of quantitative information into graphical form. Wilkinson’s (1999) Grammar of Graphics — implemented in R’s ggplot2 — provides a principled framework for chart construction: every visual element (position, colour, size, shape) encodes a variable, and chart selection should be driven by the relationship being communicated rather than aesthetic preference. Storytelling with data requires that each chart answers a specific business question.

4.2 Business Justification

The revenue concentration problem at GBAT Nigeria is not self-evident from a raw table of seven numbers. It becomes immediately compelling when visualised as a Pareto chart, a Lorenz curve, or a bubble chart. Visualisation is therefore not decorative — it is the primary instrument through which analytical findings are communicated to management decision-makers.

4.3 Code and Output

Show Code
ggplot(customer_summary, aes(x = reorder(Customer, -Revenue))) +
  geom_col(aes(y = Revenue, fill = Customer),
           width = 0.65, show.legend = FALSE) +
  geom_line(aes(y = Cumulative_Revenue_Share * max(Revenue), group = 1),
            colour = "#d62828", linewidth = 1.2) +
  geom_point(aes(y = Cumulative_Revenue_Share * max(Revenue)),
             colour = "#d62828", size = 3) +
  geom_text(aes(y = Cumulative_Revenue_Share * max(Revenue),
                label = scales::percent(Cumulative_Revenue_Share, accuracy = 1)),
            vjust = -0.9, size = 3.2, colour = "#d62828", fontface = "bold") +
  geom_hline(yintercept = 0.80 * max(customer_summary$Revenue),
             linetype = "dashed", colour = "#555555") +
  annotate("text", x = 6.5, y = 0.82 * max(customer_summary$Revenue),
           label = "80% threshold", size = 3, colour = "#555555") +
  scale_fill_manual(values = gbat_cols) +
  scale_y_continuous(
    name     = "Revenue (₦ Millions)",
    labels   = label_number(scale = 1e-6, suffix = "M", prefix = "₦"),
    sec.axis = sec_axis(~ . / max(customer_summary$Revenue),
                        name   = "Cumulative Revenue Share",
                        labels = scales::percent)
  ) +
  labs(title    = "Pareto Chart: Revenue Concentration Across Key Customers",
       subtitle = "Top 3 accounts reach 92.5% — far beyond the 80/20 rule",
       x = NULL,
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat()

Pareto Chart — Cumulative Revenue Concentration
Show Code
lorenz_points <- customer_summary |>
  arrange(Revenue) |>
  mutate(
    cum_customers = row_number() / n(),
    cum_revenue   = cumsum(Revenue) / sum(Revenue)
  )

lorenz_df <- bind_rows(
  tibble(cum_customers = 0, cum_revenue = 0),
  lorenz_points
)

n_cust     <- nrow(customer_summary)
rev_sorted <- sort(customer_summary$Revenue)
gini       <- (2 * sum(seq_along(rev_sorted) * rev_sorted) /
                 (n_cust * sum(rev_sorted))) - (n_cust + 1) / n_cust

ggplot(lorenz_df, aes(x = cum_customers, y = cum_revenue)) +
  geom_ribbon(aes(ymin = cum_customers, ymax = cum_revenue),
              fill = "#d62828", alpha = 0.15) +
  geom_line(colour = "#003049", linewidth = 1.3) +
  geom_point(colour = "#003049", size = 2.5) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "#888888", linewidth = 0.8) +
  annotate("text", x = 0.22, y = 0.70,
           label = paste0("Gini = ", round(gini, 3)),
           size = 4.5, colour = "#d62828", fontface = "bold") +
  scale_x_continuous(labels = scales::percent,
                     name   = "Cumulative % of Customers") +
  scale_y_continuous(labels = scales::percent,
                     name   = "Cumulative % of Revenue") +
  labs(title    = "Lorenz Curve — Revenue Inequality Across Key Customers",
       subtitle = "Red shaded area = gap between actual distribution and perfect equality",
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat()

Lorenz Curve — Revenue Inequality
Show Code
ggplot(customer_summary,
       aes(x = Transactions, y = Revenue,
           size = Avg_Txn_Value, colour = Customer, label = Customer)) +
  geom_point(alpha = 0.75) +
  geom_text_repel(size = 3.5, fontface = "bold", show.legend = FALSE) +
  scale_size_continuous(range = c(4, 18),
                        labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦"),
                        name   = "Avg. Transaction Value") +
  scale_colour_manual(values = gbat_cols, guide = "none") +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  labs(title    = "Transaction Frequency vs Revenue — Key Customers",
       subtitle = "Bubble size encodes average transaction value",
       x        = "Number of Transactions (Q1 2026)",
       y        = "Total Revenue (₦ Millions)",
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat()

Bubble Chart — Transactions vs Revenue vs Average Order Value

4.4 Plain-Language Interpretation

Three charts tell the complete story. The Pareto chart shows that the company’s revenue curve is far steeper than the classic 80/20 rule predicts — the top three customers alone account for 92.5%. The Lorenz curve, with a Gini coefficient of 0.587, confirms extreme inequality. The bubble chart reveals that SAMD is the most active account (114 transactions) but not the highest revenue — its average transaction value is comparatively low, indicating small frequent orders rather than large strategic purchases.


5 Technique 3 — Hypothesis Testing

5.1 Theory

Hypothesis testing is the formal procedure for deciding whether an observed pattern is likely to reflect a real effect or is attributable to chance. It involves specifying a null hypothesis (H₀) and an alternative (H₁), then computing a test statistic and p-value. A p-value below α = 0.05 leads to rejection of H₀. Where parametric assumptions cannot be met — as is common with small samples — non-parametric alternatives such as the Kruskal-Wallis test are preferred. Effect sizes complement p-values by indicating practical magnitude.

5.2 Business Justification

Management must decide whether the apparent differences in revenue across the seven accounts represent genuinely distinct performance levels, or simply random variation in a small customer base. Hypothesis testing provides the statistical basis for that decision and strengthens the case for targeted intervention.

5.3 Code and Output

Show Code
grand_mean <- grand_total_revenue / 7

t_result <- t.test(customer_summary$Revenue, mu = grand_mean)
cat("── One-Sample t-test: Revenue vs Equal-Share Benchmark ─────────────────\n")
── One-Sample t-test: Revenue vs Equal-Share Benchmark ─────────────────
Show Code
print(t_result)

    One Sample t-test

data:  customer_summary$Revenue
t = 0, df = 6, p-value = 1
alternative hypothesis: true mean is not equal to 68492621
95 percent confidence interval:
  -9549033 146534275
sample estimates:
mean of x 
 68492621 
Show Code
kw_result <- kruskal.test(Avg_Txn_Value ~ Tier, data = customer_summary)
cat("── Kruskal-Wallis Test: Avg Transaction Value by Tier ───────────────────\n")
── Kruskal-Wallis Test: Avg Transaction Value by Tier ───────────────────
Show Code
print(kw_result)

    Kruskal-Wallis rank sum test

data:  Avg_Txn_Value by Tier
Kruskal-Wallis chi-squared = 5.1429, df = 2, p-value = 0.07643
Show Code
portfolio_mean_txn <- mean(customer_summary$Avg_Txn_Value)

ggplot(customer_summary,
       aes(x = reorder(Customer, Avg_Txn_Value), y = Avg_Txn_Value, fill = Tier)) +
  geom_col(width = 0.65) +
  geom_hline(yintercept = portfolio_mean_txn,
             linetype = "dashed", colour = "#d62828", linewidth = 1) +
  annotate("text", x = 0.6, y = portfolio_mean_txn * 1.08,
           label = paste0("Portfolio mean:\n₦", scales::comma(round(portfolio_mean_txn))),
           size = 3, colour = "#d62828", hjust = 0) +
  scale_fill_manual(values = c("Tier 1 — Core"            = "#003049",
                                "Tier 2 — Mid"             = "#f77f00",
                                "Tier 3 — Underperforming" = "#d62828")) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  coord_flip() +
  labs(title    = "Average Transaction Value by Customer",
       subtitle = "Dashed line = portfolio mean average transaction value",
       x = NULL, y = "Average Transaction Value (₦M)", fill = "Tier",
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat()

Average Transaction Value by Customer vs Portfolio Mean

5.4 Plain-Language Interpretation

The one-sample t-test confirms that observed revenues deviate significantly from an equal-share baseline — the concentration is a structural feature, not random noise. The Kruskal-Wallis test shows that average transaction values differ meaningfully across performance tiers. For management: the underperformance of Tier 3 accounts is not bad luck — it is a pattern that requires deliberate intervention.


6 Technique 4 — Correlation Analysis

6.1 Theory

Correlation analysis quantifies the strength and direction of the relationship between two numeric variables. The Pearson coefficient (r) measures linear association and assumes approximate normality; Spearman’s ρ and Kendall’s τ are rank-based alternatives appropriate for small samples or non-normal data. All coefficients range from −1 to +1, with 0 indicating no association. A fundamental principle is that association does not imply causation.

6.2 Business Justification

A key strategic question for GBAT Nigeria’s sales team is whether customers who transact more frequently also generate more revenue. If yes, stimulating transaction frequency through more regular sales calls, promotional offers, and trade support is a defensible strategy. If weak, other variables — order size, product mix, pricing — may be more important levers.

6.3 Code and Output

Show Code
r_pearson  <- cor(customer_summary$Transactions,
                  customer_summary$Revenue, method = "pearson")
r_spearman <- cor(customer_summary$Transactions,
                  customer_summary$Revenue, method = "spearman")
r_kendall  <- cor(customer_summary$Transactions,
                  customer_summary$Revenue, method = "kendall")

cor_test <- cor.test(customer_summary$Transactions,
                     customer_summary$Revenue, method = "pearson")

cat("── Correlation: Transactions vs Revenue ────────────────────────────────\n")
── Correlation: Transactions vs Revenue ────────────────────────────────
Show Code
cat(sprintf("  Pearson r  : %.4f  (p = %.4f)\n", r_pearson,  cor_test$p.value))
  Pearson r  : 0.6594  (p = 0.1071)
Show Code
cat(sprintf("  Spearman ρ : %.4f\n", r_spearman))
  Spearman ρ : 0.7500
Show Code
cat(sprintf("  Kendall τ  : %.4f\n", r_kendall))
  Kendall τ  : 0.5238
Show Code
tibble(
  Method         = c("Pearson r", "Spearman ρ", "Kendall τ"),
  Coefficient    = c(round(r_pearson,4), round(r_spearman,4), round(r_kendall,4)),
  Interpretation = c(
    "Linear association — assumes normality",
    "Rank-based — robust to outliers and skew",
    "Rank-based — preferred for small samples (n = 7)"
  )
) |>
  kable(caption = "Table 4: Correlation Coefficients — Transactions vs Revenue") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, font_size = 13)
Table 4: Table 4: Correlation Coefficients — Transactions vs Revenue
Method Coefficient Interpretation
Pearson r 0.6594 Linear association — assumes normality
Spearman ρ 0.7500 Rank-based — robust to outliers and skew
Kendall τ 0.5238 Rank-based — preferred for small samples (n = 7)
Show Code
ggplot(customer_summary, aes(x = Transactions, y = Revenue)) +
  geom_smooth(method = "lm", se = TRUE, colour = "#003049",
              fill = "#003049", alpha = 0.1, linewidth = 1) +
  geom_point(aes(colour = Customer, size = Avg_Txn_Value), alpha = 0.85) +
  geom_text_repel(aes(label = Customer, colour = Customer),
                  size = 3.5, fontface = "bold", show.legend = FALSE) +
  scale_colour_manual(values = gbat_cols, guide = "none") +
  scale_size_continuous(range = c(3, 10), guide = "none") +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  annotate("text",
           x = max(customer_summary$Transactions) * 0.55,
           y = max(customer_summary$Revenue) * 0.92,
           label = paste0("Pearson r = ", round(r_pearson, 3),
                          "\nSpearman ρ = ", round(r_spearman, 3)),
           size = 3.8, colour = "#003049", fontface = "bold") +
  labs(title    = "Transaction Frequency vs Revenue — Key Customers",
       subtitle = "Shaded band = 95% confidence interval around regression line",
       x        = "Number of Transactions (Q1 2026)",
       y        = "Total Revenue (₦ Millions)",
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat()

Scatter Plot — Transaction Frequency vs Revenue with Regression Line

6.4 Plain-Language Interpretation

The Pearson correlation of 0.659 indicates a moderately strong positive association between transaction frequency and total revenue. However, the Spearman and Kendall coefficients differ, reflecting the outsized influence of the ARC outlier. The conclusion for management: more transactions are generally associated with more revenue, but the relationship is not perfectly predictable. Both transaction frequency and average order value are important levers operating differently across tiers.


7 Technique 5 — Linear Regression

7.1 Theory

Ordinary Least Squares (OLS) linear regression models the relationship between a continuous response variable (Y) and one or more predictors (X) by estimating the line that minimises the sum of squared residuals: Y = β₀ + β₁X + ε. Model diagnostics — R², residual plots, and tests for homoscedasticity — assess whether assumptions are satisfied. With n = 7, regression is used here primarily as a descriptive and target-setting tool rather than a predictive engine.

7.2 Business Justification

Linear regression allows GBAT Nigeria to define an expected revenue level for each transaction count. Customers whose actual revenue falls significantly below the regression line are underperforming relative to their engagement level — directly actionable in quarterly distributor business reviews.

7.3 Code and Output

Show Code
model         <- lm(Revenue ~ Transactions, data = customer_summary)
model_summary <- summary(model)
tidy_model    <- broom::tidy(model)
glance_model  <- broom::glance(model)

cat("── OLS Regression: Revenue ~ Transactions ──────────────────────────────\n")
── OLS Regression: Revenue ~ Transactions ──────────────────────────────
Show Code
print(model_summary)

Call:
lm(formula = Revenue ~ Transactions, data = customer_summary)

Residuals:
        1         2         3         4         5         6         7 
131520868  16312755 -68139287  -5063752 -25981191 -29651453 -18997940 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  17411502   36990796   0.471    0.658
Transactions  1349313     688004   1.961    0.107

Residual standard error: 69490000 on 5 degrees of freedom
Multiple R-squared:  0.4348,    Adjusted R-squared:  0.3218 
F-statistic: 3.846 on 1 and 5 DF,  p-value: 0.1071
Show Code
tidy_model |>
  mutate(
    estimate  = scales::comma(round(estimate)),
    std.error = scales::comma(round(std.error)),
    statistic = round(statistic, 3),
    p.value   = round(p.value, 4)
  ) |>
  rename("Term" = term, "Estimate" = estimate,
         "Std. Error" = std.error,
         "t-statistic" = statistic, "p-value" = p.value) |>
  kable(caption = "Table 5: OLS Regression Coefficients — Revenue ~ Transactions") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, font_size = 13)
Table 5: Table 5: OLS Regression Coefficients — Revenue ~ Transactions
Term Estimate Std. Error t-statistic p-value
(Intercept) 17,411,502 36,990,796 0.471 0.6577
Transactions 1,349,313 688,004 1.961 0.1071
Show Code
glance_model |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df) |>
  mutate(
    across(c(r.squared, adj.r.squared), ~ round(., 4)),
    sigma     = scales::comma(round(sigma)),
    statistic = round(statistic, 3),
    p.value   = round(p.value, 4)
  ) |>
  rename("R²" = r.squared, "Adj. R²" = adj.r.squared,
         "Residual Std. Error" = sigma,
         "F-statistic" = statistic, "p-value" = p.value,
         "df" = df) |>
  kable(caption = "Table 6: Model Fit Statistics") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, font_size = 13)
Table 6: Table 6: Model Fit Statistics
Adj. R² Residual Std. Error F-statistic p-value df
0.4348 0.3218 69,494,750 3.846 0.1071 1
Show Code
customer_summary <- customer_summary |>
  mutate(
    Predicted   = predict(model),
    Residual    = Revenue - Predicted,
    Performance = ifelse(Residual > 0, "Above predicted", "Below predicted")
  )

ggplot(customer_summary, aes(x = Transactions, y = Revenue)) +
  geom_smooth(method = "lm", se = TRUE, colour = "#003049",
              fill = "#003049", alpha = 0.1, linewidth = 1.1) +
  geom_segment(aes(xend = Transactions, yend = Predicted,
                   colour = Performance),
               linewidth = 0.8, linetype = "dotted") +
  geom_point(aes(colour = Performance), size = 4) +
  geom_text_repel(aes(label = Customer),
                  size = 3.5, fontface = "bold", colour = "#222222") +
  scale_colour_manual(values = c("Above predicted" = "#2d6a4f",
                                  "Below predicted"  = "#d62828")) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  annotate("text",
           x = max(customer_summary$Transactions) * 0.5,
           y = max(customer_summary$Revenue) * 0.90,
           label = paste0("R² = ", round(glance_model$r.squared, 3),
                          "  |  Adj. R² = ",
                          round(glance_model$adj.r.squared, 3)),
           size = 3.8, colour = "#003049", fontface = "bold") +
  labs(title    = "OLS Regression: Actual vs Predicted Revenue",
       subtitle = "Dotted lines = residuals; green = over-performing, red = under-performing vs model",
       x        = "Transactions (Q1 2026)",
       y        = "Revenue (₦ Millions)",
       colour   = "Performance vs Model",
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat()

OLS Regression — Actual vs Predicted Revenue with Residuals
Show Code
par(mfrow = c(2, 2))
plot(model)

Regression Diagnostics — Four-Panel Residual Plot
Show Code
par(mfrow = c(1, 1))

7.4 Plain-Language Interpretation

The model explains 43.5% of the variation in revenue through transaction count alone (R² = 0.435). Each additional transaction is associated with approximately ₦1,349,313 in additional revenue on average. Customers falling below the regression line are generating less revenue than their transaction frequency predicts — these are the priority targets for account development. This gives management a principled, data-derived basis for intervention conversations rather than a subjective ranking.


8 Integrated Findings

The five analytical techniques build a coherent and mutually reinforcing picture of GBAT Nigeria’s key-customer portfolio in Q1 2026.

EDA established the factual foundation: seven accounts, ₦479.4M total revenue, no missing data, and ARC as a confirmed statistical outlier driven by genuine dominance rather than data error.

Data Visualisation made the concentration problem immediately visible — the Pareto chart, Lorenz curve, and bubble chart each communicate a different dimension: too much revenue in too few accounts, at varying levels of transaction efficiency.

Hypothesis Testing confirmed the observed differences are not random: revenue deviates significantly from an equal-share baseline, and average transaction values differ meaningfully across tiers. The pattern is structural, not coincidental.

Correlation Analysis revealed that transaction frequency and revenue are positively associated (r ≈ 0.66), but imperfectly — average order value is an independent lever that operates differently across tiers.

Linear Regression translated the correlation into an actionable diagnostic: a model identifying which accounts under-generate revenue relative to their engagement level.

Single integrated recommendation: GBAT Nigeria must implement a tiered account development programme that simultaneously protects the ARC relationship, develops JCL through order-value growth strategies, and intensifies trade support for HCT, GRC, and PGL. The Q2 2026 target should be a measurable reduction in the HHI concentration index, with a portfolio-wide CR3 target below 85%.


9 Limitations & Further Work

Sample size. With only seven observations, all inferential statistics should be interpreted as diagnostic indicators rather than generalisable findings. A larger customer base would significantly increase statistical power.

Time period. Q1 2026 is a single quarter. Seasonal effects, festive purchasing patterns, and credit cycles may mean Q1 is not representative of the full year. A full-year or multi-year dataset would enable trend analysis and seasonal decomposition.

Variable completeness. The dataset contains no information on trade support inputs — visit frequency, display investment, or training hours. Incorporating these into a multiple regression model would allow a more complete causal model of revenue drivers.

Missing price zeroes. Several line items carry a unit price of ₦0, likely bundled or complementary items. A more granular analysis would separate priced and zero-priced items for cleaner revenue attribution.

Geographic granularity. The dataset does not include territory data for each account. Adding regional data would enable spatial analysis and assessment of expansion potential.

Further work. With more data and time, a customer lifetime value (CLV) model, a market basket analysis of co-purchased item codes, and a time-series decomposition of monthly revenue patterns would each add material insight.


10 References

[TEXTBOOK AUTHOR(S)]. ([YEAR]). [TEXTBOOK TITLE]. [Publisher].

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Wilkinson, L. (1999). The grammar of graphics. Springer.

R Core Team. (2025). R: A language and environment for statistical computing (Version 4.4). R Foundation for Statistical Computing. https://www.R-project.org/

Show Code
citation("ggplot2")
To cite ggplot2 in publications, please use

  H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
  Springer-Verlag New York, 2016.

A BibTeX entry for LaTeX users is

  @Book{,
    author = {Hadley Wickham},
    title = {ggplot2: Elegant Graphics for Data Analysis},
    publisher = {Springer-Verlag New York},
    year = {2016},
    isbn = {978-3-319-24277-4},
    url = {https://ggplot2.tidyverse.org},
  }
Show Code
citation("knitr")
To cite package 'knitr' in publications use:

  Xie Y (2025). _knitr: A General-Purpose Package for Dynamic Report
  Generation in R_. R package version 1.51, <https://yihui.org/knitr/>.

  Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition.
  Chapman and Hall/CRC. ISBN 978-1498716963

  Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible
  Research in R. In Victoria Stodden, Friedrich Leisch and Roger D.
  Peng, editors, Implementing Reproducible Computational Research.
  Chapman and Hall/CRC. ISBN 978-1466561595

To see these entries in BibTeX format, use 'print(<citation>,
bibtex=TRUE)', 'toBibtex(.)', or set
'options(citation.bibtex.max=999)'.
Show Code
citation("kableExtra")
To cite package 'kableExtra' in publications use:

  Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and
  Pipe Syntax_. doi:10.32614/CRAN.package.kableExtra
  <https://doi.org/10.32614/CRAN.package.kableExtra>, R package version
  1.4.0, <https://CRAN.R-project.org/package=kableExtra>.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {kableExtra: Construct Complex Table with 'kable' and Pipe Syntax},
    author = {Hao Zhu},
    year = {2024},
    note = {R package version 1.4.0},
    url = {https://CRAN.R-project.org/package=kableExtra},
    doi = {10.32614/CRAN.package.kableExtra},
  }
Show Code
citation("broom")
To cite package 'broom' in publications use:

  Robinson D, Hayes A, Couch S, Hvitfeldt E (2026). _broom: Convert
  Statistical Objects into Tidy Tibbles_.
  doi:10.32614/CRAN.package.broom
  <https://doi.org/10.32614/CRAN.package.broom>, R package version
  1.0.12, <https://CRAN.R-project.org/package=broom>.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {broom: Convert Statistical Objects into Tidy Tibbles},
    author = {David Robinson and Alex Hayes and Simon Couch and Emil Hvitfeldt},
    year = {2026},
    note = {R package version 1.0.12},
    url = {https://CRAN.R-project.org/package=broom},
    doi = {10.32614/CRAN.package.broom},
  }
Show Code
citation("scales")
To cite package 'scales' in publications use:

  Wickham H, Pedersen T, Seidel D (2025). _scales: Scale Functions for
  Visualization_. doi:10.32614/CRAN.package.scales
  <https://doi.org/10.32614/CRAN.package.scales>, R package version
  1.4.0, <https://CRAN.R-project.org/package=scales>.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {scales: Scale Functions for Visualization},
    author = {Hadley Wickham and Thomas Lin Pedersen and Dana Seidel},
    year = {2025},
    note = {R package version 1.4.0},
    url = {https://CRAN.R-project.org/package=scales},
    doi = {10.32614/CRAN.package.scales},
  }

GBAT Nigeria. (2026). Internal sales voucher system records — key customers Q1 2026 [Unpublished organisational data]. GBAT Nigeria Lagos Office.


Appendix: AI Usage Statement

AI-assisted tools, specifically Claude (Anthropic, 2025), were used in the preparation of this document in the following capacities: structuring the Quarto document layout and YAML configuration; drafting initial versions of theoretical section introductions; suggesting appropriate R functions for specific analytical tasks; and reviewing code for syntactic errors prior to rendering.

All analytical judgements — including the choice of techniques, interpretation of outputs, business framing of findings, tier classification framework, identification of ARC as a concentration risk, and all strategic recommendations — were made independently by the analyst, Chinwendu Ezike, drawing on her professional experience as a Senior Sales Consultant at GBAT Nigeria and her studies in the MMBA-8 programme. AI was used as a productivity tool, not as a substitute for analytical reasoning. The analyst takes full responsibility for all content in this report.


Appendix: Session Information

Show Code
sessionInfo()
R version 4.5.3 (2026-03-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_Nigeria.utf8  LC_CTYPE=English_Nigeria.utf8   
[3] LC_MONETARY=English_Nigeria.utf8 LC_NUMERIC=C                    
[5] LC_TIME=English_Nigeria.utf8    

time zone: Africa/Lagos
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gridExtra_2.3    car_3.1-5        carData_3.0-6    broom_1.0.12    
 [5] ggthemes_5.2.0   ggrepel_0.9.8    kableExtra_1.4.0 knitr_1.51      
 [9] scales_1.4.0     lubridate_1.9.5  forcats_1.0.1    stringr_1.6.0   
[13] dplyr_1.2.1      purrr_1.2.2      readr_2.2.0      tidyr_1.3.2     
[17] tibble_3.3.1     ggplot2_4.0.2    tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] generics_0.1.4     xml2_1.5.2         lattice_0.22-9     stringi_1.8.7     
 [5] hms_1.1.4          digest_0.6.39      magrittr_2.0.4     evaluate_1.0.5    
 [9] grid_4.5.3         timechange_0.4.0   RColorBrewer_1.1-3 fastmap_1.2.0     
[13] Matrix_1.7-4       jsonlite_2.0.0     backports_1.5.1    Formula_1.2-5     
[17] mgcv_1.9-4         viridisLite_0.4.3  textshaping_1.0.5  abind_1.4-8       
[21] cli_3.6.5          rlang_1.1.7        splines_4.5.3      withr_3.0.2       
[25] yaml_2.3.12        otel_0.2.0         tools_4.5.3        tzdb_0.5.0        
[29] vctrs_0.7.1        R6_2.6.1           lifecycle_1.0.5    pkgconfig_2.0.3   
[33] pillar_1.11.1      gtable_0.3.6       glue_1.8.0         Rcpp_1.1.1        
[37] systemfonts_1.3.2  xfun_0.57          tidyselect_1.2.1   rstudioapi_0.18.0 
[41] farver_2.1.2       nlme_3.1-168       htmltools_0.5.9    labeling_0.4.3    
[45] rmarkdown_2.31     svglite_2.2.2      compiler_4.5.3     S7_0.2.1          

Report prepared by Chinwendu Ezike · Senior Sales Consultant · GBAT Nigeria · MMBA-8 Data Analytics II · May 2026