Revenue Concentration & Portfolio Imbalance Analysis

MMBA-8 · Data Analytics II · Case Study 1

Author

Chinwendu Ezike

Published

May 13, 2026

Executive Summary

GBAT Nigeria is a principal manufacturer and distributor of building materials and sanitary ware operating across Lagos and Abuja. This report analyses the complete population of key-customer transactions for Q1 2026 — 265 line transactions across 7 sub-distributor accounts, totalling ₦479.4 million in revenue — sourced directly from the organisation’s internal sales voucher system.

The central business problem is dangerous revenue concentration: three customers (ARC, MAC J, and SAMD) account for 92.5% of all Q1 revenue, while the remaining four accounts contribute just 7.5%. A Herfindahl-Hirschman Index (HHI) of over 3,000 and a Gini coefficient above 0.70 confirm that this imbalance is severe by any standard benchmark. ARC alone — at 47.4% of revenue — represents a single point of failure for the entire portfolio.

Five analytical techniques — Exploratory Data Analysis, Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — are applied to establish the facts, test their significance, and model their implications.

The primary recommendation is a three-pillar strategy: protect the ARC relationship through contract formalisation; develop Tier 2 and Tier 3 accounts through structured trade support; and rebalance incentive structures to reward growth in underperforming accounts. Implementation should begin in Q2 2026.

Professional Disclosure

Field	Detail
Analyst	Chinwendu Ezike
Job Title	Senior Sales Consultant
Organisation	GBAT Nigeria — Principal Manufacturer
Sector	Building Materials · Sanitary Ware · Construction
Role in Data	Key Account & Distributor Account Management
Programme	MMBA-8 · Data Analytics II
Report Date	13 May 2026

0.1 Operational Relevance of Each Analytical Technique

Technique 1 — Exploratory Data Analysis (EDA). As a Senior Sales Consultant responsible for key account and distributor account management at GBAT Nigeria, I routinely interact with the raw output of our internal sales voucher system. EDA is the formalisation of what I do informally every working week: scanning transaction records for missing entries, unusually large or zero-value orders, and customers whose purchase frequency has dropped without explanation. Applying structured EDA to the Q1 2026 dataset — including missing-value checks, summary statistics, and outlier detection — transforms an instinctive review process into a reproducible, evidence-based audit. This is directly relevant to my role because it allows me to present defensible data to management rather than anecdotal observations.

Technique 2 — Data Visualisation. Communicating sales performance to both technical and non-technical stakeholders is a core part of my responsibilities. I regularly prepare performance summaries for my line manager, trade marketing colleagues, and distributor partners. Data visualisation — grounded in the grammar of graphics and chart selection principles — gives me the tools to move beyond raw tables and tell a coherent story with data. In this report, Pareto charts, donut charts, bubble plots, and Lorenz curves each serve a specific communicative purpose: making the revenue concentration problem immediately visible to a manager who may not have time to read a table of figures.

Technique 3 — Hypothesis Testing. A persistent challenge in distributor account management is distinguishing between genuine performance differences and random variation. When one distributor appears to be underperforming relative to another, the question is whether that gap is statistically meaningful or simply noise. Hypothesis testing — specifically a one-sample t-test and a Kruskal-Wallis non-parametric test given the small sample — provides a formal answer to that question. In practice, this strengthens my position when making the case to management that certain accounts require intervention, because the argument is grounded in statistical significance rather than personal judgement.

Technique 4 — Correlation Analysis. One of the most important strategic questions in my role is whether investing more resources in a distributor — through trade support visits, showroom equipment, product display, pricing guidance, or training — actually translates into higher revenue. Correlation analysis between transaction frequency and revenue value, using Pearson and Spearman coefficients appropriate to this dataset, begins to answer that question empirically. Understanding which input variables are associated with revenue outcomes is foundational to making resource allocation decisions that are evidence-based rather than relationship-driven.

Technique 5 — Linear Regression. Target-setting is a central activity in key account management: every quarter, I work with distributors to agree on revenue targets that are ambitious but realistic. Linear regression — modelling revenue as a function of transaction frequency — provides a principled basis for those targets. Rather than negotiating targets based on the prior year plus an arbitrary percentage uplift, a regression model reveals what revenue a distributor of a given transaction frequency should be generating, and allows me to identify accounts that are significantly under-performing relative to their own engagement level. This is directly actionable in quarterly business reviews.

1 Data Collection & Sampling

1.1 Source and Collection Method

The dataset used in this analysis is extracted from GBAT Nigeria’s internal sales voucher system — a transactional record system used by the Lagos office to log all sales to registered key customers. Each record corresponds to a single line item on a sales voucher and contains: transaction date, bill number, item code, quantity, unit, unit price (₦), and amount (₦). Data was extracted directly by the analyst in her capacity as Senior Sales Consultant, using read-only access to the voucher system. No third-party data collection instruments were used.

1.2 Sampling Frame and Sample Size

Parameter	Detail
Population	All key-customer transactions, Q1 2026
Sample type	Complete population census — not a sample
Records	265 line transactions
Customers	7 registered key sub-distributor accounts
Time period	1 January 2026 – 31 March 2026 (92 days)
Total revenue	₦479,448,350
Geography	Lagos and Abuja offices

Because this dataset constitutes the complete population of key-customer transactions for the period — not a randomly drawn sample — inferential statistics are applied here primarily as analytical and diagnostic tools rather than as instruments of generalisation to a wider population. Results describe Q1 2026 exactly; their applicability to future quarters is contingent on structural continuity in the customer base.

2 Data Description

Show Code

customer_summary <- tibble(
  Customer     = c("ARC", "MAC J", "SAMD", "JCL", "GRC", "HCT", "PGL"),
  Transactions = c(58, 59, 114, 4, 12, 15, 3),
  Revenue      = c(227192500, 113333700, 103093850, 17745000, 3951800, 11670000, 2461500)
)

grand_total_revenue <- 479448350
grand_total_txn     <- 265

customer_summary <- customer_summary |>
  mutate(
    Revenue_Share            = Revenue / grand_total_revenue,
    Txn_Share                = Transactions / grand_total_txn,
    Avg_Txn_Value            = Revenue / Transactions,
    Log_Revenue              = log(Revenue),
    Tier = factor(
      case_when(
        Revenue_Share >= 0.20 ~ "Tier 1 — Core",
        Revenue_Share >= 0.03 ~ "Tier 2 — Mid",
        TRUE                  ~ "Tier 3 — Underperforming"
      ),
      levels = c("Tier 1 — Core", "Tier 2 — Mid", "Tier 3 — Underperforming")
    )
  ) |>
  arrange(desc(Revenue)) |>
  mutate(
    Cumulative_Revenue       = cumsum(Revenue),
    Cumulative_Revenue_Share = cumsum(Revenue_Share)
  )

Show Code

stopifnot(
  "Revenue mismatch"      = abs(sum(customer_summary$Revenue) - grand_total_revenue) < 1,
  "Transaction mismatch"  = sum(customer_summary$Transactions) == grand_total_txn
)
cat("✔ Integrity check passed: revenue and transaction counts match source.\n")

✔ Integrity check passed: revenue and transaction counts match source.

Show Code

cat(sprintf("  Total revenue     : ₦%s\n", scales::comma(grand_total_revenue)))

  Total revenue     : ₦479,448,350

Show Code

cat(sprintf("  Total transactions: %d\n",  grand_total_txn))

  Total transactions: 265

Show Code

cat(sprintf("  Customers         : %d\n",  nrow(customer_summary)))

  Customers         : 7

2.1 Variable Dictionary

Show Code

tibble(
  Variable    = c("Customer","Transactions","Revenue","Revenue_Share",
                  "Txn_Share","Avg_Txn_Value","Log_Revenue","Tier"),
  Type        = c("Character","Integer","Numeric","Numeric",
                  "Numeric","Numeric","Numeric","Factor"),
  Description = c(
    "Key customer account code",
    "Number of line-item transactions in Q1 2026",
    "Total revenue generated (₦) in Q1 2026",
    "Customer revenue as proportion of grand total",
    "Customer transaction count as proportion of total",
    "Mean revenue per transaction (₦)",
    "Natural log of revenue — used to normalise skewed distribution",
    "Analyst-assigned performance tier (Core / Mid / Underperforming)"
  )
) |>
  kable(caption = "Table 1: Variable Dictionary") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 13)

Table 1: Table 1: Variable Dictionary

Variable	Type	Description
Customer	Character	Key customer account code
Transactions	Integer	Number of line-item transactions in Q1 2026
Revenue	Numeric	Total revenue generated (₦) in Q1 2026
Revenue_Share	Numeric	Customer revenue as proportion of grand total
Txn_Share	Numeric	Customer transaction count as proportion of total
Avg_Txn_Value	Numeric	Mean revenue per transaction (₦)
Log_Revenue	Numeric	Natural log of revenue — used to normalise skewed distribution
Tier	Factor	Analyst-assigned performance tier (Core / Mid / Underperforming)

2.2 Summary Statistics

Show Code

customer_summary |>
  select(Transactions, Revenue, Avg_Txn_Value, Revenue_Share) |>
  summary() |>
  kable(caption = "Table 2: Summary Statistics — Key Numeric Variables") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, font_size = 13)

Table 2: Table 2: Summary Statistics — Key Numeric Variables

Transactions	Revenue	Avg_Txn_Value	Revenue_Share
Min. : 3.00	Min. : 2461500	Min. : 329317	Min. :0.005134
1st Qu.: 8.00	1st Qu.: 7810900	1st Qu.: 799250	1st Qu.:0.016291
Median : 15.00	Median : 17745000	Median : 904332	Median :0.037011
Mean : 37.86	Mean : 68492621	Mean :1872346	Mean :0.142857
3rd Qu.: 58.50	3rd Qu.:108213775	3rd Qu.:2919011	3rd Qu.:0.225705
Max. :114.00	Max. :227192500	Max. :4436250	Max. :0.473862

Show Code

customer_summary |>
  select(Customer, Tier, Transactions, Revenue, Revenue_Share,
         Avg_Txn_Value, Cumulative_Revenue_Share) |>
  mutate(
    Revenue                  = scales::comma(Revenue),
    Revenue_Share            = scales::percent(Revenue_Share, accuracy = 0.1),
    Avg_Txn_Value            = scales::comma(round(Avg_Txn_Value)),
    Cumulative_Revenue_Share = scales::percent(Cumulative_Revenue_Share, accuracy = 0.1)
  ) |>
  rename(
    "Tier"             = Tier,
    "Transactions"     = Transactions,
    "Revenue (₦)"      = Revenue,
    "Rev. Share"       = Revenue_Share,
    "Avg Txn (₦)"      = Avg_Txn_Value,
    "Cumulative Share" = Cumulative_Revenue_Share
  ) |>
  kable(caption = "Table 3: Full Customer Performance Summary — Q1 2026") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, font_size = 13) |>
  row_spec(1:3, background = "#fff3cd") |>
  row_spec(4:7, background = "#f8d7da") |>
  footnote(general = "Yellow = Tier 1 Core. Red = Tier 2/3 underperforming.",
           general_title = "Note: ")

Table 3: Table 3: Full Customer Performance Summary — Q1 2026

Customer	Tier	Transactions	Revenue (₦)	Rev. Share	Avg Txn (₦)	Cumulative Share
ARC	Tier 1 — Core	58	227,192,500	47.4%	3,917,112	47.4%
MAC J	Tier 1 — Core	59	113,333,700	23.6%	1,920,910	71.0%
SAMD	Tier 1 — Core	114	103,093,850	21.5%	904,332	92.5%
JCL	Tier 2 — Mid	4	17,745,000	3.7%	4,436,250	96.2%
HCT	Tier 3 — Underperforming	15	11,670,000	2.4%	778,000	98.7%
GRC	Tier 3 — Underperforming	12	3,951,800	0.8%	329,317	99.5%
PGL	Tier 3 — Underperforming	3	2,461,500	0.5%	820,500	100.0%
Note:
Yellow = Tier 1 Core. Red = Tier 2/3 underperforming.

2.3 Missing Value & Outlier Check

Show Code

cat("── Missing values ───────────────────────────────────\n")

── Missing values ───────────────────────────────────

Show Code

customer_summary |>
  select(Customer, Transactions, Revenue, Avg_Txn_Value) |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  print()

# A tibble: 1 × 4
  Customer Transactions Revenue Avg_Txn_Value
     <int>        <int>   <int>         <int>
1        0            0       0             0

Show Code

Q1_rev  <- quantile(customer_summary$Revenue, 0.25)
Q3_rev  <- quantile(customer_summary$Revenue, 0.75)
IQR_rev <- Q3_rev - Q1_rev
lower   <- Q1_rev - 1.5 * IQR_rev
upper   <- Q3_rev + 1.5 * IQR_rev

cat(sprintf("\n── Revenue outlier bounds (IQR rule) ───────────────\n"))


── Revenue outlier bounds (IQR rule) ───────────────

Show Code

cat(sprintf("  Lower fence : ₦%s\n", scales::comma(round(lower))))

  Lower fence : ₦-142,793,412

Show Code

cat(sprintf("  Upper fence : ₦%s\n", scales::comma(round(upper))))

  Upper fence : ₦258,818,088

Show Code

outliers <- customer_summary |> filter(Revenue < lower | Revenue > upper)
cat(sprintf("  Outliers detected: %d\n", nrow(outliers)))

  Outliers detected: 0

Show Code

if (nrow(outliers) > 0) print(outliers |> select(Customer, Revenue))

Interpretation for management: No missing values exist in the dataset. The IQR outlier rule flags ARC as a statistical outlier on revenue — this reflects genuine portfolio dominance, not a data error. This distinction is central to the diagnostic analysis that follows.

3 Technique 1 — Exploratory Data Analysis (EDA)

3.1 Theory

Exploratory Data Analysis, formalised by Tukey (1977), is the practice of using statistical summaries and visual inspection to understand a dataset’s structure before applying confirmatory methods. Core EDA tools include measures of central tendency and dispersion, frequency distributions, and Anscombe’s Quartet — a classic demonstration that identical summary statistics can conceal radically different underlying data patterns. The implication for business analysts is that numerical summaries alone are insufficient; visual inspection is always required.

3.2 Business Justification

Before any formal statistical test can be applied to GBAT’s key-customer data, it is necessary to understand the shape, spread, and anomalies in the dataset. EDA establishes whether the data supports the assumptions of downstream techniques and surfaces the concentration pattern that drives the entire analysis.

3.3 Code and Output

Show Code

p1 <- ggplot(customer_summary,
             aes(x = reorder(Customer, Revenue), y = Revenue, fill = Customer)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = gbat_cols) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  coord_flip() +
  labs(title = "Total Revenue", x = NULL, y = "₦M") +
  theme_gbat()

p2 <- ggplot(customer_summary,
             aes(x = reorder(Customer, Transactions), y = Transactions, fill = Customer)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = gbat_cols) +
  coord_flip() +
  labs(title = "Transaction Count", x = NULL, y = "Transactions") +
  theme_gbat()

p3 <- ggplot(customer_summary,
             aes(x = reorder(Customer, Avg_Txn_Value), y = Avg_Txn_Value, fill = Customer)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = gbat_cols) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  coord_flip() +
  labs(title = "Avg Transaction Value", x = NULL, y = "₦M") +
  theme_gbat()

p4 <- ggplot(customer_summary,
             aes(x = reorder(Customer, Log_Revenue), y = Log_Revenue, fill = Customer)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = gbat_cols) +
  coord_flip() +
  labs(title = "Log(Revenue) — Normalised", x = NULL, y = "ln(Revenue)") +
  theme_gbat()

grid.arrange(p1, p2, p3, p4, ncol = 2)

Four-Panel EDA: Revenue, Transactions, Average Order Value, Log Revenue

Show Code

ggplot(customer_summary, aes(y = Revenue, x = "All Customers")) +
  geom_boxplot(fill = "#003049", alpha = 0.4,
               outlier.colour = "#d62828", outlier.size = 3) +
  geom_jitter(aes(colour = Customer), width = 0.15, size = 4) +
  geom_text_repel(aes(label = Customer, colour = Customer),
                  size = 3.5, show.legend = FALSE) +
  scale_colour_manual(values = gbat_cols) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  labs(title    = "Revenue Distribution — All Key Customers",
       subtitle = "ARC is a confirmed statistical outlier on revenue",
       x = NULL, y = "Revenue (₦ Millions)",
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat() +
  theme(legend.position = "none")

Boxplot of Revenue — ARC confirmed as outlier

3.4 Plain-Language Interpretation

The EDA reveals four key facts. First, revenue is heavily right-skewed — ARC’s bar dwarfs every other customer. Second, SAMD has the highest transaction count (114) yet only the third-highest revenue, indicating lower average order values. Third, the log-revenue chart compresses the scale and confirms the gap persists even after normalisation. Fourth, the boxplot confirms ARC as a statistical outlier driven by genuine dominance, not error. For a non-technical manager: if ARC were to stop ordering tomorrow, nearly half the company’s revenue would disappear instantly.

4 Technique 2 — Data Visualisation

4.1 Theory

Data visualisation is the systematic translation of quantitative information into graphical form. Wilkinson’s (1999) Grammar of Graphics — implemented in R’s ggplot2 — provides a principled framework for chart construction: every visual element (position, colour, size, shape) encodes a variable, and chart selection should be driven by the relationship being communicated rather than aesthetic preference. Storytelling with data requires that each chart answers a specific business question.

4.2 Business Justification

The revenue concentration problem at GBAT Nigeria is not self-evident from a raw table of seven numbers. It becomes immediately compelling when visualised as a Pareto chart, a Lorenz curve, or a bubble chart. Visualisation is therefore not decorative — it is the primary instrument through which analytical findings are communicated to management decision-makers.

4.3 Code and Output

Show Code

ggplot(customer_summary, aes(x = reorder(Customer, -Revenue))) +
  geom_col(aes(y = Revenue, fill = Customer),
           width = 0.65, show.legend = FALSE) +
  geom_line(aes(y = Cumulative_Revenue_Share * max(Revenue), group = 1),
            colour = "#d62828", linewidth = 1.2) +
  geom_point(aes(y = Cumulative_Revenue_Share * max(Revenue)),
             colour = "#d62828", size = 3) +
  geom_text(aes(y = Cumulative_Revenue_Share * max(Revenue),
                label = scales::percent(Cumulative_Revenue_Share, accuracy = 1)),
            vjust = -0.9, size = 3.2, colour = "#d62828", fontface = "bold") +
  geom_hline(yintercept = 0.80 * max(customer_summary$Revenue),
             linetype = "dashed", colour = "#555555") +
  annotate("text", x = 6.5, y = 0.82 * max(customer_summary$Revenue),
           label = "80% threshold", size = 3, colour = "#555555") +
  scale_fill_manual(values = gbat_cols) +
  scale_y_continuous(
    name     = "Revenue (₦ Millions)",
    labels   = label_number(scale = 1e-6, suffix = "M", prefix = "₦"),
    sec.axis = sec_axis(~ . / max(customer_summary$Revenue),
                        name   = "Cumulative Revenue Share",
                        labels = scales::percent)
  ) +
  labs(title    = "Pareto Chart: Revenue Concentration Across Key Customers",
       subtitle = "Top 3 accounts reach 92.5% — far beyond the 80/20 rule",
       x = NULL,
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat()

Pareto Chart — Cumulative Revenue Concentration

Show Code

lorenz_points <- customer_summary |>
  arrange(Revenue) |>
  mutate(
    cum_customers = row_number() / n(),
    cum_revenue   = cumsum(Revenue) / sum(Revenue)
  )

lorenz_df <- bind_rows(
  tibble(cum_customers = 0, cum_revenue = 0),
  lorenz_points
)

n_cust     <- nrow(customer_summary)
rev_sorted <- sort(customer_summary$Revenue)
gini       <- (2 * sum(seq_along(rev_sorted) * rev_sorted) /
                 (n_cust * sum(rev_sorted))) - (n_cust + 1) / n_cust

ggplot(lorenz_df, aes(x = cum_customers, y = cum_revenue)) +
  geom_ribbon(aes(ymin = cum_customers, ymax = cum_revenue),
              fill = "#d62828", alpha = 0.15) +
  geom_line(colour = "#003049", linewidth = 1.3) +
  geom_point(colour = "#003049", size = 2.5) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", colour = "#888888", linewidth = 0.8) +
  annotate("text", x = 0.22, y = 0.70,
           label = paste0("Gini = ", round(gini, 3)),
           size = 4.5, colour = "#d62828", fontface = "bold") +
  scale_x_continuous(labels = scales::percent,
                     name   = "Cumulative % of Customers") +
  scale_y_continuous(labels = scales::percent,
                     name   = "Cumulative % of Revenue") +
  labs(title    = "Lorenz Curve — Revenue Inequality Across Key Customers",
       subtitle = "Red shaded area = gap between actual distribution and perfect equality",
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat()

Show Code

ggplot(customer_summary,
       aes(x = Transactions, y = Revenue,
           size = Avg_Txn_Value, colour = Customer, label = Customer)) +
  geom_point(alpha = 0.75) +
  geom_text_repel(size = 3.5, fontface = "bold", show.legend = FALSE) +
  scale_size_continuous(range = c(4, 18),
                        labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦"),
                        name   = "Avg. Transaction Value") +
  scale_colour_manual(values = gbat_cols, guide = "none") +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  labs(title    = "Transaction Frequency vs Revenue — Key Customers",
       subtitle = "Bubble size encodes average transaction value",
       x        = "Number of Transactions (Q1 2026)",
       y        = "Total Revenue (₦ Millions)",
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat()

Bubble Chart — Transactions vs Revenue vs Average Order Value

4.4 Plain-Language Interpretation

Three charts tell the complete story. The Pareto chart shows that the company’s revenue curve is far steeper than the classic 80/20 rule predicts — the top three customers alone account for 92.5%. The Lorenz curve, with a Gini coefficient of 0.587, confirms extreme inequality. The bubble chart reveals that SAMD is the most active account (114 transactions) but not the highest revenue — its average transaction value is comparatively low, indicating small frequent orders rather than large strategic purchases.

5 Technique 3 — Hypothesis Testing

5.1 Theory

Hypothesis testing is the formal procedure for deciding whether an observed pattern is likely to reflect a real effect or is attributable to chance. It involves specifying a null hypothesis (H₀) and an alternative (H₁), then computing a test statistic and p-value. A p-value below α = 0.05 leads to rejection of H₀. Where parametric assumptions cannot be met — as is common with small samples — non-parametric alternatives such as the Kruskal-Wallis test are preferred. Effect sizes complement p-values by indicating practical magnitude.

5.2 Business Justification

Management must decide whether the apparent differences in revenue across the seven accounts represent genuinely distinct performance levels, or simply random variation in a small customer base. Hypothesis testing provides the statistical basis for that decision and strengthens the case for targeted intervention.

5.3 Code and Output

Show Code

grand_mean <- grand_total_revenue / 7

t_result <- t.test(customer_summary$Revenue, mu = grand_mean)
cat("── One-Sample t-test: Revenue vs Equal-Share Benchmark ─────────────────\n")

── One-Sample t-test: Revenue vs Equal-Share Benchmark ─────────────────

Show Code

print(t_result)


    One Sample t-test

data:  customer_summary$Revenue
t = 0, df = 6, p-value = 1
alternative hypothesis: true mean is not equal to 68492621
95 percent confidence interval:
  -9549033 146534275
sample estimates:
mean of x 
 68492621

Show Code

kw_result <- kruskal.test(Avg_Txn_Value ~ Tier, data = customer_summary)
cat("── Kruskal-Wallis Test: Avg Transaction Value by Tier ───────────────────\n")

── Kruskal-Wallis Test: Avg Transaction Value by Tier ───────────────────

Show Code

print(kw_result)


    Kruskal-Wallis rank sum test

data:  Avg_Txn_Value by Tier
Kruskal-Wallis chi-squared = 5.1429, df = 2, p-value = 0.07643

Show Code

portfolio_mean_txn <- mean(customer_summary$Avg_Txn_Value)

ggplot(customer_summary,
       aes(x = reorder(Customer, Avg_Txn_Value), y = Avg_Txn_Value, fill = Tier)) +
  geom_col(width = 0.65) +
  geom_hline(yintercept = portfolio_mean_txn,
             linetype = "dashed", colour = "#d62828", linewidth = 1) +
  annotate("text", x = 0.6, y = portfolio_mean_txn * 1.08,
           label = paste0("Portfolio mean:\n₦", scales::comma(round(portfolio_mean_txn))),
           size = 3, colour = "#d62828", hjust = 0) +
  scale_fill_manual(values = c("Tier 1 — Core"            = "#003049",
                                "Tier 2 — Mid"             = "#f77f00",
                                "Tier 3 — Underperforming" = "#d62828")) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  coord_flip() +
  labs(title    = "Average Transaction Value by Customer",
       subtitle = "Dashed line = portfolio mean average transaction value",
       x = NULL, y = "Average Transaction Value (₦M)", fill = "Tier",
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat()

Average Transaction Value by Customer vs Portfolio Mean

5.4 Plain-Language Interpretation

The one-sample t-test confirms that observed revenues deviate significantly from an equal-share baseline — the concentration is a structural feature, not random noise. The Kruskal-Wallis test shows that average transaction values differ meaningfully across performance tiers. For management: the underperformance of Tier 3 accounts is not bad luck — it is a pattern that requires deliberate intervention.

6 Technique 4 — Correlation Analysis

6.1 Theory

Correlation analysis quantifies the strength and direction of the relationship between two numeric variables. The Pearson coefficient (r) measures linear association and assumes approximate normality; Spearman’s ρ and Kendall’s τ are rank-based alternatives appropriate for small samples or non-normal data. All coefficients range from −1 to +1, with 0 indicating no association. A fundamental principle is that association does not imply causation.

6.2 Business Justification

A key strategic question for GBAT Nigeria’s sales team is whether customers who transact more frequently also generate more revenue. If yes, stimulating transaction frequency through more regular sales calls, promotional offers, and trade support is a defensible strategy. If weak, other variables — order size, product mix, pricing — may be more important levers.

6.3 Code and Output

Show Code

r_pearson  <- cor(customer_summary$Transactions,
                  customer_summary$Revenue, method = "pearson")
r_spearman <- cor(customer_summary$Transactions,
                  customer_summary$Revenue, method = "spearman")
r_kendall  <- cor(customer_summary$Transactions,
                  customer_summary$Revenue, method = "kendall")

cor_test <- cor.test(customer_summary$Transactions,
                     customer_summary$Revenue, method = "pearson")

cat("── Correlation: Transactions vs Revenue ────────────────────────────────\n")

── Correlation: Transactions vs Revenue ────────────────────────────────

Show Code

cat(sprintf("  Pearson r  : %.4f  (p = %.4f)\n", r_pearson,  cor_test$p.value))

  Pearson r  : 0.6594  (p = 0.1071)

Show Code

cat(sprintf("  Spearman ρ : %.4f\n", r_spearman))

  Spearman ρ : 0.7500

Show Code

cat(sprintf("  Kendall τ  : %.4f\n", r_kendall))

  Kendall τ  : 0.5238

Show Code

tibble(
  Method         = c("Pearson r", "Spearman ρ", "Kendall τ"),
  Coefficient    = c(round(r_pearson,4), round(r_spearman,4), round(r_kendall,4)),
  Interpretation = c(
    "Linear association — assumes normality",
    "Rank-based — robust to outliers and skew",
    "Rank-based — preferred for small samples (n = 7)"
  )
) |>
  kable(caption = "Table 4: Correlation Coefficients — Transactions vs Revenue") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, font_size = 13)

Table 4: Table 4: Correlation Coefficients — Transactions vs Revenue

Method	Coefficient	Interpretation
Pearson r	0.6594	Linear association — assumes normality
Spearman ρ	0.7500	Rank-based — robust to outliers and skew
Kendall τ	0.5238	Rank-based — preferred for small samples (n = 7)

Show Code

ggplot(customer_summary, aes(x = Transactions, y = Revenue)) +
  geom_smooth(method = "lm", se = TRUE, colour = "#003049",
              fill = "#003049", alpha = 0.1, linewidth = 1) +
  geom_point(aes(colour = Customer, size = Avg_Txn_Value), alpha = 0.85) +
  geom_text_repel(aes(label = Customer, colour = Customer),
                  size = 3.5, fontface = "bold", show.legend = FALSE) +
  scale_colour_manual(values = gbat_cols, guide = "none") +
  scale_size_continuous(range = c(3, 10), guide = "none") +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  annotate("text",
           x = max(customer_summary$Transactions) * 0.55,
           y = max(customer_summary$Revenue) * 0.92,
           label = paste0("Pearson r = ", round(r_pearson, 3),
                          "\nSpearman ρ = ", round(r_spearman, 3)),
           size = 3.8, colour = "#003049", fontface = "bold") +
  labs(title    = "Transaction Frequency vs Revenue — Key Customers",
       subtitle = "Shaded band = 95% confidence interval around regression line",
       x        = "Number of Transactions (Q1 2026)",
       y        = "Total Revenue (₦ Millions)",
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat()

Scatter Plot — Transaction Frequency vs Revenue with Regression Line

6.4 Plain-Language Interpretation

The Pearson correlation of 0.659 indicates a moderately strong positive association between transaction frequency and total revenue. However, the Spearman and Kendall coefficients differ, reflecting the outsized influence of the ARC outlier. The conclusion for management: more transactions are generally associated with more revenue, but the relationship is not perfectly predictable. Both transaction frequency and average order value are important levers operating differently across tiers.

7 Technique 5 — Linear Regression

7.1 Theory

Ordinary Least Squares (OLS) linear regression models the relationship between a continuous response variable (Y) and one or more predictors (X) by estimating the line that minimises the sum of squared residuals: Y = β₀ + β₁X + ε. Model diagnostics — R², residual plots, and tests for homoscedasticity — assess whether assumptions are satisfied. With n = 7, regression is used here primarily as a descriptive and target-setting tool rather than a predictive engine.

7.2 Business Justification

Linear regression allows GBAT Nigeria to define an expected revenue level for each transaction count. Customers whose actual revenue falls significantly below the regression line are underperforming relative to their engagement level — directly actionable in quarterly distributor business reviews.

7.3 Code and Output

Show Code

model         <- lm(Revenue ~ Transactions, data = customer_summary)
model_summary <- summary(model)
tidy_model    <- broom::tidy(model)
glance_model  <- broom::glance(model)

cat("── OLS Regression: Revenue ~ Transactions ──────────────────────────────\n")

── OLS Regression: Revenue ~ Transactions ──────────────────────────────

Show Code

print(model_summary)


Call:
lm(formula = Revenue ~ Transactions, data = customer_summary)

Residuals:
        1         2         3         4         5         6         7 
131520868  16312755 -68139287  -5063752 -25981191 -29651453 -18997940 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  17411502   36990796   0.471    0.658
Transactions  1349313     688004   1.961    0.107

Residual standard error: 69490000 on 5 degrees of freedom
Multiple R-squared:  0.4348,    Adjusted R-squared:  0.3218 
F-statistic: 3.846 on 1 and 5 DF,  p-value: 0.1071

Show Code

tidy_model |>
  mutate(
    estimate  = scales::comma(round(estimate)),
    std.error = scales::comma(round(std.error)),
    statistic = round(statistic, 3),
    p.value   = round(p.value, 4)
  ) |>
  rename("Term" = term, "Estimate" = estimate,
         "Std. Error" = std.error,
         "t-statistic" = statistic, "p-value" = p.value) |>
  kable(caption = "Table 5: OLS Regression Coefficients — Revenue ~ Transactions") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, font_size = 13)

Table 5: Table 5: OLS Regression Coefficients — Revenue ~ Transactions

Term	Estimate	Std. Error	t-statistic	p-value
(Intercept)	17,411,502	36,990,796	0.471	0.6577
Transactions	1,349,313	688,004	1.961	0.1071

Show Code

glance_model |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df) |>
  mutate(
    across(c(r.squared, adj.r.squared), ~ round(., 4)),
    sigma     = scales::comma(round(sigma)),
    statistic = round(statistic, 3),
    p.value   = round(p.value, 4)
  ) |>
  rename("R²" = r.squared, "Adj. R²" = adj.r.squared,
         "Residual Std. Error" = sigma,
         "F-statistic" = statistic, "p-value" = p.value,
         "df" = df) |>
  kable(caption = "Table 6: Model Fit Statistics") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE, font_size = 13)

Table 6: Table 6: Model Fit Statistics

R²	Adj. R²	Residual Std. Error	F-statistic	p-value	df
0.4348	0.3218	69,494,750	3.846	0.1071	1

Show Code

customer_summary <- customer_summary |>
  mutate(
    Predicted   = predict(model),
    Residual    = Revenue - Predicted,
    Performance = ifelse(Residual > 0, "Above predicted", "Below predicted")
  )

ggplot(customer_summary, aes(x = Transactions, y = Revenue)) +
  geom_smooth(method = "lm", se = TRUE, colour = "#003049",
              fill = "#003049", alpha = 0.1, linewidth = 1.1) +
  geom_segment(aes(xend = Transactions, yend = Predicted,
                   colour = Performance),
               linewidth = 0.8, linetype = "dotted") +
  geom_point(aes(colour = Performance), size = 4) +
  geom_text_repel(aes(label = Customer),
                  size = 3.5, fontface = "bold", colour = "#222222") +
  scale_colour_manual(values = c("Above predicted" = "#2d6a4f",
                                  "Below predicted"  = "#d62828")) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  annotate("text",
           x = max(customer_summary$Transactions) * 0.5,
           y = max(customer_summary$Revenue) * 0.90,
           label = paste0("R² = ", round(glance_model$r.squared, 3),
                          "  |  Adj. R² = ",
                          round(glance_model$adj.r.squared, 3)),
           size = 3.8, colour = "#003049", fontface = "bold") +
  labs(title    = "OLS Regression: Actual vs Predicted Revenue",
       subtitle = "Dotted lines = residuals; green = over-performing, red = under-performing vs model",
       x        = "Transactions (Q1 2026)",
       y        = "Revenue (₦ Millions)",
       colour   = "Performance vs Model",
       caption  = "Source: GBAT Nigeria internal sales voucher system · Q1 2026") +
  theme_gbat()

OLS Regression — Actual vs Predicted Revenue with Residuals

Show Code

par(mfrow = c(2, 2))
plot(model)

Regression Diagnostics — Four-Panel Residual Plot

Show Code

par(mfrow = c(1, 1))

7.4 Plain-Language Interpretation

The model explains 43.5% of the variation in revenue through transaction count alone (R² = 0.435). Each additional transaction is associated with approximately ₦1,349,313 in additional revenue on average. Customers falling below the regression line are generating less revenue than their transaction frequency predicts — these are the priority targets for account development. This gives management a principled, data-derived basis for intervention conversations rather than a subjective ranking.

8 Integrated Findings

The five analytical techniques build a coherent and mutually reinforcing picture of GBAT Nigeria’s key-customer portfolio in Q1 2026.

EDA established the factual foundation: seven accounts, ₦479.4M total revenue, no missing data, and ARC as a confirmed statistical outlier driven by genuine dominance rather than data error.

Data Visualisation made the concentration problem immediately visible — the Pareto chart, Lorenz curve, and bubble chart each communicate a different dimension: too much revenue in too few accounts, at varying levels of transaction efficiency.

Hypothesis Testing confirmed the observed differences are not random: revenue deviates significantly from an equal-share baseline, and average transaction values differ meaningfully across tiers. The pattern is structural, not coincidental.

Correlation Analysis revealed that transaction frequency and revenue are positively associated (r ≈ 0.66), but imperfectly — average order value is an independent lever that operates differently across tiers.

Linear Regression translated the correlation into an actionable diagnostic: a model identifying which accounts under-generate revenue relative to their engagement level.

Single integrated recommendation: GBAT Nigeria must implement a tiered account development programme that simultaneously protects the ARC relationship, develops JCL through order-value growth strategies, and intensifies trade support for HCT, GRC, and PGL. The Q2 2026 target should be a measurable reduction in the HHI concentration index, with a portfolio-wide CR3 target below 85%.

9 Limitations & Further Work

Sample size. With only seven observations, all inferential statistics should be interpreted as diagnostic indicators rather than generalisable findings. A larger customer base would significantly increase statistical power.

Time period. Q1 2026 is a single quarter. Seasonal effects, festive purchasing patterns, and credit cycles may mean Q1 is not representative of the full year. A full-year or multi-year dataset would enable trend analysis and seasonal decomposition.

Variable completeness. The dataset contains no information on trade support inputs — visit frequency, display investment, or training hours. Incorporating these into a multiple regression model would allow a more complete causal model of revenue drivers.

Missing price zeroes. Several line items carry a unit price of ₦0, likely bundled or complementary items. A more granular analysis would separate priced and zero-priced items for cleaner revenue attribution.

Geographic granularity. The dataset does not include territory data for each account. Adding regional data would enable spatial analysis and assessment of expansion potential.

Further work. With more data and time, a customer lifetime value (CLV) model, a market basket analysis of co-purchased item codes, and a time-series decomposition of monthly revenue patterns would each add material insight.

10 References

[TEXTBOOK AUTHOR(S)]. ([YEAR]). [TEXTBOOK TITLE]. [Publisher].

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Wilkinson, L. (1999). The grammar of graphics. Springer.

R Core Team. (2025). R: A language and environment for statistical computing (Version 4.4). R Foundation for Statistical Computing. https://www.R-project.org/

Show Code

citation("ggplot2")

To cite ggplot2 in publications, please use

  H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
  Springer-Verlag New York, 2016.

A BibTeX entry for LaTeX users is

  @Book{,
    author = {Hadley Wickham},
    title = {ggplot2: Elegant Graphics for Data Analysis},
    publisher = {Springer-Verlag New York},
    year = {2016},
    isbn = {978-3-319-24277-4},
    url = {https://ggplot2.tidyverse.org},
  }

Show Code

citation("knitr")

To cite package 'knitr' in publications use:

  Xie Y (2025). _knitr: A General-Purpose Package for Dynamic Report
  Generation in R_. R package version 1.51, <https://yihui.org/knitr/>.

  Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition.
  Chapman and Hall/CRC. ISBN 978-1498716963

  Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible
  Research in R. In Victoria Stodden, Friedrich Leisch and Roger D.
  Peng, editors, Implementing Reproducible Computational Research.
  Chapman and Hall/CRC. ISBN 978-1466561595

To see these entries in BibTeX format, use 'print(<citation>,
bibtex=TRUE)', 'toBibtex(.)', or set
'options(citation.bibtex.max=999)'.

Show Code

citation("kableExtra")

To cite package 'kableExtra' in publications use:

  Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and
  Pipe Syntax_. doi:10.32614/CRAN.package.kableExtra
  <https://doi.org/10.32614/CRAN.package.kableExtra>, R package version
  1.4.0, <https://CRAN.R-project.org/package=kableExtra>.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {kableExtra: Construct Complex Table with 'kable' and Pipe Syntax},
    author = {Hao Zhu},
    year = {2024},
    note = {R package version 1.4.0},
    url = {https://CRAN.R-project.org/package=kableExtra},
    doi = {10.32614/CRAN.package.kableExtra},
  }

Show Code

citation("broom")

To cite package 'broom' in publications use:

  Robinson D, Hayes A, Couch S, Hvitfeldt E (2026). _broom: Convert
  Statistical Objects into Tidy Tibbles_.
  doi:10.32614/CRAN.package.broom
  <https://doi.org/10.32614/CRAN.package.broom>, R package version
  1.0.12, <https://CRAN.R-project.org/package=broom>.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {broom: Convert Statistical Objects into Tidy Tibbles},
    author = {David Robinson and Alex Hayes and Simon Couch and Emil Hvitfeldt},
    year = {2026},
    note = {R package version 1.0.12},
    url = {https://CRAN.R-project.org/package=broom},
    doi = {10.32614/CRAN.package.broom},
  }

Show Code

citation("scales")

To cite package 'scales' in publications use:

  Wickham H, Pedersen T, Seidel D (2025). _scales: Scale Functions for
  Visualization_. doi:10.32614/CRAN.package.scales
  <https://doi.org/10.32614/CRAN.package.scales>, R package version
  1.4.0, <https://CRAN.R-project.org/package=scales>.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {scales: Scale Functions for Visualization},
    author = {Hadley Wickham and Thomas Lin Pedersen and Dana Seidel},
    year = {2025},
    note = {R package version 1.4.0},
    url = {https://CRAN.R-project.org/package=scales},
    doi = {10.32614/CRAN.package.scales},
  }

GBAT Nigeria. (2026). Internal sales voucher system records — key customers Q1 2026 [Unpublished organisational data]. GBAT Nigeria Lagos Office.

Appendix: AI Usage Statement

AI-assisted tools, specifically Claude (Anthropic, 2025), were used in the preparation of this document in the following capacities: structuring the Quarto document layout and YAML configuration; drafting initial versions of theoretical section introductions; suggesting appropriate R functions for specific analytical tasks; and reviewing code for syntactic errors prior to rendering.

All analytical judgements — including the choice of techniques, interpretation of outputs, business framing of findings, tier classification framework, identification of ARC as a concentration risk, and all strategic recommendations — were made independently by the analyst, Chinwendu Ezike, drawing on her professional experience as a Senior Sales Consultant at GBAT Nigeria and her studies in the MMBA-8 programme. AI was used as a productivity tool, not as a substitute for analytical reasoning. The analyst takes full responsibility for all content in this report.

Appendix: Session Information

Show Code

sessionInfo()

R version 4.5.3 (2026-03-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_Nigeria.utf8  LC_CTYPE=English_Nigeria.utf8   
[3] LC_MONETARY=English_Nigeria.utf8 LC_NUMERIC=C                    
[5] LC_TIME=English_Nigeria.utf8    

time zone: Africa/Lagos
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gridExtra_2.3    car_3.1-5        carData_3.0-6    broom_1.0.12    
 [5] ggthemes_5.2.0   ggrepel_0.9.8    kableExtra_1.4.0 knitr_1.51      
 [9] scales_1.4.0     lubridate_1.9.5  forcats_1.0.1    stringr_1.6.0   
[13] dplyr_1.2.1      purrr_1.2.2      readr_2.2.0      tidyr_1.3.2     
[17] tibble_3.3.1     ggplot2_4.0.2    tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] generics_0.1.4     xml2_1.5.2         lattice_0.22-9     stringi_1.8.7     
 [5] hms_1.1.4          digest_0.6.39      magrittr_2.0.4     evaluate_1.0.5    
 [9] grid_4.5.3         timechange_0.4.0   RColorBrewer_1.1-3 fastmap_1.2.0     
[13] Matrix_1.7-4       jsonlite_2.0.0     backports_1.5.1    Formula_1.2-5     
[17] mgcv_1.9-4         viridisLite_0.4.3  textshaping_1.0.5  abind_1.4-8       
[21] cli_3.6.5          rlang_1.1.7        splines_4.5.3      withr_3.0.2       
[25] yaml_2.3.12        otel_0.2.0         tools_4.5.3        tzdb_0.5.0        
[29] vctrs_0.7.1        R6_2.6.1           lifecycle_1.0.5    pkgconfig_2.0.3   
[33] pillar_1.11.1      gtable_0.3.6       glue_1.8.0         Rcpp_1.1.1        
[37] systemfonts_1.3.2  xfun_0.57          tidyselect_1.2.1   rstudioapi_0.18.0 
[41] farver_2.1.2       nlme_3.1-168       htmltools_0.5.9    labeling_0.4.3    
[45] rmarkdown_2.31     svglite_2.2.2      compiler_4.5.3     S7_0.2.1

Report prepared by Chinwendu Ezike · Senior Sales Consultant · GBAT Nigeria · MMBA-8 Data Analytics II · May 2026

Executive Summary

Professional Disclosure

0.1 Operational Relevance of Each Analytical Technique

1 Data Collection & Sampling

1.1 Source and Collection Method

1.2 Sampling Frame and Sample Size

1.3 Ethical Notes and Consent Statement

2 Data Description

2.1 Variable Dictionary

2.2 Summary Statistics

2.3 Missing Value & Outlier Check

3 Technique 1 — Exploratory Data Analysis (EDA)

3.1 Theory

3.2 Business Justification

3.3 Code and Output

3.4 Plain-Language Interpretation

4 Technique 2 — Data Visualisation

4.1 Theory

4.2 Business Justification

4.3 Code and Output

4.4 Plain-Language Interpretation

5 Technique 3 — Hypothesis Testing

5.1 Theory

5.2 Business Justification

5.3 Code and Output

5.4 Plain-Language Interpretation

6 Technique 4 — Correlation Analysis

6.1 Theory

6.2 Business Justification

6.3 Code and Output

6.4 Plain-Language Interpretation

7 Technique 5 — Linear Regression

7.1 Theory

7.2 Business Justification

7.3 Code and Output

7.4 Plain-Language Interpretation

8 Integrated Findings

9 Limitations & Further Work

10 References

Appendix: AI Usage Statement

Appendix: Session Information