Analysis of Customer Revenue and Product Adoption in a Commercial Bank

Author

Ayokunnumi Abraham

Published

May 12, 2026

1. Executive Summary

This analysis examines the financial performance and product adoption patterns of 1,802 commercial banking customers. The primary objective was to identify drivers of total revenue and evaluate the impact of business segments and product usage on profitability. Using exploratory data analysis, it was discovered that revenue is highly positively skewed, with a small number of high-value accounts contributing disproportionately to the total. Inferential testing confirmed a significant difference in revenue between small and medium business segments (\(p < 0.001\)), while loan status did not significantly impact average balances. A correlation analysis revealed a near-perfect relationship between average balance and total revenue (\(r = 0.99\)). Finally, a multiple linear regression model (R² = 0.98) demonstrated that for every unit increase in average balance, revenue increases by 0.058 units, with the West region showing the highest regional revenue uplift. Based on these findings, it is recommended that the bank focuses on migrating small segment customers to the medium segment through targeted credit facilities to maximize revenue growth.

2. Professional Disclosure

Job Title: Relationship Manager, Commercial Banking
Organisation Type: Financial Services / Banking Sector

Technique Relevance:

EDA: Essential for identifying outliers in transaction volumes that could indicate fraud or reporting errors.
Visualisation: Critical for communicating complex financial trends to non-technical regional heads.
Hypothesis Testing: Used to validate whether marketing campaigns for digital banking actually lead to higher balances.
Correlation: Helps identify which products (e.g., POS or Mobile Banking) are most strongly linked to revenue.
Regression: Allows for the forecasting of annual revenue based on projected deposit growth.

3. Data Collection & Sampling

Source: Internal organizational records (anonymized for academic purposes).
Collection Method: Data extraction from the Core Banking System (CBS) via SQL query.
Sampling Frame: Active commercial accounts as of Q1 2026.
Sample Size: 1,802 observations.
Time Period: January 2026 to April 2026.
Ethical Note: All personally identifiable information (PII) has been removed. Customer numbers are replaced with generic IDs (e.g., “Acc_11”).

4. Data Description

The dataset contains 1,802 rows and 17 variables, including:

Categorical: REGION, SUB_BUS_SGT, CATEGORY.
Numeric: 2026_AVG_BALANCE, TOTAL REVENUE, 2026_DR_TOVER.
Date: AC_OPEN_DATE.

5. Analysis 1: Exploratory Data Analysis (EDA)

During the EDA phase, two primary data quality issues were identified:

Skewness: Total Revenue exhibited an extreme positive skewness of 14.51, indicating that the mean is heavily influenced by a few top-tier customers.
Outliers: 17 accounts were identified as statistical outliers in AVG_BALANCE (Z-score > 3).

Handling: To address skewness for visualization, a log-transformation was applied. Outliers were retained for the final model to reflect the actual business reality of “whale” accounts in commercial banking.

Chart 1 – Raw Revenue Distribution

Code

ggplot(df, aes(x = TOTAL.REVENUE)) +
  geom_histogram(bins = 60, fill = "#d73027", color = "white") +
  scale_x_continuous(labels = scales::comma) +
  labs(
    title = "Raw Distribution of Total Revenue",
    x = "Total Revenue (₦)",
    y = "Number of Customers"
  )

The histogram illustrates the extreme positive skewness described in the Executive Summary: the vast majority of accounts cluster near zero, while a small number of “whale” accounts extend the tail to the right. This asymmetry (skewness = 14.51) makes log-transformation essential before any visual comparison.

Chart 2 – Outlier Detection: Average Balance (Z-Score)

Code

df |>
  mutate(
    z_score = (X2026_AVG_BALANCE - mean(X2026_AVG_BALANCE, na.rm = TRUE)) /
              sd(X2026_AVG_BALANCE, na.rm = TRUE),
    outlier = abs(z_score) > 3
  ) |>
  ggplot(aes(x = seq_along(z_score), y = z_score, color = outlier)) +
  geom_point(size = 1.2, alpha = 0.6) +
  geom_hline(yintercept = c(-3, 3), linetype = "dashed", color = "black") +
  scale_color_manual(values = c("FALSE" = "#2c7bb6", "TRUE" = "#d73027"),
                     labels = c("Normal", "Outlier (|Z| > 3)")) +
  labs(
    title = "Z-Score Plot: Average Balance Outlier Detection",
    x = "Customer Index",
    y = "Z-Score",
    color = NULL
  )

Red points represent the 17 accounts whose average balances exceed three standard deviations from the mean. These are the “whale” accounts referenced in the Executive Summary. Retaining them preserves the true revenue concentration in the portfolio, which is business-critical for understanding performance.

6. Analysis 2: Data Visualisation

Four charts explore revenue patterns across distribution, segment, region, and balance.

Chart 3 – Revenue Distribution (Log-Transformed)

Code

ggplot(df, aes(x = log(TOTAL.REVENUE + 1))) +
  geom_histogram(bins = 40, fill = "#2c7bb6", color = "white") +
  labs(
    title = "Distribution of Total Revenue (Log-Transformed)",
    x = "Log(Total Revenue + 1)",
    y = "Number of Customers"
  )

Warning in log(TOTAL.REVENUE + 1): NaNs produced

Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_bin()`).

After log-transformation, the distribution becomes approximately bell-shaped and readable. The bimodal structure visible here reflects the two distinct customer populations — Small and Medium segment accounts — which the hypothesis testing section later confirms have significantly different revenue profiles.

Chart 4 – Revenue by Business Segment

Code

ggplot(df, aes(x = SUB_BUS_SGT, y = log(TOTAL.REVENUE + 1))) +
  geom_boxplot(fill = "#2c7bb6") +
  labs(
    title = "Revenue Distribution by Business Segment",
    x = "Business Segment",
    y = "Log(Total Revenue + 1)"
  )

Warning in log(TOTAL.REVENUE + 1): NaNs produced

Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_boxplot()`).

The Medium segment shows a consistently higher revenue distribution than the Small segment — both in median and spread. This visual pattern aligns with the hypothesis test result (\(p < 0.001\)) and underpins the Executive Summary’s recommendation to migrate Small segment accounts upward.

Chart 5 – Total Revenue by Region

Code

df |>
  group_by(REGION) |>
  summarise(Total_Revenue = sum(TOTAL.REVENUE, na.rm = TRUE) / 1e6) |>
  ggplot(aes(x = reorder(REGION, Total_Revenue), y = Total_Revenue)) +
  geom_col(fill = "#2c7bb6") +
  labs(
    title = "Total Revenue by Region",
    x = "Region",
    y = "Total Revenue (₦ Millions)"
  )

South leads the portfolio in total revenue contribution, followed closely by Lagos. The West region, despite generating lower absolute revenue, is identified in the regression model as producing the highest per-account revenue uplift — distinguishing volume from efficiency.

Chart 6 – Average Balance vs Total Revenue

Code

ggplot(df, aes(x = log(X2026_AVG_BALANCE + 1), y = log(TOTAL.REVENUE + 1))) +
  geom_point(alpha = 0.3, size = 1.2, color = "#2c7bb6") +
  labs(
    title = "Average Balance vs Total Revenue (Log Scale)",
    x = "Log(Avg Balance + 1)",
    y = "Log(Total Revenue + 1)"
  )

Warning in log(TOTAL.REVENUE + 1): NaNs produced

Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).

The near-perfect linear relationship on the log scale confirms the correlation of \(r = 0.99\) cited in the Executive Summary. Average balance is overwhelmingly the dominant driver of revenue, which motivates the “Deep Wallet” strategy recommended in the Integrated Findings.

7. Analysis 3: Hypothesis Testing

Hypothesis 1

\(H_0\): There is no difference in mean revenue between ‘SMALL’ and ‘MEDIUM’ segments.
\(H_1\): There is a significant difference in mean revenue between segments.
Result: \(p < 0.001\). We reject the null hypothesis. Medium segments generate significantly more revenue.

Hypothesis 2

\(H_0\): Average balance is independent of loan status.
\(H_1\): Customers with loans maintain higher average balances.
Result: \(p = 0.43\). We fail to reject the null hypothesis.

Chart 7 – Revenue Distribution: Small vs Medium Segment

Code

ggplot(df, aes(x = SUB_BUS_SGT, y = log(TOTAL.REVENUE + 1))) +
  geom_violin(fill = "#2c7bb6", alpha = 0.7) +
  geom_boxplot(width = 0.1, fill = "white", outlier.size = 0.8) +
  labs(
    title = "Revenue Distribution: Small vs Medium Segment",
    x = "Business Segment",
    y = "Log(Total Revenue + 1)"
  )

Warning in log(TOTAL.REVENUE + 1): NaNs produced
Warning in log(TOTAL.REVENUE + 1): NaNs produced

Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_ydensity()`).

Warning: Removed 3 rows containing non-finite outside the scale range
(`stat_boxplot()`).

The violin plot makes the distributional difference between segments visible beyond just medians. The Medium segment shows a wider upper body, indicating a greater concentration of high-revenue accounts. This supports rejecting \(H_0\) and reinforces that segment classification is a meaningful revenue predictor.

Chart 8 – Average Balance by Loan Status

Code

df |>
  mutate(Loan_Status = ifelse(HAS_LOAN == 1, "Has Loan", "No Loan")) |>
  ggplot(aes(x = Loan_Status, y = log(X2026_AVG_BALANCE + 1))) +
  geom_boxplot(fill = "#2c7bb6") +
  labs(
    title = "Average Balance by Loan Status",
    x = "Loan Status",
    y = "Log(Avg Balance + 1)"
  )

Consistent with the non-significant test result (\(p = 0.43\)), the two groups show broadly overlapping balance distributions. Loan holders do not maintain materially higher balances, indicating that loan products are not functioning as cross-sell anchors that deepen the deposit relationship.

8. Analysis 4: Correlation Analysis

The correlation matrix identifies 2026_AVG_BALANCE (\(r = 0.99\)) and 2026_DR_TOVER (\(r = 0.49\)) as the strongest predictors of revenue. Interestingly, digital product adoption (HAS_POS, HAS_MOBILE_BANKING) showed negligible correlation with revenue, suggesting these are utility tools rather than direct revenue drivers.

Chart 9 – Correlation Matrix

Code

cor_vars <- df |>
  select(TOTAL.REVENUE, X2026_AVG_BALANCE, X2026_DR_TOVER,
         HAS_MOBILE_BANKING, HAS_LOAN, HAS_POS, UNION_360, HAS_ONLINE_BANKING) |>
  rename(
    `Total Revenue`  = TOTAL.REVENUE,
    `Avg Balance`    = X2026_AVG_BALANCE,
    `DR Turnover`    = X2026_DR_TOVER,
    `Mobile Banking` = HAS_MOBILE_BANKING,
    `Loan`           = HAS_LOAN,
    `POS`            = HAS_POS,
    `Union 360`      = UNION_360,
    `Online Banking` = HAS_ONLINE_BANKING
  )

cor_matrix <- cor(cor_vars, use = "complete.obs")
cor_melted  <- melt(cor_matrix)

ggplot(cor_melted, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(value, 2)), size = 3) +
  scale_fill_gradient2(low = "#d73027", mid = "white", high = "#2c7bb6",
                       midpoint = 0, limits = c(-1, 1), name = "r") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Matrix of Key Variables", x = NULL, y = NULL)

The heatmap confirms that Avg Balance (\(r = 0.99\)) is by far the strongest revenue predictor, with DR Turnover a distant second (\(r = 0.49\)). Digital products (Mobile Banking, POS, Online Banking, Union 360) show near-zero correlations with revenue, reinforcing the conclusion that product adoption does not independently drive revenue — the deposit base does.

9. Analysis 5: Linear Regression

A multiple regression model was built to predict TOTAL REVENUE:

Average Balance Coefficient: 0.058. For every ₦1 increase in balance, revenue increases by ₦0.058.
Regional Impact: Accounts in the West region earn ₦18,727 more on average than the baseline (North), holding all other factors constant.
Diagnostics: The R-squared of 0.98 indicates an exceptionally strong fit.

Chart 10 – Actual vs Fitted Revenue

Code

model <- lm(TOTAL.REVENUE ~ X2026_AVG_BALANCE + REGION, data = df)

data.frame(
  Actual = df$TOTAL.REVENUE,
  Fitted = fitted(model)
) |>
  ggplot(aes(x = log(Fitted + 1), y = log(Actual + 1))) +
  geom_point(alpha = 0.3, size = 1.2, color = "#2c7bb6") +
  geom_abline(slope = 1, intercept = 0, color = "#d73027", linewidth = 0.8) +
  labs(
    title = "Actual vs Fitted: Revenue Regression Model",
    x = "Log(Fitted Revenue + 1)",
    y = "Log(Actual Revenue + 1)"
  )

Warning in log(Actual + 1): NaNs produced

Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).

Points tightly clustered along the red 45° line indicate strong model fit. The R² of 0.98 is visually evident here — deviations from the diagonal are limited to the lower range where zero-revenue accounts introduce floor effects. High-value accounts (upper right) are modelled with high precision.

Chart 11 – Regional Revenue Coefficients

Code

tidy(model) |>
  filter(grepl("REGION", term)) |>
  mutate(term = gsub("REGION", "", term)) |>
  ggplot(aes(x = reorder(term, estimate), y = estimate)) +
  geom_col(fill = "#2c7bb6") +
  geom_errorbar(aes(ymin = estimate - std.error * 1.96,
                    ymax = estimate + std.error * 1.96), width = 0.2) +
  labs(
    title = "Regional Revenue Coefficients (vs North Baseline)",
    x = "Region",
    y = "Revenue Uplift (₦)"
  )

West and South show the highest point estimates relative to the North baseline, consistent with the Executive Summary. However, the wide confidence intervals indicate considerable uncertainty in these regional effects — the region-level uplift is directionally informative but should not be over-interpreted in isolation.

10. Integrated Findings

The analysis demonstrates that revenue is almost entirely driven by the deposit base (AVG_BALANCE) rather than transaction frequency or product count. While the “Small” business segment comprises the majority of the portfolio, the “Medium” segment is the engine of profitability.

Recommendation: Shift focus from broad digital product acquisition to “Deep Wallet” strategies aimed at increasing the deposit balances of existing Small Segment customers in the West and South regions.

11. Limitations & Further Work

Limitation: The cross-sectional nature of the data does not capture seasonal fluctuations.

Further Work: Future analysis should incorporate a Time Series approach to model the churn rate of these accounts over a 12-month period.

References

Adi, B. (2026). AI-powered business analytics. Lagos Business School.

McKinney, W. (2010). Data structures for statistical computing in Python.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis.

Appendix: AI Usage Statement

AI was used to assist in writing the Python cleaning scripts and translating them into R syntax. Independent analytical judgment was exercised in the selection of the regression variables and the interpretation of the regional coefficients.