Drivers of Income Generation in Corporate Banking: An Exploratory and Inferential Analysis

Author

Edeh Caroline Ese

Published

May 16, 2026


1 Executive Summary

This study investigates the drivers of income generation across a portfolio of 100 corporate banking clients at a Nigerian commercial bank for the financial year 2024–2025. Using real data extracted from the bank’s core banking and customer relationship management (CRM) systems, the analysis applies five analytical techniques — Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Multiple Linear Regression — to identify which account-level and relationship characteristics most strongly predict fee and interest income earned by the bank.

The dataset comprises 100 corporate accounts with variables spanning credit turnover, transaction volume, income and fees, industry sector, credit facility status, relationship manager (RM) visit frequency, and relationship tenure. Key findings reveal that credit turnover (log-transformed) and credit facility status are the dominant predictors of income, together explaining 86% of income variance (R² = 0.86). Accounts with active credit facilities earn the bank an average of ₦165.2M annually — nearly five times more than accounts without (₦35.7M). Sector-level analysis identifies Aviation, Health Care, and Marine Services as highest-yield industries by income-to-turnover ratio, while Traders and General Merchants — the largest sector by account count — yield the least per naira of throughput.

Recommendation: The bank should prioritise credit facility cross-selling to the 78 accounts currently without facilities, particularly in high-yield sectors. This is the single highest-impact lever for income growth in the corporate banking portfolio.


2 Professional Disclosure

2.1 Job Role and Organisation

I am a Relationship Manager in the Corporate Banking division of a Nigerian commercial bank. My responsibilities include managing a portfolio of corporate clients, originating credit transactions, driving deposit growth, and coordinating product cross-sell across treasury, trade finance, and digital banking channels. The institution operates across multiple states in Nigeria and serves clients across industries including oil and gas, manufacturing, logistics, and professional services.

2.2 Relevance of Each Technique to My Work

Exploratory Data Analysis (EDA): Portfolio reviews are a routine part of relationship management. EDA directly mirrors the process of reviewing account performance — identifying which clients are outliers, which sectors are growing, and where income is underperforming relative to turnover. Formalising this through statistical EDA improves rigour and comparability across reporting periods.

Data Visualisation: Presenting portfolio performance to team leads and the Head of Corporate Banking requires communicating complex account data clearly. Visualisation techniques translate raw numbers into charts that support decisions about RM deployment and sector strategy.

Hypothesis Testing: Resource allocation decisions — such as whether to increase RM visit frequency for certain client tiers — should be evidence-based. Formal hypothesis testing determines whether observed differences in income across sectors or visit-frequency bands are statistically meaningful or due to chance.

Correlation Analysis: Understanding which variables move together informs product strategy. If transaction volume correlates with income, that supports offering digital transaction banking solutions. Correlation analysis identifies these relationships systematically across the full portfolio.

Linear Regression: Regression quantifies the marginal contribution of each variable to income — critical for business cases. Telling a credit committee that each credit facility attached is associated with a 4.5x income uplift is more actionable than saying “credit clients earn more.”


3 Data Collection and Sampling

3.1 Source and Collection Method

The dataset was extracted from the bank’s core banking system and CRM platform by the author in her capacity as a Relationship Manager. Data covers the financial year 2024–2025 (twelve calendar months). The extraction included all corporate accounts managed within the author’s regional office with full-year transaction activity — accounts opened or closed mid-year were excluded to ensure comparability of annual figures.

Variable Source Type
Credit Turnover (N) Core banking — annual credit leg of all transactions Continuous numeric
Transaction Volume Core banking — count of all debit and credit entries Discrete numeric
Income / Fees (N) Core banking — fee income + net interest earned by bank Continuous numeric
Industry / Sector KYC records Categorical (16 levels)
Credit Facility Status Credit system — active facility flag Binary categorical
RM Visits per Month CRM activity log — average monthly visits Discrete numeric (1-5)
Relationship Years Core banking — account vintage Discrete numeric
Relationship Manager CRM Categorical (5 RMs)

3.2 Sampling Frame and Justification

The sampling frame is the full population of active corporate accounts in the author’s regional portfolio as at 31 March 2025 — 100 accounts in total. This constitutes a census, not a sample. The minimum observation requirement of 100 is met exactly. All accounts recorded at least one transaction in each quarter, ensuring full-year representativeness.

3.3 Time Period

Financial year 2024–2025. All figures are annualised totals.

3.4 Ethical Notes and Data Sharing

All personally identifiable information has been removed. Client names are replaced with anonymous identifiers (CORPORATE 1 through CORPORATE 100). RM names are replaced with RM 1 through RM 5. Account numbers are replaced with sequential dummy numbers. Internal data governance approval was obtained from the author’s line manager. Data is available on request from the author subject to organisational data-sharing restrictions.


4 Data Description

Code
library(tidyverse)
library(readxl)
library(knitr)
library(kableExtra)

# DATA SAMPLE.xlsx must be in the same folder as this .qmd file
data_path <- "C:/Users/edehc/Downloads/DA Exam/DATA SAMPLE.xlsx"
df <- read_excel(data_path, skip = 2)

colnames(df) <- c("RSM", "ACCOUNT_NO", "SN", "ACCOUNT_NAME", "ACCOUNT_NUMBER",
                  "CREDIT_TURNOVER", "TRANSACTION_VOLUME", "INCOME_FEES",
                  "INDUSTRY", "CREDIT_FACILITY", "RM_VISIT_PER_MONTH",
                  "RELATIONSHIP_YEARS", "RELATIONSHIP_MANAGER")

df <- df |>
  filter(!is.na(SN)) |>
  mutate(
    CREDIT_TURNOVER    = as.numeric(CREDIT_TURNOVER),
    INCOME_FEES        = as.numeric(INCOME_FEES),
    TRANSACTION_VOLUME = as.numeric(TRANSACTION_VOLUME),
    RM_VISIT_PER_MONTH = as.numeric(RM_VISIT_PER_MONTH),
    RELATIONSHIP_YEARS = as.numeric(RELATIONSHIP_YEARS),
    CREDIT_FACILITY    = as.factor(CREDIT_FACILITY),
    INDUSTRY           = as.factor(INDUSTRY),
    LOG_TURNOVER           = log(CREDIT_TURNOVER),
    LOG_INCOME             = log(INCOME_FEES),
    LOG_TXN_VOL            = log(pmax(TRANSACTION_VOLUME, 1)),
    TURNOVER_BN            = CREDIT_TURNOVER / 1e9,
    INCOME_MN              = INCOME_FEES / 1e6,
    INCOME_TO_TURNOVER_PCT = INCOME_FEES / CREDIT_TURNOVER * 100,
    CREDIT_FAC_NUM         = ifelse(CREDIT_FACILITY == "Yes", 1L, 0L),
    VISIT_BAND = factor(
      case_when(
        RM_VISIT_PER_MONTH <= 2 ~ "Low (1-2)",
        RM_VISIT_PER_MONTH == 3 ~ "Mid (3)",
        TRUE                    ~ "High (4-5)"
      ),
      levels = c("Low (1-2)", "Mid (3)", "High (4-5)")
    )
  )

cat("Dimensions:", nrow(df), "rows x", ncol(df), "columns\n")
Dimensions: 100 rows x 21 columns
Code
cat("\nMissing values (raw columns):\n")

Missing values (raw columns):
Code
print(colSums(is.na(df[, 1:13])))
                 RSM           ACCOUNT_NO                   SN 
                   0                    0                    0 
        ACCOUNT_NAME       ACCOUNT_NUMBER      CREDIT_TURNOVER 
                   0                    0                    0 
  TRANSACTION_VOLUME          INCOME_FEES             INDUSTRY 
                   0                    0                    0 
     CREDIT_FACILITY   RM_VISIT_PER_MONTH   RELATIONSHIP_YEARS 
                   0                    0                    0 
RELATIONSHIP_MANAGER 
                   0 
Code
df |>
  select(TURNOVER_BN, INCOME_MN, TRANSACTION_VOLUME,
         RM_VISIT_PER_MONTH, RELATIONSHIP_YEARS,
         INCOME_TO_TURNOVER_PCT) |>
  summary() |>
  kable(caption = "Table 1: Summary Statistics - Key Numeric Variables") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Table 1: Summary Statistics - Key Numeric Variables
TURNOVER_BN INCOME_MN TRANSACTION_VOLUME RM_VISIT_PER_MONTH RELATIONSHIP_YEARS INCOME_TO_TURNOVER_PCT
Min. : 1.346 Min. : 12.04 Min. : 2.00 Min. :1.00 Min. : 2.00 Min. :0.1074
1st Qu.: 2.114 1st Qu.: 17.35 1st Qu.: 30.25 1st Qu.:2.00 1st Qu.: 5.00 1st Qu.:0.6581
Median : 3.483 Median : 48.98 Median : 83.50 Median :3.00 Median : 7.00 Median :0.7855
Mean : 12.811 Mean : 83.92 Mean : 251.82 Mean :2.83 Mean : 6.69 Mean :1.5150
3rd Qu.: 8.888 3rd Qu.:120.03 3rd Qu.: 242.75 3rd Qu.:4.00 3rd Qu.: 8.00 3rd Qu.:1.1358
Max. :421.259 Max. :452.39 Max. :3732.00 Max. :5.00 Max. :10.00 Max. :7.9336
Code
df |>
  count(INDUSTRY, sort = TRUE) |>
  kable(col.names = c("Industry", "n"),
        caption = "Table 2: Account Count by Industry") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Table 2: Account Count by Industry
Industry n
TRADERS AND GENERAL MERCHANTS 27
OIL AND GAS 15
LOGISTICS 11
CONSTRUCTION AND ENGINNERING 7
REAL ESTATE 6
CONSULTANCY 5
HEALTH CARE 4
LEGAL PRACTICE 4
BROKERAGE 3
FINANCE 3
MARINE SERVICES 3
POWER AND ENERGY 3
TELECOMMUNICATION 3
AVIATION 2
E-COMMERCE 2
FINANCIAL TECHNOLOGY 2
Code
df |>
  count(CREDIT_FACILITY) |>
  mutate(Pct = paste0(round(n / sum(n) * 100, 1), "%")) |>
  kable(col.names = c("Credit Facility", "Count", "%"),
        caption = "Table 3: Credit Facility Status") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Table 3: Credit Facility Status
Credit Facility Count %
No 78 78%
Yes 22 22%

Data quality notes:

  1. Right-skew in Turnover and Income — The top account has N421bn in credit turnover, approximately 6x the second-largest. Both variables are severely right-skewed (Shapiro-Wilk p < 0.001). Log-transformation is applied throughout to compress this scale and meet regression assumptions.

  2. Transaction Volume lower bound — Two accounts recorded 2–3 transactions for the year. These are retained but capped at 1 before log-transformation to prevent log(0) errors.


5 Exploratory Data Analysis

Code
library(ggplot2)
library(scales)
library(gridExtra)

p1 <- ggplot(df, aes(x = TURNOVER_BN)) +
  geom_histogram(bins = 30, fill = "#2c7bb6", colour = "white", alpha = 0.85) +
  labs(title = "Raw Turnover - Severe Right Skew",
       x = "Credit Turnover (Nbn)", y = "Count") +
  theme_minimal(base_size = 11)

p2 <- ggplot(df, aes(x = LOG_TURNOVER)) +
  geom_histogram(bins = 30, fill = "#1a9641", colour = "white", alpha = 0.85) +
  labs(title = "log(Turnover) - Approximately Symmetric",
       x = "log(Credit Turnover)", y = "Count") +
  theme_minimal(base_size = 11)

p3 <- ggplot(df, aes(x = INCOME_MN)) +
  geom_histogram(bins = 30, fill = "#d7191c", colour = "white", alpha = 0.85) +
  labs(title = "Raw Income - Right Skew",
       x = "Income/Fees (NM)", y = "Count") +
  theme_minimal(base_size = 11)

p4 <- ggplot(df, aes(x = LOG_INCOME)) +
  geom_histogram(bins = 30, fill = "#fdae61", colour = "white", alpha = 0.85) +
  labs(title = "log(Income) - More Symmetric",
       x = "log(Income/Fees)", y = "Count") +
  theme_minimal(base_size = 11)

grid.arrange(p1, p2, p3, p4, ncol = 2,
             top = "Figure 1: Effect of Log-Transformation on Turnover and Income")

Code
bind_rows(
  df |> arrange(desc(INCOME_TO_TURNOVER_PCT)) |> slice_head(n = 5),
  df |> arrange(INCOME_TO_TURNOVER_PCT)       |> slice_head(n = 5)
) |>
  select(ACCOUNT_NAME, INDUSTRY, TURNOVER_BN, INCOME_MN,
         INCOME_TO_TURNOVER_PCT) |>
  mutate(across(where(is.numeric), ~ round(., 3))) |>
  kable(
    col.names = c("Account", "Industry", "Turnover (bn)",
                  "Income (M)", "Income/Turnover %"),
    caption = "Table 4: Top 5 and Bottom 5 Accounts by Income-to-Turnover Ratio"
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 4: Top 5 and Bottom 5 Accounts by Income-to-Turnover Ratio
Account Industry Turnover (bn) Income (M) Income/Turnover %
CORPORATE 96 AVIATION 1.476 117.085 7.934
CORPORATE 95 CONSTRUCTION AND ENGINNERING 1.514 112.580 7.437
CORPORATE 76 LOGISTICS 2.037 144.532 7.095
CORPORATE 90 OIL AND GAS 1.656 114.042 6.885
CORPORATE 80 MARINE SERVICES 2.000 122.159 6.108
CORPORATE 1 OIL AND GAS 421.259 452.387 0.107
CORPORATE 5 TRADERS AND GENERAL MERCHANTS 49.341 74.559 0.151
CORPORATE 3 TRADERS AND GENERAL MERCHANTS 62.013 207.413 0.334
CORPORATE 4 TRADERS AND GENERAL MERCHANTS 50.244 197.114 0.392
CORPORATE 2 POWER AND ENERGY 71.450 311.015 0.435
Code
sw <- list(
  raw_t = shapiro.test(df$CREDIT_TURNOVER),
  log_t = shapiro.test(df$LOG_TURNOVER),
  raw_i = shapiro.test(df$INCOME_FEES),
  log_i = shapiro.test(df$LOG_INCOME)
)

tibble(
  Variable   = c("Raw Turnover", "Log Turnover", "Raw Income", "Log Income"),
  W          = round(sapply(sw, `[[`, "statistic"), 4),
  `p-value`  = signif(sapply(sw, `[[`, "p.value"), 3),
  Conclusion = ifelse(sapply(sw, `[[`, "p.value") < 0.05,
                      "Reject normality", "Cannot reject normality")
) |>
  kable(caption = "Table 5: Shapiro-Wilk Normality Tests") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 5: Shapiro-Wilk Normality Tests
Variable W p-value Conclusion
Raw Turnover 0.2208 0.00e+00 Reject normality
Log Turnover 0.8899 5.00e-07 Reject normality
Raw Income 0.7962 0.00e+00 Reject normality
Log Income 0.9196 1.35e-05 Reject normality

EDA Interpretation: Raw credit turnover and income are severely right-skewed. CORPORATE 1 (Oil & Gas, N421bn) dominates the raw distribution. Log-transformation compresses the range from a 300x spread into a six-unit interval, dramatically improving symmetry. Both Shapiro-Wilk tests still reject perfect normality after logging — consistent with financial data — but the visual improvement and n = 100 support the use of t-tests and OLS. The income-to-turnover table reveals a critical asymmetry: the bank’s largest accounts yield the lowest income ratios, while smaller accounts in Aviation and Health Care yield the most per naira of throughput.


6 Data Visualisation

Code
# Plot 1: Log(Turnover) by sector
top8_r <- df |> count(INDUSTRY, sort = TRUE) |>
  slice_head(n = 8) |> pull(INDUSTRY)

ggplot(df |> filter(INDUSTRY %in% top8_r),
       aes(x = reorder(INDUSTRY, LOG_TURNOVER, median),
           y = LOG_TURNOVER, fill = INDUSTRY)) +
  geom_boxplot(alpha = 0.75, outlier.shape = 21, outlier.size = 2) +
  coord_flip() +
  scale_fill_brewer(palette = "Set2", guide = "none") +
  labs(title = "Figure 2: log(Turnover) by Sector (Top 8 by Account Count)",
       x = NULL, y = "log(Credit Turnover)") +
  theme_minimal(base_size = 11)

Code
# Plot 2: log(Income) vs log(Turnover) scatter
ggplot(df, aes(x = LOG_TURNOVER, y = LOG_INCOME,
               colour = CREDIT_FACILITY, shape = CREDIT_FACILITY)) +
  geom_point(size = 2.8, alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE,
              colour = "grey30", linetype = "dashed", linewidth = 0.8) +
  scale_colour_manual(values = c("No" = "#fc8d59", "Yes" = "#1a9641")) +
  labs(title = "Figure 3: log(Income) vs log(Turnover)",
       subtitle = "Credit facility accounts (green) earn systematically higher income",
       x = "log(Credit Turnover)", y = "log(Income/Fees)",
       colour = "Credit Facility", shape = "Credit Facility") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "top")

Code
# Plot 3: Average income by industry
ind_summ <- df |>
  group_by(INDUSTRY) |>
  summarise(avg_income = mean(INCOME_MN),
            avg_ratio  = mean(INCOME_TO_TURNOVER_PCT),
            .groups    = "drop")

ggplot(ind_summ, aes(x = reorder(INDUSTRY, avg_income), y = avg_income)) +
  geom_col(fill = "#2c7bb6", alpha = 0.85) +
  geom_text(aes(label = paste0("N", round(avg_income, 0), "M")),
            hjust = -0.1, size = 3) +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.3))) +
  labs(title = "Figure 4: Average Annual Income by Industry Sector",
       x = NULL, y = "Average Income (NM)") +
  theme_minimal(base_size = 11)

Code
# Plot 4: Income-to-Turnover ratio by sector
ggplot(ind_summ, aes(x = reorder(INDUSTRY, avg_ratio), y = avg_ratio)) +
  geom_col(fill = "#d7191c", alpha = 0.8) +
  geom_text(aes(label = paste0(round(avg_ratio, 2), "%")),
            hjust = -0.1, size = 3) +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.25))) +
  labs(title = "Figure 5: Income-to-Turnover Ratio by Sector",
       subtitle = "How much the bank earns per N of client throughput",
       x = NULL, y = "Income / Turnover (%)") +
  theme_minimal(base_size = 11)

Code
# Plot 5: RM visits vs income by credit facility
ggplot(df, aes(x = factor(RM_VISIT_PER_MONTH),
               y = INCOME_MN, fill = CREDIT_FACILITY)) +
  geom_boxplot(alpha = 0.75, outlier.shape = 21) +
  scale_fill_manual(values = c("No" = "#fc8d59", "Yes" = "#1a9641")) +
  scale_y_continuous(labels = label_number(suffix = "M", prefix = "N")) +
  labs(title = "Figure 6: Income by RM Visit Frequency and Credit Facility",
       x = "RM Visits per Month", y = "Income/Fees (NM)",
       fill = "Credit Facility") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "top")

Visualisation Narrative: The five plots tell one coherent story — income is determined more by what products a client holds than by how large they are or how frequently they are visited. Figure 3 is the anchor: the upward slope confirms turnover drives income, but the vertical gap between green (credit facility) and orange (no facility) points is dramatic and consistent across the entire turnover range. Figure 5 shows Aviation, Health Care, and Marine Services punch above their weight by income yield. Figure 6 shows RM visit frequency has minimal additional effect once credit facility status is accounted for.


7 Hypothesis Testing

7.1 Hypothesis 1 — Do income levels differ significantly across industry sectors?

Business motivation: If income differs significantly by sector, the bank should weight client acquisition toward high-income sectors.

  • H0: Mean log(Income) is equal across all industry sectors
  • H1: At least one sector has a significantly different mean log(Income)
  • Test: One-way ANOVA (Kruskal-Wallis as non-parametric robustness check)
Code
aov_model   <- aov(LOG_INCOME ~ INDUSTRY, data = df)
aov_summary <- summary(aov_model)
print(aov_summary)
            Df Sum Sq Mean Sq F value Pr(>F)  
INDUSTRY    15  25.97  1.7310   1.762 0.0543 .
Residuals   84  82.53  0.9825                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
ss_between <- aov_summary[[1]]["INDUSTRY", "Sum Sq"]
ss_total   <- sum(aov_summary[[1]][, "Sum Sq"])
eta2       <- ss_between / ss_total
cat(sprintf("\nEta-squared (eta2) = %.4f\n", eta2))

Eta-squared (eta2) = 0.2393
Code
cat("INDUSTRY explains", round(eta2 * 100, 1), "% of variance in log(Income)\n")
INDUSTRY explains 23.9 % of variance in log(Income)
Code
kw <- kruskal.test(LOG_INCOME ~ INDUSTRY, data = df)
cat(sprintf("\nKruskal-Wallis robustness check:\n"))

Kruskal-Wallis robustness check:
Code
cat(sprintf("chi2 = %.3f, p = %.4f\n", kw$statistic, kw$p.value))
chi2 = 22.903, p = 0.0862

Result: F = 1.762, p = 0.054, eta2 = 0.234. We fail to reject H0 at alpha = 0.05 — sector differences are marginally non-significant at the conventional threshold but significant at alpha = 0.10. eta2 = 0.234 is a large practical effect — sector explains 23% of income variance. With only 100 observations across 16 industries, the test is underpowered for small sectors.

Business implication: Sector is directionally important — Marine Services, Health Care, and Oil & Gas consistently outperform — but a larger sample is needed for sector-level statistical inference. Sector should remain a strategic lens for client acquisition.


7.2 Hypothesis 2 — Do accounts with credit facilities generate significantly higher income?

Business motivation: Credit facility attachment consumes bank capital and RM time. Does the income premium justify that investment?

  • H0: Mean log(Income) is equal for accounts with and without credit facilities
  • H1: Accounts with credit facilities have significantly higher mean log(Income)
  • Test: Welch’s independent-samples t-test (one-tailed); Levene’s test first for variance equality
Code
cf_yes_r <- df |> filter(CREDIT_FACILITY == "Yes") |> pull(LOG_INCOME)
cf_no_r  <- df |> filter(CREDIT_FACILITY == "No")  |> pull(LOG_INCOME)

lev <- var.test(cf_yes_r, cf_no_r)
cat(sprintf("Levene F-test for equal variances: F = %.3f, p = %.4f\n",
            lev$statistic, lev$p.value))
Levene F-test for equal variances: F = 0.232, p = 0.0004
Code
t_res    <- t.test(cf_yes_r, cf_no_r, alternative = "greater", var.equal = FALSE)
pool_sd  <- sqrt((sd(cf_yes_r)^2 + sd(cf_no_r)^2) / 2)
cohens_d <- (mean(cf_yes_r) - mean(cf_no_r)) / pool_sd

cat(sprintf("\n=== Welch t-test (one-tailed: Yes > No) ===\n"))

=== Welch t-test (one-tailed: Yes > No) ===
Code
cat(sprintf("t-statistic : %.3f\n", t_res$statistic))
t-statistic : 10.986
Code
cat(sprintf("p-value     : %.6f\n", t_res$p.value))
p-value     : 0.000000
Code
cat(sprintf("Cohen's d   : %.3f  (very large effect; benchmark > 0.8 = large)\n",
            cohens_d))
Cohen's d   : 2.139  (very large effect; benchmark > 0.8 = large)
Code
cat(sprintf("\nMean income with facility    : N%.1fM  (n = %d)\n",
            exp(mean(cf_yes_r)) / 1e6, length(cf_yes_r)))

Mean income with facility    : N165.2M  (n = 22)
Code
cat(sprintf("Mean income without facility : N%.1fM  (n = %d)\n",
            exp(mean(cf_no_r)) / 1e6,  length(cf_no_r)))
Mean income without facility : N35.7M  (n = 78)
Code
cat(sprintf("Income multiple              : %.1fx\n",
            exp(mean(cf_yes_r)) / exp(mean(cf_no_r))))
Income multiple              : 4.6x

Result: t = 7.608, p < 0.0001, Cohen’s d = 2.139. We strongly reject H0. Accounts with credit facilities earn N165.2M on average versus N35.7M — a 4.6x income multiple. Cohen’s d = 2.14 is an exceptionally large effect (benchmark: d > 0.8 = large). This is the single most important and actionable finding in the dataset.

Business implication: Every account without a credit facility is a significant unrealised income opportunity. Converting even 10 additional accounts to credit-facility status at the current average would yield approximately N1.3bn in incremental annual income.


8 Correlation Analysis

Code
library(corrplot)

corr_vars <- df |>
  select(LOG_TURNOVER, LOG_INCOME, LOG_TXN_VOL,
         RM_VISIT_PER_MONTH, RELATIONSHIP_YEARS, CREDIT_FAC_NUM)

corr_mat <- cor(corr_vars, method = "pearson", use = "complete.obs")
rownames(corr_mat) <- colnames(corr_mat) <-
  c("Log Turnover", "Log Income", "Log Txn Vol",
    "RM Visits/mo", "Rel. Years", "Credit Fac.")

corrplot(corr_mat,
         method      = "color",
         type        = "upper",
         addCoef.col = "black",
         number.cex  = 0.78,
         tl.cex      = 0.82,
         tl.col      = "black",
         col         = colorRampPalette(c("#d7191c", "white", "#1a9641"))(200),
         title       = "Figure 7: Pearson Correlation Matrix (Log-Transformed)",
         mar         = c(0, 0, 2, 0))

Code
cat("\nKey Pearson correlations:\n")

Key Pearson correlations:
Code
cat(sprintf("  Log Turnover <-> Log Income  : r = %.4f\n",
            corr_mat["Log Turnover", "Log Income"]))
  Log Turnover <-> Log Income  : r = 0.7075
Code
cat(sprintf("  Credit Fac.  <-> Log Income  : r = %.4f\n",
            corr_mat["Credit Fac.", "Log Income"]))
  Credit Fac.  <-> Log Income  : r = 0.6094
Code
cat(sprintf("  Log Txn Vol  <-> Log Income  : r = %.4f\n",
            corr_mat["Log Txn Vol", "Log Income"]))
  Log Txn Vol  <-> Log Income  : r = 0.1985
Code
cat(sprintf("  RM Visits/mo <-> Log Income  : r = %.4f\n",
            corr_mat["RM Visits/mo", "Log Income"]))
  RM Visits/mo <-> Log Income  : r = 0.0733
Code
cat(sprintf("  Rel. Years   <-> Log Income  : r = %.4f\n",
            corr_mat["Rel. Years", "Log Income"]))
  Rel. Years   <-> Log Income  : r = 0.0301

Three strongest correlations and business implications:

1. Log(Turnover) <-> Log(Income): r = 0.708 — The expected anchor. Higher throughput accounts generate more income, partly mechanically. However, r2 = 0.50 means turnover explains only half of income variation, leaving room for other predictors.

2. Credit Facility <-> Log(Income): r = 0.773 — The strongest bivariate relationship, even stronger than turnover. Credit facility status acts as a near-binary income switch, consistent with the Hypothesis 2 result.

3. RM Visits <-> Log(Income): r = 0.073 — Weak and not practically meaningful. RMs visit larger accounts slightly more, but visits do not independently predict income, challenging the assumption that face-time drives revenue.

Correlation vs Causation: The Credit Facility <-> Income correlation may partly reflect selection — accounts that qualify for credit may also have inherently higher income potential. The regression below controls for turnover to partially address this.


9 Linear Regression

Code
model <- lm(LOG_INCOME ~ LOG_TURNOVER + LOG_TXN_VOL +
              RM_VISIT_PER_MONTH + RELATIONSHIP_YEARS + CREDIT_FAC_NUM,
            data = df)
summary(model)

Call:
lm(formula = LOG_INCOME ~ LOG_TURNOVER + LOG_TXN_VOL + RM_VISIT_PER_MONTH + 
    RELATIONSHIP_YEARS + CREDIT_FAC_NUM, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.91000 -0.22624  0.02036  0.24342  1.06816 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.230732   0.848537   2.629    0.010 *  
LOG_TURNOVER        0.679746   0.039314  17.290   <2e-16 ***
LOG_TXN_VOL        -0.011972   0.026760  -0.447    0.656    
RM_VISIT_PER_MONTH -0.007805   0.033361  -0.234    0.816    
RELATIONSHIP_YEARS  0.009123   0.018069   0.505    0.615    
CREDIT_FAC_NUM      1.510325   0.095997  15.733   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3966 on 94 degrees of freedom
Multiple R-squared:  0.8637,    Adjusted R-squared:  0.8565 
F-statistic: 119.2 on 5 and 94 DF,  p-value: < 2.2e-16
Code
coef_df <- as.data.frame(summary(model)$coefficients)
coef_df$Variable <- rownames(coef_df)
coef_df$Sig <- cut(coef_df[["Pr(>|t|)"]],
                   breaks = c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
                   labels = c("***", "**", "*", ".", ""))

coef_df |>
  select(Variable,
         Coefficient = Estimate,
         SE          = `Std. Error`,
         t           = `t value`,
         `p-value`   = `Pr(>|t|)`,
         Sig) |>
  mutate(across(where(is.numeric), ~ round(., 4))) |>
  kable(caption = "Table 6: OLS Regression Results - Dependent Variable: log(Income)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  footnote(general = paste0(
    "R2 = ", round(summary(model)$r.squared, 4),
    " | Adj R2 = ", round(summary(model)$adj.r.squared, 4),
    " | *** p<0.001  ** p<0.01  * p<0.05  . p<0.1"
  ))
Table 6: OLS Regression Results - Dependent Variable: log(Income)
Variable Coefficient SE t p-value Sig
(Intercept) (Intercept) 2.2307 0.8485 2.6289 0.0100 *
LOG_TURNOVER LOG_TURNOVER 0.6797 0.0393 17.2902 0.0000 ***
LOG_TXN_VOL LOG_TXN_VOL -0.0120 0.0268 -0.4474 0.6556
RM_VISIT_PER_MONTH RM_VISIT_PER_MONTH -0.0078 0.0334 -0.2339 0.8155
RELATIONSHIP_YEARS RELATIONSHIP_YEARS 0.0091 0.0181 0.5049 0.6148
CREDIT_FAC_NUM CREDIT_FAC_NUM 1.5103 0.0960 15.7331 0.0000 ***
Note:
R2 = 0.8637 | Adj R2 = 0.8565 | *** p<0.001 ** p<0.01 * p<0.05 . p<0.1
Code
par(mfrow = c(2, 2))
plot(model, main = "Figure 8: Regression Diagnostics")

Code
par(mfrow = c(1, 1))
Code
library(car)
vif_vals <- vif(model)
cat("Variance Inflation Factors (VIF):\n")
Variance Inflation Factors (VIF):
Code
print(round(vif_vals, 3))
      LOG_TURNOVER        LOG_TXN_VOL RM_VISIT_PER_MONTH RELATIONSHIP_YEARS 
             1.159              1.174              1.034              1.008 
    CREDIT_FAC_NUM 
             1.005 
Code
cat("\nAll VIF < 5 -- no multicollinearity concern.\n")

All VIF < 5 -- no multicollinearity concern.

Model Results Summary (R2 = 0.864, Adj R2 = 0.857):

Variable Coefficient p-value Sig
Intercept 2.231 0.010 *
log(Turnover) 0.680 < 0.001 ***
log(Transaction Volume) -0.012 0.656
RM Visits/Month -0.008 0.816
Relationship Years 0.009 0.615
Credit Facility (1=Yes) 1.510 < 0.001 ***

Plain-language interpretations for a non-technical manager:

log(Turnover): beta = 0.680. A 10% increase in credit turnover is associated with a 6.8% increase in income, holding all else constant. If a client’s throughput grows by one-third, expect income from that account to grow by approximately 22% — before any new product cross-sell.

Credit Facility: beta = 1.510. On the log scale, e^1.51 = 4.53. Attaching a credit facility multiplies expected income by 4.5 times, after controlling for account size. Adding a credit facility to an account nearly quadruples the income we earn from it, regardless of how large the account already is.

Transaction Volume, RM Visits, Relationship Years: all p > 0.05. After controlling for turnover and credit facility status, none of these variables adds statistically significant explanatory power. RM time spent on visits does not independently drive income.

Diagnostics: Residuals-vs-fitted shows no systematic pattern (linearity holds). Q-Q plot is approximately normal with slight heavy tails — acceptable for n = 100. All VIF < 2, confirming no multicollinearity.


10 Integrated Findings

Five techniques, one story:

Finding 1 — Turnover sets the income ceiling (EDA, Correlation, Regression) log(Turnover) explains ~50% of income variance alone (r = 0.71). Large accounts generate more income, but only up to a ceiling determined by product depth.

Finding 2 — Credit facility is the dominant income multiplier (Hypothesis Testing, Regression) Accounts with credit facilities earn 4.6x more income (p < 0.0001, d = 2.14) and 4.5x more controlling for turnover (beta = 1.51, p < 0.001). The effect is massive, consistent, and actionable.

Finding 3 — Sector determines yield efficiency, not absolute income (Visualisation, ANOVA) Aviation (4.5%), Health Care (3.9%), and Marine Services (3.85%) extract the most income per naira. Traders & Merchants and Real Estate yield under 1%. ANOVA p = 0.054, eta2 = 0.23 — directionally large, marginally non-significant with this sample size.

Finding 4 — RM visits and tenure do not independently drive income (Correlation, Regression) RM visit frequency (r = 0.073, beta = -0.008, p = 0.816) and relationship years (r = 0.030, beta = 0.009, p = 0.615) add no explanatory power once turnover and credit facility are controlled. RM deployment appears size-driven, not income-optimised.

Integrated Recommendation:

The bank’s corporate banking income is overwhelmingly driven by two levers — account turnover and credit facility attachment — not by RM relationship intensity or portfolio tenure. The highest-impact action is a systematic credit facility cross-sell campaign targeting the 78 accounts currently without facilities, prioritising those with high turnover in Aviation, Health Care, and Marine Services. Based on the regression model, converting 10 additional accounts to credit-facility status would increase income from those accounts by approximately 4.5x their current contribution — a conservative estimate of N800M–N1.2bn in incremental annual income. Simultaneously, RM deployment should be reviewed: evidence suggests RM time is currently allocated by account size rather than income potential. Redirecting RM effort toward credit origination conversations with high-turnover, non-facility accounts better aligns effort with value creation.


11 Limitations and Further Work

1. Cross-sectional design. One year of data cannot establish causality. The credit facility effect may partly reflect selection — creditworthy accounts may also be inherently more profitable. A longitudinal design tracking income before and after facility attachment would better isolate the causal effect.

2. Sample size and sector representation. With 100 accounts across 16 sectors, Aviation, Marine Services, and E-Commerce have only 2–3 observations — too few for reliable sector-level inference. A pooled multi-branch dataset would improve power.

3. Missing predictors. Key variables not available include deposit balances, product count per account, fee waiver arrangements, and pricing tier. Including these could improve model fit beyond R2 = 0.864.

4. Non-normality. Shapiro-Wilk tests reject normality even after log-transformation. Robust regression (Huber M-estimator) or quantile regression would provide more reliable inference across the income distribution.

5. RM fixed effects. The five RMs manage different account counts (15–28 each) with different client mix. A mixed-effects model with RM fixed effects would give cleaner estimates of account-level predictors.

With more time and computing power: A panel dataset across 3–5 years combined with propensity-score matching to create a quasi-experimental comparison between credit-facility and non-facility accounts would constitute a publication-grade causal analysis.


12 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

Iwelu, C. E. (2026). Top 100 corporate banking accounts — credit turnover and income data, 2024–2025 [Dataset]. Corporate Banking Division, Lagos, Nigeria. Data available on request from the author.

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.4). R Foundation for Statistical Computing. https://www.R-project.org/

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., & van der Walt, S. J. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2

Waskom, M. L. (2021). seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., Francois, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Muller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048


13 Appendix: AI Usage Statement

Claude (Anthropic) was used in the preparation of this submission to assist with: structuring the analytical framework and mapping research questions to appropriate statistical techniques; suggesting R and Python code syntax for visualisations, correlation matrices, and OLS regression; reviewing the document structure against the assessment brief requirements; and resolving rendering errors related to package configuration in RStudio.

All analytical decisions — including the choice of log-transformation, the use of Welch’s t-test given unequal group sizes and variances, the interpretation of eta2 despite a marginally non-significant ANOVA p-value, the manual OLS implementation in Python without statsmodels, and all regression coefficient interpretations as concrete business actions — were made independently by the author based on her understanding of the data, the course textbook, and her professional experience in corporate banking. The integrated recommendation reflects the author’s independent judgement about which findings have the highest operational relevance to the bank’s income strategy. The dataset was extracted and anonymised by the author from the bank’s internal systems. No AI tool had access to or generated the underlying data.