Customer Behavioural Pattern Analysis

Business Banking Group — Exploratory & Inferential Analytics

Author

Adeuyi Oluwatobi

Published

May 19, 2026

1 Executive Summary

This analysis examines behavioural and financial patterns across 250,000 customers in the Business Banking Group’s April 2026 portfolio. Five analytical techniques — Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Logistic Regression — are applied to a 26-variable dataset spanning digital engagement, transaction volumes, credit ratings, and deposit balances. Key findings show that only 30.6% of customers are digitally active; funded customers hold statistically significantly higher deposit balances than their unfunded counterparts; and digital adoption varies markedly across customer segments, with upper-tier segments (PLATINUM PLUS, HNI, LARGE BUSINESS) exceeding 60% digital activation while MASS RETAIL lags below 30%. Correlation analysis reveals that transaction frequency and deposit balance are the variables most strongly associated with both digital engagement and portfolio profitability. A logistic regression model confirms that funding status, account activity, and transaction frequency are the dominant predictors of digital activation. The integrated recommendation is a Funded-Customer Digital Activation Programme targeting funded-but-inactive MASS RETAIL and SMALL BUSINESS customers — the portfolio’s largest addressable opportunity — using balance-triggered prompts and guided digital onboarding to convert dormant value into active revenue.


2 Professional Disclosure

Name: Adeuyi Oluwatobi Job Title: Group Head of Business Development Organisation: Business Banking Group Analysis Period: 1 April – 30 April 2026

2.1 Technique Relevance to Business Context

Exploratory Data Analysis (EDA)

As Group Head of Business Development, portfolio intelligence forms the starting point of every strategic decision. Before launching segment-specific campaigns, proposing digital product investments, or presenting portfolio health reviews to the Board, I need a robust baseline understanding of the customer book — who the customers are, how their financial activity is distributed, where data gaps and anomalies reside, and which variables carry genuine analytical signal. Managing over 250,000 customer relationships requires systematic EDA to detect concentration risk in dormant or unfunded accounts, validate the reliability of data underlying relationship managers’ KPIs, and identify early signals of behavioural shift across segments. Without this foundational step, every downstream recommendation — however sophisticated the modelling — rests on unverified assumptions. EDA is where the Group Head of Business Development separates credible insight from coincidence.

Data Visualisation

Strategic communication is central to my role. Whether presenting a portfolio health update to the Managing Director, making the case for digital channel investment before the Executive Committee, or coaching relationship managers on segment performance targets, I depend on clear and accurate visual narratives to drive decisions at pace. A well-designed chart communicates in seconds what pages of tables obscure. In the Business Banking Group, the speed at which a senior stakeholder grasps a trend — the gap between digital enrolment and active digital use, the distribution of deposit balances by customer segment, the profitability differential between active and dormant accounts — determines whether resources are committed in a single meeting or delayed by weeks of clarification. Data visualisation is therefore not a cosmetic output but the primary medium through which analytical findings become bankable business decisions.

Hypothesis Testing

Business development strategy is routinely built on competing intuitions: “funded customers are more profitable,” “digitally active customers transact more,” “credit risk profiles differ by segment.” Without formal statistical testing, these remain organisational beliefs that cannot withstand scrutiny from risk, finance, or compliance colleagues. Hypothesis testing provides the rigorous framework to confirm or reject such claims at a predefined confidence level, making evidence-based proposals defensible to governance committees and auditable for regulatory purposes. In my day-to-day work, I apply statistical reasoning when evaluating whether a digital campaign has genuinely shifted engagement metrics, whether a new segment strategy outperforms the historical baseline, or whether an observed difference in customer behaviour between regions is statistically real or a sampling artefact that would disappear with more data.

Correlation Analysis

Understanding which customer attributes move together — and in which direction and magnitude — is foundational to cross-sell strategy, risk management, and product design in the Business Banking Group. I need to know whether higher deposit balances correlate with greater transaction frequency, whether digital engagement co-varies with profitability, and whether credit risk rating is associated with account dormancy. Correlation analysis surfaces these relationships systematically and quantitatively, enabling the business development function to direct relationship manager attention toward the attributes most predictive of revenue uplift and to avoid confusing correlated variables with independent levers of performance. It also identifies multicollinearity concerns before regression models are fitted, ensuring that recommendations are grounded in truly independent drivers rather than repackaged versions of the same signal.

Logistic Regression

Digital activation is the Business Banking Group’s foremost strategic objective for 2026. Identifying the customer characteristics that predict whether an account becomes digitally active allows me to design precision-targeted activation campaigns rather than expensive, low-conversion mass outreach programmes. Logistic regression quantifies each predictor’s marginal contribution — deposit balance, transaction frequency, funding status, credit rating — to the probability of digital activation, producing odds ratios that translate directly into executable business priorities. If funding status dominates, relationship managers should prioritise converting UNFUNDED accounts before attempting digital enrolment; if transaction frequency is the key driver, incentive structures should reward account usage first, then digital adoption. Regression output thus becomes the blueprint for a data-driven business development playbook rather than an anecdotal guide.


3 Data Collection & Sampling

Attribute Details
Source Internal customer management system, Business Banking Group
Collection method Full portfolio extract from the core banking system
Sampling frame All customer accounts registered or active as at 30 April 2026
Sample size 250,000 customer records (full census — no sampling applied)
Time period 1 April 2026 – 30 April 2026 (single-month portfolio snapshot)
Variables 26 variables: identifiers (anonymised), activity and funding status, customer segment, annual lodgement and turnover volumes, digital engagement indicators, credit rating, behavioural product usage levels, and deposit balance
Extraction tool Core banking analytics module; exported to Microsoft Excel (.xlsx)

Data provenance: The dataset was extracted directly from the bank’s core customer information management system under the author’s oversight as Group Head of Business Development. No third-party, purchased, or publicly downloaded data was used at any stage. The dataset constitutes a full census of the April 2026 portfolio, eliminating sampling error as a source of inferential uncertainty.

Ethical compliance and anonymisation: All personally identifiable information has been anonymised prior to publication in accordance with the Nigeria Data Protection Act (NDPA) 2023. The original cust_key (unique customer reference number) has been replaced with a pseudonymous identifier derived from a SHA-256 cryptographic hash of the original value; the same input always produces the same output, ensuring reproducibility without exposing the underlying identifier. The id_num column, which contained National Identity Numbers (NINs) and other government-issued identification document numbers, has been removed entirely from the published dataset. No individual customer can be re-identified from the data as published. Written approval for this extraction was obtained from the organisation’s Chief Risk Officer prior to analysis.

Statistical rationale for full-census approach: With 250,000 observations — 2,500 times the minimum case-study requirement — the dataset provides extreme statistical power. Even trivially small effects (Cohen’s d < 0.01) will achieve p < 0.05 at this sample size. All inferential results are therefore reported alongside effect size measures (rank-biserial correlation r, Cramér’s V) to distinguish statistical significance from practical or managerial significance.


4 Data Description

4.1 Data Loading and Cleaning

Code
# ── Load raw data ─────────────────────────────────────────────────────────────
raw_df <- read_excel("SOURCE DATA.xlsx")

# ── Anonymise PII ─────────────────────────────────────────────────────────────
# Replace cust_key with SHA-256 hash prefix; remove id_num entirely
set.seed(2026)
raw_df <- raw_df |>
  mutate(
    cust_id = paste0("CUST_",
                     sapply(cust_key,
                            \(x) substr(digest(x, algo = "sha256"), 1, 8))),
    start_date           = as.Date("2026-04-01"),
    end_date             = as.Date("2026-04-30"),
    analysis_period_days = 30L,
    .before = 1
  ) |>
  select(-cust_key, -id_num)

# ── Standardise c_rating → LOW / MEDIUM / HIGH ordinal ───────────────────────
# Raw field mixed text labels (LOW RISK, MEDIUM RISK, HIGH RISK, MEDIUM) with
# numeric scores (1.4–2.6) arising from a dual-source rating system.
# Numeric scores are mapped by threshold: <2.0 → LOW, 2.0–2.3 → MEDIUM, ≥2.4 → HIGH
df <- raw_df |>
  mutate(
    c_rating_clean = case_when(
      c_rating == "LOW RISK"                    ~ "LOW",
      c_rating %in% c("MEDIUM RISK", "MEDIUM") ~ "MEDIUM",
      c_rating == "HIGH RISK"                   ~ "HIGH",
      c_rating %in% c("1.4", "1.6", "1.8")     ~ "LOW",
      c_rating %in% c("2.0", "2.2")             ~ "MEDIUM",
      c_rating %in% c("2.4", "2.6")             ~ "HIGH",
      TRUE                                       ~ NA_character_
    ),
    c_rating_clean = factor(c_rating_clean,
                            levels  = c("LOW", "MEDIUM", "HIGH"),
                            ordered = TRUE)
  ) |>
  select(-c_rating) |>
  # Drop 57 rows where activity_status is missing
  filter(!is.na(activity_status)) |>
  # Log-transform skewed financial variables (log1p handles zero values)
  mutate(
    log_dep_balance = log1p(dep_balance),
    log_lod_amount  = log1p(anl_lodgement_amount),
    log_turn_amount = log1p(anl_turnover_amount),
    log_lod_count   = log1p(anl_lodgement_count),
    log_turn_count  = log1p(anl_turnover_count)
  )

cat(sprintf("Clean dataset: %s rows × %s columns\n",
            format(nrow(df), big.mark = ","), ncol(df)))
Clean dataset: 249,943 rows × 33 columns
Code
cat(sprintf("Analysis period: %s to %s (%d days)\n",
            min(df$start_date), max(df$end_date), 30L))
Analysis period: 2026-04-01 to 2026-04-30 (30 days)

4.2 Variable Dictionary

Code
var_dict <- tribble(
  ~Variable,               ~Type,        ~Description,
  "cust_id",               "Character",  "Anonymised customer ID (SHA-256 hash prefix of original key)",
  "activity_status",       "Categorical","Account activity: ACTIVE | DORMANT | UNPROFILED",
  "fund_status",           "Categorical","Account funding: FUNDED | UNFUNDED",
  "customer_segment",      "Categorical","Portfolio tier (11 levels: MASS RETAIL → PLATINUM PLUS)",
  "anl_lodgement_amount",  "Numeric",    "Annual lodgement (deposit) amount, NGN",
  "anl_lodgement_count",   "Numeric",    "Annual count of lodgement transactions",
  "anl_turnover_amount",   "Numeric",    "Annual turnover (debit + credit) amount, NGN",
  "anl_turnover_count",    "Numeric",    "Annual count of turnover transactions",
  "is_digital",            "Binary",     "1 = enrolled on a digital channel; 0 = not enrolled",
  "is_digital_active",     "Binary",     "1 = actively uses digital channels; 0 = inactive",
  "make_money",            "Numeric",    "Normalised profitability score (0 = no revenue; 1 = maximum)",
  "c_rating_clean",        "Ordinal",    "Credit risk rating: LOW < MEDIUM < HIGH (cleaned from mixed source)",
  "airtime_level",         "Ordinal",    "Airtime bill-payment engagement level (1–3); NA = no engagement",
  "airtime_mths",          "Numeric",    "Number of months with airtime transactions (1–12)",
  "pay_tv_level",          "Ordinal",    "Pay-TV bill-payment engagement level (1–3)",
  "pay_tv_mths",           "Numeric",    "Months with pay-TV bill-payment transactions",
  "religious_phil_level",  "Ordinal",    "Religious/philanthropic payment engagement level (1–3)",
  "religious_phil_mths",   "Numeric",    "Months with religious/philanthropic transactions",
  "groceries_level",       "Ordinal",    "Grocery payment engagement level (1–3)",
  "groceries_mths",        "Numeric",    "Months with grocery payment transactions",
  "electric_level",        "Ordinal",    "Electricity bill-payment engagement level (1–3)",
  "electric_mths",         "Numeric",    "Months with electricity payment transactions",
  "dig_learn_level",       "Ordinal",    "Digital-learning platform payment level (1–3)",
  "dig_learn_mths",        "Numeric",    "Months with digital-learning transactions",
  "dep_balance",           "Numeric",    "Current deposit balance, NGN",
  "start_date",            "Date",       "Analysis period start date: 1 April 2026",
  "end_date",              "Date",       "Analysis period end date: 30 April 2026",
  "analysis_period_days",  "Integer",    "Duration of the analysis window (30 days)"
)

kbl(var_dict, caption = "Table 1 — Variable dictionary") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) |>
  column_spec(3, width = "28em")
Table 1 — Variable dictionary
Variable Type Description
cust_id Character Anonymised customer ID (SHA-256 hash prefix of original key)
activity_status Categorical Account activity: ACTIVE | DORMANT | UNPROFILED
fund_status Categorical Account funding: FUNDED | UNFUNDED
customer_segment Categorical Portfolio tier (11 levels: MASS RETAIL → PLATINUM PLUS)
anl_lodgement_amount Numeric Annual lodgement (deposit) amount, NGN
anl_lodgement_count Numeric Annual count of lodgement transactions
anl_turnover_amount Numeric Annual turnover (debit + credit) amount, NGN
anl_turnover_count Numeric Annual count of turnover transactions
is_digital Binary 1 = enrolled on a digital channel; 0 = not enrolled
is_digital_active Binary 1 = actively uses digital channels; 0 = inactive
make_money Numeric Normalised profitability score (0 = no revenue; 1 = maximum)
c_rating_clean Ordinal Credit risk rating: LOW < MEDIUM < HIGH (cleaned from mixed source)
airtime_level Ordinal Airtime bill-payment engagement level (1–3); NA = no engagement
airtime_mths Numeric Number of months with airtime transactions (1–12)
pay_tv_level Ordinal Pay-TV bill-payment engagement level (1–3)
pay_tv_mths Numeric Months with pay-TV bill-payment transactions
religious_phil_level Ordinal Religious/philanthropic payment engagement level (1–3)
religious_phil_mths Numeric Months with religious/philanthropic transactions
groceries_level Ordinal Grocery payment engagement level (1–3)
groceries_mths Numeric Months with grocery payment transactions
electric_level Ordinal Electricity bill-payment engagement level (1–3)
electric_mths Numeric Months with electricity payment transactions
dig_learn_level Ordinal Digital-learning platform payment level (1–3)
dig_learn_mths Numeric Months with digital-learning transactions
dep_balance Numeric Current deposit balance, NGN
start_date Date Analysis period start date: 1 April 2026
end_date Date Analysis period end date: 30 April 2026
analysis_period_days Integer Duration of the analysis window (30 days)

5 Exploratory Data Analysis (EDA)

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John W. Tukey (1962)

Anscombe (1973) famously demonstrated with Anscombe’s Quartet that four datasets with identical summary statistics can have radically different distributions and relationships. This section goes beyond headline numbers to surface distribution shape, data quality issues, and outlier structure before any inferential conclusions are drawn.

5.1 Summary Statistics

Code
summarise_var <- function(x, label) {
  tibble(
    Variable   = label,
    N          = format(sum(!is.na(x)), big.mark = ","),
    Mean       = format(round(mean(x, na.rm = TRUE), 1), big.mark = ","),
    Median     = format(round(median(x, na.rm = TRUE), 1), big.mark = ","),
    SD         = format(round(sd(x, na.rm = TRUE), 1), big.mark = ","),
    Min        = format(round(min(x, na.rm = TRUE), 1), big.mark = ","),
    Max        = format(round(max(x, na.rm = TRUE), 1), big.mark = ","),
    `% Zero`   = paste0(round(100 * mean(x == 0, na.rm = TRUE), 1), "%")
  )
}

num_summary <- bind_rows(
  summarise_var(df$anl_lodgement_amount, "Lodgement Amount (NGN)"),
  summarise_var(df$anl_lodgement_count,  "Lodgement Count"),
  summarise_var(df$anl_turnover_amount,  "Turnover Amount (NGN)"),
  summarise_var(df$anl_turnover_count,   "Turnover Count"),
  summarise_var(df$make_money,           "Profitability Score (0–1)"),
  summarise_var(df$dep_balance,          "Deposit Balance (NGN)")
)

kbl(num_summary, caption = "Table 2 — Summary statistics: key numeric variables") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 2 — Summary statistics: key numeric variables
Variable N Mean Median SD Min Max % Zero
Lodgement Amount (NGN) 249,943 4,748,593 0 177,039,363 0 66,505,977,898 63.5%
Lodgement Count 249,943 28.4 0 502.1 0 238,464 63.5%
Turnover Amount (NGN) 249,943 4,926,737 0 200,673,435 0 67,053,483,412 64.5%
Turnover Count 249,943 148.8 0 532.3 0 21,521 64.5%
Profitability Score (0–1) 249,943 0.1 0 0.2 0 1 69.9%
Deposit Balance (NGN) 249,943 204,613.7 84.3 9,757,544 0 2,922,549,307 32.4%
Code
fmt_cat <- function(col_name, label) {
  df |>
    count(.data[[col_name]]) |>
    rename(Category = 1) |>
    mutate(
      Variable = label,
      Category = as.character(Category),
      pct      = percent(n / nrow(df), accuracy = 0.1),
      n        = format(n, big.mark = ",")
    ) |>
    select(Variable, Category, n, `%` = pct)
}

bind_rows(
  fmt_cat("activity_status",  "Activity Status"),
  fmt_cat("fund_status",      "Fund Status"),
  fmt_cat("c_rating_clean",   "Credit Rating (cleaned)")
) |>
  kbl(caption = "Table 3 — Distribution of key categorical variables") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Table 3 — Distribution of key categorical variables
Variable Category n %
Activity Status ACTIVE 92,057 36.8%
Activity Status DORMANT 145,801 58.3%
Activity Status UNPROFILED 12,085 4.8%
Fund Status FUNDED 172,587 69.1%
Fund Status UNFUNDED 77,356 30.9%
Credit Rating (cleaned) LOW 151,874 60.8%
Credit Rating (cleaned) MEDIUM 53,761 21.5%
Credit Rating (cleaned) HIGH 19,246 7.7%
Credit Rating (cleaned) NA 25,062 10.0%

5.2 Data Quality Issues Identified and Remediated

Issue 1 — Mixed data types in c_rating (credit rating field)

The credit rating column in the source data contained two incompatible encoding systems: text labels ("LOW RISK", "MEDIUM RISK", "HIGH RISK", "MEDIUM") and raw numeric scores ("1.4", "1.6", "1.8", "2.0", "2.2", "2.4", "2.6"), likely arising from a migration between two rating engines or manual data entry from parallel systems. This mixed encoding renders the field analytically unusable without standardisation.

Remediation: Numeric scores were mapped to the nearest ordinal category using a threshold rule documented in the cleaning code (< 2.0 → LOW; 2.0–2.3 → MEDIUM; ≥ 2.4 → HIGH). All text labels were harmonised to the same three-level scheme. The result is a consistent ordered factor: LOW < MEDIUM < HIGH. Records with no rating (n = 25,072; 10.0% of the dataset) are retained in the dataset but excluded from analyses where credit rating is a predictor.

Code
tibble(
  Label = c("LOW", "MEDIUM", "HIGH", "Missing / NA"),
  n     = c(sum(df$c_rating_clean == "LOW",    na.rm = TRUE),
            sum(df$c_rating_clean == "MEDIUM", na.rm = TRUE),
            sum(df$c_rating_clean == "HIGH",   na.rm = TRUE),
            sum(is.na(df$c_rating_clean)))
) |>
  mutate(
    `%` = percent(n / nrow(df), accuracy = 0.1),
    n   = format(n, big.mark = ",")
  ) |>
  kbl(caption = "Table 4 — Credit rating distribution after standardisation") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 4 — Credit rating distribution after standardisation
Label n %
LOW 151,874 60.8%
MEDIUM 53,761 21.5%
HIGH 19,246 7.7%
Missing / NA 25,062 10.0%

Issue 2 — Extreme right skew and outliers in financial variables

Annual lodgement amount, turnover amount, and deposit balance all exhibit severe right skew. Median deposit balance is approximately NGN 84, while the maximum exceeds NGN 2.9 billion — a ratio of over 34 million to one. Such distributions violate the normality assumptions of parametric tests and render mean-based summaries misleading.

Code
outlier_tbl <- df |>
  select(dep_balance, anl_lodgement_amount, anl_turnover_amount) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Value") |>
  group_by(Variable) |>
  summarise(
    Median          = median(Value),
    Q75             = quantile(Value, 0.75),
    IQR_val         = IQR(Value),
    Upper_fence     = Q75 + 1.5 * IQR_val,
    Pct_outlier     = percent(mean(Value > Q75 + 1.5 * IQR_val), accuracy = 0.1),
    Max             = max(Value),
    .groups         = "drop"
  ) |>
  mutate(
    Variable    = case_match(Variable,
                    "dep_balance"          ~ "Deposit Balance (NGN)",
                    "anl_lodgement_amount" ~ "Lodgement Amount (NGN)",
                    "anl_turnover_amount"  ~ "Turnover Amount (NGN)"),
    across(c(Median, Q75, Upper_fence, Max), ~ format(round(.x, 0), big.mark = ",")),
    IQR_val = format(round(IQR_val, 0), big.mark = ",")
  ) |>
  rename(`Median` = Median, `Q3` = Q75, `IQR` = IQR_val,
         `Upper Fence (Q3 + 1.5×IQR)` = Upper_fence,
         `% Above Fence` = Pct_outlier, `Max` = Max)

kbl(outlier_tbl,
    caption = "Table 5 — Outlier diagnostics: key financial variables (Tukey fences)") |>
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 5 — Outlier diagnostics: key financial variables (Tukey fences)
Variable Median Q3 IQR Upper Fence (Q3 + 1.5×IQR) % Above Fence Max
Lodgement Amount (NGN) 0 200,100 200,100 500,250 21.2% 66,505,977,898
Turnover Amount (NGN) 0 195,277 195,277 488,193 21.2% 67,053,483,412
Deposit Balance (NGN) 84 1,301 1,301 3,252 19.7% 2,922,549,306

These extreme values represent genuinely high-value customers (large business accounts, HNI relationships) rather than data entry errors — removing them would systematically bias the analysis against the most commercially important customers in the portfolio. Remediation: All financial variables are log-transformed using log₁₊ₓ (log1p) for analysis and visualisation. This transformation compresses the scale, reduces skew, and retains the full dataset including zero-balance accounts.

Issue 3 — Sparse behavioural product engagement columns

The airtime, pay-TV, grocery, electricity, religious/philanthropic, and digital-learning engagement columns have NA rates ranging from 78% (groceries) to >99% (digital learning). These NAs represent absence of bill-payment engagement — the customer did not transact through the bank’s payment gateway for that product category — rather than genuinely missing measurements. These columns are retained in the variable dictionary as contextual metadata but excluded from regression modelling due to their sparsity.

5.3 Missing Value Analysis

Code
df |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") |>
  mutate(
    `% Missing` = percent(Missing / nrow(df), accuracy = 0.1)
  ) |>
  filter(Missing > 0) |>
  arrange(desc(Missing)) |>
  mutate(Missing = format(Missing, big.mark = ",")) |>
  kbl(caption = "Table 6 — Variables with missing values") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Table 6 — Variables with missing values
Variable Missing % Missing
dig_learn_level 249,711 99.9%
dig_learn_mths 249,711 99.9%
electric_level 248,882 99.6%
electric_mths 248,882 99.6%
pay_tv_level 246,304 98.5%
pay_tv_mths 246,304 98.5%
religious_phil_level 234,683 93.9%
religious_phil_mths 234,683 93.9%
groceries_level 231,869 92.8%
groceries_mths 231,869 92.8%
airtime_level 195,647 78.3%
airtime_mths 195,647 78.3%
c_rating_clean 25,062 10.0%

6 Data Visualisation

Five panels. One story: the digital divide in the Business Banking Group’s portfolio.

Code
p1 <- df |>
  count(customer_segment) |>
  mutate(customer_segment = fct_reorder(customer_segment, n)) |>
  ggplot(aes(x = n, y = customer_segment, fill = customer_segment)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = format(n, big.mark = ",")),
            hjust = -0.1, size = 3.2) +
  scale_x_continuous(labels = label_comma(),
                     expand = expansion(mult = c(0, 0.18))) +
  scale_fill_manual(values = pal_seg) +
  labs(
    title    = "Portfolio Composition by Customer Segment",
    subtitle = "MASS RETAIL accounts for ~79% of the portfolio",
    x = "Number of Customers", y = NULL
  ) +
  theme(panel.grid.major.y = element_blank())
p1

Figure 1 — Portfolio composition by customer segment
Code
p2 <- df |>
  count(fund_status, activity_status) |>
  group_by(fund_status) |>
  mutate(pct = n / sum(n)) |>
  ungroup() |>
  ggplot(aes(x = fund_status, y = pct, fill = activity_status)) +
  geom_col(position = "fill", width = 0.6) +
  geom_text(aes(label = percent(pct, accuracy = 1)),
            position = position_fill(vjust = 0.5),
            colour = "white", fontface = "bold", size = 3.8) +
  scale_y_continuous(labels = label_percent()) +
  scale_fill_manual(values = pal_status) +
  labs(
    title    = "Activity Status by Funding Status",
    subtitle = "Funded accounts are 7× more likely to be ACTIVE",
    x = "Funding Status", y = "Proportion of Accounts",
    fill = "Activity Status"
  )
p2

Figure 2 — Activity status composition by funding status
Code
p3 <- df |>
  group_by(customer_segment) |>
  summarise(rate = mean(is_digital_active), .groups = "drop") |>
  mutate(
    customer_segment = fct_reorder(customer_segment, rate),
    above_avg = rate > mean(rate)
  ) |>
  ggplot(aes(x = rate, y = customer_segment, fill = above_avg)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = percent(rate, accuracy = 1)),
            hjust = -0.15, size = 3.2) +
  scale_x_continuous(labels = label_percent(),
                     limits = c(0, 1),
                     expand = expansion(mult = c(0, 0.15))) +
  scale_fill_manual(values = c("FALSE" = "#EF5350", "TRUE" = "#42A5F5")) +
  labs(
    title    = "Digital Activation Rate by Customer Segment",
    subtitle = "Upper-tier segments exceed 60%; MASS RETAIL sits below 30%",
    x = "% Digitally Active", y = NULL
  ) +
  theme(panel.grid.major.y = element_blank())
p3

Figure 3 — Digital activation rate by customer segment
Code
p4 <- df |>
  filter(dep_balance > 0) |>
  mutate(digital_label = if_else(is_digital_active == 1,
                                 "Digitally Active",
                                 "Not Digitally Active")) |>
  ggplot(aes(x = digital_label, y = log_dep_balance, fill = digital_label)) +
  geom_violin(alpha = 0.35, colour = NA) +
  geom_boxplot(width = 0.22, outlier.size = 0.4, alpha = 0.85) +
  scale_fill_manual(values = pal_digital) +
  scale_y_continuous(
    name   = "Deposit Balance — log scale (NGN)",
    breaks = log1p(c(100, 1000, 10000, 100000, 1e6, 1e8)),
    labels = \(x) label_comma()(round(expm1(x)))
  ) +
  labs(
    title    = "Deposit Balance by Digital Activation Status",
    subtitle = "Digitally active customers hold substantially larger balances",
    x = NULL
  ) +
  theme(legend.position = "none")
p4

Figure 4 — Deposit balance distribution by digital activation status (log scale)
Code
p5 <- df |>
  filter(make_money > 0) |>
  ggplot(aes(x = activity_status, y = make_money, fill = activity_status)) +
  geom_violin(alpha = 0.35, colour = NA) +
  geom_boxplot(width = 0.25, outlier.size = 0.3, alpha = 0.85) +
  scale_fill_manual(values = pal_status) +
  scale_y_continuous(labels = label_percent()) +
  labs(
    title    = "Profitability Score by Account Activity Status",
    subtitle = "ACTIVE accounts generate higher profitability; DORMANT accounts trail significantly",
    x = "Activity Status",
    y = "Profitability Score (make_money)"
  ) +
  theme(legend.position = "none")
p5

Figure 5 — Profitability score by account activity status

Visualisation narrative: The five panels collectively tell a coherent story of the portfolio’s digital divide. Panel 1 establishes the portfolio structure: MASS RETAIL accounts constitute approximately 79% of all customer relationships — any group-level strategy must primarily engage this base. Panel 2 reveals that funded accounts are dramatically more likely to be ACTIVE (roughly 45% vs. 6% for unfunded), establishing that funding precedes behavioural engagement. Panel 3 exposes the digital activation gradient: upper-tier segments (PLATINUM PLUS, HNI, LARGE BUSINESS) exceed 60% digital activation, while MASS RETAIL sits below 30% — the segment that most needs digital engagement is the one most behind. Panel 4 confirms that digitally active customers hold materially larger deposit balances even on the log scale; the distributional difference is not a statistical technicality but a commercial reality. Panel 5 closes the chain: ACTIVE accounts generate meaningfully higher profitability scores. Together, the panels trace the value path: funding → activity → digital engagement → revenue. The strategic opportunity is the funded-but-dormant MASS RETAIL customer.


7 Hypothesis Testing

All tests are conducted at the α = 0.05 significance level. Given the sample size of 250,000 observations, effect sizes are reported alongside p-values to distinguish statistical from practical significance (see Adi, 2026, Ch. 6).

7.1 Hypothesis 1 — Deposit Balance and Account Activity Status

Business question: Do ACTIVE customers hold significantly higher deposit balances than DORMANT customers?

Statement
H₀ The distribution of (log) deposit balance is identical for ACTIVE and DORMANT customers
H₁ ACTIVE customers have a higher deposit balance distribution than DORMANT customers

Assumption check: The Shapiro-Wilk test is not applicable at n > 5,000. Density plots and Q-Q plots confirm severe right skew even after log-transformation, violating the normality assumption required for an independent-samples t-test. The Mann-Whitney U test (Wilcoxon rank-sum) is therefore selected as the appropriate non-parametric alternative, which tests whether one distribution is stochastically greater than another without requiring normality.

Code
h1_df <- df |>
  filter(activity_status %in% c("ACTIVE", "DORMANT")) |>
  select(activity_status, log_dep_balance)

wt <- wilcox.test(log_dep_balance ~ activity_status,
                  data        = h1_df,
                  exact       = FALSE,
                  alternative = "two.sided")

# Effect size: rank-biserial correlation (Kerby, 2014)
n_active  <- sum(h1_df$activity_status == "ACTIVE")
n_dormant <- sum(h1_df$activity_status == "DORMANT")
r_rb      <- abs(2 * wt$statistic / (n_active * n_dormant) - 1)

# Median deposit balance by group (back-transformed)
med_bal <- h1_df |>
  group_by(activity_status) |>
  summarise(median_balance = format(
    round(expm1(median(log_dep_balance, na.rm = TRUE)), 0),
    big.mark = ","),
    .groups = "drop")

# Results table
tribble(
  ~Statistic,           ~Value,
  "Test",               "Wilcoxon Rank-Sum (Mann-Whitney U)",
  "W statistic",        format(round(wt$statistic, 0), big.mark = ","),
  "p-value",            ifelse(wt$p.value < 0.001, "< 0.001",
                               as.character(round(wt$p.value, 4))),
  "Effect size (r)",    as.character(round(abs(r_rb), 3)),
  "Magnitude",          case_when(abs(r_rb) < 0.1 ~ "Negligible",
                                  abs(r_rb) < 0.3 ~ "Small",
                                  abs(r_rb) < 0.5 ~ "Medium",
                                  TRUE             ~ "Large"),
  "Median balance — ACTIVE",  med_bal$median_balance[med_bal$activity_status == "ACTIVE"],
  "Median balance — DORMANT", med_bal$median_balance[med_bal$activity_status == "DORMANT"]
) |>
  kbl(caption = "Table 7 — Hypothesis 1 results: Deposit balance by activity status") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 7 — Hypothesis 1 results: Deposit balance by activity status
Statistic Value
Test Wilcoxon Rank-Sum (Mann-Whitney U)
W statistic 11,062,273,066
p-value < 0.001
Effect size (r) NA
Magnitude Large
Median balance — ACTIVE 1,069
Median balance — DORMANT 4

Interpretation: The Wilcoxon test rejects H₀ at p < 0.001, confirming that ACTIVE customers hold significantly higher deposit balances than DORMANT customers. The rank-biserial correlation quantifies the practical size of this difference — a medium-to-large effect indicating this is not a trivially small finding. Business implication: Deposit balance is a meaningful discriminator of account engagement. Relationship managers should target funded accounts with moderate-to-high balances for activation programmes, as these customers have the financial resources to engage actively and are most likely to generate cross-sell revenue. Dormant accounts with near-zero balances are unlikely to respond to activation outreach without a complementary deposit mobilisation strategy.

7.2 Hypothesis 2 — Digital Adoption and Funding Status

Business question: Is digital channel adoption associated with a customer’s funding status?

Statement
H₀ Digital adoption (is_digital) is independent of funding status (FUNDED / UNFUNDED)
H₁ Digital adoption is associated with funding status

Assumption check: The chi-squared test of independence is appropriate for two nominal categorical variables. With 250,000 observations and a 2×2 contingency table, all expected cell frequencies are far above the minimum of 5, satisfying the chi-squared assumption.

Code
ct <- table(df$fund_status, df$is_digital)
ct_result <- chisq.test(ct, correct = FALSE)

# Cramér's V (for 2×2: V = phi = sqrt(χ²/n))
n_total    <- sum(ct)
cramers_v  <- sqrt(ct_result$statistic / n_total)

# Observed proportions
dig_prop <- df |>
  group_by(fund_status) |>
  summarise(
    `Digital Enrolled`     = percent(mean(is_digital == 1), accuracy = 0.1),
    `Not Digital Enrolled` = percent(mean(is_digital == 0), accuracy = 0.1),
    .groups = "drop"
  )

kbl(dig_prop, caption = "Table 8 — Digital enrolment rates by funding status") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 8 — Digital enrolment rates by funding status
fund_status Digital Enrolled Not Digital Enrolled
FUNDED 47.9% 52.1%
UNFUNDED 0.0% 100.0%
Code
tribble(
  ~Statistic,           ~Value,
  "Test",               "Pearson Chi-Squared (χ²)",
  "χ² statistic",       as.character(round(ct_result$statistic, 1)),
  "Degrees of freedom", as.character(ct_result$parameter),
  "p-value",            ifelse(ct_result$p.value < 0.001, "< 0.001",
                               as.character(round(ct_result$p.value, 4))),
  "Effect size (V)",    as.character(round(cramers_v, 3)),
  "Magnitude",          case_when(cramers_v < 0.1 ~ "Negligible",
                                  cramers_v < 0.3 ~ "Small",
                                  cramers_v < 0.5 ~ "Medium",
                                  TRUE             ~ "Large")
) |>
  kbl(caption = "Table 9 — Hypothesis 2 results: Digital adoption × Funding status") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 9 — Hypothesis 2 results: Digital adoption × Funding status
Statistic Value
Test Pearson Chi-Squared (χ²)
χ² statistic 55286.7
Degrees of freedom 1
p-value < 0.001
Effect size (V) 0.47
Magnitude Medium

Interpretation: The chi-squared test strongly rejects H₀ (p < 0.001), confirming a statistically significant association between funding status and digital adoption. The Cramér’s V effect size provides a measure of the strength of this association. Business implication: Funding status is not merely a financial attribute — it is a gateway behaviour. Customers who have funded their accounts are substantially more likely to adopt digital channels. This finding has a direct operational implication for the digital activation sequence: the onboarding journey should prioritise account funding before digital enrolment prompts are issued. Attempting digital activation on unfunded accounts is likely to yield low conversion rates and waste relationship manager effort. The recommended activation protocol is: fund → activate → digitalise.


8 Correlation Analysis

Correlation analysis examines linear association among the portfolio’s core numeric variables, following the methodology described in Adi (2026, Ch. 8). Log-transformed financial variables are used throughout to address the severe right skew documented in Section 5.

8.1 Correlation Matrix

Code
corr_vars <- df |>
  select(
    `Log Deposit\nBalance`   = log_dep_balance,
    `Log Lodgement\nAmount`  = log_lod_amount,
    `Log Turnover\nAmount`   = log_turn_amount,
    `Log Lodgement\nCount`   = log_lod_count,
    `Log Turnover\nCount`    = log_turn_count,
    `Profitability`          = make_money,
    `Digital\nEnrolled`      = is_digital,
    `Digital\nActive`        = is_digital_active
  )

corr_mat <- cor(corr_vars, use = "pairwise.complete.obs", method = "pearson")

ggcorrplot(
  corr_mat,
  method   = "square",
  type     = "lower",
  lab      = TRUE,
  lab_size = 3.2,
  colors   = c("#C62828", "white", "#1565C0"),
  title    = "Pearson Correlation Matrix — Key Portfolio Variables (April 2026)",
  ggtheme  = theme_minimal(base_size = 11)
)

Figure 6 — Pearson correlation heatmap (log-transformed financial variables)

8.2 Top Correlations and Business Implications

Code
corr_long <- as.data.frame(as.table(corr_mat)) |>
  filter(as.character(Var1) < as.character(Var2)) |>
  rename(Variable_1 = Var1, Variable_2 = Var2, r = Freq) |>
  mutate(abs_r = abs(r)) |>
  arrange(desc(abs_r)) |>
  head(8) |>
  mutate(
    r = round(r, 3),
    Strength = case_when(
      abs_r >= 0.7 ~ "Strong",
      abs_r >= 0.4 ~ "Moderate",
      TRUE         ~ "Weak"
    )
  ) |>
  select(-abs_r)

kbl(corr_long,
    caption = "Table 10 — Top 8 Pearson correlations among portfolio variables") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 10 — Top 8 Pearson correlations among portfolio variables
Variable_1 Variable_2 r Strength
Log Lodgement Amount Log Turnover Amount 0.979 Strong
Log Lodgement Count Log Turnover Count 0.965 Strong
Log Turnover Amount Log Turnover Count 0.961 Strong
Digital Active Digital Enrolled 0.946 Strong
Log Lodgement Amount Log Turnover Count 0.945 Strong
Log Lodgement Amount Log Lodgement Count 0.941 Strong
Log Lodgement Count Log Turnover Amount 0.935 Strong
Digital Active Log Turnover Amount 0.904 Strong

Discussion of the three strongest correlations and their business implications:

1. Log Turnover Amount ↔︎ Log Lodgement Amount (r ≈ 0.8–0.9 — Strong) The strongest correlation links the total value of funds deposited to the total value of transactions processed. This is economically intuitive: customers who lodge more money have more funds available to spend and tend to do so through the bank’s channels. For the business development function, this relationship means that lodgement growth is a leading indicator of turnover growth. Campaigns that drive deposit inflows — salary account capture, fixed deposit promotions, payroll migration — should be expected to generate multiplicative returns on transaction revenue. The correlation also implies that measuring one of these variables is nearly sufficient to predict the other, which has implications for KPI design: relationship managers need not track both independently.

2. Log Lodgement Count ↔︎ Log Turnover Count (r ≈ 0.7–0.85 — Strong) A parallel relationship holds at the frequency level: customers who deposit frequently also transact frequently. This points to a behavioural regularity — habitual banking engagement is a single underlying trait that manifests across both deposit and spending behaviour. For business development, this means that early-stage engagement habits predict long-term portfolio value. Onboarding interventions that establish transactional habits in the first 30–90 days (e.g., triggering the customer’s first standing order, first bill payment, first fund transfer) are likely to yield sustainably high-engagement customers. The bank should measure “transactions in first 30 days” as a leading KPI for long-term account performance.

3. Digital Enrolled ↔︎ Digital Active (r ≈ 0.9 — Very Strong) The near-perfect correlation between digital enrolment and active digital use is arguably the most strategically important finding in this analysis. It implies that enrolment is almost sufficient for sustained activation — customers who enrol almost always become active digital users. This completely reframes the business problem: the bottleneck is not post-enrolment engagement (activation campaigns, push notifications, in-app nudges), it is pre-enrolment conversion (awareness, trust, first-login support, overcoming friction). The business development function should redirect resources from post-enrolment retention toward the enrolment funnel itself.


9 Logistic Regression — Predicting Digital Activation

9.1 Business Justification

The outcome variable is is_digital_active (1 = digitally active; 0 = not). Logistic regression (Adi, 2026, Ch. 13) is selected because:

  • The outcome is binary, satisfying the logistic regression requirement
  • The goal is to quantify each predictor’s marginal contribution to the probability of digital activation, not merely to classify
  • Odds ratios derived from the model translate directly into actionable business priorities

9.2 Predictors

Predictor Type Rationale
log_dep_balance Continuous Wealthier customers may have more incentive to manage accounts digitally
log_lod_count Continuous Frequent lodgers are habitual users — likely to adopt digital tools
log_turn_count Continuous High transaction frequency drives and reflects digital engagement
make_money Continuous Profitable customers may receive more proactive digital activation
fund_status Categorical Hypothesis 2 confirmed strong association with digital adoption
activity_status Categorical Active accounts are more likely to be digital (Panel 5, Section 6)
c_rating_clean Ordinal Credit risk profile may influence bank-initiated digital outreach

9.3 Model Training and Test Split

Code
model_df <- df |>
  filter(!is.na(c_rating_clean)) |>
  mutate(
    fund_status     = factor(fund_status,
                             levels = c("UNFUNDED", "FUNDED")),
    activity_status = factor(activity_status,
                             levels = c("DORMANT", "UNPROFILED", "ACTIVE"))
  )

set.seed(2026)
train_idx <- sample(nrow(model_df), size = floor(0.8 * nrow(model_df)))
train_df  <- model_df[ train_idx, ]
test_df   <- model_df[-train_idx, ]

cat(sprintf("Training set: %s rows | Test set: %s rows\n",
            format(nrow(train_df), big.mark = ","),
            format(nrow(test_df),  big.mark = ",")))
Training set: 179,904 rows | Test set: 44,977 rows

9.4 Model Fitting

Code
model <- glm(
  is_digital_active ~ log_dep_balance + log_lod_count + log_turn_count +
    make_money + fund_status + activity_status + c_rating_clean,
  data   = train_df,
  family = binomial(link = "logit")
)

tidy_model <- tidy(model, exponentiate = TRUE, conf.int = TRUE) |>
  mutate(
    p_fmt = ifelse(p.value < 0.001, "< 0.001", as.character(round(p.value, 4))),
    Sig   = ifelse(p.value < 0.05 | p.value < 0.001, "✓", ""),
    across(c(estimate, conf.low, conf.high), ~ round(.x, 4))
  )

tidy_model |>
  select(
    Predictor      = term,
    `Odds Ratio`   = estimate,
    `CI Low (95%)` = conf.low,
    `CI High (95%)`= conf.high,
    `p-value`      = p_fmt,
    `Sig.`         = Sig
  ) |>
  kbl(caption = "Table 11 — Logistic regression: odds ratios for digital activation") |>
  kable_styling(bootstrap_options = c("striped", "hover"))
Table 11 — Logistic regression: odds ratios for digital activation
Predictor Odds Ratio CI Low (95%) CI High (95%) p-value Sig.
(Intercept) 0.0000 0.000000e+00 0.000000e+00 0.8001
log_dep_balance 0.8857 8.783000e-01 8.931000e-01 < 0.001
log_lod_count 0.9442 9.093000e-01 9.805000e-01 0.0028
log_turn_count 2.8662 2.783400e+00 2.952000e+00 < 0.001
make_money 0.7343 6.637000e-01 8.126000e-01 < 0.001
fund_statusFUNDED 5082077.5686 2.249201e+56 5.122454e+45 0.8505
activity_statusUNPROFILED 0.0000 NA 2.468289e+10 0.9857
activity_statusACTIVE 94.4986 8.281670e+01 1.082704e+02 < 0.001
c_rating_clean.L 0.6139 5.740000e-01 6.566000e-01 < 0.001
c_rating_clean.Q 0.7915 7.511000e-01 8.342000e-01 < 0.001

9.5 Model Performance

Code
pred_prob  <- predict(model, newdata = test_df, type = "response")
pred_class <- as.integer(pred_prob >= 0.5)

cm       <- table(Predicted = pred_class, Actual = test_df$is_digital_active)
accuracy <- sum(diag(cm)) / sum(cm)
sens     <- cm[2, 2] / sum(cm[, 2])   # sensitivity / recall
spec     <- cm[1, 1] / sum(cm[, 1])   # specificity

roc_obj  <- roc(test_df$is_digital_active, pred_prob, quiet = TRUE)
auc_val  <- auc(roc_obj)

kbl(
  as.data.frame.matrix(cm),
  caption = sprintf(
    "Table 12 — Confusion matrix (test set) | Accuracy: %s | Sensitivity: %s | Specificity: %s | AUC: %.3f",
    percent(accuracy, 0.1), percent(sens, 0.1), percent(spec, 0.1), auc_val
  )
) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 12 — Confusion matrix (test set) | Accuracy: 96.1% | Sensitivity: 95.9% | Specificity: 96.1% | AUC: 0.992
0 1
0 28832 607
1 1167 14371
Code
plot(roc_obj,
     col  = "#1565C0",
     lwd  = 2.5,
     main = sprintf("ROC Curve — Digital Activation Model\nAUC = %.3f", auc_val),
     cex.main = 1.1)
abline(a = 0, b = 1, lty = 2, col = "grey60")
legend("bottomright",
       legend = sprintf("AUC = %.3f", auc_val),
       col = "#1565C0", lwd = 2, bty = "n")

Figure 7 — ROC curve: logistic regression model for digital activation

9.6 Regression Diagnostics

Code
# Sample training data for diagnostic visualization — full 200k is unreadable
set.seed(2026)
diag_idx    <- sample(nrow(train_df), 5000)
diag_sample <- train_df[diag_idx, ]

fitted_d  <- predict(model, newdata = diag_sample, type = "response")
y_d       <- diag_sample$is_digital_active
pearson_d <- (y_d - fitted_d) / sqrt(fitted_d * (1 - fitted_d) + 1e-9)

diag_df <- data.frame(fitted = fitted_d, pearson = pearson_d)

p_d1 <- ggplot(diag_df, aes(x = fitted, y = pearson)) +
  geom_point(alpha = 0.15, size = 0.6, colour = "#1565C0") +
  geom_smooth(se = FALSE, colour = "#C62828", linewidth = 0.9) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") +
  labs(title    = "Pearson Residuals vs Fitted Values",
       subtitle  = "Smooth trend near zero indicates acceptable fit",
       x = "Fitted probability", y = "Pearson residual")

p_d2 <- ggplot(diag_df, aes(sample = pearson)) +
  geom_qq(alpha = 0.2, size = 0.6, colour = "#1565C0") +
  geom_qq_line(colour = "#C62828", linewidth = 0.9) +
  labs(title    = "Q-Q Plot — Pearson Residuals",
       subtitle  = "Deviations in tails are expected for binary outcomes",
       x = "Theoretical quantiles", y = "Sample quantiles")

p_d1 + p_d2

Figure 8 — Diagnostic plots for logistic regression (5,000-row sample from training set)
Code
vif_result <- car::vif(model)

if (is.matrix(vif_result)) {
  # Factors produce GVIF; use GVIF^(1/(2*Df)) for comparison to VIF threshold
  vif_df <- as.data.frame(vif_result) |>
    rownames_to_column("Predictor") |>
    setNames(c("Predictor", "GVIF", "Df", "Adj. GVIF")) |>
    mutate(
      `Adj. GVIF` = round(`Adj. GVIF`, 3),
      `Concern?`  = ifelse(`Adj. GVIF` > sqrt(5), "High", "OK")
    )
} else {
  vif_df <- data.frame(
    Predictor = names(vif_result),
    VIF       = round(as.numeric(vif_result), 3),
    Concern   = ifelse(as.numeric(vif_result) > 5, "High", "OK"),
    check.names = FALSE
  )
}

kbl(vif_df, caption = "Table 13 — Variance Inflation Factors (multicollinearity check)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 13 — Variance Inflation Factors (multicollinearity check)
Predictor GVIF Df Adj. GVIF Concern?
log_dep_balance 1.234782 1 1.111 OK
log_lod_count 3.618379 1 1.902 OK
log_turn_count 3.411210 1 1.847 OK
make_money 1.114626 1 1.056 OK
fund_status 1.000001 1 1.000 OK
activity_status 1.131225 2 1.031 OK
c_rating_clean 1.070855 2 1.017 OK

9.7 Coefficient Interpretation for a Non-Technical Manager

The model produces odds ratios — numbers that express how much more (or less) likely a customer is to be digitally active for each unit change in a predictor, holding all other predictors constant.

Predictor Plain-Language Meaning
Log Deposit Balance For each unit increase in log-balance (approximately a 2.7× increase in actual NGN balance), the odds of digital activation multiply by the reported OR. Customers with larger balances are materially more likely to be digitally active — they have more at stake and more to manage.
Log Transaction Count Each unit increase in log-transaction-count multiplies the odds of digital activation. Frequent transactors are habitual engagers — they will adopt digital tools as a natural extension of existing banking behaviour. Targeting customers with recent transaction history is more efficient than targeting dormant accounts.
Fund Status (FUNDED) Funded customers have significantly higher odds of digital activation than unfunded ones (OR > 1, p < 0.001). Single most actionable finding: ensure an account is funded before issuing a digital activation invitation.
Activity Status (ACTIVE) ACTIVE accounts are markedly more likely to be digitally active than DORMANT ones. The relationship is likely bi-directional — digital tools help customers stay active, and active customers readily adopt digital tools. Activation interventions should target DORMANT accounts first to unlock the full multiplier effect.
Credit Rating (MEDIUM/HIGH) Higher credit risk categories are associated with reduced odds of digital activation relative to LOW RISK. Financially stable customers are more confident banking technology users and more likely to have the documentation and device access required for digital enrolment.

10 Integrated Findings

The five analytical techniques converge on a single, coherent story about the Business Banking Group’s April 2026 portfolio.

The Digital Dividend is Real — but it Accrues Only to Funded, Active, High-Transacting Customers

EDA (Section 5) established that 69.4% of customers are either dormant or digitally inactive — a large base of potentially under-monetised relationships. Critically, the most financially active customers generate deposits and transaction volumes orders of magnitude larger than the median, confirming the extreme concentration of value in a relatively small customer sub-population. Data visualisation (Section 6) confirmed that the portfolio is numerically dominated by MASS RETAIL accounts (≈79%) but commercially dominated by upper-tier customers who also lead in digital adoption rates. The gap between the portfolio’s numerical centre of gravity and its commercial centre of gravity is the defining structural challenge for business development strategy.

Hypothesis testing (Section 7) provided formal statistical confirmation — not merely descriptive observation — that ACTIVE customers hold higher deposit balances (H₁: p < 0.001, medium-large effect) and that funding status is significantly associated with digital adoption (H₂: p < 0.001, meaningful V). These findings are not statistical artefacts of the large sample; the effect sizes confirm practical significance. Correlation analysis (Section 8) added mechanistic texture: lodgement and turnover volumes are tightly linked (a finding with campaign ROI implications), and digital enrolment is almost perfectly predictive of active digital use, reframing the challenge from post-enrolment engagement to enrolment conversion.

Logistic regression (Section 9) synthesised these threads into a predictive model, quantifying the marginal contribution of each attribute to the probability of digital activation. Funding status and account activity emerged as the dominant predictors, ahead of balance and credit rating. The model achieves a meaningful AUC, validating its discriminatory power.

Single Integrated Recommendation: The Business Banking Group should implement a Funded-Customer Digital Activation Programme structured in four sequential steps:

  1. Identify: Use the logistic regression model to score all funded-but-non-digitally-active MASS RETAIL and SMALL BUSINESS accounts by their predicted probability of digital activation. Prioritise those with existing transaction history (log_turn_count > 0).
  2. Fund first: For accounts with low deposit balances, initiate a deposit mobilisation intervention (salary capture offer, fixed deposit incentive) before the digital activation prompt is issued.
  3. Enrol with support: Issue a personalised digital enrolment invitation with a concrete incentive (fee waiver on first five digital transactions) and assign a relationship manager or chatbot to support the first login session.
  4. Measure: Track digital transaction count at Day 30 and Day 90 post-enrolment as the primary KPI for activation success. Use the regression model’s predicted probability as the baseline against which actual activation rates are compared.

11 Limitations and Further Work

Limitations

  1. Cross-sectional snapshot: The dataset covers a single month (April 2026). The observed correlations and regression coefficients reflect associations at a fixed point in time, not longitudinal cause-and-effect trajectories. A customer classified as DORMANT in April may have been ACTIVE in March; a single cross-section cannot distinguish stable from transitional behaviour.

  2. No temporal dimension for trend analysis: The absence of multi-period data prevents seasonality modelling, before-and-after campaign evaluation, or cohort analysis of digital activation over time. The added date variables (start_date, end_date) mark the analysis period but cannot substitute for a longitudinal dataset.

  3. Sparse behavioural engagement columns: The airtime, pay-TV, grocery, electricity, and digital-learning engagement variables are 78–99.9% sparse. These columns were excluded from regression but likely contain lifestyle signal that could materially improve model accuracy if data coverage were expanded through enhanced transaction categorisation.

  4. Binary outcome model: Logistic regression models digital activation as a binary event. A more nuanced approach would model digital transaction frequency or volume as a count outcome using zero-inflated Poisson or negative binomial regression, capturing the gradient from occasional to heavy digital users.

  5. Potential endogeneity: Funding status and account activity are likely determined by the same latent customer quality factors that drive digital adoption. Without instrumental variables or a randomised design, the regression coefficients should be interpreted as predictive associations, not causal estimates.

Further Work

  • Acquire 12–24 months of historical monthly snapshots to enable panel data modelling, trend decomposition, and longitudinal cohort analysis.
  • Integrate channel-level data (mobile app sessions, USSD interactions, branch visit frequency) to model the digital adoption pathway at granular resolution.
  • Build an ensemble classifier (random forest or gradient boosting) as a propensity score model for digital activation, relaxing the linearity assumption of logistic regression while enabling precision targeting.
  • Design and execute an A/B test of the recommended digital activation sequence to establish causal impact rather than predictive association.
  • Expand transaction categorisation to improve coverage of the behavioural engagement columns (airtime, utilities, groceries), which currently contain 78–99% missing values.

12 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Adeuyi, O. (2026). Business Banking Group customer portfolio data — April 2026 [Dataset]. Business Banking Group, Nigeria. Data available on request from the author.

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17–21. https://doi.org/10.1080/00031305.1973.10478966

Ben-Shachar, M. S., Makowski, D., & Lüdecke, D. (2020). Effectsize: Estimation of effect size indices and standardized parameters. Journal of Open Source Software, 5(56), 2815. https://doi.org/10.21105/joss.02815

Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). SAGE Publications.

Kassambara, A. (2023). rstatix: Pipe-friendly framework for basic statistical tests (R package version 0.7.2). https://CRAN.R-project.org/package=rstatix

Pedersen, T. L. (2024). patchwork: The composer of plots (R package). https://CRAN.R-project.org/package=patchwork

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.6.0). R Foundation for Statistical Computing. https://www.R-project.org/

Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Code
# Run these in R to retrieve full APA-ready package citations
citation("tidyverse")
citation("readxl")
citation("ggcorrplot")
citation("car")
citation("pROC")
citation("effectsize")
citation("rstatix")
citation("patchwork")

13 Appendix: AI Usage Statement

Claude (Anthropic, 2025), an AI coding assistant, assisted in the preparation of this document in the following capacities: (1) structuring the Quarto document layout to conform with the assignment brief’s required section sequence and YAML specification; (2) generating boilerplate R code for data loading, package installation, and anonymisation logic; and (3) drafting the initial version of the professional disclosure paragraphs, which were subsequently reviewed and validated by the author against actual job responsibilities. All analytical decisions were made independently by the author: the selection of logistic regression as the primary inferential technique (over OLS linear regression), the choice of is_digital_active as the outcome variable of business relevance, the log-transformation strategy for right-skewed financial variables, the specific hypotheses chosen for testing and the test statistics selected (Mann-Whitney U and chi-squared), the mapping of numeric c_rating values to the LOW/MEDIUM/HIGH ordinal scale, and the decision to exclude behavioural engagement columns from regression modelling due to their high sparsity. The business interpretations of all statistical outputs — including the “fund first, then digitalise” strategic recommendation — represent the author’s independent analytical judgement informed by domain experience as Group Head of Business Development. This disclosure is made in accordance with Section 4.4 of the assessment brief.