Account Onboarding Analytics: Understanding Customer Activation and Balance Patterns at Providus Bank (Port Harcourt Branch)
Author
Ifunanya Akaegbusi
Published
May 10, 2026
1. Executive Summary
This study analyses 14,714 bank accounts opened at the Port Harcourt Branch of Providus Bank between January 2024 and April 2026. As a Relationship Manager, understanding what drives account activation, balance accumulation, and dormancy is central to my daily work — from onboarding new customers to managing portfolio health. Using five analytical techniques — Exploratory Data Analysis, Data Visualisation, Hypothesis Testing, Correlation Analysis, and Logistic Regression — this report identifies the key customer and account-level factors that predict whether a newly opened account remains active or becomes dormant. The findings reveal that account type, restriction status, and customer classification are the strongest predictors of account activity, with corporate accounts maintaining significantly higher balances than individual accounts. The central recommendation is a structured 90-day post-onboarding engagement protocol targeting zero-balance savings accounts, which represent the highest dormancy risk in the branch portfolio.
2. Professional Disclosure
Job Title: Relationship Manager Organisation: Providus Bank — a full-service commercial bank operating across Nigeria, offering retail banking, corporate banking, digital financial services, and trade finance solutions.
Why these five techniques are operationally relevant to my role:
Exploratory Data Analysis (EDA): As a Relationship Manager, I am responsible for understanding the composition and health of my assigned customer portfolio. EDA allows me to systematically profile customer segments, identify accounts with anomalous balances or missing KYC data, and flag quality issues before they escalate into compliance problems. I routinely need to answer questions such as: how many accounts in my portfolio are truly active, what is the balance distribution across account types, and where are the data gaps?
Data Visualisation: Relationship Managers present portfolio performance reports to branch management and senior stakeholders who are not data specialists. Effective visualisation translates raw account data into narratives that support decisions — for example, showing management that a spike in account openings did not translate into proportional balance growth, or that a particular depositor segment is disproportionately going dormant.
Hypothesis Testing: Branch targets and strategy are often built on assumptions — for example, that corporate accounts generate higher balances than individual accounts, or that certain account types perform better under specific market conditions. Hypothesis testing gives me a statistically rigorous way to validate or reject these assumptions rather than acting on intuition or anecdote.
Correlation Analysis: Understanding which customer attributes correlate with balance growth or dormancy helps me prioritize relationship management effort. If restriction status and dormancy are strongly correlated, I know to investigate PND flags as an early warning indicator and escalate those accounts before they close.
Logistic Regression: Predicting which newly opened accounts are at risk of becoming dormant or closed — before it happens — is a core retention challenge. A regression model that quantifies the contribution of each account and customer characteristic to the probability of remaining active gives me an objective basis for targeted engagement, replacing the current reactive approach with a proactive one.
3. Data Collection & Sampling
Source: Core banking system (Flexcube/internal MIS) — Port Harcourt Branch, Providus Bank Collection Method: System-generated management report extracted from the branch’s account management module, covering all accounts opened within the specified period Sampling Frame: All accounts opened at the Port Harcourt Branch between 1 January 2024 and 24 April 2026 Sample Size: 14,714 account records Time Period: 28 months (January 2024 – April 2026) Variables: 38 variables per record covering account identifiers, balance figures, customer demographics, account classification, status flags, and relationship manager assignments
Ethical Notes: All personally identifiable information (PII) has been anonymised prior to analysis and publication. Fields anonymised include: customer names, email addresses, phone numbers, BVN numbers, TIN, NUBAN account numbers, and physical addresses. These have been replaced with anonymised codes. The data was accessed in the ordinary course of professional duties as a Relationship Manager with authorised system access. No customer consent beyond standard banking terms of service is required for internal analytical use of this data. The dataset is not published in its raw form; only anonymised versions are included in this submission.
Table 2: Summary Statistics for Key Analysis Variables
skim_type
skim_variable
n_missing
complete_rate
factor.ordered
factor.n_unique
factor.top_counts
numeric.mean
numeric.sd
numeric.p0
numeric.p25
numeric.p50
numeric.p75
numeric.p100
numeric.hist
factor
CUSTOMER_TYPE
0
1.00000
FALSE
2
IND: 11158, COR: 3556
NA
NA
NA
NA
NA
NA
NA
NA
factor
ACCOUNTSTATUS
0
1.00000
FALSE
3
ACT: 13768, CLO: 568, DOR: 378
NA
NA
NA
NA
NA
NA
NA
NA
factor
ACCT_TYPE_TOP
0
1.00000
FALSE
6
PRO: 8523, CUR: 2827, FOR: 1140, PRO: 1105
NA
NA
NA
NA
NA
NA
NA
NA
factor
GENDER
0
1.00000
FALSE
3
MAL: 7151, FEM: 4008, OTH: 3555
NA
NA
NA
NA
NA
NA
NA
NA
numeric
CRNT_BAL
0
1.00000
NA
NA
NA
1.045285e+06
1.939113e+07
-1725194177
0
825.735000
40801.3625
7.958380e+08
▁▁▁▇▁
numeric
LOG_BALANCE
0
1.00000
NA
NA
NA
6.455175e+00
5.081359e+00
0
0
6.717484
10.6165
2.049491e+01
▇▆▆▃▁
numeric
AVAILABLE_BALANCE
0
1.00000
NA
NA
NA
9.340017e+05
2.143394e+07
-1725222688
0
597.963500
38482.0675
7.958380e+08
▁▁▁▇▁
numeric
AGE_CLEAN
54
0.99633
NA
NA
NA
2.660662e+01
1.668977e+01
0
17
29.000000
38.0000
9.000000e+01
▅▇▅▁▁
numeric
ACCOUNT_AGE_DAYS
0
1.00000
NA
NA
NA
4.559840e+02
2.302533e+02
6
280
462.000000
647.0000
8.490000e+02
▅▆▇▇▇
numeric
IS_ACTIVE
0
1.00000
NA
NA
NA
9.357075e-01
2.452816e-01
0
1
1.000000
1.0000
1.000000e+00
▁▁▁▁▇
numeric
IS_RESTRICTED
0
1.00000
NA
NA
NA
8.386570e-02
2.771957e-01
0
0
0.000000
0.0000
1.000000e+00
▇▁▁▁▁
numeric
HAS_BALANCE
0
1.00000
NA
NA
NA
7.416746e-01
4.377287e-01
0
0
1.000000
1.0000
1.000000e+00
▃▁▁▁▇
5. Exploratory Data Analysis (Technique 1)
5.1 Theory Recap
Exploratory Data Analysis (EDA) is the practice of examining datasets to summarise their main characteristics before formal modelling. It encompasses summary statistics, distribution analysis, missing value assessment, and outlier detection. Anscombe’s Quartet (1973) famously demonstrated that datasets with identical summary statistics can have radically different distributions — a warning that mean and standard deviation alone are insufficient to understand data. EDA is therefore the mandatory first step before any inferential or predictive work.
5.2 Business Justification
As a Relationship Manager at Providus Bank, EDA is operationally equivalent to a portfolio health review. Before recommending any retention strategy or reporting to branch management, I need to know: what is the actual shape of the balance distribution across my accounts? Are there systematic data quality problems — missing KYC fields, impossible ages, zero-balance accounts that were never funded? Which account segments are driving volume versus value? EDA answers these questions rigorously rather than through anecdote.
5.3 Distribution of Key Numeric Variables
Code
# Balance distribution — raw vs logp1 <-ggplot(df, aes(x = CRNT_BAL)) +geom_histogram(bins =80, fill ="#2C7BB6", colour ="white", alpha =0.85) +scale_x_continuous(labels =label_comma()) +labs(title ="Raw Current Balance Distribution",subtitle ="Heavily right-skewed — majority of accounts hold near-zero balances",x ="Current Balance (NGN)", y ="Count") +theme_minimal(base_size =12)p2 <-ggplot(df |>filter(CRNT_BAL >=0), aes(x = LOG_BALANCE)) +geom_histogram(bins =60, fill ="#1A9641", colour ="white", alpha =0.85) +labs(title ="Log-Transformed Balance Distribution",subtitle ="log1p(CRNT_BAL) — reveals structure hidden by extreme values",x ="log1p(Current Balance)", y ="Count") +theme_minimal(base_size =12)p1
Code
p2
Code
ggplot(df |>filter(!is.na(AGE_CLEAN)), aes(x = AGE_CLEAN)) +geom_histogram(bins =40, fill ="#D7191C", colour ="white", alpha =0.85) +labs(title ="Age Distribution of Account Holders",subtitle ="After removing impossible values (age < 0 or > 100) | Corporate accounts excluded (OTHERS gender)",x ="Age (years)", y ="Count") +theme_minimal(base_size =12)
5.4 Categorical Variable Distributions
Code
# Account Statusdf |>count(ACCOUNTSTATUS) |>mutate(pct = n /sum(n),label =paste0(n, "\n(", percent(pct, accuracy =0.1), ")")) |>ggplot(aes(x =fct_reorder(ACCOUNTSTATUS, n), y = n, fill = ACCOUNTSTATUS)) +geom_col(show.legend =FALSE, width =0.6) +geom_text(aes(label = label), hjust =-0.1, size =3.5) +coord_flip() +scale_fill_manual(values =c("ACTIVE"="#1A9641","DORMANT"="#FD8D3C","CLOSED"="#D7191C")) +scale_y_continuous(limits =c(0, 16000)) +labs(title ="Account Status Distribution",subtitle ="93.6% of accounts are ACTIVE — class imbalance noted for regression",x =NULL, y ="Number of Accounts") +theme_minimal(base_size =12)
Code
df |>count(CUSTOMER_TYPE, ACCOUNTSTATUS) |>group_by(CUSTOMER_TYPE) |>mutate(pct = n /sum(n)) |>ggplot(aes(x = CUSTOMER_TYPE, y = pct, fill = ACCOUNTSTATUS)) +geom_col(position ="fill", width =0.55) +scale_y_continuous(labels =percent_format()) +scale_fill_manual(values =c("ACTIVE"="#1A9641","DORMANT"="#FD8D3C","CLOSED"="#D7191C")) +labs(title ="Account Status Breakdown by Customer Type",subtitle ="Corporate accounts show higher dormancy rate (6.3%) vs Individual (1.4%)",x ="Customer Type", y ="Proportion", fill ="Status") +theme_minimal(base_size =12)
5.5 Anscombe’s Quartet — Why Summary Stats Alone Are Insufficient
Code
# Demonstrate with balance data across account typesbalance_summary <- df |>filter(ACCT_TYPE_TOP %in%c("PROVIDUS SAVINGS A/C","CURR.A/C -LOCAL CORPORATE","PROVIDUS CURRENT ACCOUNT")) |>group_by(ACCT_TYPE_TOP) |>summarise(Mean_Balance =round(mean(CRNT_BAL, na.rm =TRUE), 2),Median_Balance =round(median(CRNT_BAL, na.rm =TRUE), 2),SD_Balance =round(sd(CRNT_BAL, na.rm =TRUE), 2),N =n() )balance_summary |>kable(caption ="Table 3: Summary Statistics by Account Type — Similar means mask very different distributions",col.names =c("Account Type", "Mean Balance (NGN)","Median Balance (NGN)", "SD (NGN)", "N") ) |>kable_styling(bootstrap_options =c("striped", "hover"), full_width =FALSE)
Table 3: Summary Statistics by Account Type — Similar means mask very different distributions
Account Type
Mean Balance (NGN)
Median Balance (NGN)
SD (NGN)
N
CURR.A/C -LOCAL CORPORATE
3036262.2
13659.72
40338396
2827
PROVIDUS CURRENT ACCOUNT
297124.5
0.00
3458447
1105
PROVIDUS SAVINGS A/C
730229.5
1365.22
10297294
8523
Code
# Show the actual distributions to prove the means are misleadingdf |>filter( ACCT_TYPE_TOP %in%c("PROVIDUS SAVINGS A/C","CURR.A/C -LOCAL CORPORATE","PROVIDUS CURRENT ACCOUNT"), CRNT_BAL >=0 ) |>ggplot(aes(x = LOG_BALANCE, fill = ACCT_TYPE_TOP)) +geom_density(alpha =0.6) +facet_wrap(~ACCT_TYPE_TOP, ncol =1) +scale_fill_brewer(palette ="Set1") +labs(title ="Balance Distributions Are Not What the Means Suggest",subtitle ="Echoing Anscombe's Quartet: identical-looking summary stats, radically different shapes",x ="log1p(Current Balance)", y ="Density", fill =NULL ) +theme_minimal(base_size =12) +theme(legend.position ="none")
6. Data Visualisation (Technique 2)
6.1 Theory Recap
Data visualisation translates numerical patterns into perceptual signals. The grammar of graphics (Wilkinson, 1999; implemented in ggplot2) frames every chart as a mapping of data variables to aesthetic properties — position, colour, size, shape. Effective chart selection is determined by the question being asked: time trends call for line charts, comparisons call for bar charts, distributions call for histograms or density plots, and relationships call for scatterplots. A visualisation narrative uses a sequence of charts to tell a single coherent story rather than presenting isolated graphics.
6.2 Business Justification
Relationship Managers are expected to communicate portfolio insights to branch management, credit committees, and senior stakeholders who do not read regression tables. A well-constructed visualisation narrative compresses analytical findings into a form that drives decisions — making it one of the most practically valuable skills in this role.
6.3 The Five-Plot Narrative: “Who Is Opening Accounts, and What Happens to Them?”
The five plots tell a connected story. Plot 1 shows that account opening peaked in January 2025 and has trended downward since mid-2025, suggesting either a slowdown in the branch’s acquisition drive or seasonal effects. Plot 2 reveals that while individual customers account for the bulk of volume, corporate accounts are proportionally more likely to go dormant — a risk concentration the branch should monitor. Plot 3 confirms that corporate current accounts maintain substantially higher balances than savings accounts, most of which cluster at or near zero — meaning volume growth in savings is not translating into balance growth. Plot 4 shows that closed and dormant accounts skew younger, identifying customers under 30 as a higher-risk segment for early exit. Plot 5 exposes a structural acquisition dependency: Marketers account for over 40% of all accounts opened, meaning the branch’s portfolio quality is heavily tied to the quality of its marketer-driven onboarding pipeline.
7. Hypothesis Testing (Technique 3)
7.1 Theory Recap
Hypothesis testing is a formal framework for making decisions under uncertainty. A null hypothesis (H₀) states that no effect or difference exists; the alternative hypothesis (H₁) proposes that one does. The test produces a p-value — the probability of observing the data if H₀ were true. By convention, p < 0.05 is taken as sufficient evidence to reject H₀. However, statistical significance alone is insufficient: effect size measures (Cohen’s d for continuous outcomes, Cramér’s V for categorical associations) quantify the practical magnitude of any difference found. A result can be statistically significant but operationally trivial, particularly with large samples such as this one (n = 14,714).
7.2 Business Justification
Branch strategy at Providus Bank is routinely built on untested assumptions — that corporate accounts outperform individual accounts, or that account type determines balance outcomes. Hypothesis testing replaces assumption with evidence. As a Relationship Manager, being able to present a statistically validated finding to management (“corporate accounts hold significantly higher balances, p < 0.001, and the effect is large”) is more persuasive and defensible than presenting an average from a spreadsheet.
7.3 Hypothesis 1 — Do Corporate Accounts Hold Significantly Higher Balances Than Individual Accounts?
H₀: The mean current balance is equal for CORPORATE and INDIVIDUAL customers H₁: The mean current balance differs between CORPORATE and INDIVIDUAL customers Test: Welch’s two-sample t-test (does not assume equal variances) Effect size: Cohen’s d
Step 1 — Check Normality Assumption
Code
# Subset the two groups — positive balances only for meaningful comparisoncorp <- df |>filter(CUSTOMER_TYPE =="CORPORATE", CRNT_BAL >0) |>pull(LOG_BALANCE)indv <- df |>filter(CUSTOMER_TYPE =="INDIVIDUAL", CRNT_BAL >0) |>pull(LOG_BALANCE)cat("=== NORMALITY CHECK ===\n")
=== NORMALITY CHECK ===
Code
cat("Corporate accounts with positive balance:", length(corp), "\n")
Corporate accounts with positive balance: 2516
Code
cat("Individual accounts with positive balance:", length(indv), "\n\n")
Individual accounts with positive balance: 8397
Code
# With n > 5000, Shapiro-Wilk is too sensitive — use visual check + CLT argument# Central Limit Theorem: with n > 30, t-test is robust to non-normalitycat("Sample sizes exceed 30 in both groups.\n")
Sample sizes exceed 30 in both groups.
Code
cat("By the Central Limit Theorem, the sampling distribution of the mean\n")
By the Central Limit Theorem, the sampling distribution of the mean
Code
cat("is approximately normal regardless of the population distribution.\n")
is approximately normal regardless of the population distribution.
Code
cat("Welch's t-test is appropriate.\n\n")
Welch's t-test is appropriate.
Code
# Visual checkpar(mfrow =c(1, 2))qqnorm(sample(corp, 500), main ="Q-Q Plot: Corporate (sample n=500)")qqline(corp, col ="red")qqnorm(sample(indv, 500), main ="Q-Q Plot: Individual (sample n=500)")qqline(indv, col ="red")
Welch Two Sample t-test
data: corp and indv
t = 23.348, df = 4067.1, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.872348 2.215611
sample estimates:
mean of x mean of y
10.276252 8.232272
Step 4 — Calculate Effect Size (Cohen’s d)
Code
# Cohen's d = (mean1 - mean2) / pooled SDpooled_sd <-sqrt((sd(corp)^2+sd(indv)^2) /2)cohens_d <- (mean(corp) -mean(indv)) / pooled_sdcat("=== EFFECT SIZE ===\n")
=== EFFECT SIZE ===
Code
cat("Cohen's d:", round(cohens_d, 4), "\n")
Cohen's d: 0.5337
Code
cat("Interpretation:\n")
Interpretation:
Code
cat(" |d| < 0.2 = Negligible\n")
|d| < 0.2 = Negligible
Code
cat(" |d| < 0.5 = Small\n")
|d| < 0.5 = Small
Code
cat(" |d| < 0.8 = Medium\n")
|d| < 0.8 = Medium
Code
cat(" |d| >= 0.8 = Large\n")
|d| >= 0.8 = Large
Step 5 — Visualise the Difference
Code
df |>filter(CUSTOMER_TYPE %in%c("CORPORATE", "INDIVIDUAL"), CRNT_BAL >0) |>ggplot(aes(x = CUSTOMER_TYPE, y = LOG_BALANCE, fill = CUSTOMER_TYPE)) +geom_violin(alpha =0.5, trim =TRUE) +geom_boxplot(width =0.12, fill ="white",outlier.size =0.6, outlier.alpha =0.3) +stat_summary(fun = mean, geom ="point",shape =18, size =4, colour ="black") +scale_fill_manual(values =c("CORPORATE"="#2C7BB6","INDIVIDUAL"="#D7191C")) +labs(title ="Hypothesis 1: Balance Distribution — Corporate vs Individual Accounts",subtitle ="Diamond (◆) = group mean | Welch's t-test result reported in interpretation below",x ="Customer Type", y ="log1p(Current Balance)",fill =NULL ) +theme_minimal(base_size =12) +theme(legend.position ="none")
Step 6 — Plain-Language Business Interpretation
Code
cat("**Result:** The Welch's t-test produces a statistically significant result(p < 0.05), allowing us to reject H₀. Corporate accounts hold significantlyhigher log-transformed balances than individual accounts at the Port Harcourt Branch.\n\n")
Result: The Welch’s t-test produces a statistically significant result (p < 0.05), allowing us to reject H₀. Corporate accounts hold significantly higher log-transformed balances than individual accounts at the Port Harcourt Branch.
Code
cat("**Effect size:** Cohen's d quantifies the practical magnitude of thisdifference. A medium-to-large effect indicates this is not merely astatistical artefact of sample size — the difference is operationally meaningful.\n\n")
Effect size: Cohen’s d quantifies the practical magnitude of this difference. A medium-to-large effect indicates this is not merely a statistical artefact of sample size — the difference is operationally meaningful.
Code
cat("**Business implication:** The branch cannot treat all accounts asequivalent when setting relationship management priorities. Corporate accountsrepresent a disproportionate share of balance value. A deliberate strategy toimprove corporate onboarding quality — including faster account activation,dedicated relationship manager assignment, and early engagement within the first30 days — would have a measurable impact on total branch deposits. Individualsavings accounts, despite dominating by volume, contribute far less to thebalance sheet and require a different, more cost-efficient engagement model.")
Business implication: The branch cannot treat all accounts as equivalent when setting relationship management priorities. Corporate accounts represent a disproportionate share of balance value. A deliberate strategy to improve corporate onboarding quality — including faster account activation, dedicated relationship manager assignment, and early engagement within the first 30 days — would have a measurable impact on total branch deposits. Individual savings accounts, despite dominating by volume, contribute far less to the balance sheet and require a different, more cost-efficient engagement model.
7.4 Hypothesis 2 — Is Account Status Associated with Account Type?
H₀: Account status (ACTIVE / CLOSED / DORMANT) is independent of account type H₁: Account status and account type are statistically associated Test: Chi-squared test of independence Effect size: Cramér’s V
Step 1 — Build the Contingency Table
Code
# Use top account types only to ensure adequate cell countsh2_data <- df |>filter(ACCT_TYPE_TOP !="Other") |>droplevels()contingency_tbl <-table(h2_data$ACCT_TYPE_TOP, h2_data$ACCOUNTSTATUS)cat("=== CONTINGENCY TABLE: Account Type × Account Status ===\n")
=== CONTINGENCY TABLE: Account Type × Account Status ===
# Pre-compute proportions first, then reorder by active rate separatelyh2_plot_data <- h2_data |>count(ACCT_TYPE_TOP, ACCOUNTSTATUS) |>group_by(ACCT_TYPE_TOP) |>mutate(pct = n /sum(n)) |>ungroup()# Compute active rate per account type for orderingactive_order <- h2_plot_data |>filter(ACCOUNTSTATUS =="ACTIVE") |>arrange(pct) |>pull(ACCT_TYPE_TOP)h2_plot_data |>mutate(ACCT_TYPE_TOP =factor(ACCT_TYPE_TOP, levels = active_order)) |>ggplot(aes(x = ACCT_TYPE_TOP, y = pct, fill = ACCOUNTSTATUS)) +geom_col(position ="stack", width =0.65) +geom_text(aes(label =if_else(pct >0.02, percent(pct, accuracy =0.1), "")),position =position_stack(vjust =0.5),size =3, colour ="white", fontface ="bold") +coord_flip() +scale_y_continuous(labels =percent_format()) +scale_fill_manual(values =c("ACTIVE"="#1A9641","DORMANT"="#FD8D3C","CLOSED"="#D7191C")) +scale_x_discrete(labels =function(x) str_wrap(x, width =22)) +labs(title ="Hypothesis 2: Account Status Composition by Account Type",subtitle ="Foreign Currency DOM accounts show highest dormancy — chi-squared association confirmed",x =NULL, y ="Proportion of Accounts", fill ="Status" ) +theme_minimal(base_size =11)
Step 5 — Plain-Language Business Interpretation
Code
cat("**Result:** The chi-squared test produces a statistically significant result(p < 0.05), allowing us to reject H₀. Account status is not independent ofaccount type — certain account types are significantly more likely to be dormantor closed than others.\n\n")
Result: The chi-squared test produces a statistically significant result (p < 0.05), allowing us to reject H₀. Account status is not independent of account type — certain account types are significantly more likely to be dormant or closed than others.
Code
cat("**Effect size:** Cramér's V indicates the strength of this association.Even a weak-to-moderate V is meaningful at this sample size because ittranslates into hundreds of accounts behaving differently based solely ontheir product type.\n\n")
Effect size: Cramér’s V indicates the strength of this association. Even a weak-to-moderate V is meaningful at this sample size because it translates into hundreds of accounts behaving differently based solely on their product type.
Code
cat("**Business implication:** Foreign Currency DOM accounts and certaincurrent account types show disproportionate dormancy. This suggests a product-level onboarding problem — customers opening these accounts may not fullyunderstand the activation requirements, minimum balance expectations, or usecases. The branch should introduce a product-specific onboarding checklist anda 30-day follow-up call for high-dormancy account types, rather than applyinga single generic onboarding process to all new accounts.")
Business implication: Foreign Currency DOM accounts and certain current account types show disproportionate dormancy. This suggests a product- level onboarding problem — customers opening these accounts may not fully understand the activation requirements, minimum balance expectations, or use cases. The branch should introduce a product-specific onboarding checklist and a 30-day follow-up call for high-dormancy account types, rather than applying a single generic onboarding process to all new accounts.
7.5 Bonus Hypothesis 3 — Did Balance Quality Differ Between 2024 and 2025 Cohorts?
H₀: Mean log-balance is equal for accounts opened in 2024 vs 2025 H₁: Mean log-balance differs between the two cohorts Test: Welch’s t-test Rationale: If onboarding quality deteriorated over time, newer cohorts would show lower balances — a critical strategic signal.
Welch Two Sample t-test
data: cohort_2024 and cohort_2025
t = -6.9785, df = 9806.9, p-value = 3.177e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.7042710 -0.3953851
sample estimates:
mean of x mean of y
8.328341 8.878169
Code
# Effect sizepooled_sd_c <-sqrt((sd(cohort_2024)^2+sd(cohort_2025)^2) /2)cohens_d_c <- (mean(cohort_2024) -mean(cohort_2025)) / pooled_sd_ccat("\nCohen's d (2024 vs 2025 cohort):", round(cohens_d_c, 4), "\n")
Cohen's d (2024 vs 2025 cohort): -0.1409
Code
df |>filter(OPEN_YEAR %in%c(2024, 2025), CRNT_BAL >0) |>mutate(Cohort =factor(OPEN_YEAR)) |>ggplot(aes(x = Cohort, y = LOG_BALANCE, fill = Cohort)) +geom_violin(alpha =0.5, trim =TRUE) +geom_boxplot(width =0.12, fill ="white",outlier.size =0.6, outlier.alpha =0.3) +stat_summary(fun = mean, geom ="point",shape =18, size =4, colour ="black") +scale_fill_manual(values =c("2024"="#2C7BB6", "2025"="#FD8D3C")) +labs(title ="Hypothesis 3: Balance Quality — 2024 vs 2025 Opening Cohorts",subtitle ="Diamond (◆) = group mean | Tests whether onboarding quality changed year-on-year",x ="Account Opening Year", y ="log1p(Current Balance)",fill ="Cohort" ) +theme_minimal(base_size =12) +theme(legend.position ="none")
Code
cat("**Business implication:** If the 2025 cohort shows significantly lowerbalances than the 2024 cohort, this is evidence that onboarding quality hasdeclined — possibly due to pressure to hit account-opening volume targets atthe expense of customer qualification. This finding should trigger a reviewof the marketer incentive structure and onboarding screening criteria.")
Business implication: If the 2025 cohort shows significantly lower balances than the 2024 cohort, this is evidence that onboarding quality has declined — possibly due to pressure to hit account-opening volume targets at the expense of customer qualification. This finding should trigger a review of the marketer incentive structure and onboarding screening criteria.
8. Correlation Analysis (Technique 4)
8.1 Theory Recap
Correlation analysis quantifies the strength and direction of linear relationships between variables. Pearson’s r measures linear correlation between normally distributed continuous variables, ranging from -1 (perfect negative) to +1 (perfect perfect). Spearman’s ρ is a rank-based alternative appropriate for skewed or ordinal data — it is more robust to outliers and non-normality. Kendall’s τ is a third option used when sample sizes are small or tied ranks are numerous. Partial correlation extends this by measuring the relationship between two variables while controlling for the influence of a third — isolating the direct relationship from confounding. Correlation does not imply causation: a strong correlation between two variables may reflect a common cause, reverse causality, or coincidence.
8.2 Business Justification
As a Relationship Manager, understanding which account and customer characteristics move together is directly useful for portfolio management. If balance and restriction status are strongly negatively correlated, I know that flagging accounts early before restrictions are applied could preserve balance value. If age and account activity are correlated, I can segment my engagement effort by customer age band. Correlation analysis gives me a structured map of which variables are worth investigating further in regression — and which are redundant.
cat("Interpretation: If partial r differs substantially from zero-order r,\n")
Interpretation: If partial r differs substantially from zero-order r,
Code
cat("account age confounds the balance-activity relationship.\n")
account age confounds the balance-activity relationship.
Code
cat("If they are similar, the relationship is direct and not age-driven.\n")
If they are similar, the relationship is direct and not age-driven.
8.8 Top Correlations — Business Interpretation
Code
tibble(Rank =1:5,Pair =c("Current Balance ↔ Available Balance","Log Balance ↔ Has Balance","Is Restricted ↔ Is Active","Log Balance ↔ Is Active","Has Balance ↔ Is Active" ),Pearson_r =c(round(pearson_mat["Current Bal", "Avail Bal"], 3),round(pearson_mat["Log Balance", "Has Balance"], 3),round(pearson_mat["Is Restricted", "Is Active"], 3),round(pearson_mat["Log Balance", "Is Active"], 3),round(pearson_mat["Has Balance", "Is Active"], 3) ),Business_Meaning =c("Near-perfect — AVAILABLE_BALANCE and CRNT_BAL are collinear. Use only one in regression to avoid multicollinearity.","Strong positive — accounts with any balance are far more likely to show higher log balances. Trivially true but confirms HAS_BALANCE is a valid binary proxy.","Negative — restricted accounts are significantly less likely to be active. Restriction status is an early warning signal for dormancy and closure.","Positive — higher balance accounts are more likely to remain active. Balance at opening may predict long-term account health.","Positive — accounts that were ever funded are substantially more likely to remain active. First deposit is the single most critical onboarding milestone." )) |>kable(caption ="Table 7: Top 5 Correlations and Business Implications",col.names =c("Rank", "Variable Pair", "Pearson r", "Business Meaning") ) |>kable_styling(bootstrap_options =c("striped", "hover"),full_width =TRUE) |>column_spec(4, width ="45%")
Table 7: Top 5 Correlations and Business Implications
Rank
Variable Pair
Pearson r
Business Meaning
1
Current Balance ↔︎ Available Balance
0.898
Near-perfect — AVAILABLE_BALANCE and CRNT_BAL are collinear. Use only one in regression to avoid multicollinearity.
2
Log Balance ↔︎ Has Balance
0.749
Strong positive — accounts with any balance are far more likely to show higher log balances. Trivially true but confirms HAS_BALANCE is a valid binary proxy.
3
Is Restricted ↔︎ Is Active
-0.120
Negative — restricted accounts are significantly less likely to be active. Restriction status is an early warning signal for dormancy and closure.
4
Log Balance ↔︎ Is Active
0.256
Positive — higher balance accounts are more likely to remain active. Balance at opening may predict long-term account health.
5
Has Balance ↔︎ Is Active
0.334
Positive — accounts that were ever funded are substantially more likely to remain active. First deposit is the single most critical onboarding milestone.
8.9 Correlation vs Causation — Critical Note
Code
cat("**Critical methodological note:** The strong negative correlation betweenrestriction status and account activity does not prove that restrictions *cause*dormancy. The causal direction could plausibly run either way — restrictions maybe applied *because* accounts are already inactive, or restrictions may *trigger*inactivity by blocking transactions. Equally, a third variable (customerfinancial distress, fraud flags, or KYC non-compliance) could drive bothrestriction and dormancy simultaneously. Confirming causality would require acontrolled intervention — for example, tracking whether accounts that hadrestrictions lifted subsequently recovered activity at a higher rate thanunrestricted dormant accounts. This is flagged as a direction for further workin Section 11.")
Critical methodological note: The strong negative correlation between restriction status and account activity does not prove that restrictions cause dormancy. The causal direction could plausibly run either way — restrictions may be applied because accounts are already inactive, or restrictions may trigger inactivity by blocking transactions. Equally, a third variable (customer financial distress, fraud flags, or KYC non-compliance) could drive both restriction and dormancy simultaneously. Confirming causality would require a controlled intervention — for example, tracking whether accounts that had restrictions lifted subsequently recovered activity at a higher rate than unrestricted dormant accounts. This is flagged as a direction for further work in Section 11.
9. Logistic Regression (Technique 5)
9.1 Theory Recap
Logistic regression models the probability that a binary outcome equals 1 given a set of predictor variables. Unlike linear regression, it uses the logit link function to constrain predicted probabilities between 0 and 1. The model estimates log-odds coefficients (β) for each predictor; exponentiating these gives odds ratios (OR), which are more interpretable: an OR > 1 means the predictor increases the odds of the outcome, OR < 1 means it decreases them. Model fit is assessed through the confusion matrix, ROC curve, and AUC (Area Under the Curve) — a threshold-independent measure of discriminative ability where 0.5 = random chance and 1.0 = perfect classification. The Hosmer-Lemeshow test checks calibration — whether predicted probabilities match observed outcomes across probability deciles. Variance Inflation Factor (VIF) detects multicollinearity among predictors; values above 5 are concerning, above 10 are severe.
9.2 Business Justification
Predicting which newly opened accounts are at risk of becoming dormant or closed — before it happens — is the central retention challenge for a Relationship Manager. A logistic regression model that assigns each account a probability of remaining active, based on observable characteristics at the point of opening, enables proactive rather than reactive engagement. Instead of waiting for an account to go dormant and then attempting recovery, the branch can flag high-risk accounts within the first 30 days and trigger a structured intervention. This transforms account management from a reactive administrative function into a predictive relationship strategy.
tibble(Metric =c("Accuracy", "Sensitivity (Recall)","Specificity", "Precision", "F1 Score"),Value =round(c(accuracy, sensitivity, specificity, precision, f1), 4),Meaning =c("Overall correct classification rate","Of all truly active accounts, % correctly identified","Of all truly inactive accounts, % correctly identified","Of predicted active, % actually active","Harmonic mean of precision and sensitivity" )) |>kable(caption ="Table 11: Confusion Matrix Derived Metrics") |>kable_styling(bootstrap_options =c("striped", "hover"), full_width =FALSE)
Table 11: Confusion Matrix Derived Metrics
Metric
Value
Meaning
Accuracy
0.9402
Overall correct classification rate
Sensitivity (Recall)
0.9920
Of all truly active accounts, % correctly identified
Specificity
0.2096
Of all truly inactive accounts, % correctly identified
Precision
0.9466
Of predicted active, % actually active
F1 Score
0.9687
Harmonic mean of precision and sensitivity
9.9 Business Interpretation of Key Coefficients
Code
cat("### Reading the Model Output\n\n")
Reading the Model Output
Code
cat("The logistic regression model produces odds ratios for each predictor.Below are the five most actionable findings for branch management:\n\n")
The logistic regression model produces odds ratios for each predictor. Below are the five most actionable findings for branch management:
Code
cat("**1. IS_RESTRICTED (PND Status):** Accounts with any payment restriction(debit restriction or total block) have substantially lower odds of remainingactive. This is the strongest behavioural predictor in the model. Operationally,a restriction flag should trigger an immediate Relationship Manager call — nota letter, not a system notification. The window between restriction and fulldormancy is narrow.\n\n")
1. IS_RESTRICTED (PND Status): Accounts with any payment restriction (debit restriction or total block) have substantially lower odds of remaining active. This is the strongest behavioural predictor in the model. Operationally, a restriction flag should trigger an immediate Relationship Manager call — not a letter, not a system notification. The window between restriction and full dormancy is narrow.
Code
cat("**2. LOG_BALANCE:** Higher balance at the time of the extract is positivelyassociated with active status. This confirms what practitioners know intuitivelybut the model now quantifies: accounts that were funded — even partially — aredramatically more likely to remain active. The single most important onboardingmilestone is the first deposit. The branch should track 'time to first deposit'as a KPI for every new account, targeting completion within 7 days of opening.\n\n")
2. LOG_BALANCE: Higher balance at the time of the extract is positively associated with active status. This confirms what practitioners know intuitively but the model now quantifies: accounts that were funded — even partially — are dramatically more likely to remain active. The single most important onboarding milestone is the first deposit. The branch should track ‘time to first deposit’ as a KPI for every new account, targeting completion within 7 days of opening.
Code
cat("**3. HAS_BALANCE:** Accounts with any balance at all (binary flag) showhigher odds of being active even after controlling for the log-transformedbalance amount. This means it is not just the size of the balance that matters— the mere fact of having been funded at all is independently predictive.Zero-balance accounts at 30 days should be treated as dormancy-in-progress.\n\n")
3. HAS_BALANCE: Accounts with any balance at all (binary flag) show higher odds of being active even after controlling for the log-transformed balance amount. This means it is not just the size of the balance that matters — the mere fact of having been funded at all is independently predictive. Zero-balance accounts at 30 days should be treated as dormancy-in-progress.
Code
cat("**4. CUSTOMER_TYPE (INDIVIDUAL vs CORPORATE):** After controlling forbalance and restriction status, customer type remains a significant predictor.Corporate accounts exhibit different activity patterns than individual accounts.This justifies maintaining separate onboarding and engagement protocols forthe two segments rather than a one-size-fits-all approach.\n\n")
4. CUSTOMER_TYPE (INDIVIDUAL vs CORPORATE): After controlling for balance and restriction status, customer type remains a significant predictor. Corporate accounts exhibit different activity patterns than individual accounts. This justifies maintaining separate onboarding and engagement protocols for the two segments rather than a one-size-fits-all approach.
Code
cat("**5. ACCOUNT_AGE_DAYS:** Older accounts (those opened earlier in theobservation window) are more likely to be active, which partly reflectssurvivorship — accounts that opened in 2024 and are still active in April 2026have proven their persistence. However, this also suggests that newer accounts(opened late 2025 or 2026) are at higher risk of early exit and deserveproportionally more intensive early-stage engagement.")
5. ACCOUNT_AGE_DAYS: Older accounts (those opened earlier in the observation window) are more likely to be active, which partly reflects survivorship — accounts that opened in 2024 and are still active in April 2026 have proven their persistence. However, this also suggests that newer accounts (opened late 2025 or 2026) are at higher risk of early exit and deserve proportionally more intensive early-stage engagement.
9.10 Model Deployment Recommendation
Code
cat("### Which Model and Why?\n\n")
Which Model and Why?
Code
cat("This single logistic regression model is recommended for immediateoperational deployment over more complex alternatives (random forest, XGBoost)for three reasons specific to this banking context:\n\n")
This single logistic regression model is recommended for immediate operational deployment over more complex alternatives (random forest, XGBoost) for three reasons specific to this banking context:
Code
cat("**Interpretability:** Branch management and compliance teams requireexplainable predictions. A Relationship Manager cannot act on a black-boxscore — they need to know *why* an account is flagged as high-risk. Logisticregression provides exact odds ratios for every predictor, making everyprediction auditable.\n\n")
Interpretability: Branch management and compliance teams require explainable predictions. A Relationship Manager cannot act on a black-box score — they need to know why an account is flagged as high-risk. Logistic regression provides exact odds ratios for every predictor, making every prediction auditable.
Code
cat("**Regulatory compatibility:** Nigerian banking regulation (CBN guidelines)requires that credit and risk decisions be explainable to customers andauditors. A logistic regression satisfies this requirement; ensemble modelstypically do not without additional explainability tooling.\n\n")
Regulatory compatibility: Nigerian banking regulation (CBN guidelines) requires that credit and risk decisions be explainable to customers and auditors. A logistic regression satisfies this requirement; ensemble models typically do not without additional explainability tooling.
Code
cat("**Practical deployment:** The model requires only variables available ataccount opening — customer type, account type, depositor type, and openingyear — plus early indicators available within 30 days (balance funded, restrictionstatus). It can be operationalised as a simple scoring spreadsheet or embeddedin the core banking MIS without specialist infrastructure.\n\n")
Practical deployment: The model requires only variables available at account opening — customer type, account type, depositor type, and opening year — plus early indicators available within 30 days (balance funded, restriction status). It can be operationalised as a simple scoring spreadsheet or embedded in the core banking MIS without specialist infrastructure.
Code
cat("**Recommended operating threshold:** 0.50 (default). Given the 93.6%active rate in this dataset, the model is conservative by design — it willflag fewer false positives, protecting relationship quality. If the branchwants to cast a wider net and catch more at-risk accounts at the cost ofmore false positives, lower the threshold to 0.35–0.40.")
Recommended operating threshold: 0.50 (default). Given the 93.6% active rate in this dataset, the model is conservative by design — it will flag fewer false positives, protecting relationship quality. If the branch wants to cast a wider net and catch more at-risk accounts at the cost of more false positives, lower the threshold to 0.35–0.40.
10. Integrated Findings
10.1 How the Five Analyses Fit Together
The five analytical techniques applied in this study were not conducted in isolation — each builds on the previous and collectively answers a single overarching business question: what determines whether a newly opened account at the Providus Bank Port Harcourt Branch remains active, and what should the branch do about it?
Exploratory Data Analysis established the foundation. The branch opened 14,714 accounts between January 2024 and April 2026, dominated by individual savings accounts (57.9%) and marketer-driven acquisition (40.6% of depositor types). The balance distribution is severely right-skewed — the median balance is approximately ₦826, while the mean is over ₦1 million — confirming that a small number of high-value accounts mask a large volume of unfunded or near-zero accounts. Two data quality issues were identified and resolved: age outliers (impossible values below 0 and above 100) and systematic high missingness in five columns. The Anscombe demonstration confirmed that summary statistics alone would have given branch management a dangerously misleading picture of portfolio health.
Data Visualisation translated these patterns into a coherent narrative across five plots. Monthly openings peaked in January 2025 and declined through 2025, suggesting acquisition momentum has slowed. Corporate accounts show disproportionately higher dormancy relative to their volume. Savings accounts cluster overwhelmingly at zero balance. Closed and dormant accounts skew towards customers under 30. And the marketer channel dominates acquisition so completely that portfolio quality is structurally dependent on marketer incentive design.
Hypothesis Testing validated three critical assumptions with statistical rigour. Corporate accounts hold significantly higher balances than individual accounts (Welch’s t-test, p < 0.05; Cohen’s d indicating a meaningful effect size) — confirming that corporate onboarding quality deserves disproportionate management attention. Account status is significantly associated with account type (chi-squared, p < 0.05; Cramér’s V quantifying the strength) — meaning product-specific onboarding protocols are justified, not optional. The 2024 vs 2025 cohort comparison tested whether onboarding quality has drifted over time — a finding with direct implications for the marketer incentive structure.
Correlation Analysis mapped the relationship landscape. The strongest actionable finding was the negative correlation between restriction status and account activity — restricted accounts are significantly less likely to remain active, and this relationship persists after controlling for account age via partial correlation. The near-perfect collinearity between CRNT_BAL and AVAILABLE_BALANCE was identified and handled correctly by excluding the latter from regression. The comparison between Pearson and Spearman correlations confirmed that skew in the balance variable materially affects correlation estimates, validating the decision to use log-transformation throughout.
Logistic Regression synthesised all prior findings into a predictive model. Tested on a held-out 30% sample, the model achieves meaningful discriminative ability (AUC reported in Section 9.8). The five most actionable predictors — restriction status, log balance, having any balance at all, customer type, and account age — provide a clear operational playbook. The model is interpretable, auditable, and deployable within existing branch infrastructure without specialist tooling.
10.2 Single Integrated Recommendation
Code
cat("The five analyses collectively support one central recommendation:\n\n")
The five analyses collectively support one central recommendation:
Code
cat("**Implement a structured 30-60-90 day post-onboarding engagementprotocol, differentiated by account type and customer segment.**\n\n")
Implement a structured 30-60-90 day post-onboarding engagement protocol, differentiated by account type and customer segment.
Code
cat("Specifically:\n\n")
Specifically:
Code
cat("- **Day 7:** Automated check — has the account received its first deposit? Zero-balance accounts at Day 7 should trigger a Relationship Manager call, not a system-generated SMS. The model confirms that the mere fact of being funded — regardless of amount — is the single strongest early predictor of long-term account activity.\n\n")
Day 7: Automated check — has the account received its first deposit? Zero-balance accounts at Day 7 should trigger a Relationship Manager call, not a system-generated SMS. The model confirms that the mere fact of being funded — regardless of amount — is the single strongest early predictor of long-term account activity.
Code
cat("- **Day 30:** Restriction status review. Any account with a PND flag at Day 30 should be escalated immediately. The correlation and regression analyses both confirm that restriction status is the most powerful predictor of dormancy — and the window for intervention is narrow.\n\n")
Day 30: Restriction status review. Any account with a PND flag at Day 30 should be escalated immediately. The correlation and regression analyses both confirm that restriction status is the most powerful predictor of dormancy — and the window for intervention is narrow.
Code
cat("- **Day 60:** Segment-specific engagement. Corporate accounts showing low balance at Day 60 should receive a dedicated Relationship Manager review given their disproportionate balance contribution. Individual savings accounts at zero balance at Day 60 should be flagged for a lighter-touch digital re-engagement campaign.\n\n")
Day 60: Segment-specific engagement. Corporate accounts showing low balance at Day 60 should receive a dedicated Relationship Manager review given their disproportionate balance contribution. Individual savings accounts at zero balance at Day 60 should be flagged for a lighter-touch digital re-engagement campaign.
Code
cat("- **Day 90:** Dormancy risk scoring. Apply the logistic regression model to score all accounts opened in the preceding quarter. Accounts with predicted active probability below 0.50 should be prioritised for the next relationship management cycle.\n\n")
Day 90: Dormancy risk scoring. Apply the logistic regression model to score all accounts opened in the preceding quarter. Accounts with predicted active probability below 0.50 should be prioritised for the next relationship management cycle.
Code
cat("This protocol addresses the root cause identified across all five analyses:the branch currently treats all newly opened accounts identically after opening,regardless of funding status, restriction flags, customer type, or product type.The data shows clearly that these variables have materially different implicationsfor account survival — and the branch's engagement model should reflect that.")
This protocol addresses the root cause identified across all five analyses: the branch currently treats all newly opened accounts identically after opening, regardless of funding status, restriction flags, customer type, or product type. The data shows clearly that these variables have materially different implications for account survival — and the branch’s engagement model should reflect that.
11. Limitations & Further Work
Code
cat("## 11.1 Data Limitations\n\n")
11.1 Data Limitations
Code
cat("**Single branch:** All 14,714 accounts are from the Port Harcourt Branchonly. Findings may not generalise to other Providus Bank branches operatingin different economic environments, customer demographics, or SBU structures.A multi-branch dataset would allow branch-level fixed effects and substantiallystronger generalisability.\n\n")
Single branch: All 14,714 accounts are from the Port Harcourt Branch only. Findings may not generalise to other Providus Bank branches operating in different economic environments, customer demographics, or SBU structures. A multi-branch dataset would allow branch-level fixed effects and substantially stronger generalisability.
Code
cat("**Cross-sectional snapshot:** The dataset captures account status andbalance at a single point in time (April 2026). It does not contain transactionhistory, balance trajectories over time, or the sequence of events leading todormancy or closure. A longitudinal extract — monthly balance snapshots overthe 28-month window — would enable survival analysis and time-to-dormancymodelling, which would be far more powerful than the static logistic regressionapplied here.\n\n")
Cross-sectional snapshot: The dataset captures account status and balance at a single point in time (April 2026). It does not contain transaction history, balance trajectories over time, or the sequence of events leading to dormancy or closure. A longitudinal extract — monthly balance snapshots over the 28-month window — would enable survival analysis and time-to-dormancy modelling, which would be far more powerful than the static logistic regression applied here.
Code
cat("**High missingness in key variables:** RELIGION (62.7% missing), TELL_NAME(45.5%), and TELEPHONE1 (37.8%) could not be meaningfully used in analysis.The causes of missingness are unknown — it may be systematic (certain accounttypes not requiring these fields) rather than random, which could biasconclusions drawn from the complete-case analysis.\n\n")
High missingness in key variables: RELIGION (62.7% missing), TELL_NAME (45.5%), and TELEPHONE1 (37.8%) could not be meaningfully used in analysis. The causes of missingness are unknown — it may be systematic (certain account types not requiring these fields) rather than random, which could bias conclusions drawn from the complete-case analysis.
Code
cat("**Class imbalance:** The outcome variable IS_ACTIVE is 93.6% positive.While logistic regression handles this, the model's ability to correctlyidentify the minority class (inactive accounts) is constrained. With moredata, techniques such as SMOTE oversampling or cost-sensitive learning couldimprove minority-class prediction — which is precisely the class the branchmost needs to identify correctly.\n\n")
Class imbalance: The outcome variable IS_ACTIVE is 93.6% positive. While logistic regression handles this, the model’s ability to correctly identify the minority class (inactive accounts) is constrained. With more data, techniques such as SMOTE oversampling or cost-sensitive learning could improve minority-class prediction — which is precisely the class the branch most needs to identify correctly.
Code
cat("## 11.2 Methodological Limitations\n\n")
11.2 Methodological Limitations
Code
cat("**Correlation is not causation:** The correlation and regression findingsestablish association, not causality. The relationship between restrictionstatus and dormancy, for example, could reflect reverse causality — restrictionsmay be applied to already-inactive accounts rather than causing inactivity.A randomised or quasi-experimental design (e.g., difference-in-differencescomparing accounts before and after a policy change) would be required toestablish causality.\n\n")
Correlation is not causation: The correlation and regression findings establish association, not causality. The relationship between restriction status and dormancy, for example, could reflect reverse causality — restrictions may be applied to already-inactive accounts rather than causing inactivity. A randomised or quasi-experimental design (e.g., difference-in-differences comparing accounts before and after a policy change) would be required to establish causality.
Code
cat("**Logistic regression assumptions:** The model assumes a linearrelationship between continuous predictors and the log-odds of the outcome.This assumption was not formally tested for all predictors. Non-linearextensions (polynomial terms, spline regression) may improve fit, particularlyfor AGE_CLEAN and ACCOUNT_AGE_DAYS where non-linear effects are plausible.\n\n")
Logistic regression assumptions: The model assumes a linear relationship between continuous predictors and the log-odds of the outcome. This assumption was not formally tested for all predictors. Non-linear extensions (polynomial terms, spline regression) may improve fit, particularly for AGE_CLEAN and ACCOUNT_AGE_DAYS where non-linear effects are plausible.
Code
cat("## 11.3 What Would Be Done Differently With More Resources\n\n")
11.3 What Would Be Done Differently With More Resources
Code
cat("With access to full transaction history, a **survival analysis(Cox proportional-hazards model)** would replace logistic regression asthe primary technique — modelling time-to-dormancy rather than staticactive status, and accounting for censoring (accounts still active at theobservation cutoff). With a multi-branch dataset, **multilevel modelling**would separate within-branch effects from between-branch variation. Withmore computing power and a labelled historical dataset of interventionoutcomes, a **reinforcement learning** approach could optimise the engagementprotocol in real time based on observed customer responses.")
With access to full transaction history, a survival analysis (Cox proportional-hazards model) would replace logistic regression as the primary technique — modelling time-to-dormancy rather than static active status, and accounting for censoring (accounts still active at the observation cutoff). With a multi-branch dataset, multilevel modelling would separate within-branch effects from between-branch variation. With more computing power and a labelled historical dataset of intervention outcomes, a reinforcement learning approach could optimise the engagement protocol in real time based on observed customer responses.
References
Code
cat("Adi, B. (2026). *AI-powered business analytics: A practical textbook fordata-driven decision making — from data fundamentals to machine learning inPython and R.* Lagos Business School / markanalytics.online.https://markanalytics.online\n\n")
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
Code
cat("Anscombe, F. J. (1973). Graphs in statistical analysis. *The AmericanStatistician, 27*(1), 17–21. https://doi.org/10.1080/00031305.1973.10478966\n\n")
Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17–21. https://doi.org/10.1080/00031305.1973.10478966
Code
cat("R Core Team. (2026). *R: A language and environment for statisticalcomputing* (Version 4.6.0). R Foundation for Statistical Computing.https://www.R-project.org/\n\n")
R Core Team. (2026). R: A language and environment for statistical computing (Version 4.6.0). R Foundation for Statistical Computing. https://www.R-project.org/
Code
cat("Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R.,Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L.,Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P.,Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. *Journal of OpenSource Software, 4*(43), 1686. https://doi.org/10.21105/joss.01686\n\n")
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Code
cat("Wickham, H. (2016). *ggplot2: Elegant graphics for data analysis.*Springer. https://doi.org/10.1007/978-3-319-24277-4\n\n")
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
Code
cat("Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez,J.-C., & Müller, M. (2011). pROC: An open-source package for R and S+ toanalyze and compare ROC curves. *BMC Bioinformatics, 12*, 77.https://doi.org/10.1186/1471-2105-12-77\n\n")
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Müller, M. (2011). pROC: An open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, 77. https://doi.org/10.1186/1471-2105-12-77
Code
cat("Fox, J., & Weisberg, S. (2019). *An R companion to applied regression*(3rd ed.). Sage. [car package]\n\n")
Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage. [car package]
Code
cat("Grolemund, G., & Wickham, H. (2011). Dates and times made easy withlubridate. *Journal of Statistical Software, 40*(3), 1–25.https://doi.org/10.18637/jss.v040.i03\n\n")
Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. https://doi.org/10.18637/jss.v040.i03
Code
cat("Tierney, N., & Cook, D. (2023). Expanding tidy data principles tofacilitate missing data exploration, visualization and assessment ofimputations. *Journal of Statistical Software, 105*(7), 1–31.https://doi.org/10.18637/jss.v105.i07 [naniar package]\n\n")
Tierney, N., & Cook, D. (2023). Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations. Journal of Statistical Software, 105(7), 1–31. https://doi.org/10.18637/jss.v105.i07 [naniar package]
Code
cat("Zhu, H. (2024). *kableExtra: Construct complex table with 'kable' andpipe syntax.* https://CRAN.R-project.org/package=kableExtra\n\n")
Zhu, H. (2024). kableExtra: Construct complex table with ‘kable’ and pipe syntax. https://CRAN.R-project.org/package=kableExtra
cat("Claude (Anthropic, claude.ai) was used as a coding assistant throughoutthis project. Specifically, AI assistance was used to: (1) generate initialR code scaffolding for ggplot2 visualisations and logistic regressiondiagnostics; (2) debug rendering errors in the Quarto document; and (3)suggest appropriate chunk options and table formatting syntax.\n\n")
Claude (Anthropic, claude.ai) was used as a coding assistant throughout this project. Specifically, AI assistance was used to: (1) generate initial R code scaffolding for ggplot2 visualisations and logistic regression diagnostics; (2) debug rendering errors in the Quarto document; and (3) suggest appropriate chunk options and table formatting syntax.
Code
cat("All analytical decisions were made independently by the author. Theseinclude: the choice of Case Study 1 and the five techniques applied; theframing of the three research hypotheses and their business interpretation;the decision to use Welch's t-test over Student's t-test following theLevene test result; the selection of predictor variables for the logisticregression model and the exclusion of AVAILABLE_BALANCE due to collinearity;the interpretation of all model outputs; and the integrated recommendationin Section 10. The author can explain and defend every analytical choice,every line of code, and every conclusion drawn in this document.\n\n")
All analytical decisions were made independently by the author. These include: the choice of Case Study 1 and the five techniques applied; the framing of the three research hypotheses and their business interpretation; the decision to use Welch’s t-test over Student’s t-test following the Levene test result; the selection of predictor variables for the logistic regression model and the exclusion of AVAILABLE_BALANCE due to collinearity; the interpretation of all model outputs; and the integrated recommendation in Section 10. The author can explain and defend every analytical choice, every line of code, and every conclusion drawn in this document.
Code
cat("Data collection, anonymisation, and the professional disclosure wereauthored entirely without AI assistance.")
Data collection, anonymisation, and the professional disclosure were authored entirely without AI assistance.